# Sigmie — Full Documentation

> A modern, developer-friendly Elasticsearch and OpenSearch library for PHP and Laravel. Fluent search, semantic and hybrid retrieval, AI-ready, no boilerplate.

This file concatenates the full Sigmie documentation for one-shot LLM ingestion. Individual pages live under /docs/v2/{slug}.md.



---

<!-- source: https://sigmie.com/docs/v2/introduction -->

# Introduction

Sigmie is a Laravel-inspired PHP library for Elasticsearch and OpenSearch. It replaces deeply nested query arrays with a fluent, chainable API that reads like the feature you're building.

```php
$results = $sigmie->newSearch('products')
    ->properties($props)
    ->queryString('lapto')
    ->typoTolerance()
    ->filters('price<=1500 AND in_stock:true')
    ->facets('brand category price:100')
    ->highlighting(['title', 'description'])
    ->sort('_score:desc price:asc')
    ->get();
```

That's a typo-tolerant, faceted, highlighted search with sorting and filtering — in one expression.

## When to use Sigmie

Reach for Sigmie when you want to:

- Build search features without becoming an Elasticsearch DSL expert.
- Get typo tolerance, highlighting, faceting, and filtering with sensible defaults.
- Add semantic search with vector embeddings.
- Stay close to your domain code instead of writing query JSON.
- Use Laravel Scout with a serious Elasticsearch backend.

Use the raw Elasticsearch client instead when you need every possible Elasticsearch feature, when you have a large existing codebase built on raw queries, or when you need direct control over query parameters Sigmie does not expose.

## High-level field types

Sigmie wraps Elasticsearch's `text`, `keyword`, `number`, and friends in **semantic field types** that tell Elasticsearch how to treat your data:

```php
$props = new NewProperties;
$props->title('name');         // short searchable titles
$props->category('brand');     // exact-match filterable
$props->price();               // numeric ranges and histograms
$props->text('bio')->keyword();// full-text search + sortable
```

Each high-level type configures the right analyzers, sub-fields, and queries underneath. See [Mappings & Properties](mappings.md).

## Human-readable filters

Skip the boolean query DSL. Write filters the way you'd describe them:

```php
->filters('category:"electronics" AND price:100..500 AND in_stock:true')
```

See [Filter Parser](filter-parser.md) for the full syntax.

## Semantic search

Add `->semantic()` to a text field and Sigmie generates vectors at index time using whatever embeddings API you registered:

```php
$props->text('description')->semantic(api: 'embeddings', dimensions: 384);

$sigmie->registerApi('embeddings', $embeddingsApi);

$results = $sigmie->newSearch('products')
    ->properties($props)
    ->semantic()
    ->queryString('portable work computer')   // finds "laptop", "notebook"
    ->get();
```

See [Semantic Search](semantic-search.md).

## Faceted search

Build sidebar filters for e-commerce or content discovery with a single method:

```php
$response = $sigmie->newSearch('products')
    ->queryString('laptop')
    ->facets('brand category price:100')   // [tl! highlight]
    ->get();

$facets = $response->json('facets');
// ['brand' => ['Apple' => 12, 'Dell' => 8], 'price' => ['min' => 599, ...]]
```

See [Facets](facets.md).

## End-to-end example

```php
use Sigmie\Sigmie;
use Sigmie\Mappings\NewProperties;
use Sigmie\Document\Document;

$sigmie = Sigmie::create(hosts: ['127.0.0.1:9200']);

$props = new NewProperties;
$props->title('name');
$props->text('description');
$props->category('brand')->facetDisjunctive();
$props->price();
$props->bool('in_stock');

$sigmie->newIndex('products')->properties($props)->create();

$sigmie->collect('products', refresh: true)
    ->properties($props)
    ->merge([
        new Document([
            'name' => 'MacBook Pro 16"',
            'description' => 'High-performance laptop for professionals',
            'brand' => 'Apple',
            'price' => 2499,
            'in_stock' => true,
        ]),
        new Document([
            'name' => 'ThinkPad X1 Carbon',
            'description' => 'Lightweight business notebook',
            'brand' => 'Lenovo',
            'price' => 1599,
            'in_stock' => true,
        ]),
    ]);

$results = $sigmie->newSearch('products')
    ->properties($props)
    ->queryString('profesional latop')        // typos are fine
    ->typoTolerance()
    ->filters('in_stock:true')
    ->facets('brand price:100')
    ->weight(['name' => 3, 'description' => 1])
    ->sort('_score:desc price:asc')
    ->get();

foreach ($results->hits() as $hit) {
    echo $hit['name'] . ' - $' . $hit['price'] . "\n";
}
```

## Requirements

- PHP 8.1 or higher
- Elasticsearch 7.x, 8.x, or 9.x — or OpenSearch 2.x or 3.x
- Composer

## What's next

- New to the library: [Installation](installation.md), then [Quick Start](quick-start.md).
- Laravel user: [Laravel Scout](laravel-scout.md).
- Building AI agents: [Laravel AI SDK](laravel-ai.md) and [Retrieval and Agents](rag.md).
- Exploring the model: [Core Concepts](core-concepts.md).


---

<!-- source: https://sigmie.com/docs/v2/installation -->

# Installation

## Requirements

- PHP 8.1 or higher
- Elasticsearch 7.x, 8.x, or 9.x — or OpenSearch 2.x or 3.x
- Composer

## Install via Composer

```bash
composer require sigmie/sigmie
```

Sigmie uses PSR-4 autoloading. If your project doesn't autoload Composer's autoloader yet:

```php
require_once 'vendor/autoload.php';
```

## Connect

The fastest way to connect is `Sigmie::create()`:

```php
use Sigmie\Sigmie;

$sigmie = Sigmie::create(hosts: ['127.0.0.1:9200']);
```

That covers local development against an Elasticsearch instance with security disabled.

### Connect to a secured cluster

For production clusters, pass auth credentials in the `config` array:

```php
$sigmie = Sigmie::create(
    hosts: ['https://elasticsearch.example.com:9200'],
    config: [
        'auth' => ['elastic', 'your-password'],
        'verify' => true,
    ]
);
```

The `config` array accepts any [Guzzle HTTP client option](https://docs.guzzlephp.org/en/stable/request-options.html).

### Connect to OpenSearch

Specify the engine type:

```php
use Sigmie\Enums\SearchEngineType;

$sigmie = Sigmie::create(
    hosts: ['https://localhost:9200'],
    engine: SearchEngineType::OpenSearch,
    config: [
        'auth' => ['admin', 'MyStrongPass123!'],
        'verify' => false,   // self-signed cert
    ]
);
```

See [OpenSearch](opensearch.md) for differences and AWS OpenSearch setup.

### Connect to multiple nodes

```php
$sigmie = Sigmie::create(
    hosts: ['10.0.0.1:9200', '10.0.0.2:9200', '10.0.0.3:9200']
);
```

Sigmie distributes requests across nodes round-robin and retries the next node on failure.

## Run Elasticsearch locally

If you don't have a cluster yet, the fastest local setup is Docker:

```bash
docker run -d \
  --name elasticsearch \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" \
  docker.elastic.co/elasticsearch/elasticsearch:9.0.0
```

Verify it's running:

```bash
curl http://localhost:9200
```

For the full AI-powered local stack (embeddings, reranking, vector models), see [Docker](docker.md).

## Verify your connection

```php
if ($sigmie->isConnected()) {
    echo "Connected.\n";
}
```

You're ready to build your first search — continue with the [Quick Start](quick-start.md).

## Troubleshooting

**`cURL error 7: Failed to connect to localhost port 9200`**
Elasticsearch isn't running, or the host/port in your config is wrong.

**`cURL error 60: SSL certificate problem`**
In development, set `'verify' => false`. In production, point `verify` at your CA bundle path.

**`401 Unauthorized`**
Auth credentials are wrong, or auth isn't enabled where you think it is. Check cluster logs.

**`cURL error 28: Operation timed out`**
Increase `connect_timeout` and `timeout` in your `config` array.

For full authentication and SSL options, see [Connection Setup](connection.md).


---

<!-- source: https://sigmie.com/docs/v2/quick-start -->

# Quick Start

In five minutes you'll have a product search that handles typos, filters by stock and price, and returns relevance-ranked results.

## Prerequisites

- Sigmie [installed](installation.md)
- Elasticsearch running on `127.0.0.1:9200`

Quick sanity check:

```php
use Sigmie\Sigmie;

$sigmie = Sigmie::create(hosts: ['127.0.0.1:9200']);

$sigmie->isConnected();   // true
```

## Step 1: Define a schema

`NewProperties` is your schema builder. Use high-level types — they wire up the right analyzers and queries underneath.

```php
use Sigmie\Mappings\NewProperties;

$props = new NewProperties;
$props->title('name');         // full-text searchable title
$props->text('description');   // long-form searchable text
$props->category('category');  // exact-match category
$props->price();               // numeric, filterable by range
$props->bool('in_stock');
```

## Step 2: Create the index

```php
$sigmie->newIndex('products')
    ->properties($props)
    ->create();
```

## Step 3: Index documents

```php
use Sigmie\Document\Document;

$sigmie->collect('products', refresh: true)
    ->merge([
        new Document([
            'name' => 'Laptop Pro',
            'description' => 'High-performance laptop for professionals',
            'category' => 'electronics',
            'price' => 1299,
            'in_stock' => true,
        ]),
        new Document([
            'name' => 'Wireless Mouse',
            'description' => 'Ergonomic wireless mouse with precision tracking',
            'category' => 'accessories',
            'price' => 49,
            'in_stock' => true,
        ]),
        new Document([
            'name' => 'USB-C Cable',
            'description' => 'Fast charging and data transfer cable',
            'category' => 'accessories',
            'price' => 15,
            'in_stock' => false,
        ]),
    ]);
```

`refresh: true` makes documents immediately searchable. Omit it in production for better bulk-indexing performance.

## Step 4: Search

```php
$results = $sigmie->newSearch('products')
    ->properties($props)
    ->queryString('laptop')
    ->get();

foreach ($results->hits() as $hit) {
    echo $hit['name'] . "\n";
}
// Laptop Pro
```

## Step 5: Tolerate typos

```php
$results = $sigmie->newSearch('products')
    ->properties($props)
    ->queryString('lapto')                        // typo
    ->typoTolerance()                             // [tl! highlight]
    ->get();
// Laptop Pro
```

Defaults: one typo allowed for terms of 3+ characters, two typos for 6+. Override with `typoTolerance(oneTypoChars: 4, twoTypoChars: 8)`.

## Step 6: Filter

The [filter parser](filter-parser.md) reads like a sentence:

```php
$results = $sigmie->newSearch('products')
    ->properties($props)
    ->queryString('cable')
    ->filters('in_stock:true')                    // [tl! highlight]
    ->get();
// (no results — USB-C Cable is out of stock)
```

Combine clauses with `AND`, `OR`, and `NOT`:

```php
->filters('category:"accessories" AND price:<=100')
->filters('price:100..500 AND in_stock:true')
->filters('NOT category:"books"')
```

## Step 7: Sort and paginate

```php
$results = $sigmie->newSearch('products')
    ->properties($props)
    ->queryString('mouse cable')
    ->sort('_score:desc price:asc')
    ->from(0)
    ->size(10)
    ->get();
```

`_score:desc` is the default. `_score:asc` is not allowed — Elasticsearch always sorts relevance highest-first.

## Step 8: Weight fields

```php
$results = $sigmie->newSearch('products')
    ->properties($props)
    ->queryString('laptop')
    ->weight(['name' => 3, 'description' => 1])   // [tl! highlight]
    ->get();
```

A match in `name` now scores 3× higher than the same match in `description`.

## Putting it together

```php
$results = $sigmie->newSearch('products')
    ->properties($props)
    ->queryString('wireless laptop')
    ->typoTolerance()
    ->filters('category:"electronics" AND price:100..2000 AND in_stock:true')
    ->weight(['name' => 3, 'description' => 1])
    ->sort('_score:desc price:asc')
    ->size(20)
    ->get();

echo "Found {$results->total()} products\n";

foreach ($results->hits() as $hit) {
    printf("%s — $%d\n", $hit['name'], $hit['price']);
}
```

## Add facets

Faceted navigation is one method away:

```php
$props->category('brand')->facetDisjunctive();   // enable faceting

$results = $sigmie->newSearch('products')
    ->properties($props)
    ->queryString('laptop')
    ->facets('brand category price:100')         // [tl! highlight]
    ->get();

$facets = $results->json('facets');
// ['brand' => ['Apple' => 5, ...], 'price' => ['min' => 999, 'max' => 2499, ...]]
```

See [Facets](facets.md) for the full reference.

## Add semantic search

When you want results by meaning, not just keywords:

```php
use Sigmie\AI\APIs\OpenAIEmbeddingsApi;

$sigmie->registerApi('embeddings', new OpenAIEmbeddingsApi('sk-...'));

$props = new NewProperties;
$props->title('name');
$props->text('description')->semantic(            // [tl! highlight]
    api: 'embeddings',
    dimensions: 1536,
);

$results = $sigmie->newSearch('products')
    ->properties($props)
    ->semantic()
    ->queryString('portable computer for work')   // matches "laptop", "notebook"
    ->get();
```

See [Semantic Search](semantic-search.md) for embeddings setup, accuracy levels, and similarity functions.

## Where to go next

- [Filter Parser](filter-parser.md) — every operator and clause.
- [Facets](facets.md) — sidebar filters with conjunctive/disjunctive logic.
- [Search](search.md) — every `NewSearch` method.
- [Mappings & Properties](mappings.md) — all field types.
- [Laravel Scout](laravel-scout.md) — Eloquent integration.


---

<!-- source: https://sigmie.com/docs/v2/core-concepts -->

# Core Concepts

Sigmie has four moving parts:

1. **The client** (`Sigmie`) — your connection.
2. **Indices** — containers for documents, with mappings and settings.
3. **Documents** — the JSON records you store and search.
4. **Properties** — the schema that tells Elasticsearch how to treat each field.

Everything in Sigmie composes from these four pieces.

## The client

```php
use Sigmie\Sigmie;

$sigmie = Sigmie::create(hosts: ['127.0.0.1:9200']);
```

The client is the entry point to every other operation:

```php
$sigmie->newIndex('movies');                // build a new index
$sigmie->index('movies');                   // load an existing one
$sigmie->collect('movies');                 // get a writable collection
$sigmie->newSearch('movies');               // build a search
$sigmie->newQuery('movies');                // build a raw query
$sigmie->newRecommend('movies');            // build a recommendation
```

See [Installation](installation.md) and [Connection Setup](connection.md) for connection options.

## Indices

An index is a container for related documents — closer to a database table than a file. Sigmie builds an index with a schema and a set of analysis settings:

```php
use Sigmie\Mappings\NewProperties;

$props = new NewProperties;
$props->title('title');
$props->name('director');
$props->number('year')->integer();

$sigmie->newIndex('movies')
    ->properties($props)
    ->create();
```

Index settings — sharding, replication, custom analyzers — are configured before `create()`:

```php
$sigmie->newIndex('movies')
    ->properties($props)
    ->shards(3)
    ->replicas(1)
    ->lowercase()
    ->tokenizeOnWhitespaces()
    ->create();
```

See [Indices](index.md) for the full lifecycle: creation, update, deletion.

## Documents

Documents are JSON objects you store in an index:

```php
use Sigmie\Document\Document;

$movie = new Document([
    'title' => 'The Matrix',
    'director' => 'The Wachowskis',
    'year' => 1999,
]);
```

Documents are written through a **collection**:

```php
$sigmie->collect('movies', refresh: true)
    ->merge([
        new Document(['title' => 'The Matrix', 'year' => 1999]),
        new Document(['title' => 'Inception', 'year' => 2010]),
    ]);
```

`refresh: true` makes documents immediately searchable. Use it in tests; skip it in production.

See [Documents](document.md) for adding, iterating, and bulk operations.

## Properties

Properties define the schema. They control:

- **Type** — how values are interpreted (`text`, `number`, `bool`).
- **Analysis** — how text is broken into tokens.
- **Sub-fields** — for example, a `text` field with a `.keyword` sub-field for sorting.
- **Queries** — which query types match each field.

```php
$props = new NewProperties;
$props->title('title');             // optimized for searchable titles
$props->name('director');           // optimized for personal/place names
$props->category('genre');          // exact-match filterable
$props->price('ticket_price');      // numeric, supports ranges
$props->date('release_date');
$props->text('description')->keyword();   // full-text + sortable/filterable
```

The high-level types are wrappers over Elasticsearch's `text`, `keyword`, `number`, etc. They wire up the right analyzers and queries so you don't have to.

See [Mappings & Properties](mappings.md) for every type.

## How the pieces fit

Use the same `NewProperties` instance for indexing and searching:

```php
$props = new NewProperties;
$props->title('title');
$props->name('director');

// At index creation
$sigmie->newIndex('movies')->properties($props)->create();

// At search time
$sigmie->newSearch('movies')->properties($props)->queryString('matrix')->get();
```

This is how Sigmie knows which queries to generate for each field. Skip it and your search falls back to a basic match query.

## Analysis: how text becomes searchable

Elasticsearch transforms text at index time so it can search it fast at query time. The same transformation runs on your query string at search time.

```
Document text                             Query string
   "The Matrix"                              "Matrix"
        │                                       │
        ▼                                       ▼
   [Char filters]                          [Char filters]
        │                                       │
        ▼                                       ▼
   [Tokenizer]                             [Tokenizer]
   ["The", "Matrix"]                       ["Matrix"]
        │                                       │
        ▼                                       ▼
   [Token filters]                         [Token filters]
   ["matrix"]    (lowercase, stopwords)    ["matrix"]
        │                                       │
        └──────────► term match ◄───────────────┘
```

The text "The Matrix" is stored as the single token `matrix`. The query "Matrix" gets lowercased to `matrix` too — so it matches.

Sigmie exposes index analysis through the index builder:

```php
$sigmie->newIndex('movies')
    ->tokenizeOnWhitespaces()
    ->lowercase()
    ->trim()
    ->create();
```

You can verify what an analyzer produces:

```php
$tokens = $sigmie->index('movies')->analyze('The Matrix');   // ["matrix"]
```

See [Analysis](analysis.md) for the full pipeline, and [Tokenizers](tokenizers.md) / [Token Filters](token-filters.md) for the building blocks.

## Search vs. query

Two ways to retrieve documents:

```php
// High-level: user-facing search, with typo tolerance, highlighting, facets
$sigmie->newSearch('movies')
    ->properties($props)
    ->queryString('matrix sci-fi')
    ->typoTolerance()
    ->highlighting(['title'])
    ->facets('genre')
    ->get();

// Low-level: raw Elasticsearch boolean queries
$sigmie->newQuery('movies')
    ->properties($props)
    ->bool(function ($bool) {
        $bool->must()->match('title', 'matrix');
        $bool->filter()->range('year', ['>' => 1990]);
        $bool->should()->term('genre', 'sci-fi');
    })
    ->get();
```

Reach for [`newSearch()`](search.md) for user-facing search. Reach for [`newQuery()`](query.md) when you need full control over the Elasticsearch DSL.

## What's next

- Building your first feature: [Quick Start](quick-start.md).
- Designing your schema: [Mappings & Properties](mappings.md).
- Managing data: [Indices](index.md) and [Documents](document.md).
- Searching: [Search](search.md) and [Query Builder](query.md).


---

<!-- source: https://sigmie.com/docs/v2/index -->

# Indices

An index is a container for related documents — closer to a database table than a folder. Unlike a relational database, you don't have to define columns before inserting rows: Elasticsearch will create the index and infer field types on first write. But for serious applications you almost always define a schema first.

## Create an index

The simplest possible index:

```php
$sigmie->newIndex('movies')->create();
```

That works, but Elasticsearch will guess at field types as you index documents. For control over how fields are stored and searched, pass [properties](mappings.md):

```php
use Sigmie\Mappings\NewProperties;

$props = new NewProperties;
$props->title('title');
$props->name('director');
$props->category('genre');
$props->number('year')->integer();
$props->date('release_date');

$sigmie->newIndex('movies')
    ->properties($props)
    ->create();
```

## Configure analysis

Index-level analysis controls how text is tokenized and normalized at index time:

```php
$sigmie->newIndex('movies')
    ->properties($props)
    ->tokenizeOnWhitespaces()       // split on whitespace
    ->lowercase()                   // normalize to lowercase
    ->trim()                        // strip surrounding whitespace
    ->create();
```

The same analyzer runs on query strings at search time, so a query for `Matrix` matches a document storing `matrix`.

See [Analysis](analysis.md) and [Token Filters](token-filters.md) for the full pipeline.

### Language analyzers

```php
use Sigmie\Languages\English\English;

$sigmie->newIndex('articles')
    ->properties($props)
    ->language(new English)
    ->englishStemmer()
    ->englishStopwords()
    ->englishLowercase()
    ->create();
```

See [Languages](language.md) for English, German, and Greek builders.

### Autocomplete

```php
$props = new NewProperties;
$props->title('title');
$props->text('description');
$props->autocomplete();

$sigmie->newIndex('movies')
    ->properties($props)
    ->autocomplete(['title', 'description'])
    ->create();
```

## Sharding and replication

```php
$sigmie->newIndex('movies')
    ->shards(3)
    ->replicas(1)
    ->create();
```

A shard is a smaller index that holds a subset of documents. An index with 3 shards and 8 documents distributes like:

```
movies
├─ shard 1
│  ├─ document 1
│  ├─ document 2
│  └─ document 3
├─ shard 2
│  ├─ document 4
│  ├─ document 5
│  └─ document 6
└─ shard 3
   ├─ document 7
   └─ document 8
```

A replica is a copy of a shard on another node, for fault tolerance. With 3 primaries and 2 replicas across 3 nodes:

```
cluster
├─ node 1
│  ├─ primary 1
│  ├─ replica of primary 2
│  └─ replica of primary 3
├─ node 2
│  ├─ primary 2
│  ├─ replica of primary 1
│  └─ replica of primary 3
└─ node 3
   ├─ primary 3
   ├─ replica of primary 1
   └─ replica of primary 2
```

If a node fails, the surviving nodes still hold every document. Replicas are promoted to primaries automatically.

For most workloads, keep each shard under 30 GB.

## Add documents

To write documents into an index, get a **collection** for it:

```php
use Sigmie\Document\Document;

$sigmie->collect('movies')
    ->merge([
        new Document(['title' => 'Cinderella']),
        new Document(['title' => 'Snow White']),
        new Document(['title' => 'Sleeping Beauty']),
    ]);
```

To make documents immediately searchable (for tests), pass `refresh: true`:

```php
$sigmie->collect('movies', refresh: true)->merge($documents);
```

See [Documents](document.md) for the full collection API.

## Update an index

Elasticsearch indices are immutable: once analysis is applied to a document, you can't re-analyze it without re-indexing. Sigmie provides an `update()` method that handles this transparently using **aliases**:

```php
use Sigmie\Index\UpdateIndex;

$sigmie->index('movies')->update(function (UpdateIndex $update) {
    $update->properties($newProperties);
    $update->lowercase();
});
```

Behind the scenes, `update()`:

1. Creates a new physical index with a timestamp suffix.
2. Reindexes every document into the new index.
3. Switches the `movies` alias to point at the new index.
4. Deletes the old index.

```
Step 1: Create new index
movies (alias) ──► movies_20221122210823379774
                   ├─ Cinderella
                   ├─ Snow White
                   └─ Sleeping Beauty

movies_20221222210823379774   (empty)

Step 2: Reindex
movies (alias) ──► movies_20221122210823379774
                   ├─ Cinderella
                   ├─ Snow White
                   └─ Sleeping Beauty

movies_20221222210823379774
├─ Cinderella
├─ Snow White
└─ Sleeping Beauty

Step 3: Swap alias
movies_20221122210823379774   (orphaned)

movies (alias) ──► movies_20221222210823379774
                   ├─ Cinderella
                   ├─ Snow White
                   └─ Sleeping Beauty

Step 4: Delete old index
movies (alias) ──► movies_20221222210823379774
```

To clients, the index name is unchanged. There's no downtime.

> **Warning:** Index settings are **not** merged. Anything you don't re-declare in the `update()` callback is dropped. Re-set everything you want to keep.

## Inspect an index

```php
$index = $sigmie->index('movies');

$index->mappings;                        // index mappings
$index->mappings->properties();          // property definitions
$index->raw;                             // raw Elasticsearch response
$index->analyze('The Matrix');           // run text through the analyzer
```

## Delete an index

```php
$sigmie->index('movies')->delete();
```

## Advanced settings

Apply any [Elasticsearch index module setting](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html) with `config()`:

```php
$sigmie->newIndex('movies')
    ->config('index.max_ngram_diff', 3)
    ->create();
```


---

<!-- source: https://sigmie.com/docs/v2/document -->

# Documents

A `Document` is a JSON object stored in an index. Sigmie treats every index as a writable collection: you add `Document` instances, iterate over them, query them, and update them.

## Create documents

```php
use Sigmie\Document\Document;

$doc = new Document(['title' => 'The Matrix', 'year' => 1999]);
```

Documents can hold any JSON-serializable structure:

```php
$doc = new Document([
    'title' => 'Inception',
    'director' => [
        'name' => 'Christopher Nolan',
        'born' => 1970,
    ],
    'cast' => [
        ['name' => 'Leonardo DiCaprio', 'role' => 'Cobb'],
        ['name' => 'Marion Cotillard', 'role' => 'Mal'],
    ],
    'metadata' => [
        'runtime' => 148,
        'budget' => 160_000_000,
    ],
]);
```

### Custom document IDs

Pass an ID as the second argument:

```php
$doc = new Document(['title' => 'The Matrix'], 'matrix_1999');
```

Custom IDs let you re-index a document later by writing the same ID — Elasticsearch overwrites it in place.

## Get a collection

```php
$movies = $sigmie->collect('movies');
```

For tests, `refresh: true` makes documents immediately searchable:

```php
$movies = $sigmie->collect('movies', refresh: true);
```

> **Warning:** Don't use `refresh: true` in production. It forces a costly refresh on every write.

## Add documents

A single document:

```php
$movies->add(new Document(['title' => 'Mickey Mouse']));
```

Many documents (much faster than calling `add()` in a loop):

```php
$movies->merge([
    new Document(['title' => 'Snow White']),
    new Document(['title' => 'Cinderella']),
    new Document(['title' => 'Sleeping Beauty']),
]);
```

## Validate with properties

Pass properties to the collection and Sigmie validates each document against the schema before indexing:

```php
use Sigmie\Mappings\NewProperties;

$props = new NewProperties;
$props->title('title');
$props->date('release_date');
$props->number('rating')->float();

$movies = $sigmie->collect('movies')->properties($props);

$movies->merge([
    new Document([
        'title' => 'The Matrix',
        'release_date' => '1999-03-31T00:00:00Z',
        'rating' => 8.7,
    ]),
]);
```

Invalid data (a non-numeric `rating`, an unparseable `release_date`) is caught at indexing time.

## Update a document

To update, write the same ID:

```php
$movies->add(new Document([
    'title' => 'The Matrix',
    'year' => 1999,
    'rating' => 8.7,
], 'matrix_1'));
```

Elasticsearch indexes the new version under the same `_id`, replacing the previous one.

## Iterate over a collection

`each()` streams every document without loading the index into memory. Sigmie pages through results internally using a Point-in-Time and `search_after`, so writes during iteration don't corrupt the cursor.

```php
$movies->each(function (Document $doc): void {
    echo $doc['title'] . "\n";
});
```

The default page size is 500. Override it with `chunk()`:

```php
$movies->chunk(100)->each(function (Document $doc): void {
    processOne($doc);
});
```

For iteration over a **subset** (filtered, sorted), use `NewSearch::each()` or `NewSearch::lazy()` instead. See [Iterating over all matching hits](search.md#iterating-over-all-matching-hits).

## Other collection methods

```php
$movies->count();                       // total document count
$movies->has('matrix_1');               // does this ID exist
$movies->get('matrix_1');               // fetch one by ID
$movies->getMany(['matrix_1', 'inception_1']);  // fetch many by ID
$movies->random(5);                     // 5 random documents
$movies->remove('matrix_1');            // delete by ID
$movies->clear();                       // delete every document
$movies->toArray();                     // load all into memory (small indices only)
```

`Document` implements `ArrayAccess`:

```php
$doc['title'] = 'New Title';
$title = $doc['title'];
isset($doc['year']);
unset($doc['description']);
```

## Complex data types

### Dates

```php
$props->date('created_at');

new Document([
    'title' => 'New Movie',
    'created_at' => '2023-04-07T12:38:29.000000Z',
]);
```

### Geo points

```php
$props->geoPoint('location');

new Document([
    'venue' => 'Cinema Downtown',
    'location' => ['lat' => 40.7128, 'lon' => -74.0060],
]);
```

### Nested arrays of objects

```php
$props->nested('cast', function (NewProperties $props) {
    $props->name('actor');
    $props->keyword('role');
});

new Document([
    'title' => 'Avengers',
    'cast' => [
        ['actor' => 'Robert Downey Jr.', 'role' => 'Iron Man'],
        ['actor' => 'Chris Evans', 'role' => 'Captain America'],
    ],
]);
```

Nested fields preserve the relationship between sibling values during search — see [Filter Parser](filter-parser.md#nested-field-filtering) for nested filtering syntax.

## When are writes visible to search

By default Elasticsearch operates in "near real-time" — writes become searchable about a second later:

```php
$movies = $sigmie->collect('movies');
$movies->add(new Document(['title' => 'Snow White']));
$movies->count();   // 0 (immediately)
sleep(1);
$movies->count();   // 1
```

`refresh: true` makes them visible immediately:

```php
$movies = $sigmie->collect('movies', refresh: true);
$movies->add(new Document(['title' => 'Snow White']));
$movies->count();   // 1
```

For batch processing, use the default and refresh once when you're done:

```php
$movies = $sigmie->collect('movies');
$movies->merge($manyDocuments);
$sigmie->index('movies')->refresh();
```


---

<!-- source: https://sigmie.com/docs/v2/mappings -->

# Mappings & Properties

Properties tell Elasticsearch what each field is and how to search it. Sigmie exposes both **native Elasticsearch types** (`text`, `keyword`, `number`, `bool`, `date`, `geoPoint`) and **high-level types** (`title`, `name`, `category`, `price`, `email`) that wrap the natives with sensible defaults.

You build mappings with the `NewProperties` builder and pass the same instance to your index, your collection, and your searches:

```php
use Sigmie\Mappings\NewProperties;

$props = new NewProperties;
$props->title('title');
$props->name('director');
$props->number('year')->integer();

$sigmie->newIndex('movies')->properties($props)->create();
$sigmie->collect('movies')->properties($props);
$sigmie->newSearch('movies')->properties($props)->queryString('matrix')->get();
```

Reusing the same `$props` is what lets Sigmie generate the right queries for each field.

## High-level types

### Title

For short, searchable text — movie titles, article titles, product names.

```php
$props->title('name');
```

### Name

For personal and place names. Tuned for autocomplete-style matching.

```php
$props->name('director');
$props->name('first_name');
$props->name('city');
```

### Category

For exact-match classification — genres, departments, brands.

```php
$props->category('genre');
$props->category('brand');
```

Categories can opt into faceting (see [Facets](facets.md)):

```php
$props->category('brand')->facetDisjunctive();
$props->category('color')->facetConjunctive();
```

### Tags

For multi-value fields — product tags, attributes, skills.

```php
$props->tags('skills');
```

### Price

For monetary values — supports range queries and histograms.

```php
$props->price();             // defaults to field name 'price'
$props->price('amount');
```

### Long text

For descriptions, summaries, comments, articles.

```php
$props->longText('description');
$props->longText('synopsis');
```

### Short text

For brief, single-line text content.

```php
$props->shortText('headline');
```

### HTML

Strips HTML tags before indexing. Useful for crawled content.

```php
$props->html('content');
```

### Searchable number

Numbers users actually type into a search box — years, phone numbers, reservation IDs.

```php
$props->searchableNumber('birth_year');
$props->searchableNumber('phone');
```

Reach for `number()` instead when you're filtering or sorting numerically but not searching.

### Identifier

For primary and foreign keys — filterable, groupable.

```php
$props->id('user_id');
$props->id('product_id');
```

### Email and address

```php
$props->email('contact_email');
$props->address('shipping_address');
```

### Case-sensitive keyword

Keyword fields lowercase their values by default. Use this when case matters:

```php
$props->caseSensitiveKeyword('promo_code');
```

### Path

For hierarchical paths, indexed at each level (`/a`, `/a/b`, `/a/b/c`):

```php
$props->path('file_path');
```

### Boost

For per-document score boosts:

```php
$props->boost();
```

### Autocomplete

For prefix-style suggestions:

```php
$props->autocomplete();
```

## Native types

### Text

The Elasticsearch workhorse for unstructured text:

```php
$props->text('description');
```

Add a `.keyword` sub-field if you need to sort, filter, or aggregate on the same field:

```php
$props->text('category')->keyword();          // can search AND filter
$props->text('category')->keyword()->makeSortable();
```

Other modifiers:

```php
$props->text('name')->searchAsYouType();      // search-as-you-type
$props->text('content')->indexPrefixes();     // index prefixes for Prefix queries
$props->text('description')->unstructuredText();   // explicit (it's the default)
```

### Keyword

Stored as-is, no analysis. Use for exact matching, sorting, and aggregations:

```php
$props->keyword('ISBN');
$props->keyword('status');
```

### Number

```php
$props->number('rating')->float();
$props->number('count')->integer();
$props->number('amount')->double();
$props->number('precise_amount')->scaledFloat();
```

There are also convenience methods:

```php
$props->integer('count');
$props->float('rating');
$props->long('big_number');
$props->double('high_precision');
```

### Boolean

```php
$props->bool('is_active');
```

### Date

```php
$props->date('created_at');
```

Default format is the PHP ISO format (`Y-m-d\TH:i:s.uP`). Format dates with:

```php
(new DateTime())->format('Y-m-d\TH:i:s.uP');
```

Supported out of the box:

- `2023-04-07T12:38:29.000000Z`
- `2023-04-07T12:38:29Z`
- `2023-04-07T12:38:29`
- `2023-04-07`
- `2023-04-07T12:38:29.000000+02:00`
- `2023-04-07T12:38:29+02:00`

For other formats, pass an Elasticsearch date pattern:

```php
$props->date('created_at', 'MM/dd/yyyy');
```

### Geo point

```php
$props->geoPoint('location');
```

Documents store coordinates as `['lat' => 12.34, 'lon' => 56.78]`. See [Filter Parser](filter-parser.md#geo-location-filtering) for proximity filters.

## Complex types

### Object

For single nested objects (not arrays). Fields are indexed flatly:

```php
$props->object('director', function (NewProperties $props) {
    $props->name('name');
    $props->number('birth_year')->integer();
    $props->email('contact');
});
```

Filter with dot notation: `director.name:"Nolan"`.

### Nested

For arrays of objects where you need to preserve the relationship between sibling values:

```php
$props->nested('cast', function (NewProperties $props) {
    $props->name('actor');
    $props->keyword('character');
    $props->number('screen_time')->integer();
});
```

Filter with curly braces: `cast:{actor:"Keanu Reeves" AND character:"Neo"}`.

Use `object()` when each document has one of these things; use `nested()` when each document has a list and the fields within each item belong together.

## Semantic fields

Make any text field semantic by chaining `->semantic()`. Sigmie generates vector embeddings at index time using whichever embeddings API you registered:

```php
$props->text('description')->semantic(api: 'embeddings', dimensions: 384);
```

The `api` name matches what you passed to `Sigmie::registerApi()`:

```php
use Sigmie\AI\APIs\OpenAIEmbeddingsApi;

$sigmie->registerApi('embeddings', new OpenAIEmbeddingsApi('sk-...'));
```

Tune accuracy (1 = fast, 5 = high quality):

```php
$props->text('content')->semantic(
    api: 'embeddings',
    dimensions: 512,
    accuracy: 3,
);
```

Choose a similarity metric:

```php
use Sigmie\Enums\VectorSimilarity;

$props->text('content')->semantic(
    api: 'embeddings',
    similarity: VectorSimilarity::Cosine,           // default
);
```

Add multiple vector representations of the same field:

```php
$props->text('job_description')
    ->semantic(api: 'embeddings', accuracy: 3, dimensions: 512)
    ->semantic(api: 'embeddings', accuracy: 5, dimensions: 512,
        similarity: VectorSimilarity::Euclidean);
```

See [Semantic Search](semantic-search.md) for the full feature.

## Custom analyzers

Override analysis on a per-field basis:

```php
use Sigmie\Index\NewAnalyzer;

$props->text('email')
    ->withNewAnalyzer(function (NewAnalyzer $analyzer) {
        $analyzer->tokenizeOnPattern('(@|\.)');
        $analyzer->lowercase();
    });
```

## Custom query logic

Define which queries to run for a given field:

```php
use Sigmie\Query\Queries\Term\Prefix;
use Sigmie\Query\Queries\Term\Term;
use Sigmie\Query\Queries\Text\Match_;

$props->text('email')
    ->unstructuredText()
    ->indexPrefixes()
    ->keyword()
    ->withQueries(function (string $queryString) {
        return [
            new Match_('email', $queryString),
            new Prefix('email', $queryString),
            new Term('email.keyword', $queryString),
        ];
    });
```

## Custom property classes

For reusable types, extend a base class:

```php
use Sigmie\Mappings\Types\Text;
use Sigmie\Index\NewAnalyzer;
use Sigmie\Query\Queries\Term\Prefix;
use Sigmie\Query\Queries\Term\Term;
use Sigmie\Query\Queries\Text\Match_;

class Color extends Text
{
    public string $name = 'color';

    public function configure(): void
    {
        $this->unstructuredText()->indexPrefixes()->keyword();
    }

    public function analyze(NewAnalyzer $analyzer): void
    {
        $analyzer->tokenizeOnWhitespaces();
        $analyzer->lowercase();
    }

    public function queries(string $queryString): array
    {
        return [
            new Match_($this->name, $queryString),
            new Prefix($this->name, $queryString),
            new Term("{$this->name}.keyword", $queryString),
        ];
    }
}
```

Register the type:

```php
$props->type(new Color);
```

For shipping field types as a reusable package, see [Extending Sigmie](extending.md).

## Inspect properties

```php
$properties = $props->get();
$fields = $properties->fieldNames();          // ['title', 'cast.actor', ...]
$allFields = $properties->fieldNames(true);   // include intermediate objects
```

Validate a value against a property:

```php
[$valid, $message] = $properties['created_at']->validate('created_at', '2023-04-07');
```

## Quick reference

**Native types:** `text`, `keyword`, `number`, `bool`, `date`, `geoPoint`

**High-level types:** `title`, `name`, `category`, `tags`, `price`, `longText`, `shortText`, `html`, `email`, `address`, `searchableNumber`, `id`, `caseSensitiveKeyword`, `path`, `boost`, `autocomplete`

**Complex types:** `object`, `nested`

**Semantic:** `->semantic(api:, dimensions:, accuracy:, similarity:)`


---

<!-- source: https://sigmie.com/docs/v2/search -->

# Search

`newSearch()` is the high-level entry point for user-facing search: typo tolerance, faceting, highlighting, weighting, semantic matching, all in one fluent chain.

For lower-level access to Elasticsearch's boolean query DSL, see [Advanced Queries](query.md).

```php
use Sigmie\Mappings\NewProperties;

$props = new NewProperties;
$props->name();
$props->text('description');

$results = $sigmie->newSearch('fairy-tales')
    ->properties($props)
    ->queryString('snow white')
    ->get();
```

Two arguments are required: the **properties** (so Sigmie knows how to query each field) and the **query string**.

## Query string

The user input you're searching for:

```php
$sigmie->newSearch('fairy-tales')
    ->properties($props)
    ->queryString('snow white')
    ->get();
```

Add multiple query strings with different weights to bias the score:

```php
$sigmie->newSearch('characters')
    ->properties($props)
    ->queryString('Mickey', weight: 2)
    ->queryString('Goofy', weight: 1)
    ->get();
```

## Limit which fields are searched

By default, every searchable field in your properties is queried. Narrow to specific fields with `fields()`:

```php
$sigmie->newSearch('fairy-tales')
    ->properties($props)
    ->queryString('Snow White')
    ->fields(['name'])                              // only search `name`
    ->get();
```

## Limit which fields are returned

Reduce response size by selecting only the fields you need:

```php
$sigmie->newSearch('fairy-tales')
    ->properties($props)
    ->queryString('Snow White')
    ->retrieve(['name', 'description'])
    ->get();
```

## Filter

The [filter parser](filter-parser.md) reads filters in a human-friendly syntax:

```php
$sigmie->newSearch('fairy-tales')
    ->properties($props)
    ->queryString('Sleeping Beauty')
    ->filters('stock>0 AND is:active AND NOT category:"Drama"')
    ->get();
```

Filters narrow the result set but don't affect relevance scoring.

## Sort

```php
$sigmie->newSearch('fairy-tales')
    ->properties($props)
    ->queryString('Snow White')
    ->sort('_score:desc name:asc')
    ->get();
```

`_score:desc` is the default. `_score:asc` is not allowed — Elasticsearch can't sort relevance ascending. See [Sort Parser](sort-parser.md) for full syntax.

## Typo tolerance

```php
$sigmie->newSearch('fairy-tales')
    ->properties($props)
    ->queryString('Sleping Buety')                  // typos OK
    ->typoTolerance()
    ->get();
```

The default policy: one typo allowed for terms 3+ characters long, two typos for 6+. Override the thresholds:

```php
->typoTolerance(oneTypoChars: 4, twoTypoChars: 8)
```

Restrict typos to specific fields:

```php
$sigmie->newSearch('fairy-tales')
    ->properties($props)
    ->queryString('Sleping Buety')
    ->typoTolerance()
    ->typoTolerantAttributes(['name'])
    ->get();
```

## Highlight matches

Wrap matching tokens in HTML for direct display:

```php
$sigmie->newSearch('fairy-tales')
    ->properties($props)
    ->queryString('sleeping beauty')
    ->highlighting(
        ['name'],
        prefix: '<mark>',
        suffix: '</mark>',
    )
    ->get();
```

Default prefix/suffix is `<em>` / `</em>`.

## Weight fields

Give certain fields more influence on relevance:

```php
$sigmie->newSearch('fairy-tales')
    ->properties($props)
    ->queryString('sleeping beauty')
    ->weight(['name' => 4, 'description' => 1])
    ->get();
```

A match in `name` now scores 4× higher than the same match in `description`.

## Minimum score

Drop low-relevance results:

```php
$sigmie->newSearch('fairy-tales')
    ->properties($props)
    ->queryString('Mickey')
    ->weight(['name' => 5])
    ->minScore(2)
    ->get();
```

## Paginate

```php
$sigmie->newSearch('fairy-tales')
    ->properties($props)
    ->queryString('sleeping beauty')
    ->from(10)
    ->size(10)
    ->get();
```

`from(10)->size(10)` returns the second page (skip first 10, take next 10).

`page()` is a shortcut:

```php
->page(2, 20)               // page 2, 20 per page (== from(20)->size(20))
```

## Deduplicate

Return one hit per value of a field. Useful for product variants:

```php
$sigmie->newSearch('products')
    ->properties($props)
    ->queryString('sneakers')
    ->uniqueBy('product_id')
    ->get();
```

Include the next best matches from each group as inner hits:

```php
->uniqueBy('product_id', top: 3)
```

The collapse field must be single-valued (e.g. `keyword`).

## Facets

Build sidebar filters with one method. See [Facets](facets.md):

```php
$response = $sigmie->newSearch('products')
    ->properties($props)
    ->queryString('laptop')
    ->facets('brand category price:100')
    ->get();

$facets = $response->json('facets');
```

## Semantic search

Enable vector matching alongside keyword search:

```php
$sigmie->newSearch('articles')
    ->properties($props)
    ->semantic()
    ->queryString('artificial intelligence')
    ->get();
```

Use vectors only (no keyword matching):

```php
->semantic()->disableKeywordSearch()
```

See [Semantic Search](semantic-search.md) for embeddings setup and accuracy levels.

## Autocomplete

```php
$response = $sigmie->newSearch('fairy-tales')
    ->properties($props)
    ->autocompletePrefix('m')
    ->fields(['name'])
    ->retrieve(['name'])
    ->get();

$suggestions = $response->json('autocomplete');
```

## Multi-language

Search across multiple indices:

```php
$result = $sigmie->newSearch("$germanIndex,$englishIndex")
    ->properties($props)
    ->queryString('door tür')
    ->get();
```

## Nested fields

Search and retrieve nested fields with dot notation:

```php
$sigmie->newSearch('users')
    ->properties($props)
    ->queryString('Pluto')
    ->fields(['contact.dog.name'])
    ->retrieve(['contact.dog.name'])
    ->get();
```

## Reading results

```php
$response = $sigmie->newSearch('fairy-tales')
    ->properties($props)
    ->queryString('mickey')
    ->get();

$response->total();                  // total matching documents
$response->hits();                   // array of hits
$response->json('hits');             // raw hits array
$response->json('hits.0._source');   // a specific value via dot notation
```

## Empty query strings

By default, an empty query string returns every document. To return nothing instead:

```php
->noResultsOnEmptySearch()
```

## Async execution

`promise()` returns a Guzzle promise instead of executing immediately:

```php
$promise = $sigmie->newSearch('fairy-tales')
    ->properties($props)
    ->queryString('mickey')
    ->promise();
```

## Iterating over all matching hits

`size()` is for UIs. For exports, migrations, or bulk re-processing, use `each()` or `lazy()` to stream every matching document. Both reuse your filters, query string, and field scoping, and page internally using Point-in-Time + `search_after` — so concurrent writes don't break the cursor.

### With a callback

```php
use Sigmie\Document\Hit;

$sigmie->newSearch('orders')
    ->properties($props)
    ->filters('status:completed')
    ->each(function (Hit $hit) use ($csv): void {
        $csv->writeRow($hit->_source);
    });
```

Each `Hit` exposes `_id`, `_source`, and `_score`.

### With a generator

```php
$generator = $sigmie->newSearch('orders')
    ->properties($props)
    ->filters('status:completed')
    ->lazy();

foreach ($generator as $hit) {
    processHit($hit);
}
```

### Page size

Default 500 per page. Tune for memory vs. round-trips:

```php
$sigmie->newSearch('products')
    ->properties($props)
    ->chunk(100)
    ->each(function (Hit $hit): void {
        // 100 at a time
    });
```

### Sort during iteration

Point-in-Time needs a deterministic sort. Sigmie handles this for you:

- **`NewSearch::sort()`** — your sort string is kept. Sigmie appends a stable tiebreaker (`_shard_doc` on Elasticsearch, `_id` on OpenSearch) if you didn't already provide one. `_score`-only or `_doc`-only sorts are replaced by the tiebreaker.
- **`NewQuery::sortString()` / `sort(array)`** — call before the query method (`matchAll`, `bool`, etc.). Omit sort entirely to stream in stable but unranked order. Use field names that exist in your mapping (often a `.keyword` sub-field for text).
- **`raw()`** — include a top-level `sort` key in the body you pass.

```php
$multi->raw('orders', [
    'query' => ['match_all' => (object) []],
    'sort' => [['processed_at' => 'asc']],
]);
```

When the body includes `collapse`, Sigmie does not append the tiebreaker — Elasticsearch only allows one sort key with `collapse` + `search_after`, and that's your responsibility.

### Multi-search

`newMultiSearch()` registers multiple queries; a single `_msearch` returns one page each. To stream **all** matching hits across registered queries, call `each()` or `lazy()` on the multi-search:

```php
use Sigmie\Document\Hit;

$multi = $sigmie->newMultiSearch();

$multi->newSearch('orders')
    ->properties($orderProps)
    ->filters('status:pending')
    ->chunk(200);

$multi->newQuery('products')->matchAll();

$multi->raw('orders', [
    'query' => ['term' => ['status' => 'pending']],
])->chunk(200);

foreach ($multi->lazy() as $hit) {
    exportRow($hit);
}
```

Each registered search runs its own PIT iteration; results yield in registration order. Set `chunk()` per query — the multi-search has no global chunk size.

> **Note:** `each()` and `lazy()` ignore `from()`, `size()`, `page()`, and `highlighting()` — these are pagination/display concerns. Sort is honored as described above.


---

<!-- source: https://sigmie.com/docs/v2/query -->

# Advanced Queries

`newQuery()` gives you direct access to Elasticsearch's boolean query DSL. Reach for it when you need control [`newSearch()`](search.md) doesn't expose — custom scoring, nested boolean logic, or features specific to your Elasticsearch version.

```php
use Sigmie\Query\Queries\Compound\Boolean;

$response = $sigmie->newQuery('disney-movies')
    ->properties($props)
    ->sortString('name:asc')
    ->bool(function (Boolean $bool) {
        $bool->filter()->matchAll();
        $bool->filter()->multiMatch('goofy', ['name', 'description']);
        $bool->must()->term('is_active', true);
        $bool->mustNot()->term('is_active', false);
        $bool->mustNot()->wildcard('foo', '**/*');
        $bool->should()->bool(fn (Boolean $bool) =>
            $bool->must()->match('name', 'Mickey')
        );
    })
    ->from(0)
    ->size(15)
    ->get();
```

## When to use each builder

Use **[`newSearch()`](search.md)** for:

- User-facing search.
- Built-in features (typo tolerance, highlighting, facets, semantic search).
- Most ordinary search code.

Use **`newQuery()`** for:

- Complex boolean logic.
- Custom scoring (`scriptScore`, `functionScore`).
- Direct Elasticsearch DSL features Sigmie doesn't wrap.
- Migrating from raw Elasticsearch queries.

## Basic queries

```php
use Sigmie\Mappings\NewProperties;

$props = new NewProperties;
$props->name('name');
$props->number('age')->integer();

$response = $sigmie->newQuery('users')
    ->properties($props)
    ->matchAll()
    ->get();
```

Properties are required for complex queries — they let Sigmie parse field names and route queries correctly.

## Boolean queries

A boolean query has four clause types, each handling matches differently:

| Clause | Behavior | Affects score |
|--------|----------|---------------|
| `must()` | Must match | Yes |
| `mustNot()` | Must not match | No |
| `should()` | Should match (OR) | Yes |
| `filter()` | Must match | No |

```php
$sigmie->newQuery('movies')
    ->properties($props)
    ->bool(function (Boolean $bool) {
        $bool->must()->match('title', 'matrix');
        $bool->filter()->range('year', ['>' => 1990]);
        $bool->should()->term('genre', 'sci-fi');
        $bool->mustNot()->term('rating', 'R');
    })
    ->get();
```

### Must — AND

Every clause inside `must()` must match:

```php
$sigmie->newQuery('products')
    ->bool(function (Boolean $bool) {
        $bool->must()->term('is_active', true);
        $bool->must()->range('stock', ['>' => 0]);
    });
```

SQL equivalent:

```sql
SELECT * FROM products WHERE is_active = TRUE AND stock > 0;
```

### Must Not — NOT

Documents matching any `mustNot()` clause are excluded:

```php
$sigmie->newQuery('products')
    ->bool(function (Boolean $bool) {
        $bool->mustNot()->term('is_active', false);
        $bool->mustNot()->term('stock', 0);
    });
```

### Should — OR

At least one `should()` clause must match. Multiple clauses are OR'd:

```php
$sigmie->newQuery('movies')
    ->bool(function (Boolean $bool) {
        $bool->should()->term('category', 'fantasy');
        $bool->should()->term('category', 'musical');
    });
```

### Filter — AND without scoring

Same logic as `must()`, but doesn't influence `_score`. Filter queries are cached by Elasticsearch — use them whenever scoring doesn't matter:

```php
$sigmie->newQuery('movies')
    ->bool(function (Boolean $bool) {
        $bool->filter()->term('is_active', true);
    });
```

## Standalone queries

Outside a boolean wrapper, query types can be called directly on the builder:

```php
// Instead of wrapping in bool:
$sigmie->newQuery('movies')
    ->bool(function (Boolean $bool) {
        $bool->filter()->term('active', true);
    });

// Call directly:
$sigmie->newQuery('movies')->term('active', true);
```

## Query types

### Match all / match none

```php
$query->matchAll();
$query->matchNone();
```

### Term and terms

`term()` finds an exact value — best for `keyword`, `bool`, `integer`, etc:

```php
$query->term('active', true);
$query->term('user_id', 13);
```

`terms()` matches any of several values:

```php
$query->terms('category', ['horror', 'action']);
```

> **Note:** `term()` against an analyzed text field usually doesn't work — the field is tokenized. Add a `.keyword` sub-field if you need exact matching:
>
> ```php
> $props->text('category')->keyword();
> // then:
> $query->term('category.keyword', 'drama');
> ```

### Match

Analyzed query — best for text fields:

```php
$query->match('name', 'mickey');
```

### Multi-match

Match across multiple fields:

```php
$query->multiMatch(['name', 'username'], 'mickey');
```

### Range

Filter numeric and date ranges:

```php
$query->range('count', ['>=' => 233]);
$query->range('price', ['>=' => 30, '<=' => 130]);
```

Operators: `>=`, `>`, `<=`, `<`.

### Exists

Document has any value for the field:

```php
$query->exists('director');
```

### Ids

Match by document `_id`:

```php
$query->ids(['dkKwMe4UBAUb2dMteRe2', 'wd6Me4UBAUb2dMJT']);
```

### Regex, wildcard, prefix, fuzzy

```php
$query->regex('category', '(horror|action)');
$query->wildcard('name', 'john*');
$query->prefix('name', 'john');
$query->fuzzy('name', 'john');
```

## Parsing a filter string

For ad-hoc queries built from human input, `parse()` accepts the same syntax as [Filter Parser](filter-parser.md):

```php
$sigmie->newQuery('movies')
    ->properties($props)
    ->parse('name:"John Doe" AND age<21')
    ->get();
```

## Custom scoring

### Script score

Replace or multiply the score with a custom Painless script:

```php
$sigmie->newQuery('movies')
    ->properties($props)
    ->matchAll()
    ->scriptScore(
        source: "Math.log(2 + doc['popularity'].value)",
        boostMode: 'replace',
    )
    ->get();
```

### Function score

```php
$sigmie->newQuery('movies')
    ->properties($props)
    ->functionScore()
    ->get();
```

## Boosting

Boost a query's contribution to `_score`:

```php
$query->matchAll(boost: 5);
```

## Aggregations and facets

Add facets the same way as in `newSearch()`:

```php
$response = $sigmie->newQuery('products')
    ->properties($props)
    ->matchAll()
    ->facets('category')
    ->get();

$facets = $response->json('aggregations');
```

For raw aggregations, see [Aggregations](aggregations.md).

## Sorting

Call `sort()` or `sortString()` **before** the query method (`matchAll`, `bool`, `parse`, etc.). Each call replaces the previous sort — put all fields in a single string.

```php
$query->sortString('name:asc created_at:desc');
$query->sort([['year' => 'desc'], ['_score' => 'desc']]);
```

`_score:asc` is not allowed.

> **Note:** Sorting on text fields requires a `.keyword` sub-field. Add one with `$props->text('name')->keyword()->makeSortable()`.

See [Sort Parser](sort-parser.md) for full syntax.

## Pagination

`from` and `size` are on the `Search` instance returned after the query method:

```php
$response = $sigmie->newQuery('movies')
    ->properties($props)
    ->sortString('title:asc')
    ->matchAll()
    ->from(0)
    ->size(20)
    ->get();
```

## Reading responses

```php
$response = $sigmie->newQuery('movies')
    ->properties($props)
    ->matchAll()
    ->get();

$response->json();                       // full response
$response->json('hits.hits');            // hits array
$response->json('hits.total.value');     // total count
```

## Debugging

`getDSL()` returns the underlying Elasticsearch JSON:

```php
$dsl = $query->getDSL();
```

## Performance

- Use `filter()` instead of `must()` when scoring doesn't matter. Filter queries are cached.
- Prefer `term()` over `match()` for exact matches.
- Limit `size()` to what you need.
- Use `retrieve()` (on `newSearch()`) to drop unused fields from the response.

```php
$sigmie->newQuery('products')
    ->bool(function (Boolean $bool) {
        $bool->filter()->term('status', 'active');     // cached, no scoring
        $bool->must()->match('title', $searchTerm);    // scored
    })
    ->size(10)
    ->get();
```


---

<!-- source: https://sigmie.com/docs/v2/semantic-search -->

# Semantic Search

Semantic search matches documents by **meaning**, not just keywords. "Portable computer for work" can match documents containing "laptop", "notebook", or "MacBook" — none of which share a word with the query.

Sigmie does this by:

1. Generating vector embeddings from your text at index time.
2. Generating an embedding for the query at search time.
3. Returning the documents whose vectors are most similar.

You bring an embeddings API (OpenAI, Cohere, Voyage, Infinity, or anything implementing `EmbeddingsApi`) and register it with the client:

```php
use Sigmie\AI\APIs\OpenAIEmbeddingsApi;

$sigmie->registerApi('embeddings', new OpenAIEmbeddingsApi('sk-...'));
```

The name `'embeddings'` is yours to choose — refer to it from your field definitions.

## Define a semantic field

```php
use Sigmie\Mappings\NewProperties;

$props = new NewProperties;
$props->title('title')->semantic(api: 'embeddings', dimensions: 1536);
$props->text('description')->semantic(api: 'embeddings', dimensions: 1536);

$sigmie->newIndex('articles')->properties($props)->create();
```

Match the `dimensions` to the model you registered. OpenAI's `text-embedding-3-small` outputs 1536-dim vectors; Infinity's `bge-small-en-v1.5` outputs 384-dim.

## Index documents

When a property has `->semantic()`, Sigmie generates embeddings automatically as documents flow through `merge()` and `add()`:

```php
use Sigmie\Document\Document;

$sigmie->collect('articles', refresh: true)
    ->properties($props)
    ->merge([
        new Document([
            'title' => 'Introduction to Machine Learning',
            'description' => 'A primer on supervised and unsupervised learning.',
        ]),
        new Document([
            'title' => 'Deep Learning Fundamentals',
            'description' => 'Neural networks form the basis of deep learning.',
        ]),
    ]);
```

## Search

Enable semantic matching with `->semantic()`:

```php
$response = $sigmie->newSearch('articles')
    ->properties($props)
    ->semantic()
    ->queryString('artificial intelligence basics')
    ->get();
```

By default this combines semantic and keyword matching. Documents matched by both rank higher than documents matched by only one.

### Pure semantic search

Drop keyword matching entirely:

```php
$sigmie->newSearch('articles')
    ->properties($props)
    ->semantic()
    ->disableKeywordSearch()
    ->queryString('machine learning algorithms')
    ->get();
```

### Score multipliers

Bias the blend between keyword and semantic scores:

```php
$sigmie->newSearch('articles')
    ->properties($props)
    ->semantic()
    ->textScoreMultiplier(1.0)
    ->semanticScoreMultiplier(2.0)        // emphasize semantic matches
    ->queryString('quantum computing')
    ->get();
```

## Embedding providers

Sigmie ships clients for several providers — all implement `Sigmie\AI\Contracts\EmbeddingsApi`:

```php
use Sigmie\AI\APIs\OpenAIEmbeddingsApi;
use Sigmie\AI\APIs\CohereEmbeddingsApi;
use Sigmie\AI\APIs\VoyageEmbeddingsApi;
use Sigmie\AI\APIs\InfinityEmbeddingsApi;

$sigmie->registerApi('embeddings', new OpenAIEmbeddingsApi('sk-...'));
$sigmie->registerApi('embeddings', new CohereEmbeddingsApi('co-...'));
$sigmie->registerApi('embeddings', new VoyageEmbeddingsApi('pa-...'));

// Local Infinity service (see Docker docs)
$sigmie->registerApi('embeddings', new InfinityEmbeddingsApi(
    baseUrl: 'http://localhost:7997',
    model: 'BAAI/bge-small-en-v1.5',
));
```

### Custom provider

Implement `EmbeddingsApi`:

```php
use Sigmie\AI\Contracts\EmbeddingsApi;
use GuzzleHttp\Promise\Promise;

class MyEmbeddings implements EmbeddingsApi
{
    public function embed(string $text, int $dimensions): array { /* ... */ }
    public function batchEmbed(array $payload): array { /* ... */ }
    public function promiseEmbed(string $text, int $dimensions): Promise { /* ... */ }
    public function model(): string { /* ... */ }
}

$sigmie->registerApi('embeddings', new MyEmbeddings());
```

## Accuracy

The `accuracy` parameter controls the HNSW index parameters under the hood. Higher accuracy means better recall at the cost of more memory and slower indexing:

```php
$props->text('content')->semantic(api: 'embeddings', dimensions: 512, accuracy: 1);
// Fast: m=16, ef_construction=80

$props->text('content')->semantic(api: 'embeddings', dimensions: 512, accuracy: 3);
// Balanced (default): m=64, ef_construction=300

$props->text('content')->semantic(api: 'embeddings', dimensions: 512, accuracy: 5);
// High quality: m=128, ef_construction=800

$props->text('content')->semantic(api: 'embeddings', dimensions: 512, accuracy: 7);
// Script-score: exact vectors, slowest, highest quality
```

Reach for higher accuracy on important fields (titles, primary content). Use lower accuracy on long, less-critical fields (tags, supporting text).

## Similarity functions

```php
use Sigmie\Enums\VectorSimilarity;

$props->text('content')->semantic(
    api: 'embeddings',
    dimensions: 512,
    similarity: VectorSimilarity::Cosine,            // default
);

// Other options:
VectorSimilarity::DotProduct;
VectorSimilarity::Euclidean;
VectorSimilarity::MaxInnerProduct;
```

- **Cosine** — standard for text similarity, handles different lengths.
- **Dot product** — efficient when your vectors are pre-normalized.
- **Euclidean** — distance-based, sensitive to magnitude.
- **Max inner product** — optimized for IP-similarity workloads.

## Multiple vectors per field

Index the same text with different similarity functions or accuracies:

```php
$props->text('job_description')
    ->semantic(
        api: 'embeddings',
        accuracy: 3,
        dimensions: 512,
        similarity: VectorSimilarity::Cosine,
    )
    ->semantic(
        api: 'embeddings',
        accuracy: 5,
        dimensions: 512,
        similarity: VectorSimilarity::Euclidean,
    );
```

## Field-specific semantic search

Restrict semantic matching to specific fields:

```php
$sigmie->newSearch('articles')
    ->properties($props)
    ->semantic()
    ->fields(['title'])
    ->queryString('deep learning neural networks')
    ->get();
```

## Working with arrays

Semantic fields work with array values — each entry is embedded independently:

```php
new Document([
    'experience' => [
        'Artist',
        'Graphic Design',
        'Creative Director',
    ],
]);

// "drawing illustration" matches "Artist" semantically
$sigmie->newSearch('professionals')
    ->semantic()
    ->properties($props)
    ->queryString('drawing illustration')
    ->get();
```

## Pre-computed embeddings

To skip Sigmie's embedding pipeline (for backfills or batched offline embedding), include vectors directly in the document:

```php
new Document([
    'title' => 'AI Research Paper',
    'content' => 'Artificial intelligence has evolved significantly...',
    '_embeddings' => [
        'title_vector' => [0.1, 0.2, 0.3, /* ... */],
        'content_vector' => [0.4, 0.5, 0.6, /* ... */],
    ],
]);
```

## Batched embedding calls

When you pass many documents to `merge()`, Sigmie collects the texts from every doc and sends them to your embedding provider in batched requests — not one request per document.

### What gets batched together

Sigmie groups texts by `(api, dimensions, modality)`. Two docs whose `title` field uses the same provider and the same vector dimensions share a single request. Text and image inputs are never mixed in the same request, even when the same provider can handle both.

### Chunk size

Each group is split into chunks of `min(100, $provider->maxBatchSize())`. The provider cap reflects what each backend accepts in one call — OpenAI and Jina allow 2048, Voyage 128, Cohere 96, Infinity 512.

```php
$movies->merge($thirtyDocs);   // 1 request of 30
$movies->merge($twoHundredDocs); // 2 requests: 100 + 100
```

`add()` and `replace()` go through the same code path, so a single document is just a one-item batch.

### What stays per-document

- Docs that already carry an `_embeddings` block for a field (see [Pre-computed embeddings](#pre-computed-embeddings)) are excluded from the batch for that field.
- After vectors come back, strategy formatting (concatenate / average / script-score) and [score multipliers](#score-multipliers) still run on each doc individually.

### Tradeoffs

The win is fewer roundtrips and less rate-limit pressure when indexing in bulk. The cost: if a chunk fails, the whole `merge()` fails — Sigmie does not split-and-retry, because partial upserts produce inconsistent indexes. Keep merge batches at a size your provider can serve reliably.

## Reranking

For higher-quality top-K results, rerank with a cross-encoder after retrieval:

```php
use Sigmie\AI\APIs\InfinityRerankApi;

$sigmie->registerApi('my-rerank', new InfinityRerankApi(
    baseUrl: 'http://localhost:7998',
    model: 'cross-encoder/ms-marco-MiniLM-L-6-v2',
));

$response = $sigmie->newSearch('articles')
    ->properties($props)
    ->semantic()
    ->queryString('quantum computing applications')
    ->size(20)
    ->get();

$top5 = $response->rerank('my-rerank', ['content'], topK: 5);
```

See [Retrieval and Agents](rag.md) for the full retrieval-then-generate pattern.

## Empty queries

By default, an empty query string returns every document. To return nothing instead:

```php
->noResultsOnEmptySearch()
```

## Multilingual

Embeddings handle most languages automatically. A query in English finds documents in French if the embedding model supports both:

```php
$sigmie->newSearch('documents')
    ->semantic()
    ->properties($props)
    ->queryString('machine learning')
    // matches "apprentissage automatique", "机器学习", etc.
    ->get();
```

## Patterns

### E-commerce search

```php
$props = new NewProperties;
$props->name('name')->semantic(api: 'embeddings', dimensions: 512);
$props->text('description')->semantic(api: 'embeddings', dimensions: 512);
$props->category('category');
$props->price();

$sigmie->newSearch('products')
    ->properties($props)
    ->semantic()
    ->queryString('comfortable running shoes')
    ->filters('price<=100')
    ->get();
```

### Content recommendation

```php
$sigmie->newSearch('articles')
    ->semantic()
    ->disableKeywordSearch()
    ->properties($props)
    ->queryString('machine learning deep learning neural networks')
    ->size(5)
    ->get();
```

For "similar items" using existing documents as seeds, see [Recommendations](recommendations.md) — it gives you RRF fusion and MMR diversification out of the box.

## Troubleshooting

**No results.** Check that documents have embeddings — index a sample and inspect with `$collection->get($id)`. If they don't, verify your `api:` name matches the registered API.

**Slow search.** Drop to lower accuracy or smaller dimensions. Add filters to narrow the candidate pool before vector ranking.

**Memory pressure.** Lower accuracy and dimensions. High accuracy with 1536-dim vectors needs serious RAM at scale.

## See also

- [Recommendations](recommendations.md) — "similar items" via vector retrieval with RRF and MMR.
- [Retrieval and Agents](rag.md) — combining retrieval, reranking, and generation.
- [Magic Tags](magic-tags.md) — LLM-generated taxonomy tags backed by embeddings.
- [Docker](docker.md) — running local Infinity embeddings and reranker.


---

<!-- source: https://sigmie.com/docs/v2/aggregations -->

# Aggregations

Aggregations summarize and analyze your indexed data. Use them to power analytics dashboards, statistical summaries, and the underlying data for filter UIs.

Sigmie has two paths into aggregations:

1. **[Facets](facets.md)** — high-level, integrated with properties. The right choice for filter sidebars.
2. **Raw aggregations** — direct access to all Elasticsearch aggregation types. The right choice for analytics.

This page covers the raw aggregations API.

## Basic usage

```php
use Sigmie\Query\Aggs;

$response = $sigmie->newQuery('orders')
    ->matchAll()
    ->aggregate(function (Aggs $agg) {
        $agg->sum(name: 'turnover', field: 'price');
    })
    ->get();

$response->aggregation('turnover.value');     // 54.403
```

## Metric aggregations

Metrics return a single value across the matched documents.

### Sum

```php
$agg->sum(name: 'stock_sum', field: 'stock');
$response->aggregation('stock_sum.value');
```

SQL equivalent: `SELECT SUM(stock)`.

### Max / Min / Avg

```php
$agg->max(name: 'max_price', field: 'price');
$agg->min(name: 'min_price', field: 'price');
$agg->avg(name: 'avg_rating', field: 'rating');
```

Access with `$response->aggregation('max_price.value')`.

### Value count

Count of distinct values:

```php
$agg->valueCount(name: 'categories_count', field: 'category');
```

### Cardinality

Approximate distinct-value count — much cheaper than `valueCount` on large fields:

```php
$agg->cardinality(name: 'unique_users', field: 'user_id');
```

### Stats

A quick statistical summary:

```php
$agg->stats(name: 'sales_stats', field: 'amount');
$response->aggregation('sales_stats');
// [
//     'count' => 133,
//     'min'   => 5.33,
//     'max'   => 128.58,
//     'avg'   => 73.53,
//     'sum'   => 9779.49,
// ]
```

## Bucket aggregations

Bucket aggregations group documents by criteria — each bucket holds the documents that match.

### Terms

Group by the unique values of a field. Use a `keyword` field (or `text` field with a `.keyword` sub-field):

```php
$agg->terms(name: 'category_terms', field: 'category')->missing('N/A');

$response->aggregation('category_terms.buckets');
// [
//     ['key' => 'Musical', 'doc_count' => 18],
//     ['key' => 'Adventure', 'doc_count' => 13],
//     ['key' => 'Fantasy', 'doc_count' => 20],
//     ['key' => 'N/A', 'doc_count' => 7],
// ]
```

`missing('N/A')` puts documents without the field into a bucket of that key.

### Range

Group by explicit numeric ranges:

```php
$agg->range(name: 'price_ranges', field: 'price', [
    ['key' => '0-100', 'to' => 100],
    ['key' => '100-200', 'from' => 100, 'to' => 200],
    ['key' => '200+', 'from' => 200],
]);

$response->aggregation('price_ranges.buckets');
// [
//     '0-100'   => ['to' => 100, 'doc_count' => 803],
//     '100-200' => ['from' => 100, 'to' => 200, 'doc_count' => 422],
//     '200+'    => ['from' => 200, 'doc_count' => 343],
// ]
```

### Histogram

Fixed-width buckets across a numeric field:

```php
$agg->histogram(name: 'price_histogram', field: 'price', interval: 50);
```

### Date histogram

Time-bucket documents:

```php
$agg->dateHistogram(name: 'sales_over_time', field: 'created_at', interval: 'month');
```

### Auto date histogram

Let Elasticsearch pick the bucket interval:

```php
$agg->autoDateHistogram(name: 'timeline', field: 'created_at', buckets: 12);
```

## Sub-aggregations

Nest aggregations to compute metrics per bucket:

```php
$agg->terms(name: 'category_terms', field: 'category')
    ->subAggregation(function (Aggs $sub) {
        $sub->avg(name: 'avg_price', field: 'price');
        $sub->max(name: 'max_price', field: 'price');
    });
```

Each category bucket now carries `avg_price` and `max_price` alongside `doc_count`.

## Pipeline aggregations

Operate on the output of other aggregations:

```php
$agg->terms(name: 'monthly_sales', field: 'month')
    ->subAggregation(function (Aggs $sub) {
        $sub->sum(name: 'total_sales', field: 'amount');
    })
    ->pipelineAggregation(function (Aggs $pipe) {
        $pipe->avgBucket(name: 'avg_monthly_sales', bucketsPath: 'monthly_sales>total_sales');
    });
```

## Filtered aggregations

Run an aggregation over a filtered subset of the query results:

```php
$agg->filter(name: 'expensive_products', filter: ['range' => ['price' => ['gte' => 100]]])
    ->subAggregation(function (Aggs $sub) {
        $sub->terms(name: 'expensive_categories', field: 'category');
    });
```

## Combined with the query builder

```php
$response = $sigmie->newQuery('products')
    ->properties($props)
    ->matchAll()
    ->facets('category price:50')
    ->scriptScore(
        source: "Math.log(2 + doc['popularity'].value)",
        boostMode: 'replace',
    )
    ->get();

$hits = $response->json('hits.hits');
$facets = $response->json('facets');
$rawAggs = $response->json('aggregations');
```

## Analytics-only requests

For pure analytics (no documents needed), set `size(0)`:

```php
$response = $sigmie->newQuery('sales')
    ->matchAll()
    ->aggregate(function (Aggs $agg) {
        $agg->dateHistogram('sales_over_time', 'date', 'month')
            ->subAggregation(function (Aggs $sub) {
                $sub->sum('monthly_revenue', 'amount');
            });

        $agg->terms('top_products', 'product_id')
            ->size(10)
            ->subAggregation(function (Aggs $sub) {
                $sub->sum('product_revenue', 'amount');
            });
    })
    ->size(0)
    ->get();
```

## Performance

- Use `keyword` fields for term aggregations — `text` fields require `.keyword` sub-fields.
- Limit bucket size — `terms(...)->size(10)` for top 10.
- Aggregate inside a `filter()` boolean clause to enable Elasticsearch's filter cache.
- Cardinality aggregations on high-cardinality fields use significant memory.

```php
$sigmie->newQuery('products')
    ->properties($props)
    ->bool(function ($bool) {
        $bool->filter()->term('status', 'active');     // cached
        $bool->must()->match('title', $searchTerm);
    })
    ->facets('category:10 brand:10')                   // top 10 per facet
    ->size(20)
    ->get();
```

## See also

- [Facets](facets.md) — high-level facets for filter UIs.
- [Mappings & Properties](mappings.md) — choosing the right field type for aggregation.
- [Advanced Queries](query.md) — combining aggregations with custom queries.


---

<!-- source: https://sigmie.com/docs/v2/magic-tags -->

# Magic Tags

Magic Tags adds a `keyword` field whose values come from an LLM: short, reusable labels (typically kebab-case) that describe another field's content. The pipeline favors **reusing** existing tags so your vocabulary stays stable for search, filtering, and downstream agent tooling.

> **Note:** Magic Tags is **not** part of this repository. It's a separate [Sigmie package](extending.md) that registers a `magicTags()` macro on `NewProperties` and a `CollectionHook` for indexing. This page documents the intended behavior. Examples assume the package is installed and registered.

Behind the scenes, the package maintains a **sidecar index** of unique tags with semantic embeddings on the tag text. The sidecar uses the same embeddings API and dimensions as your source field, so vector operations on tags stay consistent with the rest of your data.

## Install the package

```php
use Sigmie\Sigmie;
use Vendor\MagicTags\MagicTagsPackage;

$sigmie = new Sigmie($connection);
$sigmie->extend(new MagicTagsPackage());
```

`extend()` registers the macro and hook on **this Sigmie instance**. See [Extending Sigmie](extending.md) for the package interface.

## How it fits together

```
Main index documents                       Sidecar index (tag registry)
+---------------------------+              +-----------------------------+
| content (semantic text)   |              | magic_field_path (keyword)  |
| topic (magic_tags)        |    sync      | tag (short text + vectors)  |
| _embeddings.content ...   |  ─────────►  | _embeddings.tag ...         |
+---------------------------+              +-----------------------------+
        │                                            │
        │                                            │
   LLM + optional                          Same embedding API as
   classify-first                          the source field
```

The **main index** stores tags as an array of strings on each document (mapped as `keyword` with `meta.type` `magic_tags`).

The **sidecar index** name defaults to `{logicalName}__sigmie_magic_tags`. Each row is one `(magic_field_path, tag)` pair with a deterministic `_id` (`md5(path::tag)`) so repeated writes upsert rather than duplicate.

## Define magic tags on a mapping

The source field must be a **semantic** text field — the package reads its embeddings configuration to set up the sidecar:

```php
use Sigmie\Mappings\NewProperties;

$props = new NewProperties;

$props->text('content')
    ->semantic(api: 'my-embeddings', accuracy: 1, dimensions: 1024);

$props->magicTags('topic', fromField: 'content')
    ->api('my-llm');
```

Register the same API names on the collection:

```php
$collection = $sigmie->collect('kb', refresh: true)
    ->properties($props)
    ->apis([
        'my-llm' => $llmApi,
        'my-embeddings' => $embeddingsApi,
    ]);
```

Now `merge()` and `add()` run the magic-tag pipeline:

```php
use Sigmie\Document\Document;

$collection->merge([
    new Document(['content' => 'How to reset a circuit breaker.']),
]);
```

The document gets a `topic` array populated by the LLM, with classification as a fast path when enough tags already exist.

## Generation order

When you index a document:

1. **Classify-first** (optional). If `classifyFirst(true)` and the sidecar has enough tags, the package embeds the source text and scores it against centroids built from sample passages per tag. Tags above the confidence threshold are applied without an LLM call.
2. **LLM fallback.** If classification returns nothing, the LLM generates tags from the source text plus a prompt listing existing tags for reuse.
3. **Dedup.** New tags are deduplicated against existing ones using embedding similarity.

## Configure classification and dedup

```php
$props->magicTags('topic', fromField: 'content')
    ->api('my-llm')
    ->embeddingsApi('my-embeddings')
    ->embeddingDimensions(1024)
    ->classifyFirst(true)
    ->minTagsForClassification(10)        // need 10+ tags before classifying
    ->classifyConfidence(0.3)             // minimum centroid similarity
    ->classifySamplesPerTag(5)            // passages per tag for centroid
    ->similarityThreshold(0.85)           // dedup threshold
    ->maxTags(5);
```

Disable classification entirely:

```php
$props->magicTags('topic', fromField: 'content')
    ->api('my-llm')
    ->classifyFirst(false);
```

## Custom prompt

Override the default LLM instructions:

```php
$props->magicTags('topic', fromField: 'content')
    ->api('my-llm')
    ->prompt(
        'You tag property-management support content. Return up to 5 lowercase '.
        'kebab-case tags. Prefer reusing tags from the existing list when applicable.'
    );
```

## Share one registry across indices

Point several mappings at the same `tagIndex()` logical name to share a single sidecar. Main index names stay different:

```php
$shared = 'property_app';

$kb = new NewProperties;
$kb->text('content')->semantic(api: 'my-embeddings', accuracy: 1, dimensions: 1024);
$kb->magicTags('topic', fromField: 'content')
    ->api('my-llm')
    ->tagIndex($shared);

$memory = new NewProperties;
$memory->text('content')->semantic(api: 'my-embeddings', accuracy: 1, dimensions: 1024);
$memory->magicTags('topic', fromField: 'content')
    ->api('my-llm')
    ->tagIndex($shared);
```

Both collections write to `property_app__sigmie_magic_tags`. Tag rows record `magic_field_path` so you can tell which source field produced each tag.

> **Note:** The "existing tags" list shown to the LLM during generation is fetched from the **current** main index only. If you want a global vocabulary across collections for the prompt, merge tag lists yourself before calling `merge()`.

## Skip the pipeline for a batch

When documents already carry final tag values:

```php
$collection->withoutHooks()->merge($documents);
```

## Use tags in an agent tool

A chatbot or filter UI typically wants a list of available tags. Run a terms aggregation on the magic-tag field:

```php
use Sigmie\Query\Aggs;
use Sigmie\Query\Queries\MatchAll;
use Sigmie\Query\Search as QuerySearch;

$aggs = new Aggs;
$aggs->terms('by_topic', 'topic')->size(20);

$response = (new QuerySearch($connection))
    ->index('kb')
    ->query(new MatchAll)
    ->aggs($aggs)
    ->size(0)
    ->get();

$buckets = $response->json('aggregations.by_topic.buckets');
// [['key' => 'returns', 'doc_count' => 42], ['key' => 'shipping', 'doc_count' => 18], ...]
```

This is separate from the internal tag list used during generation (which uses a larger `size`, often 500, so the LLM sees a broad vocabulary).

See [Aggregations](aggregations.md).

## What the package contains

A Magic Tags package typically registers:

- **`NewProperties::macro('magicTags', ...)`** so mappings can call `magicTags()`.
- **A `CollectionHook`** via `$sigmie->addCollectionHook(...)` implementing:
  - `shouldRun()` — checks `Properties::fieldsOfType(MagicTags::class)` so unrelated collections skip the hook.
  - `beforeBatch()` — ensures the sidecar index exists.
  - `processBatch()` — LLM + classification + dedup.
  - `afterBatch()` — upserts tag rows into the sidecar.

See [Extending Sigmie](extending.md) for the `Package` interface and the hook lifecycle.

## See also

- [Extending Sigmie](extending.md) — packages, macros, and collection hooks.
- [Semantic Search](semantic-search.md) — semantic fields and embeddings.
- [Mappings & Properties](mappings.md) — property builders.
- [Documents](document.md) — collections, `add`, `merge`.
- [Aggregations](aggregations.md) — terms buckets for tool selection.


---

<!-- source: https://sigmie.com/docs/v2/facets -->

# Facets

Facets are the aggregated counts that drive filter sidebars: "Brand: Apple (12), Dell (8)" or "Price: $0–$100 (124), $100–$500 (89)". Sigmie generates them automatically from your property definitions — define a `category('brand')`, request `facets('brand')`, and you get back the counts.

```php
use Sigmie\Mappings\NewProperties;

$props = new NewProperties;
$props->category('brand');
$props->price();

$response = $sigmie->newSearch('products')
    ->properties($props)
    ->queryString('laptop')
    ->facets('brand price:100')
    ->get();

$facets = $response->json('facets');
// ['brand' => ['Apple' => 5, 'Dell' => 3], 'price' => [...]]
```

## Term facets

Category and keyword fields produce term counts:

```php
$props = new NewProperties;
$props->category('brand');
$props->keyword('color');

$response = $sigmie->newSearch('products')
    ->properties($props)
    ->queryString('shoes')
    ->facets('brand color')
    ->get();

$response->json('facets');
// ['brand' => ['Nike' => 15, 'Adidas' => 12], 'color' => ['black' => 10, ...]]
```

Use `category()` for categorical data (brand, department, genre). Use `keyword()` for exact-match strings (SKU, status).

## Price facets

Price fields return min, max, and a histogram. The argument after `:` is the bucket size:

```php
$props->price();

$response = $sigmie->newSearch('products')
    ->properties($props)
    ->queryString('laptop')
    ->facets('price:100')        // $100 buckets
    ->get();

$price = $props->get()['price']->facets($response->facetAggregations());
// [
//     'min' => 299,
//     'max' => 1499,
//     'histogram' => [
//         200 => 3,    // 3 in $200–$299
//         300 => 8,
//         400 => 5,
//         ...
//     ],
// ]
```

Pick interval size to match your data range — $10 for cheap items, $100 for big-ticket.

## Number facets

Number fields return statistics:

```php
$props->number('rating');

$response = $sigmie->newSearch('products')
    ->properties($props)
    ->queryString('laptop')
    ->facets('rating')
    ->get();

$stats = $props->get()['rating']->facets($response->facetAggregations());
// [
//     'count' => 127,
//     'min' => 1.0,
//     'max' => 5.0,
//     'avg' => 4.3,
//     'sum' => 546.1,
// ]
```

## Text facets

Text fields need a `.keyword` sub-field for faceting:

```php
$props->text('author')->keyword();

$response = $sigmie->newSearch('articles')
    ->properties($props)
    ->queryString('technology')
    ->facets('author')
    ->get();
```

## Filtering with facets

### Global filters

`filters()` applies to both results and facet counts:

```php
$sigmie->newSearch('products')
    ->properties($props)
    ->queryString('laptop')
    ->filters("brand:'Apple' AND price:500..1500")
    ->facets('brand category price:100')
    ->get();
```

### Facet-specific filters

Pass a filter string as the second argument to `facets()`:

```php
$sigmie->newSearch('products')
    ->properties($props)
    ->queryString('laptop')
    ->facets('brand category color', "brand:'Apple' AND category:'electronics'")
    ->get();
```

## Disjunctive vs conjunctive

E-commerce facets usually want **disjunctive** logic: selecting two brands should show items from either brand, and both brand options should remain visible in the sidebar.

### Disjunctive (OR within a field)

```php
$props->category('color')->facetDisjunctive();
$props->category('size')->facetDisjunctive();

$sigmie->newSearch('products')
    ->properties($props)
    ->queryString('shirt')
    ->facets('color size', "color:'red' color:'blue' size:'lg'")
    ->get();
```

- Multiple values for the same field combine with **OR**: `color:red OR color:blue`.
- Different fields combine with **AND**: `(color) AND (size)`.
- Both red and blue stay visible in color facets.

### Conjunctive (AND within a field)

```php
$props->category('color')->facetConjunctive();
$props->category('material')->facetConjunctive();
```

Multiple values combine with **AND**: only items matching every selected value are returned. Use this when filters narrow a set of multi-valued documents (a product with multiple tags).

### Self-exclusion

With disjunctive facets, a field's own filter doesn't affect that field's facet counts — so selecting "Apple" still shows you how many Dell, HP, Lenovo items exist:

```php
$sigmie->newSearch('products')
    ->properties($props)
    ->filters("stock>0")
    ->facets('color size', "color:'green'")
    ->get();
```

- Color facets show **every** color available (not just green).
- Size facets reflect sizes available for green items.
- Results contain only green items.

This is the standard pattern for filter UIs.

## Nested fields

Use dot notation:

```php
$props->nested('attributes', function (NewProperties $p) {
    $p->keyword('color');
    $p->price();
});

$response = $sigmie->newSearch('products')
    ->properties($props)
    ->queryString('shirt')
    ->facets('attributes.color attributes.price:50')
    ->get();

$compiled = $props->get();
$colors = $compiled->get('attributes.color')->facets($response->facetAggregations());
```

Multi-level nesting works too:

```php
$props->nested('product', function (NewProperties $p) {
    $p->nested('variants', function (NewProperties $p) {
        $p->keyword('size');
        $p->price();
    });
});

->facets('product.variants.size product.variants.price:25')
```

## Reading facet data

### From the response JSON

```php
$allFacets = $response->json('facets');
$brand = $response->json('facets.brand');
$color = $response->json('facets.color');
```

### Through property objects

For price and number facets, the property object computes structured data:

```php
$compiled = $props->get();

$price = $compiled['price']->facets($response->facetAggregations());
$min = $price['min'];
$max = $price['max'];
$histogram = $price['histogram'];

$rating = $compiled['rating']->facets($response->facetAggregations());
$avg = $rating['avg'];
$count = $rating['count'];
```

## Combined example

A realistic e-commerce facet setup:

```php
$props = new NewProperties;
$props->category('category')->facetDisjunctive();
$props->category('brand')->facetDisjunctive();
$props->category('color')->facetDisjunctive();
$props->category('size')->facetDisjunctive();
$props->price();
$props->number('rating');
$props->number('stock');

$response = $sigmie->newSearch('products')
    ->properties($props)
    ->queryString($searchTerm)
    ->filters("stock>0")
    ->facets(
        'category brand color size price:50 rating',
        "category:'electronics' brand:'apple' price:500..1500"
    )
    ->get();

$compiled = $props->get();
$brand = $response->json('facets.brand');
$color = $response->json('facets.color');
$price = $compiled['price']->facets($response->facetAggregations());
$rating = $compiled['rating']->facets($response->facetAggregations());
```

## Empty search with facets

Browsing without a query string:

```php
$response = $sigmie->newSearch('products')
    ->properties($props)
    ->queryString('')
    ->facets('category brand price:100')
    ->get();
```

Returns facets across the entire dataset.

## See also

- [Aggregations](aggregations.md) — raw `terms`, `range`, `histogram`, `stats` aggregations.
- [Filter Parser](filter-parser.md) — the syntax used in `filters()` and `facets()`.
- [Mappings & Properties](mappings.md) — `facetDisjunctive()` / `facetConjunctive()` on field definitions.


---

<!-- source: https://sigmie.com/docs/v2/recommendations -->

# Recommendations

`newRecommend()` finds documents similar to one or more **seed documents** you already have. It's the right call for "You might also like…", "Related articles", and "Customers who viewed this also viewed…" widgets.

```php
use Sigmie\Mappings\NewProperties;

$props = new NewProperties;
$props->text('name')->semantic(api: 'embeddings', dimensions: 384);
$props->text('category')->semantic(api: 'embeddings', dimensions: 384);
$props->text('description')->semantic(api: 'embeddings', dimensions: 384);
$props->number('price');

$recommendations = $sigmie->newRecommend('products')
    ->properties($props)
    ->seedIds(['product-123', 'product-456'])
    ->field('category', weight: 2.0)
    ->field('name', weight: 1.0)
    ->filter('price<=100')
    ->topK(5)
    ->hits();
```

## How it works

1. **Fetch seeds.** Sigmie loads the documents you reference by ID.
2. **Extract vectors.** For each field you weighted, it pulls the stored embeddings off the seed documents.
3. **Multi-search.** It runs a semantic search per (seed × field) using those vectors.
4. **Fuse.** Results are combined with Reciprocal Rank Fusion (RRF).
5. **Diversify (optional).** Maximal Marginal Relevance (MMR) spreads results across the result space.

No new embeddings are generated — you're searching with the vectors you already indexed.

## Use `newSearch()` instead when

- The user types a search query (no seed IDs).
- You need keyword search alongside semantic.
- You want highlighting, facets, autocomplete.

Use `newRecommend()` when:

- You have an existing document and want similar ones.
- Different fields should contribute different amounts to similarity.
- You want to fuse multiple seeds (browse history, multi-item carts).
- You want diversity in the results.

## Field weighting

Each `field()` call specifies a semantic field on the seed documents and how much it should influence the final ranking:

```php
$sigmie->newRecommend('products')
    ->properties($props)
    ->seedIds(['product-42'])
    ->field('category', weight: 3.0)
    ->field('brand', weight: 2.0)
    ->field('description', weight: 1.0)
    ->topK(10)
    ->hits();
```

| Weight | Meaning |
|--------|---------|
| 1.0 | Baseline. |
| 2.0–3.0 | Important — should strongly drive results. |
| 0.5 | Refinement field. |
| 5.0+ | Dominant — overrides everything else. |

> **Note:** Only semantic fields participate. A non-semantic field passed to `field()` is silently skipped.

## API methods

### `properties()`

Required. Sigmie uses your property definitions to determine which fields are semantic.

```php
$recommendations->properties($props);
```

### `seedIds()`

Documents must exist in the index and must have been indexed with `populateEmbeddings()` (the default) so their vectors are stored.

```php
$recommendations->seedIds(['product-123']);                       // single seed
$recommendations->seedIds(['product-123', 'product-456']);        // RRF across seeds
```

### `field()`

```php
$recommendations->field('category', weight: 2.0);
$recommendations
    ->field('category', weight: 3.0)
    ->field('brand', weight: 2.0)
    ->field('description', weight: 1.0);
```

### `filter()`

[Filter parser](filter-parser.md) syntax — narrow the candidate pool:

```php
$recommendations->filter('price>=50 AND price<=200');
$recommendations->filter('in_stock:true AND rating>=4');
```

### `topK()`

```php
$recommendations->topK(5);     // default 10
```

### `rrf()`

Configure Reciprocal Rank Fusion:

```php
$recommendations->rrf(rankConstant: 60);
```

Higher `rankConstant` makes the fusion more forgiving of lower-ranked results.

### `mmr()`

Enable Maximal Marginal Relevance for result diversity:

```php
$recommendations->mmr(lambda: 0.5);     // balanced (default)
$recommendations->mmr(lambda: 0.8);     // favor relevance
$recommendations->mmr(lambda: 0.2);     // favor diversity
```

### `make()` / `get()` / `hits()`

```php
$search = $recommendations->make();        // get the Search object without running it
$rawDsl = $search->toRaw();                // inspect the Elasticsearch query

$response = $recommendations->get();       // full Elasticsearch response
$hits = $recommendations->hits();          // just the hits array
```

## Reciprocal Rank Fusion

RRF combines multiple ranked lists into one. For each document, its RRF score is:

```
score = Σ (1 / (k + rank))
```

…summed across every result list where the document appears.

If a document appears at rank 1 in seed A's results and rank 3 in seed B's results, with default `k = 60`:

```
score = 1 / (60 + 1) + 1 / (60 + 3) = 0.0164 + 0.0159 = 0.0323
```

RRF is robust to outliers, needs no score normalization, and rewards documents that appear in multiple result sets.

## Maximal Marginal Relevance

Without MMR, you can get ten near-duplicate results (ten blue Nike sneakers with slightly different SKUs). MMR diversifies the list by penalizing each candidate's similarity to results already selected.

```
mmr_score = λ × relevance − (1 − λ) × similarity_to_selected
```

- `λ = 1.0` — pure relevance, no diversity.
- `λ = 0.5` — balanced (default).
- `λ = 0.0` — pure diversity.

Use MMR for product recommendations and content discovery. Skip it when precision is critical (medical, legal) or when you need very similar matches.

```php
// Without MMR — risks 10 identical-looking blue Nike sneakers
$sigmie->newRecommend('products')
    ->seedIds(['blue-nike-running-shoe'])
    ->field('category', weight: 2.0)
    ->field('color', weight: 1.0)
    ->topK(10)
    ->hits();

// With MMR — same starting point, more varied results
$sigmie->newRecommend('products')
    ->seedIds(['blue-nike-running-shoe'])
    ->field('category', weight: 2.0)
    ->field('color', weight: 1.0)
    ->mmr(lambda: 0.5)
    ->topK(10)
    ->hits();
```

MMR is applied per-field before the final RRF fusion, so each field contributes diverse candidates that then get blended.

> **Note:** MMR is O(n²) over candidates. Sigmie retrieves `topK × 10` before running MMR, so very large `topK` values are expensive. Filters help by shrinking the candidate pool.

## End-to-end example

```php
use Sigmie\AI\APIs\OpenAIEmbeddingsApi;
use Sigmie\Mappings\NewProperties;

$sigmie->registerApi('embeddings', new OpenAIEmbeddingsApi(getenv('OPENAI_API_KEY')));

$props = new NewProperties;
$props->text('name')->semantic(api: 'embeddings', dimensions: 1536);
$props->text('category')->semantic(api: 'embeddings', dimensions: 1536, accuracy: 4);
$props->text('description')->semantic(api: 'embeddings', dimensions: 1536);
$props->text('brand')->semantic(api: 'embeddings', dimensions: 1536);
$props->number('price');
$props->number('rating');
$props->bool('in_stock');

$recommendations = $sigmie->newRecommend('products')
    ->properties($props)
    ->seedIds(['macbook-pro-16-2023'])
    ->field('category', weight: 3.0)
    ->field('brand', weight: 2.0)
    ->field('name', weight: 1.5)
    ->field('description', weight: 1.0)
    ->mmr(lambda: 0.6)
    ->filter('in_stock:true AND price<=2000 AND rating>=4')
    ->topK(10)
    ->hits();

foreach ($recommendations as $hit) {
    $p = $hit['_source'];
    echo "{$p['name']} — {$p['brand']} | \${$p['price']} | {$p['rating']}/5\n";
}
```

## Debugging

Inspect the generated query:

```php
$search = $sigmie->newRecommend('products')
    ->properties($props)
    ->seedIds(['running-shoes-nike-pegasus'])
    ->field('category', weight: 2.0)
    ->topK(5)
    ->make();

print_r($search->toRaw());
```

List your semantic fields:

```php
$semanticFields = $props->get()
    ->nestedSemanticFields()
    ->filter(fn ($f) => $f->isSemantic())
    ->map(fn ($f) => $f->fullPath)
    ->toArray();
```

## Troubleshooting

**No results.** Filters too restrictive, or seeds don't have stored embeddings (re-index with `populateEmbeddings()`).

**Poor quality.** Verify the fields you're weighting are actually semantic. Tune weights and accuracy. Try seeds from different parts of your dataset.

**Slow.** Reduce `topK`, narrow with filters, drop semantic accuracy, or disable MMR if you don't need diversity.

## See also

- [Semantic Search](semantic-search.md) — semantic fields and embeddings.
- [Filter Parser](filter-parser.md) — filter syntax used in `filter()`.
- [Mappings & Properties](mappings.md) — defining semantic fields.


---

<!-- source: https://sigmie.com/docs/v2/analysis -->

# Text Analysis

Analysis is how Elasticsearch transforms text into searchable tokens. Every text field is analyzed at index time, and every query string is analyzed the same way at search time. When both sides apply identical transformations, matching becomes a fast set operation.

## The pipeline

Every analyzer has three stages:

```
input text
   │
   ▼
[Char filters]    pre-process raw characters (strip HTML, map symbols)
   │
   ▼
[Tokenizer]       split into tokens
   │
   ▼
[Token filters]   transform tokens (lowercase, remove stopwords, stem)
   │
   ▼
indexed tokens
```

Each stage is optional but tokenization. The tokenizer is the one required component — char filters and token filters are added as needed.

## A worked example

Index this HTML text:

```php
"<span>Some people are worth melting for</span>"
```

With this analyzer:

```
Analyzer
├─ Char filters
│  └─ Strip HTML
├─ Tokenizer
│  └─ Whitespace
└─ Token filters
   ├─ Lowercase
   └─ Stopwords (drop "are", "for")
```

### Step 1: Char filters

The HTML strip removes tags:

```
"<span>Some people are worth melting for</span>"   →   "Some people are worth melting for"
```

### Step 2: Tokenize

Whitespace tokenizer splits on spaces:

```
"Some people are worth melting for"

→ "Some"
→ "people"
→ "are"
→ "worth"
→ "melting"
→ "for"
```

### Step 3: Token filters

Lowercase normalizes case; stopwords drops common words:

```
"Some"    → "some"
"people"  → "people"
"are"     → (dropped)
"worth"   → "worth"
"melting" → "melting"
"for"     → (dropped)
```

The indexed tokens are:

```
"some" "people" "worth" "melting"
```

## Query analysis

A query string goes through the **same** analyzer. Query "Some people worth melting":

```
"Some people worth melting"
   │
   ▼ (no HTML to strip)
   │
   ▼ Whitespace tokenizer
"Some" "people" "worth" "melting"
   │
   ▼ Lowercase + stopwords
"some" "people" "worth" "melting"
```

Now Elasticsearch can match tokens against the index:

```
Query Term    Document 1    Document 2
"some"           ✓             ✓
"people"         ✓             ✓
"worth"                        ✓
"melting"        ✓             ✓
```

Document 2 matches more terms, so it scores higher.

## Configure analysis in Sigmie

Index-level analysis runs on every text field unless a field overrides it:

```php
$sigmie->newIndex('movies')
    ->tokenizeOnWhitespaces()        // tokenizer
    ->lowercase()                    // token filter
    ->trim()                         // token filter
    ->stripHTML()                    // char filter
    ->create();
```

See [Tokenizers](tokenizers.md), [Token Filters](token-filters.md), and [Character Filters](char-filters.md) for every option.

## Per-field analysis

Override analysis on a single field:

```php
use Sigmie\Index\NewAnalyzer;

$props->text('email')
    ->withNewAnalyzer(function (NewAnalyzer $analyzer) {
        $analyzer->tokenizeOnPattern('(@|\.)');
        $analyzer->lowercase();
    });
```

## Test the analyzer

`analyze()` runs text through the index's analyzer and returns the resulting tokens:

```php
$sigmie->index('movies')->analyze('The Matrix Reloaded');
// ["matrix", "reloaded"]
```

Use this to verify a field is being tokenized the way you expect before re-indexing the world.

## Language-specific analysis

English, German, and Greek have purpose-built analyzers with stemmers, stopwords, and normalizers — see [Languages](language.md).

```php
use Sigmie\Languages\English\English;

$sigmie->newIndex('articles')
    ->language(new English)
    ->englishStemmer()
    ->englishStopwords()
    ->englishLowercase()
    ->create();
```


---

<!-- source: https://sigmie.com/docs/v2/tokenizers -->

# Tokenizers

The tokenizer is the middle stage of the [analysis pipeline](analysis.md). It takes a string and produces tokens — typically words, but the rules depend on which tokenizer you pick.

For text like:

```
"Make your user's search experience great"
```

A whitespace tokenizer produces:

```
"Make" "your" "user's" "search" "experience" "great"
```

Sigmie has tokenizers for word boundaries, whitespace, patterns, paths, non-letters, and a no-op that keeps the input as one token.

## Word boundaries

Produces a token at every word boundary (handles punctuation):

```php
use Sigmie\Index\Analysis\Tokenizers\WordBoundaries;

$analyzer->tokenizer(new WordBoundaries(name: 'word_boundaries', maxTokenLength: 255));

// Or via the builder shortcut:
$analyzer->tokenizeOnWordBoundaries(maxTokenLength: 255);
```

`maxTokenLength` defaults to 255.

Example:

```
"Aw shucks, pluto. I can't be mad at ya!"

→ "Aw"
→ "shucks"
→ "pluto"
→ "I"
→ "can't"
→ "be"
→ "mad"
→ "at"
→ "ya"
```

Punctuation is absorbed into the boundary.

## Whitespace

Splits on whitespace characters only — punctuation stays attached to neighboring tokens:

```php
use Sigmie\Index\Analysis\Tokenizers\Whitespace;

$analyzer->tokenizer(new Whitespace(name: 'whitespace_tokenizer'));

// Or:
$analyzer->tokenizeOnWhitespaces();
```

Same input as above:

```
"Aw" "shucks," "pluto." "I" "can't" "be" "mad" "at" "ya!"
```

`shucks,`, `pluto.`, and `ya!` keep their punctuation.

## No-op

Treats the entire input as a single token. Useful when you want exact-match behavior on text fields:

```php
use Sigmie\Index\Analysis\Tokenizers\Noop;

$analyzer->tokenizer(new Noop(name: 'noop_tokenizer'));

// Or:
$analyzer->dontTokenize();
```

```
"If you ain't scared, you ain't alive."

→ "If you ain't scared, you ain't alive."
```

## Pattern

Splits at every match of a regular expression. The matched text **is not** included in any token:

```php
use Sigmie\Index\Analysis\Tokenizers\Pattern;

$analyzer->tokenizer(new Pattern(name: 'pattern_tokenizer', ','));

// Or:
$analyzer->tokenizeOnPattern(',');
```

```
"Though at times it may feel like the sky is falling around you, never give up, for every day is a new day"

→ "Though at times it may feel like the sky is falling around you"
→ " never give up"
→ " for every day is a new day"
```

## Simple pattern

Outputs each match of the pattern as a token (the inverse of `Pattern`):

```php
use Sigmie\Index\Analysis\Tokenizers\SimplePattern;

$analyzer->tokenizer(new SimplePattern(name: 'simple_pattern', "'.*'"));

// Or:
$analyzer->tokenizeOnPatternMatch("'.*'");
```

```
"I remember daddy told me 'Fairytales can come true'."

→ "'Fairytales can come true'"
```

Only the quoted phrase becomes a token.

## Path hierarchy

Produces a token at every level of a hierarchical path:

```php
use Sigmie\Index\Analysis\Tokenizers\PathHierarchy;

$analyzer->tokenizer(new PathHierarchy(delimiter: '/'));

// Or:
$analyzer->tokenizePathHierarchy(delimiter: '/');
```

Default delimiter is `/`.

```
"Disney/Movies/Musical/Sleeping Beauty"

→ "Disney"
→ "Disney/Movies"
→ "Disney/Movies/Musical"
→ "Disney/Movies/Musical/Sleeping Beauty"
```

Useful for filtering on path prefixes — searching "Disney/Movies" matches every nested entry.

## Non-letter

Splits on any character that isn't a letter:

```php
use Sigmie\Index\Analysis\Tokenizers\NonLetter;

$analyzer->tokenizer(new NonLetter);

// Or:
$analyzer->tokenizeOnNonLetter();
```

```
"To infinity … and beyond!"

→ "To"
→ "infinity"
→ "and"
→ "beyond"
```

## See also

- [Analysis](analysis.md) — the full pipeline.
- [Token Filters](token-filters.md) — transforming tokens after tokenization.
- [Character Filters](char-filters.md) — pre-processing before tokenization.


---

<!-- source: https://sigmie.com/docs/v2/token-filters -->

# Token Filters

Token filters run after the [tokenizer](tokenizers.md). Each filter transforms or removes tokens — lowercasing, stemming, dropping stopwords, applying synonyms.

Filters run in the order you declare them. The order matters: lowercasing before applying stopwords (which are usually defined in lowercase) is correct; doing it the other way around drops nothing.

## Stemming

Reduces words to a root form so "going" matches "go":

```php
$analyzer->stemming([
    ['go', ['going']],
]);
```

```
"Where" "are" "you" "going"
   │
   ▼ Stemming
"Where" "are" "you" "go"
```

## Stopwords

Drop common words:

```php
$analyzer->stopwords(['but', 'not']);
```

```
"Ladies" "do" "not" "start" "fights" "but" "they" "can" "finish" "them"
   │
   ▼ Stopwords ("not", "but")
"Ladies" "do" "start" "fights" "they" "can" "finish" "them"
```

## Trim

Remove leading and trailing whitespace from each token:

```php
$analyzer->trim();
```

Useful after pattern-based tokenization that can leave whitespace attached:

```
" never give up"   →   "never give up"
" for every day"   →   "for every day"
```

## Unique

Remove duplicate tokens:

```php
$analyzer->unique(onlyOnSamePosition: false);
```

```
"I" "was" "hiding" "under" "your" "porch" "because" "I" "love" "you"
   │
   ▼ Unique
"I" "was" "hiding" "under" "your" "porch" "because" "love" "you"
```

## Synonyms

### One-way

Replace specified terms with a canonical form:

```php
$analyzer->oneWaySynonyms([
    'ipod' => ['i-pod', 'i pod'],
]);
```

Anywhere `i-pod` or `i pod` appears, it's also indexed as `ipod` — but searches for `i-pod` don't match documents containing `ipod`.

### Two-way

Map a set of terms to each other:

```php
$analyzer->synonyms([
    ['joy', 'fun'],
]);
```

`fun` and `joy` are interchangeable — either matches documents containing the other.

```
"It's" "kind" "of" "fun" "to" "do" "the" "impossible"
   │
   ▼ Synonyms (fun ↔ joy)
"It's" "kind" "of" "fun" "joy" "to" "do" "the" "impossible"
```

## Lowercase / Uppercase

```php
$analyzer->lowercase();
$analyzer->uppercase();
```

Lowercase is part of nearly every analyzer — without it, "Matrix" doesn't match a query for "matrix".

```
"You" "better" "be" "back" "ASAP"
   │
   ▼ Lowercase
"you" "better" "be" "back" "asap"
```

## Decimal digit

Convert non-ASCII digits to ASCII:

```php
$analyzer->decimalDigit();
```

```
"໑" "໒" "໓" "໔" "໕"     (Lao digits)
   │
   ▼ Decimal Digit
"1" "2" "3" "4" "5"
```

## ASCII folding

Strip diacritics:

```php
$analyzer->asciiFolding();
```

```
"manténgase"   →   "mantengase"
```

Useful when users might or might not type accents.

## Token limit

Keep only the first N tokens:

```php
$analyzer->tokenLimit(maxTokenCount: 5);
```

```
"I" "was" "hiding" "under" "your" "porch" "because" "I" "love" "you"
   │
   ▼ Token Limit 5
"I" "was" "hiding" "under" "your"
```

## Truncate

Limit each token's length:

```php
$analyzer->truncate(length: 10);
```

```
"Supercalifragilisticexpialidocious"
   │
   ▼ Truncate 10
"Supercalif"
```

## Keywords

Protect specific terms from later filters — for example, prevent stemming on a brand name:

```php
$analyzer
    ->keywords(['going'])
    ->stemming([
        ['go', ['going']],
    ]);
```

```
"Where" "are" "you" "going"
   │
   ▼ Keywords protect "going"
   ▼ Stemming would normally turn "going" into "go" — but doesn't here
"Where" "are" "you" "going"
```

## Custom token filters

Register your own filter classes by name:

```php
use Sigmie\Index\Analysis\TokenFilter\TokenFilter;

TokenFilter::filterMap([
    'skroutz_greeklish' => SkroutzGreeklish::class,
    'skroutz_stem_greek' => SkroutzGreekStemmer::class,
]);
```

`SkroutzGreeklish` and `SkroutzGreekStemmer` are your classes implementing the token filter contract.

## See also

- [Tokenizers](tokenizers.md) — splitting text into tokens before the filters run.
- [Character Filters](char-filters.md) — preprocessing before tokenization.
- [Analysis](analysis.md) — the full pipeline.
- [Languages](language.md) — pre-built filter chains for English, German, Greek.


---

<!-- source: https://sigmie.com/docs/v2/char-filters -->

# Character Filters

Character filters run **before** the [tokenizer](tokenizers.md). They operate on the raw string — stripping HTML, mapping characters, applying regex substitutions — so the tokenizer sees clean input.

Sigmie ships three: HTML strip, character mapping, and pattern replace.

## HTML strip

Remove HTML tags from input text:

```php
use Sigmie\Index\Analysis\CharFilter\HTMLStrip;

$analyzer->charFilter(new HTMLStrip);

// Or:
$analyzer->stripHTML();
```

```
"<span>Some people are worth melting for.</span>"
   │
   ▼ Strip HTML
"Some people are worth melting for."
```

Use this for crawled web content, rich-text fields, or anywhere user input might contain HTML.

## Character mapping

Replace specific substrings with replacements:

```php
use Sigmie\Index\Analysis\CharFilter\Mapping;

$analyzer->charFilter(new Mapping(
    name: 'mapping_char_filter',
    mappings: [
        ':)' => 'happy',
        ':(' => 'sad',
    ],
));

// Or:
$analyzer->mapChars([
    ':)' => 'happy',
    ':(' => 'sad',
]);
```

```
"Even miracles take a little time. :)"
   │
   ▼ Map Chars (":)" → "happy")
"Even miracles take a little time. happy"
```

Mappings are literal substring substitutions — not regex.

## Pattern replace

Apply a regex substitution:

```php
use Sigmie\Index\Analysis\CharFilter\Pattern;

$analyzer->charFilter(new Pattern(
    name: 'pattern_replace_char_filter',
    pattern: ':D|:\)',
    replace: 'happy',
));

// Or:
$analyzer->patternReplace(pattern: ':D|:\)', replace: 'happy');
```

```
"This is the perfect time to panic! :D :)"
   │
   ▼ Pattern Replace (":D|:\\)" → "happy")
"This is the perfect time to panic! happy happy"
```

Use pattern replace for edge cases mapping can't handle — variable-length matches, alternations, anchors.

## See also

- [Tokenizers](tokenizers.md) — what runs after the char filters.
- [Token Filters](token-filters.md) — transforming individual tokens.
- [Analysis](analysis.md) — the full pipeline.


---

<!-- source: https://sigmie.com/docs/v2/language -->

# Languages

Sigmie ships purpose-built analyzers for **English**, **German**, and **Greek**. Each language has its own stemmers, stopword lists, lowercase normalizer, and (where appropriate) script-specific normalizers.

To use a language, pass an instance to `language()` on the index builder, then chain the filters you want:

```php
use Sigmie\Languages\English\English;

$sigmie->newIndex('articles')
    ->language(new English)
    ->englishStemmer()
    ->englishStopwords()
    ->englishLowercase()
    ->create();
```

`language()` returns a builder typed to that language, so the chained methods are language-specific and discoverable.

## English

```php
use Sigmie\Languages\English\English;

$sigmie->newIndex('articles')
    ->language(new English)
    ->englishStemmer()
    ->englishStopwords()
    ->englishLowercase()
    ->create();
```

### Available filters

| Filter | Purpose |
|--------|---------|
| `englishStemmer()` | Standard English stemmer. |
| `englishPorter2Stemmer()` | Porter2 (Snowball) stemmer. More aggressive than the default. |
| `englishLightStemmer()` | Lighter stemming — keeps more of the original form. |
| `englishLovinsStemmer()` | The Lovins algorithm. |
| `englishMinimalStemmer()` | Minimal stemming for high-precision use cases. |
| `englishPossessiveStemming()` | Strip trailing `'s` and `'`. |
| `englishStopwords()` | Drop English stopwords. |
| `englishLowercase()` | Lowercase tokens. |

Pick **one** stemmer per analyzer — they overlap. Porter2 is the standard choice; the light/minimal variants help when stemming reduces precision too much.

```php
$sigmie->newIndex('articles')
    ->language(new English)
    ->englishPorter2Stemmer()
    ->englishPossessiveStemming()
    ->englishStopwords()
    ->englishLowercase()
    ->create();
```

## German

```php
use Sigmie\Languages\German\German;

$sigmie->newIndex('artikel')
    ->language(new German)
    ->germanNormalize()
    ->germanStemmer()
    ->germanStopwords()
    ->germanLowercase()
    ->create();
```

### Available filters

| Filter | Purpose |
|--------|---------|
| `germanStemmer()` | Default German stemmer. |
| `germanStemmer2()` | Alternate German stemmer (variant 2). |
| `germanLightStemmer()` | Lighter stemming. |
| `germanMinimalStemmer()` | Minimal stemming. |
| `germanNormalize()` | Normalize umlauts and ß. |
| `germanStopwords()` | Drop German stopwords. |
| `germanLowercase()` | Lowercase tokens. |

`germanNormalize()` is usually worth including — it folds `ü→u`, `ö→o`, `ä→a`, `ß→ss`, so queries match regardless of how users type umlauts.

## Greek

```php
use Sigmie\Languages\Greek\Greek;

$sigmie->newIndex('arthra')
    ->language(new Greek)
    ->greekLowercase()
    ->greekStemmer()
    ->greekStopwords()
    ->create();
```

### Available filters

| Filter | Purpose |
|--------|---------|
| `greekStemmer()` | Greek stemmer. |
| `greekStopwords()` | Drop Greek stopwords. |
| `greekLowercase()` | Lowercase Greek tokens (handles σ → ς word-final form). |

`greekLowercase()` does more than ASCII lowercase — it handles the Greek-specific final-sigma form. Use it instead of the generic `lowercase()` for Greek text.

## Multi-language indices

Per-field analysis lets you mix languages in one index. Define one analyzer per language-specific field:

```php
use Sigmie\Index\NewAnalyzer;
use Sigmie\Languages\English\English;
use Sigmie\Languages\German\German;

$props->text('description_en')
    ->withNewAnalyzer(function (NewAnalyzer $analyzer) {
        $analyzer->tokenizeOnWordBoundaries();
        // language-specific filters via separate builder
    });
```

For most cases, simpler to keep one index per language and search across them:

```php
$sigmie->newSearch("articles_de,articles_en")
    ->properties($props)
    ->queryString('Tür door')
    ->get();
```

## See also

- [Analysis](analysis.md) — how text becomes searchable.
- [Token Filters](token-filters.md) — generic, language-agnostic filters.
- [Indices](index.md) — index configuration.


---

<!-- source: https://sigmie.com/docs/v2/filter-parser -->

# Filter Parser

The Filter Parser turns human-readable expressions into Elasticsearch boolean queries. You write `active:true AND price:100..500`; Sigmie compiles it into the right combination of `term`, `range`, and `bool` queries.

```php
use Sigmie\Parse\FilterParser;
use Sigmie\Mappings\NewProperties;

$props = new NewProperties;
$props->keyword('category');
$props->bool('active');
$props->number('stock');

$parser = new FilterParser($props());
$query = $parser->parse('active:true AND category:"sports"');
```

In `newSearch()` and `newQuery()`, you almost never instantiate the parser directly — you pass a filter string to `filters()` or `parse()`:

```php
$sigmie->newSearch('products')
    ->properties($props)
    ->filters('active:true AND stock>0 AND price:100..500')
    ->queryString('laptop')
    ->get();
```

## Why use the parser

- **Reads like a sentence** — `active:true AND price:100..500` instead of nested `bool` arrays.
- **Type-aware** — validated against your property definitions.
- **Errors early** — invalid syntax raises a `ParseException` before hitting Elasticsearch.

## Exact match

```
category:"sports"
color:'red'
name:'John Doe'
name:"John Doe"
```

Single and double quotes are interchangeable.

### Numbers

Numbers don't need quotes:

```
price:100
stock:50
```

### Booleans

```
active:true
published:false
```

Boolean values are lowercase, no quotes.

### Field exists

```
email:*               # has any value
NOT email:*           # has no value
```

## Multiple values (IN)

Match any value from an array:

```
category:["sports", "action", "horror"]
status:["active", "pending"]
```

Whitespace inside an array is trimmed. Empty arrays match nothing.

## Wildcards

```
phone:'*650'         # ends with 650
phone:'2353*'        # starts with 2353
title:'*manager*'    # contains "manager"
```

## Ranges

### Comparison operators

```
price>=100
price<=200
stock>0
created_at>="2023-05-01"
```

Operators: `>`, `<`, `>=`, `<=`.

### Inclusive range

```
price:100..500
created_at:"2023-01-01".."2023-12-31"
```

`price:100..500` is equivalent to `price>=100 AND price<=500`.

## Logical operators

```
active:true AND category:"sports" AND stock>0
category:"action" OR category:"horror"
NOT category:"sports"
active:true AND NOT stock:0
```

Group with parentheses:

```
active:true AND (category:"action" OR category:"horror") AND stock>0
```

> **Note:** Multiple clauses without an operator throw `ParseException`:
>
> ```
> color:'red' size:'large'             # error
> color:'red' AND size:'large'         # OK
> ```

## Object properties

For flattened object fields, use dot notation:

```
contact.active:true
contact.name:"John Doe"
user.profile.settings.notifications:true
```

```php
$props->object('contact', function (NewProperties $p) {
    $p->bool('active');
    $p->name('name');
    $p->email('email');
});

$parser->parse('contact.active:true AND contact.name:"Alice"');
```

## Nested fields

For arrays of objects (mapped as `nested()`), use curly braces. All conditions inside the braces must match the **same** array element:

```
contact:{ active:true }
contact:{ name:"John Doe" AND verified:true }
```

```php
$props->nested('vehicles', function (NewProperties $p) {
    $p->keyword('make');
    $p->keyword('model');
});

$parser->parse("vehicles:{ make:'Powell Motors' AND model:'Canyonero' }");
```

This finds documents with a vehicle whose `make` is "Powell Motors" **and** model is "Canyonero" — not documents with one vehicle named "Powell Motors" and a separate vehicle modeled "Canyonero".

### Deep nesting

```
contact:{ address:{ city:"Berlin" AND marker:"X" } }
```

### Object vs nested syntax

Same data, different mappings — different syntax:

| Mapping | Syntax |
|---------|--------|
| `object()` | `contact.active:true AND contact.city:"Berlin"` |
| `nested()` | `contact:{ active:true AND city:"Berlin" }` |

`nested` preserves the relationship; `object` flattens everything to root.

## Geo-location

Filter by proximity to a point:

```
location:1km[51.49,13.77]
location:5mi[40.7128,-74.0060]
contact:{ location:1km[51.16,13.49] AND active:true }
```

The format is `field:distance[lat,lon]`. Supported units: `km`, `mi`, `m`, `yd`, `ft`, `nmi`, `cm`, `in`.

> **Note:** Zero distance returns nothing, even on an exact coordinate match. Use a small positive distance:
>
> ```
> location:0km[51.16,13.49]       # returns nothing
> location:1m[51.16,13.49]        # OK
> ```

## Empty values

```
database:""
tags:[]
```

Empty arrays match no documents.

## Escaping and special characters

### Quotes inside strings

```
description:"She said \"Hello World\""
title:'It\'s working'
```

### Dashes, spaces, parentheses

Values with spaces, dashes, or special characters need quotes:

```
status:'in-progress'
category:"crime & drama"
job_title:"Chief Executive Officer (CEO)"
industry:["Renewables & Environment"]
```

## Error handling

Invalid syntax raises `ParseException`:

```php
use Sigmie\Parse\ParseException;

try {
    $query = $parser->parse('color:"red" color:"blue"');     // missing operator
} catch (ParseException $e) {
    // log, surface to the user, etc.
}
```

Common causes:

- Missing logical operator between clauses.
- Referencing a field that isn't in your mappings.
- Mismatched parentheses or brackets.
- Excessive nesting (the parser has a depth limit).

### Field validation

The parser validates field names against your property definitions:

```php
$props = new NewProperties;
$props->keyword('category');

$parser = new FilterParser($props());

$parser->parse('category:"sports"');                  // OK
$parser->parse('subject_service:{ id:"23" }');        // error — field unknown

if (!empty($parser->errors())) {
    // handle errors
}
```

## Syntax cheatsheet

| Operation | Syntax | Example |
|-----------|--------|---------|
| Exact match | `field:"value"` | `category:"sports"` |
| Number | `field:123` | `price:100` |
| Boolean | `field:true` | `active:true` |
| Field exists | `field:*` | `email:*` |
| IN | `field:[v1,v2]` | `status:["active","pending"]` |
| Wildcard | `field:'*pat*'` | `phone:'*650'` |
| Range | `field:min..max` | `price:100..500` |
| Greater than | `field>value` | `stock>0` |
| Less than | `field<value` | `price<100` |
| AND | `a AND b` | `active:true AND stock>0` |
| OR | `a OR b` | `cat:"a" OR cat:"b"` |
| NOT | `NOT a` | `NOT category:"books"` |
| Object | `obj.field:value` | `contact.active:true` |
| Nested | `field:{condition}` | `contact:{active:true}` |
| Geo | `field:dist[lat,lon]` | `location:1km[51,13]` |


---

<!-- source: https://sigmie.com/docs/v2/sort-parser -->

# Sort Parser

The Sort Parser turns space-separated sort expressions into Elasticsearch sort arrays. You write `_score rating:desc name:asc`; Sigmie generates the right JSON.

## In `newSearch()` and `newQuery()`

Pass a sort string to `sort()` on `NewSearch`, or `sortString()` on `NewQuery`:

```php
$sigmie->newSearch('movies')
    ->properties($props)
    ->sort('_score rating:desc name:asc');

$sigmie->newQuery('movies')
    ->properties($props)
    ->sortString('_score rating:desc name:asc');
```

> **Note:** On `NewQuery`, call `sortString()` **before** the query method (`matchAll`, `bool`, `parse`, etc.). Each call replaces the previous sort.

## Syntax

```
_score:desc rating:desc name:asc
```

Each clause is `field` or `field:asc` / `field:desc`. Clauses are space-separated. The default direction depends on the field: `_score` defaults to `desc`, everything else to `asc`.

> **Note:** `_score:asc` is **not allowed**. Elasticsearch can't sort relevance ascending. Use `_score` or `_score:desc`.

## With properties

When you pass properties, the parser routes text fields to their `.keyword` sub-field automatically:

```php
$props = new NewProperties;
$props->bool('active');
$props->text('name')->keyword();
$props->text('category');

$parser = new SortParser($props());
$parser->parse('_score rating:desc name:asc');
```

The compiled output:

```json
[
    "_score",
    { "rating": "desc" },
    { "name.keyword": "asc" }
]
```

Without properties, the parser passes field names through unchanged — which usually fails for text fields, since Elasticsearch can't sort an analyzed `text` field directly. Always pass properties when sorting on text.

## Direct use

```php
use Sigmie\Parse\SortParser;

$parser = new SortParser($props());
$sort = $parser->parse('_score rating:desc name:asc');
```

The result is a valid Elasticsearch sort array suitable for the `sort` key in a raw query body.

## Geo sort

For `geoPoint` fields:

```
location[40.71,-74.00]:km:asc
```

The format is `field[lat,lon]:unit:direction`. Units match the filter parser: `km`, `mi`, `m`, `yd`, etc.

## See also

- [Filter Parser](filter-parser.md) — same human-friendly syntax for `WHERE` clauses.
- [Search](search.md#sort) — using sort with `newSearch()`.
- [Advanced Queries](query.md#sorting) — using sort with `newQuery()`.


---

<!-- source: https://sigmie.com/docs/v2/connection -->

# Connection Setup

The [Installation](installation.md) guide covers basic local connections. This page covers everything else: production auth, SSL, multi-node clusters, and cloud providers.

Sigmie uses the same connection API for Elasticsearch and OpenSearch. Only the engine type and credentials change.

## Authentication

### Basic auth

```php
use Sigmie\Sigmie;

$sigmie = Sigmie::create(
    hosts: ['https://elasticsearch.example.com:9200'],
    config: [
        'auth' => ['elastic', 'your-password'],
    ]
);
```

For lower-level control, build the client manually:

```php
use Sigmie\Http\JSONClient;
use Sigmie\Base\Http\ElasticsearchConnection;
use Sigmie\Base\Drivers\Elasticsearch;

$client = JSONClient::createWithBasic(
    hosts: ['https://elasticsearch.example.com:9200'],
    username: 'elastic',
    password: 'your-password'
);

$connection = new ElasticsearchConnection($client, new Elasticsearch);
$sigmie = new Sigmie($connection);
```

### API key

Generate the key in Elasticsearch:

```bash
curl -X POST "localhost:9200/_security/api_key" \
  -H 'Content-Type: application/json' \
  -u elastic:your-password \
  -d '{"name": "my-api-key", "expiration": "90d"}'
```

Base64-encode `id:api_key` and pass it as the Authorization header:

```php
$sigmie = Sigmie::create(
    hosts: ['https://elasticsearch.example.com:9200'],
    config: [
        'headers' => [
            'Authorization' => 'ApiKey ' . base64_encode('id:api_key'),
        ],
    ]
);
```

### Bearer token

```php
$sigmie = Sigmie::create(
    hosts: ['https://elasticsearch.example.com:9200'],
    config: [
        'headers' => [
            'Authorization' => 'Bearer your-token-here',
        ],
    ]
);
```

## SSL/TLS

### Self-signed certificates (development)

```php
$sigmie = Sigmie::create(
    hosts: ['https://localhost:9200'],
    config: ['verify' => false],
);
```

> **Warning:** Never disable SSL verification in production.

### Custom CA certificates

```php
$sigmie = Sigmie::create(
    hosts: ['https://elasticsearch.example.com:9200'],
    config: ['verify' => '/path/to/ca-certificate.pem'],
);
```

## Multiple nodes

```php
$sigmie = Sigmie::create(
    hosts: [
        '10.0.0.1:9200',
        '10.0.0.2:9200',
        '10.0.0.3:9200',
    ],
    config: ['auth' => ['elastic', 'your-password']],
);
```

Requests are distributed round-robin. If a node fails, Sigmie retries the next one.

## OpenSearch

Specify the engine type:

```php
use Sigmie\Enums\SearchEngineType;

$sigmie = Sigmie::create(
    hosts: ['https://localhost:9200'],
    engine: SearchEngineType::OpenSearch,
    config: [
        'auth' => ['admin', 'MyStrongPass123!@#'],
        'verify' => false,
    ]
);
```

See [OpenSearch](opensearch.md) for the full integration.

## Cloud providers

### Elastic Cloud

```php
$sigmie = Sigmie::create(
    hosts: ['https://my-deployment.es.us-east-1.aws.found.io:9243'],
    config: ['auth' => ['elastic', 'your-cloud-password']],
);
```

Find your endpoint in the deployment dashboard under "Elasticsearch endpoint."

### AWS OpenSearch Service

AWS requires IAM-signed requests. Build a Guzzle handler with the AWS SDK:

```php
use Aws\Credentials\CredentialProvider;
use Aws\Signature\SignatureV4;
use GuzzleHttp\HandlerStack;
use GuzzleHttp\Middleware;
use Sigmie\Http\JSONClient;
use Sigmie\Base\Http\ElasticsearchConnection;
use Sigmie\Base\Drivers\Opensearch;

$credentials = CredentialProvider::defaultProvider()();
$signer = new SignatureV4('es', 'us-east-1');
$handler = HandlerStack::create();
$handler->push(Middleware::mapRequest(fn ($request) =>
    $signer->signRequest($request, $credentials)
));

$client = JSONClient::create(
    hosts: ['https://search-domain.us-east-1.es.amazonaws.com:443'],
    config: ['handler' => $handler]
);

$connection = new ElasticsearchConnection($client, new Opensearch);
$sigmie = new Sigmie($connection);
```

## Configuration options

The `config` array accepts any [Guzzle HTTP option](https://docs.guzzlephp.org/en/stable/request-options.html). Common ones:

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `connect_timeout` | int | 10 | Seconds to wait for the connection. |
| `timeout` | int | 30 | Seconds to wait for the response. |
| `verify` | bool\|string | true | SSL verification (boolean or CA path). |
| `auth` | array | null | Basic auth `['username', 'password']`. |
| `headers` | array | [] | Custom HTTP headers. |

### Timeouts

```php
$sigmie = Sigmie::create(
    hosts: ['127.0.0.1:9200'],
    config: [
        'connect_timeout' => 15,
        'timeout' => 120,           // long for bulk operations
    ]
);
```

### Environment-based configuration

```php
$sigmie = Sigmie::create(
    hosts: explode(',', $_ENV['ELASTICSEARCH_HOSTS']),
    config: [
        'auth' => [$_ENV['ES_USER'], $_ENV['ES_PASSWORD']],
        'verify' => $_ENV['ES_VERIFY_SSL'] === 'true',
    ]
);
```

## Verify the connection

```php
if ($sigmie->isConnected()) {
    echo "Connected.\n";
}

foreach ($sigmie->indices() as $index) {
    echo $index->name . "\n";
}
```

## Troubleshooting

**`cURL error 7: Failed to connect`**
The cluster isn't running, or your host/port is wrong. Try `curl http://localhost:9200`.

**`cURL error 60: SSL certificate problem`**
Use `'verify' => false` in development, a valid certificate in production, or `'verify' => '/path/to/ca.pem'` for a custom CA.

**`401 Unauthorized`**
Wrong credentials, or auth isn't configured the way you think. Check cluster security logs.

**`cURL error 28: Operation timed out`**
Increase `connect_timeout` and `timeout`. For bulk operations, 60–120 seconds is common.


---

<!-- source: https://sigmie.com/docs/v2/docker -->

# Docker

Sigmie ships a `docker-compose.yml` that runs Elasticsearch (or OpenSearch) together with local embedding and reranking services. The Infinity-based services let you build semantic search without paid APIs in development.

## Start everything

```bash
docker-compose up -d
```

The first start downloads models and takes 5–10 minutes. After that, restarts are quick.

> **Note:** Elasticsearch and OpenSearch both bind port 9200. Start one or the other, not both.

## Services

| Service | Port | Model | Purpose |
|---------|------|-------|---------|
| `elasticsearch` | 9200 | — | Elasticsearch 9.1.3, security disabled |
| `opensearch` | 9200 | — | OpenSearch 3.0 with default admin auth |
| `embeddings` | 7997 | BAAI/bge-small-en-v1.5 (384-dim) | Text embeddings |
| `reranker` | 7998 | cross-encoder/ms-marco-MiniLM-L-6-v2 | Result reranking |
| `image-embeddings` | 7996 | TinyCLIP ViT-8M-16 | Image/text embeddings |
| `llm` | 7999 | Ollama (app-side only) | Optional; Sigmie has no LLM client |

## Start only what you need

```bash
# Minimal: keyword search only
docker-compose up -d elasticsearch

# Add semantic search
docker-compose up -d elasticsearch embeddings

# Semantic search + reranking
docker-compose up -d elasticsearch embeddings reranker

# Image search
docker-compose up -d elasticsearch image-embeddings
```

## Connect Sigmie to the local services

Register the local embeddings service with Sigmie:

```php
use Sigmie\AI\APIs\InfinityEmbeddingsApi;

$sigmie->registerApi('embeddings', new InfinityEmbeddingsApi(
    baseUrl: 'http://localhost:7997',
    model: 'BAAI/bge-small-en-v1.5',
));
```

Register the reranker:

```php
use Sigmie\AI\APIs\InfinityRerankApi;

$sigmie->registerApi('my-rerank', new InfinityRerankApi(
    baseUrl: 'http://localhost:7998',
    model: 'cross-encoder/ms-marco-MiniLM-L-6-v2',
));

$response = $sigmie->newSearch('docs')
    ->properties($props)
    ->queryString('return policy')
    ->get();

$reranked = $response->rerank('my-rerank', ['content']);
```

See [Semantic Search](semantic-search.md) for using these in mappings, and [Retrieval and Agents](rag.md) for combining them with generation.

## Connect to Elasticsearch

```php
use Sigmie\Sigmie;

$sigmie = Sigmie::create(hosts: ['127.0.0.1:9200']);
```

## Connect to OpenSearch

```php
use Sigmie\Sigmie;
use Sigmie\Enums\SearchEngineType;

$sigmie = Sigmie::create(
    hosts: ['https://localhost:9200'],
    engine: SearchEngineType::OpenSearch,
    config: [
        'auth' => ['admin', 'MyStrongPass123!@#'],
        'verify' => false,
    ]
);
```

## Environment variables

Copy `.env.example` to `.env` to customize service URLs:

```ini
LOCAL_EMBEDDING_URL=http://localhost:7997
LOCAL_RERANK_URL=http://localhost:7998
LOCAL_CLIP_URL=http://localhost:7996
```

For cloud API keys:

```ini
OPENAI_API_KEY=sk-...
VOYAGE_API_KEY=pa-...
COHERE_API_KEY=...
MIXEDBREAD_API_KEY=...
```

Sigmie itself doesn't read these — your application registers the API clients via `registerApi()`.

## Health checks

```bash
docker-compose ps                                  # all services
curl http://localhost:7997/health                  # embeddings
curl http://localhost:7998/health                  # reranker
curl http://localhost:7996/health                  # image embeddings
curl http://localhost:9200/_cluster/health         # Elasticsearch
curl -u admin:MyStrongPass123!@# -k \
    https://localhost:9200/_cluster/health         # OpenSearch
```

## Data persistence

All data lives in `./data/`:

```
./data/
├── embeddings/         # downloaded model
├── reranker/           # downloaded model
├── image-embeddings/   # downloaded model
├── llm/                # Ollama models (if used)
├── elasticsearch/      # indices and documents
└── opensearch/         # indices and documents
```

Reset everything:

```bash
docker-compose down -v
rm -rf ./data/
```

> **Warning:** This deletes all indices, documents, and downloaded models.

## Logs

```bash
docker-compose logs -f                # all services, follow
docker-compose logs embeddings        # one service
docker-compose logs elasticsearch
```

## Resource budget

| Service | RAM | Disk |
|---------|-----|------|
| Embeddings | 1–2 GB | ~500 MB |
| Reranker | 1–2 GB | ~400 MB |
| Image embeddings | 1–2 GB | ~300 MB |
| Elasticsearch | 2–4 GB | varies |
| OpenSearch | 2–4 GB | varies |

For Elasticsearch + embeddings + reranker, give Docker at least 8 GB.

## Troubleshooting

**Port 9200 already in use.** Stop the engine you're not using:

```bash
docker-compose stop elasticsearch
docker-compose up -d opensearch
```

**Embeddings service won't start.** Check the logs — the first start downloads the model:

```bash
docker-compose logs embeddings
```

Wait for "Model loaded successfully."

**Out of memory.** Allocate more RAM in Docker Desktop preferences.


---

<!-- source: https://sigmie.com/docs/v2/opensearch -->

# OpenSearch

Sigmie supports OpenSearch 2.x and 3.x — including AWS OpenSearch Service — with the same API as Elasticsearch. You change one parameter and everything else continues to work.

## Connect

Specify the engine type:

```php
use Sigmie\Sigmie;
use Sigmie\Enums\SearchEngineType;

$sigmie = Sigmie::create(
    hosts: ['https://localhost:9200'],
    engine: SearchEngineType::OpenSearch,
    config: [
        'auth' => ['admin', 'MyStrongPass123!@#'],
        'verify' => false,    // self-signed cert in dev
    ]
);
```

## Supported versions

- OpenSearch 2.4 – 2.11
- OpenSearch 3.0+
- AWS OpenSearch Service

Sigmie adapts to the version automatically.

## Semantic search

Define semantic fields the same way as with Elasticsearch:

```php
use Sigmie\Mappings\NewProperties;

$props = new NewProperties;
$props->text('title')->semantic(dimensions: 384, api: 'embeddings');
$props->text('content')->semantic(dimensions: 384, accuracy: 3, api: 'embeddings');

$sigmie->newIndex('articles')->properties($props)->create();
```

Sigmie configures OpenSearch's KNN settings under the hood.

### Accuracy

| Level | Use case |
|-------|----------|
| 1 | Fastest indexing, large corpora |
| 2 | Balanced |
| 3 | Recommended default |
| 4–5 | Highest precision |

### Similarity metrics

```php
use Sigmie\Enums\VectorSimilarity;

$props->text('description')->semantic(
    dimensions: 384,
    similarity: VectorSimilarity::Cosine,        // default
);

$props->text('abstract')->semantic(
    dimensions: 384,
    similarity: VectorSimilarity::DotProduct,
);

$props->text('content')->semantic(
    dimensions: 384,
    similarity: VectorSimilarity::Euclidean,
);
```

## Searching

Search is identical to Elasticsearch:

```php
$results = $sigmie->newSearch('articles')
    ->properties($props)
    ->semantic()
    ->queryString('quantum computing')
    ->size(20)
    ->get();
```

## Migrate from Elasticsearch

Change one parameter:

```php
// Before
$sigmie = Sigmie::create(hosts: ['https://localhost:9200']);

// After
$sigmie = Sigmie::create(
    hosts: ['https://localhost:9200'],
    engine: SearchEngineType::OpenSearch,
);
```

The rest of your code is unchanged.

## AWS OpenSearch Service

For username/password auth:

```php
$sigmie = Sigmie::create(
    hosts: ['https://your-domain.region.es.amazonaws.com'],
    engine: SearchEngineType::OpenSearch,
    config: ['auth' => ['username', 'password']],
);
```

For IAM-signed requests, see the AWS section of [Connection Setup](connection.md).


---

<!-- source: https://sigmie.com/docs/v2/laravel-scout -->

# Laravel Scout

`sigmie/elasticsearch-scout` is a [Laravel Scout](https://laravel.com/docs/scout) driver. It plugs into Scout's model lifecycle so writes and deletes flow into Elasticsearch automatically, and `Model::search()` runs through Sigmie's search builder.

## Install

Install Scout first:

```bash
composer require laravel/scout
```

Publish its config:

```bash
php artisan vendor:publish --provider="Laravel\Scout\ScoutServiceProvider"
```

Install the Sigmie driver:

```bash
composer require sigmie/elasticsearch-scout
```

Set the Scout driver in your `.env`:

```ini
SCOUT_DRIVER=elasticsearch
```

Publish the Sigmie config (optional, for customization):

```bash
php artisan vendor:publish --provider="Sigmie\ElasticsearchScout\ElasticsearchScoutServiceProvider"
```

This creates `config/elasticsearch-scout.php`:

```php
return [
    'hosts' => env('ELASTICSEARCH_HOSTS', '127.0.0.1:9200'),
    'auth' => [
        'type' => env('ELASTICSEARCH_AUTH_TYPE', 'none'),
        'user' => env('ELASTICSEARCH_USER', ''),
        'password' => env('ELASTICSEARCH_PASSWORD', ''),
        'token' => env('ELASTICSEARCH_TOKEN', ''),
        'headers' => [],
    ],
    'guzzle_config' => [
        'allow_redirects' => false,
        'http_errors' => false,
        'connect_timeout' => 15,
    ],
    'index-settings' => [
        'shards' => env('ELASTICSEARCH_INDEX_SHARDS', 1),
        'replicas' => env('ELASTICSEARCH_INDEX_REPLICAS', 2),
    ],
];
```

## Make a model searchable

Use **Sigmie's** `Searchable` trait instead of Laravel Scout's. They have the same name; the Sigmie version adds the methods Sigmie needs:

```php
use Sigmie\ElasticsearchScout\Searchable;
use Sigmie\Mappings\NewProperties;

class Movie extends Model
{
    use Searchable;

    public function elasticsearchProperties(NewProperties $properties): void
    {
        $properties->title('title');
        $properties->name('director');
        $properties->category('genre');
        $properties->date('created_at');
        $properties->date('updated_at');
    }
}
```

`elasticsearchProperties()` is required — it defines the index schema for this model.

## Build the index

```bash
php artisan scout:index "App\Models\Movie"
```

## Import existing rows

```bash
php artisan scout:import "App\Models\Movie"
```

## Update the index settings

Unlike other Scout drivers, Sigmie requires the model name when re-syncing:

```bash
php artisan scout:sync-index-settings "App\Models\Movie"
```

This re-applies your `elasticsearchProperties()` and `elasticsearchIndex()` configuration.

## Customize the search

Define `elasticsearchSearch()` to use any [`NewSearch`](search.md) feature:

```php
use Sigmie\Search\NewSearch;

class Movie extends Model
{
    use Searchable;

    public function elasticsearchProperties(NewProperties $props): void
    {
        $props->title('title');
        $props->name('director');
        $props->category('genre');
    }

    public function elasticsearchSearch(NewSearch $search): void
    {
        $search->typoTolerance();
        $search->typoTolerantAttributes(['title', 'director']);
        $search->retrieve(['title', 'director']);
        $search->fields(['title', 'director']);
        $search->highlighting(
            ['title', 'director'],
            '<span class="font-bold">',
            '</span>',
        );
    }
}
```

For one-off customization, pass a closure to `Model::search()`:

```php
use Sigmie\Search\NewSearch;

Movie::search('Star Wars', function (NewSearch $search) {
    $search->weight(['title' => 5]);
})->get();
```

## Customize index analysis

```php
use Sigmie\Index\NewIndex;

class Movie extends Model
{
    use Searchable;

    public function elasticsearchProperties(NewProperties $props): void { /* ... */ }
    public function elasticsearchSearch(NewSearch $search): void { /* ... */ }

    public function elasticsearchIndex(NewIndex $index): void
    {
        $index->tokenizeOnWordBoundaries()
            ->lowercase()
            ->trim()
            ->shards(3)
            ->replicas(3);
    }
}
```

If you don't define `elasticsearchIndex()`, Sigmie defaults to tokenizing on word boundaries, lowercase, trim.

## Accessing hit metadata

Each model returned by Scout carries the raw Elasticsearch hit on `$model->hit`:

```php
$movie = Movie::search('Star Wars')->get()->first();

$movie->hit['_score'];                          // 32.343453
$movie->hit['highlight']['title'][0];           // <span class="font-bold">Star Wars</span>
```

## Date formatting

Sigmie expects dates in `Y-m-d H:i:s.u`. Laravel's default `toSearchableArray` converts Eloquent timestamps automatically; if you override `toSearchableArray()`, do the conversion yourself:

```php
public function toSearchableArray(): array
{
    $array = $this->toArray();

    $array['created_at'] = $this->created_at?->format('Y-m-d H:i:s.u');
    $array['updated_at'] = $this->updated_at?->format('Y-m-d H:i:s.u');

    return $array;
}
```

Or use a different format and tell Sigmie about it:

```php
public function elasticsearchProperties(NewProperties $props): void
{
    $props->date('created_at')->format('MM/dd/yyyy');
    $props->date('updated_at')->format('MM/dd/yyyy');
}
```

## Authentication

### Basic auth

```ini
ELASTICSEARCH_AUTH_TYPE=basic
ELASTICSEARCH_USER=elastic
ELASTICSEARCH_PASSWORD=your-password
```

### Bearer token

```ini
ELASTICSEARCH_AUTH_TYPE=token
ELASTICSEARCH_TOKEN=your-token-here
```

### Custom headers

If you need API keys or custom auth headers, populate `'headers'` in `config/elasticsearch-scout.php`:

```php
'headers' => [
    'X-App-Token' => env('APP_TOKEN'),
    'Authorization' => 'ApiKey ' . env('ELASTICSEARCH_API_KEY'),
],
```

## Multiple hosts

```ini
ELASTICSEARCH_HOSTS=10.0.0.1:9200,10.0.0.2:9200,10.0.0.3:9200
```

## Guzzle configuration

Tune the underlying HTTP client in `config/elasticsearch-scout.php`:

```php
'guzzle_config' => [
    'allow_redirects' => false,
    'http_errors' => false,
    'connect_timeout' => 15,
    'timeout' => 30,
],
```

See [Connection Setup](connection.md) for the full list of Guzzle options Sigmie understands.

## See also

- [Search](search.md) — every option available in `elasticsearchSearch()`.
- [Mappings & Properties](mappings.md) — types available in `elasticsearchProperties()`.
- [Indices](index.md) — analysis options for `elasticsearchIndex()`.
- [Laravel AI SDK](laravel-ai.md) — expose Scout-indexed models as AI agent tools.


---

<!-- source: https://sigmie.com/docs/v2/laravel-ai -->

# Laravel AI SDK

`SigmieIndexTool` exposes a Sigmie index as a [Laravel AI SDK](https://laravel.com/docs/ai-sdk) tool. The AI agent gets full access to your search builder — query, filters, sorts, facets, pagination — with a description auto-generated from your property definitions.

## Quick start

```php
use Sigmie\AI\SigmieIndexTool;

class ShoppingAssistant implements Agent, HasTools
{
    use Promptable;

    public function instructions(): string
    {
        return 'You help users find products in our catalog.';
    }

    public function tools(): array
    {
        return [
            new SigmieIndexTool(app(ProductIndex::class)),
        ];
    }
}
```

The agent now searches `products` end-to-end, with filtering, sorting, and facets.

## The `AsTool` trait

For convenience, add `AsTool` to your `SigmieIndex` subclass:

```php
use Sigmie\AI\AsTool;
use Sigmie\SigmieIndex;
use Sigmie\Mappings\NewProperties;

class ProductIndex extends SigmieIndex
{
    use AsTool;

    public function name(): string
    {
        return 'products';
    }

    public function properties(): NewProperties
    {
        $props = new NewProperties;
        $props->name('name');
        $props->category('brand');
        $props->number('price');
        $props->bool('in_stock');
        return $props;
    }
}
```

Now `toTool()` builds the agent tool:

```php
public function tools(): array
{
    return [
        app(ProductIndex::class)->toTool(),
    ];
}
```

## Base filters

Pass `baseFilters` to scope every query the AI makes. This is how you enforce multi-tenancy or per-user authorization — the AI can't bypass these filters and can't see them in its tool description:

```php
new SigmieIndexTool(
    app(OrderIndex::class),
    baseFilters: "user_id:{$user->id}",
);

// Or via the trait:
app(OrderIndex::class)->toTool(baseFilters: "user_id:{$user->id}");
```

Base filters are wrapped in parentheses and AND-ed with whatever the AI passes:

```
(user_id:3) AND (status:'shipped' OR status:'delivered')
```

## The auto-generated description

The tool description is built from your properties. For:

```php
$props->name('name');
$props->category('brand');
$props->number('price');
$props->boolean('in_stock');
$props->date('created_at');
```

The AI sees:

```
Search the 'products' index.

Available fields:
- name [text] (sortable, facetable): name:'value' name:['a','b']
- brand [text] (sortable, facetable): brand:'value' brand:['a','b']
- price [number] (sortable, facetable): price>n price<=n price:min..max
- in_stock [boolean] (sortable): in_stock:true in_stock:false
- created_at [date] (sortable): created_at>'2024-01-01' created_at<'2024-12-31'

Filter operators: AND, OR, AND NOT
Negation: NOT field:'value'
Grouping: (field:'a' OR field:'b') AND other>10
Exists check: field:*
Sort: field:asc field:desc _score (space-separated)
Geo sort: field[lat,lon]:km:asc
Facets: field1 field2:20 (space-separated, optional :size for keywords or :interval for numbers)
```

## Tool parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `query` | string (required) | Search query text. |
| `filters` | string | Filter expression. |
| `sort` | string | Sort expression. |
| `facets` | string | Space-separated facet fields. |
| `facet_filters` | string | Active facet filter values. |
| `per_page` | int (default 10) | Results per page. |
| `page` | int (default 1) | Page number. |

## Filter syntax

Filters use the [Filter Parser](filter-parser.md). Quick reference by field type:

### Keyword

```
brand:'toyota'
brand:['toyota','honda','ford']
brand:toy*
```

### Number / price

```
price>100
price<=50
price:10..100
```

### Boolean

```
in_stock:true
in_stock:false
```

### Date

```
created_at>'2024-01-01'
created_at<'2024-12-31'
```

### Geo

```
location:10km[40.71,-74.00]
```

### Nested

```
variants:{color:'red' AND size>10}
```

### Object

```
meta.author:'John'
```

### Combining

```
brand:'toyota' AND price:10000..50000
(brand:'toyota' OR brand:'honda') AND in_stock:true
NOT status:'discontinued'
brand:'toyota' AND NOT color:'red'
```

## Sort

Space-separated, optional `:asc` / `:desc`:

```
price:asc
created_at:desc price:asc
_score
```

Geo:

```
location[40.71,-74.00]:km:asc
```

## Facets

```
brand
brand:20 price:50
```

When facets are requested, the tool response includes a `facets` key with the aggregation data.

## See also

- [Filter Parser](filter-parser.md) — every filter operator.
- [Sort Parser](sort-parser.md) — sort expression syntax.
- [Facets](facets.md) — facet behavior and structure.
- [Search](search.md) — the underlying search builder.
- [MCP Server](mcp.md) — connect AI agents to Sigmie's own documentation.


---

<!-- source: https://sigmie.com/docs/v2/mcp -->

# MCP Server

Sigmie runs a remote [MCP (Model Context Protocol)](https://modelcontextprotocol.io) server that gives AI agents semantic search across the full Sigmie documentation. Any MCP-compatible client — Claude Code, Cursor, Windsurf, or a custom agent — can search, browse, and read these docs without leaving the editor.

## Add the server

In your project's `.mcp.json`:

```json
{
  "mcpServers": {
    "sigmie-docs": {
      "type": "http",
      "url": "https://sigmie.com/mcp"
    }
  }
}
```

Or add it globally in `~/.claude.json` so it's available in every project.

Restart your agent. Three tools become available:

- **`search_docs`** — semantic search across all documentation.
- **`read_doc`** — read a specific page in full.
- **`list_docs`** — list every available page.

## How it works

```
AI agent (Claude Code, Cursor, etc.)
   │
   │  HTTPS (Streamable HTTP)
   ▼
sigmie.com/mcp ──► Node.js MCP server
                       │
          ┌────────────┼────────────┐
          │            │            │
    search_docs   read_doc     list_docs
          │            │            │
          ▼            ▼            ▼
    Elasticsearch    docs/*.md    docs/*.md
    hybrid search    (full page)  (file list)
    (649 sections)
```

`search_docs` runs a hybrid query — keyword + 384-dim vectors — against an Elasticsearch index built from the docs.

## Available tools

### `search_docs`

```
search_docs({ query: "how to configure semantic search" })
```

Returns the top 10 matching sections with title, page slug, version, URL, and content. Ranked by a combination of keyword and vector scores.

### `read_doc`

```
read_doc({ page: "search", version: "v2" })
```

Returns the full Markdown of a page. Use after `search_docs` to pull complete context — including code examples — into the agent's working memory.

### `list_docs`

```
list_docs({ version: "v2" })
```

Returns every page slug for a version. Useful for the agent to discover what's available.

## Use cases

### AI-assisted development

Your coding assistant can look up the exact API while writing code:

```
"How do I add typo tolerance?"
→ search_docs returns the relevant section with code
→ The agent adapts the example to your codebase
```

### Onboarding

New developers ask natural-language questions and get the right doc section back instead of browsing the table of contents.

### Self-configuring agents

If you build an AI agent on top of Sigmie (using the [Laravel AI SDK](laravel-ai.md)), the MCP server helps the agent understand its own search tool.

## Client configuration

### Claude Code (project)

`.mcp.json` in the project root:

```json
{
  "mcpServers": {
    "sigmie-docs": {
      "type": "http",
      "url": "https://sigmie.com/mcp"
    }
  }
}
```

### Claude Code (global)

`~/.claude.json`:

```json
{
  "mcpServers": {
    "sigmie-docs": {
      "type": "http",
      "url": "https://sigmie.com/mcp"
    }
  }
}
```

### Cursor

In Cursor's MCP settings:

```json
{
  "mcpServers": {
    "sigmie-docs": {
      "url": "https://sigmie.com/mcp"
    }
  }
}
```

## See also

- [Laravel AI SDK](laravel-ai.md) — expose your own Sigmie indices as agent tools.
- [Semantic Search](semantic-search.md) — the same technology powers the MCP server.
- [Retrieval and Agents](rag.md) — build retrieval-augmented generation with Sigmie.


---

<!-- source: https://sigmie.com/docs/v2/packages -->

# Packages

`sigmie/sigmie` is a meta-package that pulls in everything you need for typical use. If you want a leaner install — for example, you only need the filter parser, or you're building a tool that uses just the HTTP client — you can require individual packages directly.

## Standard installation

```bash
composer require sigmie/sigmie
```

This installs everything below as transitive dependencies. Most applications start (and stay) here.

## Individual packages

| Package | Purpose |
|---------|---------|
| `sigmie/base` | Driver abstractions for Elasticsearch and OpenSearch. |
| `sigmie/http` | HTTP client built on Guzzle, with auth and multi-host support. |
| `sigmie/index` | Index builders, analyzers, and language modules. |
| `sigmie/document` | The `Document` and `AliveCollection` classes. |
| `sigmie/mappings` | Property types and the `NewProperties` builder. |
| `sigmie/query` | The low-level query builder (`NewQuery`). |
| `sigmie/search` | The high-level search builder (`NewSearch`). |
| `sigmie/parse` | Filter and sort string parsers. |
| `sigmie/testing` | Test utilities and assertions. |
| `sigmie/english` | English language analyzers and filters. |
| `sigmie/german` | German language analyzers and filters. |
| `sigmie/greek` | Greek language analyzers and filters. |

```bash
composer require sigmie/parse           # filter/sort parsing only
composer require sigmie/mappings        # build property mappings
composer require sigmie/search          # high-level search
composer require sigmie/testing         # test helpers
```

## Integration packages

Separately maintained:

| Package | Purpose |
|---------|---------|
| `sigmie/elasticsearch-scout` | Laravel Scout driver. |

```bash
composer require sigmie/elasticsearch-scout
```

See [Laravel Scout](laravel-scout.md).

## Extension packages

For shipping field types and document-processing hooks, see [Extending Sigmie](extending.md). Each external package registers itself on a `Sigmie` instance via `$sigmie->extend()`.


---

<!-- source: https://sigmie.com/docs/v2/rag -->

# Retrieval and Agents

Sigmie is a **retrieval and indexing** library. It gives you:

- Indices and mappings.
- Keyword and semantic search.
- Reranking on search responses.
- Embeddings as a first-class field type.

It does **not** ship:

- An LLM client.
- A prompt builder.
- A RAG orchestrator (no single "search → context → model → answer" API).

For text generation, use your preferred HTTP client, vendor SDK, or framework. The application code below shows the pattern.

## What stays in Sigmie

| Area | Sigmie API |
|------|------------|
| Retrieval | `newSearch()`, `newMultiSearch()`, `newQuery()`, `newRecommend()` |
| Reranking | `$response->rerank(...)` with a registered `RerankApi` |
| Embeddings | `EmbeddingsApi` + `->semantic()` on text fields |
| Taxonomy tags | Optional [Magic Tags](magic-tags.md) package |

## The pattern: retrieve, rerank, generate

```php
use Sigmie\AI\APIs\OpenAIEmbeddingsApi;
use Sigmie\AI\APIs\CohereRerankApi;

$sigmie->registerApi('embeddings', new OpenAIEmbeddingsApi('sk-...'));
$sigmie->registerApi('reranker', new CohereRerankApi('co-...'));

// 1. Retrieve.
$response = $sigmie->newSearch('docs')
    ->properties($props)
    ->semantic()
    ->queryString('What is your return policy?')
    ->size(20)
    ->get();

// 2. Rerank.
$top5 = $response->rerank('reranker', ['content'], topK: 5);

// 3. Generate (your code, not Sigmie's).
$context = collect($top5)->pluck('_source.content')->implode("\n\n");

$answer = $yourOpenAiClient->chat([
    ['role' => 'system', 'content' => 'Answer using only the provided context.'],
    ['role' => 'user', 'content' => "Context:\n{$context}\n\nQuestion: What is your return policy?"],
]);
```

## Reranking

`$response->rerank()` accepts either a registered API name or a concrete `RerankApi`:

```php
$response->rerank('reranker', ['content']);
$response->rerank('reranker', ['title', 'content'], topK: 3);
$response->rerank('reranker', ['content'], 'return policy');
```

The signature is:

```php
rerank(
    RerankApi|string $reranker,
    array $fields,
    ?string $query = null,        // defaults to the search's query string
    ?int $topK = null,            // defaults to the search's size
): array;
```

For advanced cases, build a rerank manually with `Sigmie\Search\NewRerank`.

## Optional: conversation history

`Sigmie\AI\History\Index` is a standalone index for storing conversation turns. It uses embeddings for semantic recall, but stays decoupled from generation — your app reads from it before composing each prompt.

## See also

- [Semantic Search](semantic-search.md) — embeddings and similarity.
- [Recommendations](recommendations.md) — RRF and MMR for similar-item retrieval.
- [Magic Tags](magic-tags.md) — taxonomy tags backed by embeddings.
- [Laravel AI SDK](laravel-ai.md) — expose Sigmie indices as tools for an AI agent.
- [MCP Server](mcp.md) — Sigmie's docs as an MCP tool for AI coding assistants.


---

<!-- source: https://sigmie.com/docs/v2/extending -->

# Extending Sigmie

Sigmie has a single registration point for external packages. A package can add custom field-type builder methods to `NewProperties` and document-processing hooks that fire during `merge()` and `add()`.

The [Magic Tags](magic-tags.md) package is a real-world example: it adds a `magicTags()` builder and a `CollectionHook` that calls an LLM and writes to a sidecar index. Core Sigmie knows nothing about either — only the package does.

## Bootstrap a package

```php
use Sigmie\Sigmie;
use Vendor\MagicTags\MagicTagsPackage;

$sigmie = new Sigmie($connection);
$sigmie->extend(new MagicTagsPackage());
```

`extend()` calls the package's `register()` immediately and binds the hook to **this** `Sigmie` instance — not process-wide static state. Two clients in the same PHP process can have different extensions registered.

> **Note:** `NewProperties` macros are process-global. In tests that need isolation, call `NewProperties::flushMacros()` between cases.

## The `Package` interface

A package implements `Sigmie\Contracts\Package`:

```php
namespace Vendor\MagicTags;

use Sigmie\Contracts\Package;
use Sigmie\Mappings\NewProperties;
use Sigmie\Sigmie;

class MagicTagsPackage implements Package
{
    public function register(Sigmie $sigmie): void
    {
        NewProperties::macro('magicTags', /* ... */);
        $sigmie->addCollectionHook(new MagicTagsCollectionHook());
    }
}
```

`register()` runs once per `extend()` call.

## Field-type macros

`NewProperties::macro()` adds a method that behaves like a built-in field type:

```php
use Closure;
use Sigmie\Mappings\NewProperties;
use Vendor\MagicTags\Types\MagicTags;

NewProperties::macro('magicTags', function (string $name, string $fromField): MagicTags {
    $field = new MagicTags($name, $fromField);
    $this->add($name, $field);

    return $field;
});
```

After registration, callers use it like any native type:

```php
$props = new NewProperties;
$props->text('content')->semantic(api: 'embeddings', accuracy: 1, dimensions: 1024);
$props->magicTags('topic', fromField: 'content')
    ->api('llm')
    ->embeddingsApi('embeddings');
```

## Collection hooks

`CollectionHook` lets your package intervene around document indexing through `merge()` and `add()`. Register a hook on the same `Sigmie` instance:

```php
$sigmie->addCollectionHook(new MagicTagsCollectionHook());
```

A hook implements `Sigmie\Document\Contracts\CollectionHook`:

```php
namespace Vendor\MagicTags;

use Sigmie\Document\Contracts\CollectionHook;
use Sigmie\Mappings\Properties;
use Sigmie\Sigmie;
use Vendor\MagicTags\Types\MagicTags;

class MagicTagsCollectionHook implements CollectionHook
{
    public function shouldRun(Properties $properties): bool
    {
        return $properties->fieldsOfType(MagicTags::class)->isNotEmpty();
    }

    public function beforeBatch(
        string $indexName,
        Sigmie $sigmie,
        Properties $properties,
        array $apis
    ): void {
        // Ensure the sidecar index exists with the right mapping.
    }

    public function processBatch(
        array $documents,
        Properties $properties,
        array $apis
    ): array {
        // Classify, call the LLM, dedup tags. Return updated documents.
        return $documents;
    }

    public function afterBatch(
        array $documents,
        string $indexName,
        Sigmie $sigmie,
        Properties $properties,
        array $apis
    ): void {
        // Upsert (magic_field_path, tag) rows into the sidecar.
    }
}
```

### `shouldRun()`

Gate the hook on whether the collection's properties contain your field type. This keeps the hook from firing on unrelated indices — including any sidecar indices your package creates:

```php
public function shouldRun(Properties $properties): bool
{
    return $properties->fieldsOfType(MagicTags::class)->isNotEmpty();
}
```

### The `$apis` array

`processBatch()` and `afterBatch()` receive a map of registered API name → instance, populated from `Sigmie::registerApi()` and per-collection `apis()`.

Core Sigmie only registers `EmbeddingsApi` and `RerankApi` implementations. If your package needs an LLM client, inject it in the package constructor (or resolve it from your application container):

```php
$embeddings = $apis['my-embeddings'] ?? null;    // EmbeddingsApi
$rerank = $apis['my-rerank'] ?? null;            // RerankApi
$llm = $this->llmClient;                         // your own dependency
```

## Skip hooks on demand

`withoutHooks()` indexes documents without running any registered hooks — useful when documents already carry the values your hook would generate:

```php
$sigmie->collect('kb')->withoutHooks()->merge($documents);
```

## Full example

```php
namespace Vendor\MagicTags;

use Sigmie\Contracts\Package;
use Sigmie\Mappings\NewProperties;
use Sigmie\Sigmie;
use Vendor\MagicTags\Types\MagicTags;

class MagicTagsPackage implements Package
{
    public function register(Sigmie $sigmie): void
    {
        NewProperties::macro('magicTags', function (string $name, string $fromField): MagicTags {
            $field = new MagicTags($name, $fromField);
            $this->add($name, $field);

            return $field;
        });

        $sigmie->addCollectionHook(new MagicTagsCollectionHook());
    }
}
```

Application bootstrap:

```php
use Sigmie\Mappings\NewProperties;
use Sigmie\Sigmie;
use Vendor\MagicTags\MagicTagsPackage;

$sigmie = new Sigmie($connection);
$sigmie->extend(new MagicTagsPackage());

$props = new NewProperties;
$props->text('content')->semantic(api: 'embeds', accuracy: 1, dimensions: 1024);
$props->magicTags('topic', fromField: 'content')->api('llm')->embeddingsApi('embeds');

$sigmie->collect('kb', refresh: true)
    ->properties($props)
    ->apis([
        'llm' => $llmApi,
        'embeds' => $embeddingsApi,
    ])
    ->merge([/* documents */]);
```

Every `merge()` / `add()` on a collection whose properties contain `MagicTags` fields runs the hook for that batch — for the same `Sigmie` instance you called `extend()` on.

## See also

- [Magic Tags](magic-tags.md) — a complete package built on this API.
- [Mappings & Properties](mappings.md) — field types and properties.
- [Documents](document.md) — collection lifecycle.
