Search & RAG — reference

Exhaustive reference for content search and the three streaming inference surfaces: every parameter, field, mode, limit, error code, the stream event vocabulary, and an honest "Notes & limits" section stating what each feature does not do.

For request/response wire detail at the endpoint level, see the generated API reference (the OpenAPI / Scalar spec). This page documents the SDK-level surface and the behavior that the spec alone does not capture.

Version note. The API spec is at 0.27.0. Nothing on the search side of this page is 0.26-only, so it works on any current client. The optional inference flag allowGlobalRegion (region serving, below) is a 0.27 addition. Where the SDK method name differs from a raw field name, the SDK name is given.

Content search — `client.search.content(req)`

Unified search across documents and records. Requires the search:r scope.

Request parameters

Parameter	Type	Default	Notes
`query`	string	— (required)	Natural-language or keyword query.
`mode`	`'TEXT' \| 'SEMANTIC' \| 'HYBRID'`	`HYBRID`	Keyword relevance / meaning-based similarity / fused ranking.
`limit`	integer	20	Max results. Valid range 1–100; out-of-range is rejected with `400`.
`offset`	integer	0	Results to skip — paging (see Paging below).
`contentTypes`	`('documents' \| 'records')[]`	both	Narrow to one content type. Omitted, empty, or both values ⇒ unified.
`typeName`	string	—	Restrict record hits to one schema type. Implicitly narrows to records (skips documents) unless `contentTypes` includes documents.
`folderId`	string (uuid)	—	Restrict to content in this exact folder.
`rootFolderId`	string (uuid)	—	Restrict to this folder and all descendants. Use instead of `folderId`.
`userId`	string (uuid)	—	Ownership filter — content owned by this user.
`orgId`	string (uuid)	—	Ownership filter — content of this organization.
`clientId`	string (uuid)	—	Ownership filter — content of this client.
`filters`	object	—	Field-level metadata filters; AND-combined across keys. See Filter grammar.
`createdAfter`	string (ISO-8601)	—	Content created at/after this UTC timestamp.
`createdBefore`	string (ISO-8601)	—	Content created at/before this UTC timestamp.
`uniqueDocuments`	boolean	false	When true, at most one hit per source document.
`minSimilarity`	number 0.0–1.0	—	Minimum semantic similarity; hits below are excluded (semantic / hybrid).
`minTextRelevance`	number 0.0–1.0	—	Relative keyword-relevance floor, as a fraction of the top hit's score (e.g. 0.5 keeps hits at least half as relevant as the best). Applies to `TEXT` / `HYBRID`. Omit or ≤0 keeps all.
`textMode`	`'OR' \| 'AND' \| 'PHRASE' \| 'COMPLEX'`	`OR`	Keyword sub-mode (see Keyword sub-modes). Applies to `TEXT` / `HYBRID`.
`slop`	integer ≥0	0	Phrase-match slop: intervening positions tolerated between terms when `textMode='PHRASE'`. Ignored otherwise.
`requireComplete`	boolean	false	Fail-closed override: return `503` instead of degraded partial results when a leg is unavailable.

Keyword sub-modes (`textMode`)

For TEXT and HYBRID searches, textMode controls how query terms combine in the keyword leg:

OR (default) — match any term; broadest recall.
AND — require all terms; higher precision.
PHRASE — require terms as a contiguous sequence (tunable with slop).
COMPLEX — full keyword query syntax (boolean operators, field-scoped clauses, range filters). Use only when you need expression-level control; the query is parsed as a structured expression rather than a bag of terms.

Filter grammar (`filters`)

Each top-level key is a metadata field declared filterable on the schema (or a built-in document field). Top-level keys are AND-combined. Each value is one of:

Scalar (string / number / boolean) — equality, e.g. { "status": "open" }.
Array of scalars — OR-set (match any), e.g. { "tag": ["red", "blue"] }.
Operator map — closed set of operators:
- Scalar operand: $eq, $ne, $gt, $gte, $lt, $lte. Operators in one map are AND-combined, so { "price": { "$gte": 100, "$lte": 500 } } is a closed range.
- Array operand: $in, $nin. Cannot be combined with other operators in the same map.

Numbers and booleans match typed (the field must have been ingested under a typed schema). Dates may be ISO-8601 strings or epoch millis. Filter keys are validated (^[A-Za-z_] [A-Za-z0-9_-]*$); unknown operators, non-scalar operands, malformed keys, and any attempt to filter on a reserved tenancy/ownership key are rejected with 400. You cannot widen your access through the filter map — ownership scope is enforced separately.

Response shape

search.content returns a flat object (it is not wrapped in the { data, nextCursor } cursor envelope used by list/lookup endpoints — see Paging):

Field	Type	Notes
`results`	`SearchResult[]`	Matched chunks, ranked (highest `score` first). May be empty.
`totalResults`	integer	Total matches found. `0` for a miss.
`searchTimeMs`	integer	Server-side execution time. Reported even on an empty result.
`degraded`	boolean	True when one leg was unavailable and results came from the survivor only.
`degradedLegs`	`string[]`	Which legs were unavailable: `"text"` (keyword) and/or `"vector"` (semantic). Empty when not degraded.

Each SearchResult:

Field	Type	Notes
`documentId`	string (uuid)	Source entity id. Use with `getDocument` / `getRecord`. (This is the source id, not an internal index id.)
`sourceType`	`'PartnerDocument' \| 'GenericRecord'`	Document vs. record discriminator — the two literal strings the API returns; branch on it when rendering mixed results.
`score`	number	Fused relevance score; primary sort key, higher is more relevant.
`textScore`	number	Keyword (relevance) sub-score. Non-zero in `HYBRID` when the keyword leg contributed. Always 0 in `TEXT`-only mode — that path ranks by result-array order, not by an exposed per-hit score.
`semanticScore`	number	Semantic similarity sub-score. Non-zero in `SEMANTIC` / `HYBRID`.
`chunkText`	string	The specific chunk that matched. Feed this (or `contextText`) to a model.
`contextText`	string	The wider surrounding passage containing the chunk — better grounding context.
`snippet`	string	Highlighted excerpt with query terms emphasized, for display. May be null for a semantic-only hit (use `chunkText`).
`metadata`	object	Metadata supplied at ingest (title, folderId, custom fields).
`createdAt`	string (ISO-8601)	Source content creation time.

Paging

search.content has no nextCursor — it is not enveloped. Page by combining limit with offset:

const page1 = await client.search.content({ query, mode, limit: 20, offset: 0 });
const page2 = await client.search.content({ query, mode, limit: 20, offset: 20 });

limit is capped at 100; consecutive pages are disjoint. This differs from list/lookup endpoints, which return { data, nextCursor } and are drained by feeding nextCursor back as startFrom. For pulling recent content deterministically, prefer a createdAfter window over deep offset paging.

Notes & limits — search

Index/search mode must line up. Content indexed for one strategy only will not appear in a search requiring the other. Indexing mode is a property of the content (set at ingest / on the schema); search mode is a property of the query.
Only searchable fields participate in keyword relevance. A query matching only a non-searchable field returns nothing — by design.
Sensitive fields never enter the index. They cannot be searched, ranked, or surfaced under any scope (index-time exclusion).
textScore is 0 in TEXT-only mode — that path returns highlighted snippets but does not expose a per-hit keyword score; rank order is the signal.
minTextRelevance applies only to TEXT / HYBRID; minSimilarity applies only to SEMANTIC / HYBRID.
A miss is a 200, not a 404. Empty results, totalResults: 0.
Degradation is silent unless you check. Inspect degraded / degradedLegs, or set requireComplete: true to turn a degraded leg into a 503.
Cross-content search returns mixed types. Always branch on sourceType when rendering unified results.

Inference surfaces — `client.inference.*`

Three streaming surfaces, all returning an async iterable of SSE events, all requiring the inference:r scope, all sharing one pre-flight check sequence. Inference runs against AWS-hosted models inside the Vectros perimeter (in-perimeter for the partner data plane).

Streaming model (shared)

Each surface returns an async iterable; iterate it to consume events in arrival order. Every event carries an event field naming its type, so a consumer can dispatch on one key. On the wire it is standard Server-Sent Events (event: <type> / data: <json> framed by a blank line) — any compliant SSE reader works; the SDK presents it as an async iterator.

Shared event vocabulary:

Event	Fields	When
`content_delta`	`delta` (string)	One chunk of generated text. Append each to build the answer.
`done`	`inputTokens`, `outputTokens`, `model`, `platformCreditsCharged`, `inferenceBalanceCentsCharged`, optionally `cacheReadTokens` / `cacheCreateTokens`	Terminal event with token counts, resolved model id, and per-call cost. Exactly one.
`error`	`message`, `code`	A mid-stream model failure.

Surface-specific events are listed under each surface below.

Pre-flight checks (shared, fixed order)

Run before any model invocation; cheaper checks first:

Action scope → 403. A scoped token must carry inference:r (or a wildcard). Root sk_* keys carry wildcard scope and pass by construction. The 403 does not enumerate the missing action.
Monthly credit limit → 402. Once the period's cumulative credits exceed the plan's ceiling, inference rejects until the period rolls or the plan is upgraded.
Burst rate limit → 429. Per-tenant request-rate ceiling, scaling with plan tier.
Inference billing gate → 402. In balance mode (default on lower tiers), a per-partner pre-funded balance must be positive, else 402 Insufficient inference balance. In usage mode (Enterprise-shaped, post-billed), accumulated usage is checked against a contractual cap, else 402 when the cap is reached.

The token cost of a call is metered and recorded on stream finalization (after the stream closes, or on a broken pipe with partial output). A transient accounting failure does not fail the in-flight response — the balance on the next call may briefly lag.

Grounded corpus answers — `client.inference.ragInference(req)`

Retrieve-then-generate over your indexed content.

Request parameters:

Parameter	Type	Default	Notes
`query`	string	— (required)	The question to ground and answer.
`model`	string	tier default	Model alias. See Model catalog.
`maxTokens`	integer	1024	Output cap. Capped at 4096 (tighter than chat — retrieved context shares the input budget).
`temperature`	number	0.3	Sampling temperature.
`instructions`	string	—	Optional extra instructions for the answer.
`search`	object	—	Retrieval params (below).

search sub-object mirrors content search: mode (default HYBRID), limit (default 10, capped at 50 — this is the RAG topK), userId, orgId, clientId, folderId, rootFolderId, typeName, filters, contentTypes, createdAfter, createdBefore, requireComplete.

Event sequence: search_results → optional truncation_warning → content_delta+ → done.

Event	Fields	Notes
`search_results`	`results[]`, `totalResults`, `searchTimeMs`, `degraded`, `degradedLegs`	Always emitted, even when `results` is empty. Each entry: `documentId`, `score`, `textScore`, `semanticScore`, `chunkText`, `contextText`, `snippet`, `metadata`, `sourceType`, `typeName`, `createdAt`. These are your citations.
`truncation_warning`	`resultsRequested`, `resultsUsed`, `reason`	Emitted before the answer if retrieved passages had to be dropped to fit the context budget.

Behavior:

With search.requireComplete: true, a degraded retrieval leg causes the call to reject before the stream opens (503) instead of grounding on partial results.
An empty retrieval still emits search_results (empty) and still streams an answer (typically stating that nothing relevant was found).
A scoped token's data scope is enforced on the retrieval step.

Single-document Q&A — `client.inference.documentAsk(req)`

Ask one question against one document's full text. No retrieval step.

Request parameters:

Parameter	Type	Default	Notes
`id`	string (uuid)	— (required)	The document to ask against (in the request body).
`prompt`	string	— (required)	The question.
`model`	string	tier default	Model alias.
`maxTokens`	integer	2048	Output cap. Capped at 8192.

Event sequence: document_context → content_delta+ → done.

Event	Fields	Notes
`document_context`	`documentId`, `title`, `textBytes`, `model`	The document loaded, its size, and the resolved model. Fires before any generated text.

Errors:

409 (before the stream opens) — the document is not askable yet: it is still processing (not yet fully indexed), it failed ingest, or it was ingested without its text retained (storeText was not set, so there is no full text to load). Freshly-ingested documents commonly return 409 until indexing completes.
413 (before the stream opens, no credits charged) — the document's estimated input size exceeds the cap (32,000 input tokens, ~25 pages). Payload: message, estimatedTokens, limitTokens. Branch on this and re-route to RAG.
404 — the document does not exist, belongs to another tenant, or is out of your token's scope. All three return the identical 404; the endpoint never reveals existence outside your scope.

Stateless chat — `client.inference.chatInference(req)`

Single-turn completion. No retrieval, no stored state.

Request parameters:

Parameter	Type	Default	Notes
`messages`	`{ role, content }[]`	— (required)	Conversation. A `system` role message becomes the system prompt; `user` / `assistant` messages pass through.
`model`	string	tier default	Model alias.
`maxTokens`	integer	2048	Output cap. Capped at 8192.
`temperature`	number	0.7	Sampling temperature.
`topP`	number	—	Nucleus-sampling parameter (optional).

Event sequence: content_delta+ → done.

Chat stores nothing. For multi-turn, append the assistant's reply to your messages array and re-send the whole array next turn.

Model catalog — `client.inference.listInferenceModels()`

Lists the models the calling key's plan tier can reach.

Response:

Field	Type	Notes
`models`	`Model[]`	Available models for this key.
`defaultModel`	string	The alias used when a call omits `model`. Reachable on the free plan.

Each Model:

Field	Type	Notes
`id`	string	Alias, e.g. `claude-haiku-4-5`, `claude-sonnet-4-5`, `claude-sonnet-4-6`, `claude-opus-4-7`. Matches the model vendor's marketing names.
`name`	string	Display name.
`provider`	string	Model provider.
`contextWindow`	integer	Context window size in tokens.
`inputCreditsPer1kTokens`	number	Input token credit rate.
`outputCreditsPer1kTokens`	number	Output token credit rate.
`availableOn`	`string[]`	Plan tiers that may call this model (e.g. `free`, `starter`, `pro`, `scale`, `enterprise`).

Requesting a model your plan does not include returns a 402 pointing to upgrade. A lighter model is available on every tier; more capable models require higher tiers.

Region serving (`allowGlobalRegion`)

All three inference surfaces (chat, RAG, document-ask) accept an optional boolean allowGlobalRegion in the request body.

Field	Type	Default	Meaning
`allowGlobalRegion`	boolean	tenant residency default	Opt this request into the lower-cost global (non-US) region path.

The tenant's residency default is US serving, applied when the flag is omitted. US serving is the fail-closed default and carries a region premium.
Setting allowGlobalRegion: true lets an entitled tenant serve the request from the global region at a lower rate. Entitlement is gated on a signed global-processing waiver.
If allowGlobalRegion: true is sent by a tenant that is not entitled, the request is rejected with 403 — it is not silently downgraded or upgraded. Region choice never changes which content is retrieved, only where the model runs and the price.

Error codes (inference)

Code	Surface	Meaning
`400`	all	Malformed request (e.g. missing `query` / `messages` / `prompt`, bad filter key).
`402`	all	Monthly credit limit exceeded, insufficient inference balance, usage cap reached, or a requested model the plan does not include.
`403`	all	Token scope does not permit inference (`inference:r` missing); or `allowGlobalRegion: true` was sent but this tenant is not entitled to global-region serving (no signed global-processing waiver).
`404`	document-ask	Document not found / cross-tenant / out-of-scope (uniform — existence never revealed).
`409`	document-ask	Document not askable yet — still processing (not yet indexed), failed ingest, or text not retained (ingested without `storeText`). Returned before the stream opens.
`413`	document-ask	Document exceeds the input-token cap (before the stream opens; no credits charged).
`429`	all	Burst rate limit exceeded.
`503`	RAG, search	A retrieval/search leg was unavailable and `requireComplete` / `requireComplete: true` was set.

Notes & limits — inference

Hard output caps per surface: chat 8192, RAG 4096, document-ask 8192 output tokens. Document-ask additionally caps input at 32,000 tokens (~25 pages) with a 413 before the stream opens. A maxTokens above a surface's cap is floored to the cap.
RAG topK capped at 50. search.limit defaults to 10, max 50 — retrieved context shares the model's input budget, so unbounded topK would push the prompt past the context window.
Chat is stateless; there is no managed conversation state. No server-side thread store, no assistants registry. Multi-turn is the caller's responsibility — re-send the messages array each turn and budget the history against the model's context window.
Document-ask is single-document. No multi-document Q&A endpoint. For multi-document grounding, use RAG (retrieval picks relevant passages across the corpus) or stitch multiple /ask calls at the application layer.
Cost is recorded on finalization. The done event carries platformCreditsCharged and inferenceBalanceCentsCharged; an accounting hiccup will not fail an in-flight response, so the next call's balance may briefly lag.
Cache-token fields are forward-declared. done may carry cacheReadTokens / cacheCreateTokens; the current billing formula does not yet apply a cache discount. Consumers already reading these will see the reduction when it lands, with no code change.
In-perimeter scope. The in-perimeter (no third-party model-vendor egress) guarantee is for the partner data plane — the content you store and retrieve through Vectros. Specific compliance coverage terms are addressed in the security and compliance documentation, not asserted here.
The model catalog is the source of truth. Handlers gate on the live catalog at request time, so a model going generally available or being retired takes effect immediately — what listInferenceModels returns is what the deployed handlers accept.

Where to go next

how-to.md — runnable guides for each call on this page.
explanation.md — the concepts behind the modes, grounding context, and the three inference surfaces.
../data-model/reference.md — schema field declarations (searchable, filterable, sensitive) that govern what search indexes.
../operations-trust/compliance.md — sensitive-field protections, isolation guarantees, and the in-perimeter inference posture.