Search & RAG — reference
Exhaustive reference for content search and the three streaming inference surfaces: every parameter, field, mode, limit, error code, the stream event vocabulary, and an honest "Notes & limits" section stating what each feature does not do.
For request/response wire detail at the endpoint level, see the generated API reference (the OpenAPI / Scalar spec). This page documents the SDK-level surface and the behavior that the spec alone does not capture.
Version note. The API spec is at
0.27.0. Nothing on the search side of this page is 0.26-only, so it works on any current client. The optional inference flagallowGlobalRegion(region serving, below) is a 0.27 addition. Where the SDK method name differs from a raw field name, the SDK name is given.
Content search — client.search.content(req)
Unified search across documents and records. Requires the search:r scope.
Request parameters
| Parameter | Type | Default | Notes |
|---|---|---|---|
query | string | — (required) | Natural-language or keyword query. |
mode | 'TEXT' | 'SEMANTIC' | 'HYBRID' | HYBRID | Keyword relevance / meaning-based similarity / fused ranking. |
limit | integer | 20 | Max results. Valid range 1–100; out-of-range is rejected with 400. |
offset | integer | 0 | Results to skip — paging (see Paging below). |
contentTypes | ('documents' | 'records')[] | both | Narrow to one content type. Omitted, empty, or both values ⇒ unified. |
typeName | string | — | Restrict record hits to one schema type. Implicitly narrows to records (skips documents) unless contentTypes includes documents. |
folderId | string (uuid) | — | Restrict to content in this exact folder. |
rootFolderId | string (uuid) | — | Restrict to this folder and all descendants. Use instead of folderId. |
userId | string (uuid) | — | Ownership filter — content owned by this user. |
orgId | string (uuid) | — | Ownership filter — content of this organization. |
clientId | string (uuid) | — | Ownership filter — content of this client. |
filters | object | — | Field-level metadata filters; AND-combined across keys. See Filter grammar. |
createdAfter | string (ISO-8601) | — | Content created at/after this UTC timestamp. |
createdBefore | string (ISO-8601) | — | Content created at/before this UTC timestamp. |
uniqueDocuments | boolean | false | When true, at most one hit per source document. |
minSimilarity | number 0.0–1.0 | — | Minimum semantic similarity; hits below are excluded (semantic / hybrid). |
minTextRelevance | number 0.0–1.0 | — | Relative keyword-relevance floor, as a fraction of the top hit's score (e.g. 0.5 keeps hits at least half as relevant as the best). Applies to TEXT / HYBRID. Omit or ≤0 keeps all. |
textMode | 'OR' | 'AND' | 'PHRASE' | 'COMPLEX' | OR | Keyword sub-mode (see Keyword sub-modes). Applies to TEXT / HYBRID. |
slop | integer ≥0 | 0 | Phrase-match slop: intervening positions tolerated between terms when textMode='PHRASE'. Ignored otherwise. |
requireComplete | boolean | false | Fail-closed override: return 503 instead of degraded partial results when a leg is unavailable. |
Keyword sub-modes (textMode)
For TEXT and HYBRID searches, textMode controls how query terms combine in the
keyword leg:
OR(default) — match any term; broadest recall.AND— require all terms; higher precision.PHRASE— require terms as a contiguous sequence (tunable withslop).COMPLEX— full keyword query syntax (boolean operators, field-scoped clauses, range filters). Use only when you need expression-level control; the query is parsed as a structured expression rather than a bag of terms.
Filter grammar (filters)
Each top-level key is a metadata field declared filterable on the schema (or a built-in document field). Top-level keys are AND-combined. Each value is one of:
- Scalar (string / number / boolean) — equality, e.g.
{ "status": "open" }. - Array of scalars — OR-set (match any), e.g.
{ "tag": ["red", "blue"] }. - Operator map — closed set of operators:
- Scalar operand:
$eq,$ne,$gt,$gte,$lt,$lte. Operators in one map are AND-combined, so{ "price": { "$gte": 100, "$lte": 500 } }is a closed range. - Array operand:
$in,$nin. Cannot be combined with other operators in the same map.
- Scalar operand:
Numbers and booleans match typed (the field must have been ingested under a typed schema).
Dates may be ISO-8601 strings or epoch millis. Filter keys are validated (^[A-Za-z_] [A-Za-z0-9_-]*$); unknown operators, non-scalar operands, malformed keys, and any attempt
to filter on a reserved tenancy/ownership key are rejected with 400. You cannot widen your
access through the filter map — ownership scope is enforced separately.
Response shape
search.content returns a flat object (it is not wrapped in the { data, nextCursor }
cursor envelope used by list/lookup endpoints — see Paging):
| Field | Type | Notes |
|---|---|---|
results | SearchResult[] | Matched chunks, ranked (highest score first). May be empty. |
totalResults | integer | Total matches found. 0 for a miss. |
searchTimeMs | integer | Server-side execution time. Reported even on an empty result. |
degraded | boolean | True when one leg was unavailable and results came from the survivor only. |
degradedLegs | string[] | Which legs were unavailable: "text" (keyword) and/or "vector" (semantic). Empty when not degraded. |
Each SearchResult:
| Field | Type | Notes |
|---|---|---|
documentId | string (uuid) | Source entity id. Use with getDocument / getRecord. (This is the source id, not an internal index id.) |
sourceType | 'PartnerDocument' | 'GenericRecord' | Document vs. record discriminator — the two literal strings the API returns; branch on it when rendering mixed results. |
score | number | Fused relevance score; primary sort key, higher is more relevant. |
textScore | number | Keyword (relevance) sub-score. Non-zero in HYBRID when the keyword leg contributed. Always 0 in TEXT-only mode — that path ranks by result-array order, not by an exposed per-hit score. |
semanticScore | number | Semantic similarity sub-score. Non-zero in SEMANTIC / HYBRID. |
chunkText | string | The specific chunk that matched. Feed this (or contextText) to a model. |
contextText | string | The wider surrounding passage containing the chunk — better grounding context. |
snippet | string | Highlighted excerpt with query terms emphasized, for display. May be null for a semantic-only hit (use chunkText). |
metadata | object | Metadata supplied at ingest (title, folderId, custom fields). |
createdAt | string (ISO-8601) | Source content creation time. |
Paging
search.content has no nextCursor — it is not enveloped. Page by combining limit
with offset:
const page1 = await client.search.content({ query, mode, limit: 20, offset: 0 });
const page2 = await client.search.content({ query, mode, limit: 20, offset: 20 });
limit is capped at 100; consecutive pages are disjoint. This differs from list/lookup
endpoints, which return { data, nextCursor } and are drained by feeding nextCursor back
as startFrom. For pulling recent content deterministically, prefer a createdAfter
window over deep offset paging.
Notes & limits — search
- Index/search mode must line up. Content indexed for one strategy only will not appear in a search requiring the other. Indexing mode is a property of the content (set at ingest / on the schema); search mode is a property of the query.
- Only searchable fields participate in keyword relevance. A query matching only a non-searchable field returns nothing — by design.
- Sensitive fields never enter the index. They cannot be searched, ranked, or surfaced under any scope (index-time exclusion).
textScoreis 0 inTEXT-only mode — that path returns highlighted snippets but does not expose a per-hit keyword score; rank order is the signal.minTextRelevanceapplies only toTEXT/HYBRID;minSimilarityapplies only toSEMANTIC/HYBRID.- A miss is a
200, not a404. Emptyresults,totalResults: 0. - Degradation is silent unless you check. Inspect
degraded/degradedLegs, or setrequireComplete: trueto turn a degraded leg into a503. - Cross-content search returns mixed types. Always branch on
sourceTypewhen rendering unified results.
Inference surfaces — client.inference.*
Three streaming surfaces, all returning an async iterable of SSE events, all requiring the
inference:r scope, all sharing one pre-flight check sequence. Inference runs against
AWS-hosted models inside the Vectros perimeter (in-perimeter for the partner data plane).
Streaming model (shared)
Each surface returns an async iterable; iterate it to consume events in arrival order. Every
event carries an event field naming its type, so a consumer can dispatch on one key. On the
wire it is standard Server-Sent Events (event: <type> / data: <json> framed by a blank
line) — any compliant SSE reader works; the SDK presents it as an async iterator.
Shared event vocabulary:
| Event | Fields | When |
|---|---|---|
content_delta | delta (string) | One chunk of generated text. Append each to build the answer. |
done | inputTokens, outputTokens, model, platformCreditsCharged, inferenceBalanceCentsCharged, optionally cacheReadTokens / cacheCreateTokens | Terminal event with token counts, resolved model id, and per-call cost. Exactly one. |
error | message, code | A mid-stream model failure. |
Surface-specific events are listed under each surface below.
Pre-flight checks (shared, fixed order)
Run before any model invocation; cheaper checks first:
- Action scope →
403. A scoped token must carryinference:r(or a wildcard). Rootsk_*keys carry wildcard scope and pass by construction. The403does not enumerate the missing action. - Monthly credit limit →
402. Once the period's cumulative credits exceed the plan's ceiling, inference rejects until the period rolls or the plan is upgraded. - Burst rate limit →
429. Per-tenant request-rate ceiling, scaling with plan tier. - Inference billing gate →
402. In balance mode (default on lower tiers), a per-partner pre-funded balance must be positive, else402 Insufficient inference balance. In usage mode (Enterprise-shaped, post-billed), accumulated usage is checked against a contractual cap, else402when the cap is reached.
The token cost of a call is metered and recorded on stream finalization (after the stream closes, or on a broken pipe with partial output). A transient accounting failure does not fail the in-flight response — the balance on the next call may briefly lag.
Grounded corpus answers — client.inference.ragInference(req)
Retrieve-then-generate over your indexed content.
Request parameters:
| Parameter | Type | Default | Notes |
|---|---|---|---|
query | string | — (required) | The question to ground and answer. |
model | string | tier default | Model alias. See Model catalog. |
maxTokens | integer | 1024 | Output cap. Capped at 4096 (tighter than chat — retrieved context shares the input budget). |
temperature | number | 0.3 | Sampling temperature. |
instructions | string | — | Optional extra instructions for the answer. |
search | object | — | Retrieval params (below). |
search sub-object mirrors content search: mode (default HYBRID), limit
(default 10, capped at 50 — this is the RAG topK), userId, orgId, clientId,
folderId, rootFolderId, typeName, filters, contentTypes, createdAfter,
createdBefore, requireComplete.
Event sequence: search_results → optional truncation_warning → content_delta+ →
done.
| Event | Fields | Notes |
|---|---|---|
search_results | results[], totalResults, searchTimeMs, degraded, degradedLegs | Always emitted, even when results is empty. Each entry: documentId, score, textScore, semanticScore, chunkText, contextText, snippet, metadata, sourceType, typeName, createdAt. These are your citations. |
truncation_warning | resultsRequested, resultsUsed, reason | Emitted before the answer if retrieved passages had to be dropped to fit the context budget. |
Behavior:
- With
search.requireComplete: true, a degraded retrieval leg causes the call to reject before the stream opens (503) instead of grounding on partial results. - An empty retrieval still emits
search_results(empty) and still streams an answer (typically stating that nothing relevant was found). - A scoped token's data scope is enforced on the retrieval step.
Single-document Q&A — client.inference.documentAsk(req)
Ask one question against one document's full text. No retrieval step.
Request parameters:
| Parameter | Type | Default | Notes |
|---|---|---|---|
id | string (uuid) | — (required) | The document to ask against (in the request body). |
prompt | string | — (required) | The question. |
model | string | tier default | Model alias. |
maxTokens | integer | 2048 | Output cap. Capped at 8192. |
Event sequence: document_context → content_delta+ → done.
| Event | Fields | Notes |
|---|---|---|
document_context | documentId, title, textBytes, model | The document loaded, its size, and the resolved model. Fires before any generated text. |
Errors:
409(before the stream opens) — the document is not askable yet: it is still processing (not yet fully indexed), it failed ingest, or it was ingested without its text retained (storeTextwas not set, so there is no full text to load). Freshly-ingested documents commonly return409until indexing completes.413(before the stream opens, no credits charged) — the document's estimated input size exceeds the cap (32,000 input tokens, ~25 pages). Payload:message,estimatedTokens,limitTokens. Branch on this and re-route to RAG.404— the document does not exist, belongs to another tenant, or is out of your token's scope. All three return the identical404; the endpoint never reveals existence outside your scope.
Stateless chat — client.inference.chatInference(req)
Single-turn completion. No retrieval, no stored state.
Request parameters:
| Parameter | Type | Default | Notes |
|---|---|---|---|
messages | { role, content }[] | — (required) | Conversation. A system role message becomes the system prompt; user / assistant messages pass through. |
model | string | tier default | Model alias. |
maxTokens | integer | 2048 | Output cap. Capped at 8192. |
temperature | number | 0.7 | Sampling temperature. |
topP | number | — | Nucleus-sampling parameter (optional). |
Event sequence: content_delta+ → done.
Chat stores nothing. For multi-turn, append the assistant's reply to your messages array
and re-send the whole array next turn.
Model catalog — client.inference.listInferenceModels()
Lists the models the calling key's plan tier can reach.
Response:
| Field | Type | Notes |
|---|---|---|
models | Model[] | Available models for this key. |
defaultModel | string | The alias used when a call omits model. Reachable on the free plan. |
Each Model:
| Field | Type | Notes |
|---|---|---|
id | string | Alias, e.g. claude-haiku-4-5, claude-sonnet-4-5, claude-sonnet-4-6, claude-opus-4-7. Matches the model vendor's marketing names. |
name | string | Display name. |
provider | string | Model provider. |
contextWindow | integer | Context window size in tokens. |
inputCreditsPer1kTokens | number | Input token credit rate. |
outputCreditsPer1kTokens | number | Output token credit rate. |
availableOn | string[] | Plan tiers that may call this model (e.g. free, starter, pro, scale, enterprise). |
Requesting a model your plan does not include returns a 402 pointing to upgrade. A lighter
model is available on every tier; more capable models require higher tiers.
Region serving (allowGlobalRegion)
All three inference surfaces (chat, RAG, document-ask) accept an optional boolean
allowGlobalRegion in the request body.
| Field | Type | Default | Meaning |
|---|---|---|---|
allowGlobalRegion | boolean | tenant residency default | Opt this request into the lower-cost global (non-US) region path. |
- The tenant's residency default is US serving, applied when the flag is omitted. US serving is the fail-closed default and carries a region premium.
- Setting
allowGlobalRegion: truelets an entitled tenant serve the request from the global region at a lower rate. Entitlement is gated on a signed global-processing waiver. - If
allowGlobalRegion: trueis sent by a tenant that is not entitled, the request is rejected with403— it is not silently downgraded or upgraded. Region choice never changes which content is retrieved, only where the model runs and the price.
Error codes (inference)
| Code | Surface | Meaning |
|---|---|---|
400 | all | Malformed request (e.g. missing query / messages / prompt, bad filter key). |
402 | all | Monthly credit limit exceeded, insufficient inference balance, usage cap reached, or a requested model the plan does not include. |
403 | all | Token scope does not permit inference (inference:r missing); or allowGlobalRegion: true was sent but this tenant is not entitled to global-region serving (no signed global-processing waiver). |
404 | document-ask | Document not found / cross-tenant / out-of-scope (uniform — existence never revealed). |
409 | document-ask | Document not askable yet — still processing (not yet indexed), failed ingest, or text not retained (ingested without storeText). Returned before the stream opens. |
413 | document-ask | Document exceeds the input-token cap (before the stream opens; no credits charged). |
429 | all | Burst rate limit exceeded. |
503 | RAG, search | A retrieval/search leg was unavailable and requireComplete / requireComplete: true was set. |
Notes & limits — inference
- Hard output caps per surface: chat 8192, RAG 4096, document-ask 8192 output tokens.
Document-ask additionally caps input at 32,000 tokens (~25 pages) with a
413before the stream opens. AmaxTokensabove a surface's cap is floored to the cap. - RAG topK capped at 50.
search.limitdefaults to 10, max 50 — retrieved context shares the model's input budget, so unbounded topK would push the prompt past the context window. - Chat is stateless; there is no managed conversation state. No server-side thread
store, no assistants registry. Multi-turn is the caller's responsibility — re-send the
messagesarray each turn and budget the history against the model's context window. - Document-ask is single-document. No multi-document Q&A endpoint. For multi-document
grounding, use RAG (retrieval picks relevant passages across the corpus) or stitch
multiple
/askcalls at the application layer. - Cost is recorded on finalization. The
doneevent carriesplatformCreditsChargedandinferenceBalanceCentsCharged; an accounting hiccup will not fail an in-flight response, so the next call's balance may briefly lag. - Cache-token fields are forward-declared.
donemay carrycacheReadTokens/cacheCreateTokens; the current billing formula does not yet apply a cache discount. Consumers already reading these will see the reduction when it lands, with no code change. - In-perimeter scope. The in-perimeter (no third-party model-vendor egress) guarantee is for the partner data plane — the content you store and retrieve through Vectros. Specific compliance coverage terms are addressed in the security and compliance documentation, not asserted here.
- The model catalog is the source of truth. Handlers gate on the live catalog at request
time, so a model going generally available or being retired takes effect immediately —
what
listInferenceModelsreturns is what the deployed handlers accept.
Where to go next
- how-to.md — runnable guides for each call on this page.
- explanation.md — the concepts behind the modes, grounding context, and the three inference surfaces.
- ../data-model/reference.md — schema field declarations (searchable, filterable, sensitive) that govern what search indexes.
- ../operations-trust/compliance.md — sensitive-field protections, isolation guarantees, and the in-perimeter inference posture.