Search & RAG — reference

Exhaustive reference for content search and the three streaming inference surfaces: every parameter, field, mode, limit, error code, the stream event vocabulary, and an honest "Notes & limits" section stating what each feature does not do.

For request/response wire detail at the endpoint level, see the generated API reference (the OpenAPI / Scalar spec). This page documents the SDK-level surface and the behavior that the spec alone does not capture.

Version note. The API spec is at 0.27.0. Nothing on the search side of this page is 0.26-only, so it works on any current client. The optional inference flag allowGlobalRegion (region serving, below) is a 0.27 addition. Where the SDK method name differs from a raw field name, the SDK name is given.


Content search — client.search.content(req)

Unified search across documents and records. Requires the search:r scope.

Request parameters

ParameterTypeDefaultNotes
querystring— (required)Natural-language or keyword query.
mode'TEXT' | 'SEMANTIC' | 'HYBRID'HYBRIDKeyword relevance / meaning-based similarity / fused ranking.
limitinteger20Max results. Valid range 1–100; out-of-range is rejected with 400.
offsetinteger0Results to skip — paging (see Paging below).
contentTypes('documents' | 'records')[]bothNarrow to one content type. Omitted, empty, or both values ⇒ unified.
typeNamestringRestrict record hits to one schema type. Implicitly narrows to records (skips documents) unless contentTypes includes documents.
folderIdstring (uuid)Restrict to content in this exact folder.
rootFolderIdstring (uuid)Restrict to this folder and all descendants. Use instead of folderId.
userIdstring (uuid)Ownership filter — content owned by this user.
orgIdstring (uuid)Ownership filter — content of this organization.
clientIdstring (uuid)Ownership filter — content of this client.
filtersobjectField-level metadata filters; AND-combined across keys. See Filter grammar.
createdAfterstring (ISO-8601)Content created at/after this UTC timestamp.
createdBeforestring (ISO-8601)Content created at/before this UTC timestamp.
uniqueDocumentsbooleanfalseWhen true, at most one hit per source document.
minSimilaritynumber 0.0–1.0Minimum semantic similarity; hits below are excluded (semantic / hybrid).
minTextRelevancenumber 0.0–1.0Relative keyword-relevance floor, as a fraction of the top hit's score (e.g. 0.5 keeps hits at least half as relevant as the best). Applies to TEXT / HYBRID. Omit or ≤0 keeps all.
textMode'OR' | 'AND' | 'PHRASE' | 'COMPLEX'ORKeyword sub-mode (see Keyword sub-modes). Applies to TEXT / HYBRID.
slopinteger ≥00Phrase-match slop: intervening positions tolerated between terms when textMode='PHRASE'. Ignored otherwise.
requireCompletebooleanfalseFail-closed override: return 503 instead of degraded partial results when a leg is unavailable.

Keyword sub-modes (textMode)

For TEXT and HYBRID searches, textMode controls how query terms combine in the keyword leg:

  • OR (default) — match any term; broadest recall.
  • AND — require all terms; higher precision.
  • PHRASE — require terms as a contiguous sequence (tunable with slop).
  • COMPLEX — full keyword query syntax (boolean operators, field-scoped clauses, range filters). Use only when you need expression-level control; the query is parsed as a structured expression rather than a bag of terms.

Filter grammar (filters)

Each top-level key is a metadata field declared filterable on the schema (or a built-in document field). Top-level keys are AND-combined. Each value is one of:

  • Scalar (string / number / boolean) — equality, e.g. { "status": "open" }.
  • Array of scalars — OR-set (match any), e.g. { "tag": ["red", "blue"] }.
  • Operator map — closed set of operators:
    • Scalar operand: $eq, $ne, $gt, $gte, $lt, $lte. Operators in one map are AND-combined, so { "price": { "$gte": 100, "$lte": 500 } } is a closed range.
    • Array operand: $in, $nin. Cannot be combined with other operators in the same map.

Numbers and booleans match typed (the field must have been ingested under a typed schema). Dates may be ISO-8601 strings or epoch millis. Filter keys are validated (^[A-Za-z_] [A-Za-z0-9_-]*$); unknown operators, non-scalar operands, malformed keys, and any attempt to filter on a reserved tenancy/ownership key are rejected with 400. You cannot widen your access through the filter map — ownership scope is enforced separately.

Response shape

search.content returns a flat object (it is not wrapped in the { data, nextCursor } cursor envelope used by list/lookup endpoints — see Paging):

FieldTypeNotes
resultsSearchResult[]Matched chunks, ranked (highest score first). May be empty.
totalResultsintegerTotal matches found. 0 for a miss.
searchTimeMsintegerServer-side execution time. Reported even on an empty result.
degradedbooleanTrue when one leg was unavailable and results came from the survivor only.
degradedLegsstring[]Which legs were unavailable: "text" (keyword) and/or "vector" (semantic). Empty when not degraded.

Each SearchResult:

FieldTypeNotes
documentIdstring (uuid)Source entity id. Use with getDocument / getRecord. (This is the source id, not an internal index id.)
sourceType'PartnerDocument' | 'GenericRecord'Document vs. record discriminator — the two literal strings the API returns; branch on it when rendering mixed results.
scorenumberFused relevance score; primary sort key, higher is more relevant.
textScorenumberKeyword (relevance) sub-score. Non-zero in HYBRID when the keyword leg contributed. Always 0 in TEXT-only mode — that path ranks by result-array order, not by an exposed per-hit score.
semanticScorenumberSemantic similarity sub-score. Non-zero in SEMANTIC / HYBRID.
chunkTextstringThe specific chunk that matched. Feed this (or contextText) to a model.
contextTextstringThe wider surrounding passage containing the chunk — better grounding context.
snippetstringHighlighted excerpt with query terms emphasized, for display. May be null for a semantic-only hit (use chunkText).
metadataobjectMetadata supplied at ingest (title, folderId, custom fields).
createdAtstring (ISO-8601)Source content creation time.

Paging

search.content has no nextCursor — it is not enveloped. Page by combining limit with offset:

const page1 = await client.search.content({ query, mode, limit: 20, offset: 0 });
const page2 = await client.search.content({ query, mode, limit: 20, offset: 20 });

limit is capped at 100; consecutive pages are disjoint. This differs from list/lookup endpoints, which return { data, nextCursor } and are drained by feeding nextCursor back as startFrom. For pulling recent content deterministically, prefer a createdAfter window over deep offset paging.

  • Index/search mode must line up. Content indexed for one strategy only will not appear in a search requiring the other. Indexing mode is a property of the content (set at ingest / on the schema); search mode is a property of the query.
  • Only searchable fields participate in keyword relevance. A query matching only a non-searchable field returns nothing — by design.
  • Sensitive fields never enter the index. They cannot be searched, ranked, or surfaced under any scope (index-time exclusion).
  • textScore is 0 in TEXT-only mode — that path returns highlighted snippets but does not expose a per-hit keyword score; rank order is the signal.
  • minTextRelevance applies only to TEXT / HYBRID; minSimilarity applies only to SEMANTIC / HYBRID.
  • A miss is a 200, not a 404. Empty results, totalResults: 0.
  • Degradation is silent unless you check. Inspect degraded / degradedLegs, or set requireComplete: true to turn a degraded leg into a 503.
  • Cross-content search returns mixed types. Always branch on sourceType when rendering unified results.

Inference surfaces — client.inference.*

Three streaming surfaces, all returning an async iterable of SSE events, all requiring the inference:r scope, all sharing one pre-flight check sequence. Inference runs against AWS-hosted models inside the Vectros perimeter (in-perimeter for the partner data plane).

Streaming model (shared)

Each surface returns an async iterable; iterate it to consume events in arrival order. Every event carries an event field naming its type, so a consumer can dispatch on one key. On the wire it is standard Server-Sent Events (event: <type> / data: <json> framed by a blank line) — any compliant SSE reader works; the SDK presents it as an async iterator.

Shared event vocabulary:

EventFieldsWhen
content_deltadelta (string)One chunk of generated text. Append each to build the answer.
doneinputTokens, outputTokens, model, platformCreditsCharged, inferenceBalanceCentsCharged, optionally cacheReadTokens / cacheCreateTokensTerminal event with token counts, resolved model id, and per-call cost. Exactly one.
errormessage, codeA mid-stream model failure.

Surface-specific events are listed under each surface below.

Pre-flight checks (shared, fixed order)

Run before any model invocation; cheaper checks first:

  1. Action scope → 403. A scoped token must carry inference:r (or a wildcard). Root sk_* keys carry wildcard scope and pass by construction. The 403 does not enumerate the missing action.
  2. Monthly credit limit → 402. Once the period's cumulative credits exceed the plan's ceiling, inference rejects until the period rolls or the plan is upgraded.
  3. Burst rate limit → 429. Per-tenant request-rate ceiling, scaling with plan tier.
  4. Inference billing gate → 402. In balance mode (default on lower tiers), a per-partner pre-funded balance must be positive, else 402 Insufficient inference balance. In usage mode (Enterprise-shaped, post-billed), accumulated usage is checked against a contractual cap, else 402 when the cap is reached.

The token cost of a call is metered and recorded on stream finalization (after the stream closes, or on a broken pipe with partial output). A transient accounting failure does not fail the in-flight response — the balance on the next call may briefly lag.


Grounded corpus answers — client.inference.ragInference(req)

Retrieve-then-generate over your indexed content.

Request parameters:

ParameterTypeDefaultNotes
querystring— (required)The question to ground and answer.
modelstringtier defaultModel alias. See Model catalog.
maxTokensinteger1024Output cap. Capped at 4096 (tighter than chat — retrieved context shares the input budget).
temperaturenumber0.3Sampling temperature.
instructionsstringOptional extra instructions for the answer.
searchobjectRetrieval params (below).

search sub-object mirrors content search: mode (default HYBRID), limit (default 10, capped at 50 — this is the RAG topK), userId, orgId, clientId, folderId, rootFolderId, typeName, filters, contentTypes, createdAfter, createdBefore, requireComplete.

Event sequence: search_results → optional truncation_warningcontent_delta+ → done.

EventFieldsNotes
search_resultsresults[], totalResults, searchTimeMs, degraded, degradedLegsAlways emitted, even when results is empty. Each entry: documentId, score, textScore, semanticScore, chunkText, contextText, snippet, metadata, sourceType, typeName, createdAt. These are your citations.
truncation_warningresultsRequested, resultsUsed, reasonEmitted before the answer if retrieved passages had to be dropped to fit the context budget.

Behavior:

  • With search.requireComplete: true, a degraded retrieval leg causes the call to reject before the stream opens (503) instead of grounding on partial results.
  • An empty retrieval still emits search_results (empty) and still streams an answer (typically stating that nothing relevant was found).
  • A scoped token's data scope is enforced on the retrieval step.

Single-document Q&A — client.inference.documentAsk(req)

Ask one question against one document's full text. No retrieval step.

Request parameters:

ParameterTypeDefaultNotes
idstring (uuid)— (required)The document to ask against (in the request body).
promptstring— (required)The question.
modelstringtier defaultModel alias.
maxTokensinteger2048Output cap. Capped at 8192.

Event sequence: document_contextcontent_delta+ → done.

EventFieldsNotes
document_contextdocumentId, title, textBytes, modelThe document loaded, its size, and the resolved model. Fires before any generated text.

Errors:

  • 409 (before the stream opens) — the document is not askable yet: it is still processing (not yet fully indexed), it failed ingest, or it was ingested without its text retained (storeText was not set, so there is no full text to load). Freshly-ingested documents commonly return 409 until indexing completes.
  • 413 (before the stream opens, no credits charged) — the document's estimated input size exceeds the cap (32,000 input tokens, ~25 pages). Payload: message, estimatedTokens, limitTokens. Branch on this and re-route to RAG.
  • 404 — the document does not exist, belongs to another tenant, or is out of your token's scope. All three return the identical 404; the endpoint never reveals existence outside your scope.

Stateless chat — client.inference.chatInference(req)

Single-turn completion. No retrieval, no stored state.

Request parameters:

ParameterTypeDefaultNotes
messages{ role, content }[]— (required)Conversation. A system role message becomes the system prompt; user / assistant messages pass through.
modelstringtier defaultModel alias.
maxTokensinteger2048Output cap. Capped at 8192.
temperaturenumber0.7Sampling temperature.
topPnumberNucleus-sampling parameter (optional).

Event sequence: content_delta+ → done.

Chat stores nothing. For multi-turn, append the assistant's reply to your messages array and re-send the whole array next turn.


Model catalog — client.inference.listInferenceModels()

Lists the models the calling key's plan tier can reach.

Response:

FieldTypeNotes
modelsModel[]Available models for this key.
defaultModelstringThe alias used when a call omits model. Reachable on the free plan.

Each Model:

FieldTypeNotes
idstringAlias, e.g. claude-haiku-4-5, claude-sonnet-4-5, claude-sonnet-4-6, claude-opus-4-7. Matches the model vendor's marketing names.
namestringDisplay name.
providerstringModel provider.
contextWindowintegerContext window size in tokens.
inputCreditsPer1kTokensnumberInput token credit rate.
outputCreditsPer1kTokensnumberOutput token credit rate.
availableOnstring[]Plan tiers that may call this model (e.g. free, starter, pro, scale, enterprise).

Requesting a model your plan does not include returns a 402 pointing to upgrade. A lighter model is available on every tier; more capable models require higher tiers.


Region serving (allowGlobalRegion)

All three inference surfaces (chat, RAG, document-ask) accept an optional boolean allowGlobalRegion in the request body.

FieldTypeDefaultMeaning
allowGlobalRegionbooleantenant residency defaultOpt this request into the lower-cost global (non-US) region path.
  • The tenant's residency default is US serving, applied when the flag is omitted. US serving is the fail-closed default and carries a region premium.
  • Setting allowGlobalRegion: true lets an entitled tenant serve the request from the global region at a lower rate. Entitlement is gated on a signed global-processing waiver.
  • If allowGlobalRegion: true is sent by a tenant that is not entitled, the request is rejected with 403 — it is not silently downgraded or upgraded. Region choice never changes which content is retrieved, only where the model runs and the price.

Error codes (inference)

CodeSurfaceMeaning
400allMalformed request (e.g. missing query / messages / prompt, bad filter key).
402allMonthly credit limit exceeded, insufficient inference balance, usage cap reached, or a requested model the plan does not include.
403allToken scope does not permit inference (inference:r missing); or allowGlobalRegion: true was sent but this tenant is not entitled to global-region serving (no signed global-processing waiver).
404document-askDocument not found / cross-tenant / out-of-scope (uniform — existence never revealed).
409document-askDocument not askable yet — still processing (not yet indexed), failed ingest, or text not retained (ingested without storeText). Returned before the stream opens.
413document-askDocument exceeds the input-token cap (before the stream opens; no credits charged).
429allBurst rate limit exceeded.
503RAG, searchA retrieval/search leg was unavailable and requireComplete / requireComplete: true was set.

Notes & limits — inference

  • Hard output caps per surface: chat 8192, RAG 4096, document-ask 8192 output tokens. Document-ask additionally caps input at 32,000 tokens (~25 pages) with a 413 before the stream opens. A maxTokens above a surface's cap is floored to the cap.
  • RAG topK capped at 50. search.limit defaults to 10, max 50 — retrieved context shares the model's input budget, so unbounded topK would push the prompt past the context window.
  • Chat is stateless; there is no managed conversation state. No server-side thread store, no assistants registry. Multi-turn is the caller's responsibility — re-send the messages array each turn and budget the history against the model's context window.
  • Document-ask is single-document. No multi-document Q&A endpoint. For multi-document grounding, use RAG (retrieval picks relevant passages across the corpus) or stitch multiple /ask calls at the application layer.
  • Cost is recorded on finalization. The done event carries platformCreditsCharged and inferenceBalanceCentsCharged; an accounting hiccup will not fail an in-flight response, so the next call's balance may briefly lag.
  • Cache-token fields are forward-declared. done may carry cacheReadTokens / cacheCreateTokens; the current billing formula does not yet apply a cache discount. Consumers already reading these will see the reduction when it lands, with no code change.
  • In-perimeter scope. The in-perimeter (no third-party model-vendor egress) guarantee is for the partner data plane — the content you store and retrieve through Vectros. Specific compliance coverage terms are addressed in the security and compliance documentation, not asserted here.
  • The model catalog is the source of truth. Handlers gate on the live catalog at request time, so a model going generally available or being retired takes effect immediately — what listInferenceModels returns is what the deployed handlers accept.

Where to go next

  • how-to.md — runnable guides for each call on this page.
  • explanation.md — the concepts behind the modes, grounding context, and the three inference surfaces.
  • ../data-model/reference.md — schema field declarations (searchable, filterable, sensitive) that govern what search indexes.
  • ../operations-trust/compliance.md — sensitive-field protections, isolation guarantees, and the in-perimeter inference posture.