Search & RAG — concepts
Search and retrieval are the substrate underneath almost every Vectros-powered application. Once you have written documents and records into a context, search is how you find them again — by keyword, by meaning, or by both — and retrieval-augmented generation (RAG) is how you put a model in front of that content so it answers questions grounded in your data rather than the model's general training.
This page explains the mental model: what content search is and why it works the way it does, what the three search modes mean for a developer choosing between them, and how the three streaming inference surfaces — grounded answers over your whole corpus, single- document Q&A, and stateless chat — fit together. For runnable guides see how-to.md; for the exhaustive field-and-limit list see reference.md.
One search surface over everything you store
Vectros indexes two kinds of content into one searchable surface: documents (text you
ingest directly or upload as a file, which Vectros extracts and chunks) and records
(typed, schema'd entities). A single search call queries both at once. Each result carries
a sourceType discriminator so you can tell a document hit from a record hit and branch
your rendering accordingly, but you do not run two searches and merge them yourself —
unified, cross-content search is the default, and you narrow to one kind only when you want
to.
What flows into the index is governed by your schema. For records, only fields you declare searchable participate in keyword relevance; for documents, the title and the extracted body text are indexed. This matters in two directions. First, a query that only matches a non-searchable field returns nothing — that is the designed behavior, not a bug. Second, and importantly for regulated workloads, fields you mark sensitive never enter the search index at all. Sensitive data is excluded at index time, so it cannot be surfaced, ranked, or leaked through a search query under any scope. (This index-time exclusion is one of three independent protections for sensitive fields; the other two — redaction at write time and masking on read — are covered in the security and compliance docs.)
Every searchable row also carries its ownership and tenancy by construction. A search is always bounded to the caller's context first, and a scoped key narrowed to a particular user, organization, or client is enforced against the content itself — not bolted on as a filter the caller could widen. You cannot inject a tenancy or ownership filter through the search request to see content outside your scope.
The three search modes
Vectros runs two complementary retrieval strategies and lets you choose how much of each
you want per query. The mode parameter selects one of three behaviors.
TEXT — keyword relevance. Classic keyword matching: a result scores higher when it
contains the query's terms, downweighted for terms that appear everywhere, and length-
normalized so a short record does not lose to a long document just for being short. Text
mode is fast and cheap, and it is the right choice for exact-phrase lookups, known
identifiers, boolean term logic, and any case where the user is typing words they expect to
appear verbatim. It returns highlighted snippets so a UI can show why each result
matched.
SEMANTIC — meaning-based similarity. Instead of matching words, semantic mode matches
meaning. The query and the indexed content are compared as embeddings, so a search for
"exposure therapy for panic" can surface a passage about treating panic disorder with
graded exposure even when the exact words differ. Semantic mode is the right choice for
natural-language questions and conceptual recall, where the user cares about the idea, not
the vocabulary. It also returns the surrounding passage around each match (see Grounding
context below), which is what makes it useful for feeding a model.
HYBRID — fused ranking (the default, and the recommended choice). Hybrid runs both
strategies and fuses their rankings into one result list. The two cover each other's
weaknesses: keyword relevance is precise but brittle — it misses content phrased
differently from the query — while semantic similarity understands meaning but over-
retrieves loosely related material. Fusing them on rank position (not on raw scores,
which are not comparable between the two strategies) yields a result set that holds up
across exact lookups, natural-language questions, and the messy mix of the two that real
users actually type. A hit that both strategies rank highly rises to the top. For most
applications, hybrid is simply the best default; reach for TEXT or SEMANTIC only when
you have a specific reason to want one strategy alone.
A practical consequence worth internalizing: a document indexed for one strategy only will not appear in a search that requires the other. Indexing mode is a property of the content, chosen when you ingest it or define its schema; search mode is a property of the query. They have to line up for a result to surface.
Grounding context: chunks and their surrounding passage
When Vectros indexes a document for semantic search, it splits the text into small chunks
so that a match is precise — a query lands on the specific passage that is relevant, not
on the whole document it was buried inside. But a small chunk is often too little context
for a model to reason over confidently. So each result can carry two pieces of text: the
specific chunk that matched (chunkText) and a larger surrounding passage that contains it
(contextText). The wider passage is what you feed a model when you want it to ground an
answer without losing the thread — it is the difference between handing the model a
sentence and handing it the paragraph that sentence lives in.
This is why search and RAG are the same machinery viewed from two angles. Search returns ranked hits with their grounding context; RAG runs that exact search, takes the grounding context, and streams a model's answer built on top of it.
Resilience: graceful degradation
A hybrid search runs its two strategies independently. If one becomes temporarily
unavailable, the request does not fail — it returns results from the surviving strategy and
flags the response as degraded, naming which leg was missing. Your results may be less
complete than a full hybrid search, but you still get an answer. If completeness matters
more than availability for a given call, you can opt into fail-closed behavior so that a
degraded search is rejected outright rather than returning a partial set. Single-mode
TEXT and SEMANTIC searches never degrade — their single strategy is the request.
Grounded answers: the three inference surfaces
Vectros exposes three ways to put a model in front of content. All three stream their responses back as Server-Sent Events (SSE) over one connection, all three run the model inside the Vectros perimeter, and all three share the same sequence of pre-flight checks so that failures look the same regardless of which one you called.
Grounded answers over your corpus (RAG). You pass a question; Vectros runs a search across your indexed content, emits the matched results to you as a citation event before any text is generated, then streams the model's answer grounded on those results. This is the integrated retrieve-then-generate path: you get both the citations (so your UI can show the user what the answer was built on) and the answer, in one call. Retrieval reuses the same search engine described above, with the same modes, filters, and scope enforcement.
Single-document Q&A. You pass a question and one document id; Vectros loads that one document's full extracted text and streams an answer scoped to it. There is no retrieval step — the whole document is the context. This is the right surface when the user is already looking at a specific document and wants to ask about it, and it is bounded by a hard input-size cap (roughly 25 pages of text). For questions that span many documents, use RAG instead, whose retrieval step picks the relevant passages across the whole corpus.
Stateless chat. A plain single-turn completion: you pass a message array, Vectros streams the model's reply. Chat does not retrieve anything and does not store any conversation state — it is stateless by design. If you want multi-turn behavior, your application keeps the history and re-sends the full message array on each turn. Chat is the right surface when you are managing your own context (perhaps assembled from your own searches) and just want a model to complete it.
In-perimeter inference
Whichever surface you call, the model runs against AWS-hosted models from inside the Vectros AWS account. A prompt — a chat message, a RAG query with its retrieved passages, or a single document's text — does not cross out to a third-party model vendor's API. For teams building on regulated data, this is the architectural payoff: content in a prompt and content in a retrieved passage stay inside the same perimeter the rest of the platform runs in. This in-perimeter guarantee applies to the partner data plane — the data you store and retrieve through Vectros. (The precise terms of compliance coverage are addressed in the security and compliance documentation, not here.)
By default, inference is served from a US region — the fail-closed default for every
tenant. A tenant that is entitled to it (via a signed global-processing waiver) can opt an
individual request into a lower-cost global region by sending allowGlobalRegion: true;
without that entitlement the flag is rejected rather than honored. Region choice affects
only where the model runs and the price, never which content is retrieved. See the
reference for the field and its error.
The model catalog is plan-gated
Each inference call resolves a model. The set of models a given key can reach is governed by the plan tier — a lighter, fast model is available on every tier, and more capable models are available on higher tiers. You can list the catalog the calling key can reach, and if you request a model your plan does not include, the call is rejected with a clear "upgrade" signal rather than silently substituting one. Omit the model and Vectros falls back to a sensible default that every tier can reach — so a brand-new developer can make a working call without configuring anything first.
Why pre-flight checks, and in what order
Before any model is invoked, each inference request runs a fixed sequence of checks: permission (does this key's scope permit inference at all), monthly plan allowance, request- rate ceiling, and inference balance. The order is deliberate — the cheapest checks run first, so a request that would fail on balance never pays for a search round-trip, and a request that would fail on permission never touches your usage counters. The token cost of a call is metered and recorded only after the stream finishes, and the terminal event of every stream reports exactly what the call cost.
Where to go next
- how-to.md — runnable guides: hybrid search with filters, a grounded RAG answer consuming the stream and citations, single-document Q&A, and a stateless chat call.
- reference.md — every parameter, field, mode, limit, error code, and the honest "what this does not do" notes for search and inference.
- ../data-model/explanation.md — schemas, records, and the searchable / sensitive field declarations that govern what search sees.
- ../operations-trust/compliance.md — the three independent sensitive-field protections, tenant and context isolation, and the in-perimeter inference posture in full.