Search & RAG — concepts

Search and retrieval are the substrate underneath almost every Vectros-powered application. Once you have written documents and records into a context, search is how you find them again — by keyword, by meaning, or by both — and retrieval-augmented generation (RAG) is how you put a model in front of that content so it answers questions grounded in your data rather than the model's general training.

This page explains the mental model: what content search is and why it works the way it does, what the three search modes mean for a developer choosing between them, and how the three streaming inference surfaces — grounded answers over your whole corpus, single- document Q&A, and stateless chat — fit together. For runnable guides see how-to.md; for the exhaustive field-and-limit list see reference.md.

One search surface over everything you store

Vectros indexes two kinds of content into one searchable surface: documents (text you ingest directly or upload as a file, which Vectros extracts and chunks) and records (typed, schema'd entities). A single search call queries both at once. Each result carries a sourceType discriminator so you can tell a document hit from a record hit and branch your rendering accordingly, but you do not run two searches and merge them yourself — unified, cross-content search is the default, and you narrow to one kind only when you want to.

What flows into the index is governed by your schema. For records, only fields you declare searchable participate in keyword relevance; for documents, the title and the extracted body text are indexed. This matters in two directions. First, a query that only matches a non-searchable field returns nothing — that is the designed behavior, not a bug. Second, and importantly for regulated workloads, fields you mark sensitive never enter the search index at all. Sensitive data is excluded at index time, so it cannot be surfaced, ranked, or leaked through a search query under any scope. (This index-time exclusion is one of three independent protections for sensitive fields; the other two — redaction at write time and masking on read — are covered in the security and compliance docs.)

Every searchable row also carries its ownership and tenancy by construction. A search is always bounded to the caller's context first, and a scoped key narrowed to a particular user, organization, or client is enforced against the content itself — not bolted on as a filter the caller could widen. You cannot inject a tenancy or ownership filter through the search request to see content outside your scope.

The three search modes

Vectros runs two complementary retrieval strategies and lets you choose how much of each you want per query. The mode parameter selects one of three behaviors.

TEXT — keyword relevance. Classic keyword matching: a result scores higher when it contains the query's terms, downweighted for terms that appear everywhere, and length- normalized so a short record does not lose to a long document just for being short. Text mode is fast and cheap, and it is the right choice for exact-phrase lookups, known identifiers, boolean term logic, and any case where the user is typing words they expect to appear verbatim. It returns highlighted snippets so a UI can show why each result matched.

SEMANTIC — meaning-based similarity. Instead of matching words, semantic mode matches meaning. The query and the indexed content are compared as embeddings, so a search for "exposure therapy for panic" can surface a passage about treating panic disorder with graded exposure even when the exact words differ. Semantic mode is the right choice for natural-language questions and conceptual recall, where the user cares about the idea, not the vocabulary. It also returns the surrounding passage around each match (see Grounding context below), which is what makes it useful for feeding a model.

HYBRID — fused ranking (the default, and the recommended choice). Hybrid runs both strategies and fuses their rankings into one result list. The two cover each other's weaknesses: keyword relevance is precise but brittle — it misses content phrased differently from the query — while semantic similarity understands meaning but over- retrieves loosely related material. Fusing them on rank position (not on raw scores, which are not comparable between the two strategies) yields a result set that holds up across exact lookups, natural-language questions, and the messy mix of the two that real users actually type. A hit that both strategies rank highly rises to the top. For most applications, hybrid is simply the best default; reach for TEXT or SEMANTIC only when you have a specific reason to want one strategy alone.

A practical consequence worth internalizing: a document indexed for one strategy only will not appear in a search that requires the other. Indexing mode is a property of the content, chosen when you ingest it or define its schema; search mode is a property of the query. They have to line up for a result to surface.

Grounding context: chunks and their surrounding passage

When Vectros indexes a document for semantic search, it splits the text into small chunks so that a match is precise — a query lands on the specific passage that is relevant, not on the whole document it was buried inside. But a small chunk is often too little context for a model to reason over confidently. So each result can carry two pieces of text: the specific chunk that matched (chunkText) and a larger surrounding passage that contains it (contextText). The wider passage is what you feed a model when you want it to ground an answer without losing the thread — it is the difference between handing the model a sentence and handing it the paragraph that sentence lives in.

This is why search and RAG are the same machinery viewed from two angles. Search returns ranked hits with their grounding context; RAG runs that exact search, takes the grounding context, and streams a model's answer built on top of it.

Resilience: graceful degradation

A hybrid search runs its two strategies independently. If one becomes temporarily unavailable, the request does not fail — it returns results from the surviving strategy and flags the response as degraded, naming which leg was missing. Your results may be less complete than a full hybrid search, but you still get an answer. If completeness matters more than availability for a given call, you can opt into fail-closed behavior so that a degraded search is rejected outright rather than returning a partial set. Single-mode TEXT and SEMANTIC searches never degrade — their single strategy is the request.

Grounded answers: the three inference surfaces

Vectros exposes three ways to put a model in front of content. All three stream their responses back as Server-Sent Events (SSE) over one connection, all three run the model inside the Vectros perimeter, and all three share the same sequence of pre-flight checks so that failures look the same regardless of which one you called.

Grounded answers over your corpus (RAG). You pass a question; Vectros runs a search across your indexed content, emits the matched results to you as a citation event before any text is generated, then streams the model's answer grounded on those results. This is the integrated retrieve-then-generate path: you get both the citations (so your UI can show the user what the answer was built on) and the answer, in one call. Retrieval reuses the same search engine described above, with the same modes, filters, and scope enforcement.

Single-document Q&A. You pass a question and one document id; Vectros loads that one document's full extracted text and streams an answer scoped to it. There is no retrieval step — the whole document is the context. This is the right surface when the user is already looking at a specific document and wants to ask about it, and it is bounded by a hard input-size cap (roughly 25 pages of text). For questions that span many documents, use RAG instead, whose retrieval step picks the relevant passages across the whole corpus.

Stateless chat. A plain single-turn completion: you pass a message array, Vectros streams the model's reply. Chat does not retrieve anything and does not store any conversation state — it is stateless by design. If you want multi-turn behavior, your application keeps the history and re-sends the full message array on each turn. Chat is the right surface when you are managing your own context (perhaps assembled from your own searches) and just want a model to complete it.

In-perimeter inference

Whichever surface you call, the model runs against AWS-hosted models from inside the Vectros AWS account. A prompt — a chat message, a RAG query with its retrieved passages, or a single document's text — does not cross out to a third-party model vendor's API. For teams building on regulated data, this is the architectural payoff: content in a prompt and content in a retrieved passage stay inside the same perimeter the rest of the platform runs in. This in-perimeter guarantee applies to the partner data plane — the data you store and retrieve through Vectros. (The precise terms of compliance coverage are addressed in the security and compliance documentation, not here.)

By default, inference is served from a US region — the fail-closed default for every tenant. A tenant that is entitled to it (via a signed global-processing waiver) can opt an individual request into a lower-cost global region by sending allowGlobalRegion: true; without that entitlement the flag is rejected rather than honored. Region choice affects only where the model runs and the price, never which content is retrieved. See the reference for the field and its error.

The model catalog is plan-gated

Each inference call resolves a model. The set of models a given key can reach is governed by the plan tier — a lighter, fast model is available on every tier, and more capable models are available on higher tiers. You can list the catalog the calling key can reach, and if you request a model your plan does not include, the call is rejected with a clear "upgrade" signal rather than silently substituting one. Omit the model and Vectros falls back to a sensible default that every tier can reach — so a brand-new developer can make a working call without configuring anything first.

Why pre-flight checks, and in what order

Before any model is invoked, each inference request runs a fixed sequence of checks: permission (does this key's scope permit inference at all), monthly plan allowance, request- rate ceiling, and inference balance. The order is deliberate — the cheapest checks run first, so a request that would fail on balance never pays for a search round-trip, and a request that would fail on permission never touches your usage counters. The token cost of a call is metered and recorded only after the stream finishes, and the terminal event of every stream reports exactly what the call cost.

Where to go next

  • how-to.md — runnable guides: hybrid search with filters, a grounded RAG answer consuming the stream and citations, single-document Q&A, and a stateless chat call.
  • reference.md — every parameter, field, mode, limit, error code, and the honest "what this does not do" notes for search and inference.
  • ../data-model/explanation.md — schemas, records, and the searchable / sensitive field declarations that govern what search sees.
  • ../operations-trust/compliance.md — the three independent sensitive-field protections, tenant and context isolation, and the in-perimeter inference posture in full.