Search & RAG — how-to guides

Goal-oriented, runnable guides for searching your content and getting grounded answers from a model. Every snippet here uses real SDK calls that pass against a live Vectros environment. All data is synthetic — fictional names and values only.

SDK version. These guides use the Node SDK. The API spec is at 0.27.0; anything that needs a newer client is marked inline. Nothing on this page requires 0.26, so the core calls work on any current client — including the 0.23 staging build the React toolkit and reference apps pin, and the 0.26 build the CLI and MCP server bundle. The optional global-region inference flag (allowGlobalRegion, below) is a 0.27 addition.

Client setup

All guides assume a constructed client. The constructor takes a token and an environment (the API base URL):

import { VectrosClient } from '@vectros-ai/sdk';

const client = new VectrosClient({
  token: process.env.VECTROS_API_KEY!,          // sk_*, ssk_*, or st_*
  environment: process.env.VECTROS_API_BASE_URL!, // e.g. https://api.vectros.ai
});

Sub-clients are grouped by area: client.search.*, client.inference.*, client.documents.*, client.records.*, client.schemas.*, client.folders.*.

A note on indexing latency. Writes are indexed asynchronously — a document or record is not searchable the instant the write call returns. In application code you typically do not wait; you write, and the content becomes searchable shortly after. In tests and demos, poll until the content surfaces (search by a unique marker phrase, or scope retrieval with createdAfter to your run's start time) rather than sleeping a fixed interval.

Search your content with hybrid mode and filters

Goal: find content across documents and records, then narrow with filters and read the results.

Prerequisites: a key with search:r; some indexed content.

Steps

A unified hybrid search — the default mode, querying both documents and records:

const results = await client.search.content({
  query: 'lifestyle changes for stage 1 hypertension',
  mode: 'HYBRID',     // 'TEXT' | 'SEMANTIC' | 'HYBRID' (default HYBRID)
  limit: 10,
});

for (const hit of results.results ?? []) {
  console.log(hit.sourceType, hit.documentId, hit.score);
  console.log('matched chunk:', hit.chunkText);
  console.log('grounding context:', hit.contextText);
}
console.log(`${results.totalResults} total, ${results.searchTimeMs} ms`);

Each hit carries a sourceType discriminator — branch on it to tell a document hit from a record hit (the literal values the API returns are the strings "PartnerDocument" and "GenericRecord") — the source documentId (use it with getDocument / getRecord), the fused score, the matched chunkText, the wider contextText, a highlighted snippet, and the metadata supplied at ingest time.

Narrow to one content type with contentTypes:

// Documents only
const docsOnly = await client.search.content({
  query: 'ACE inhibitor dosing',
  mode: 'HYBRID',
  contentTypes: ['documents'],
});

// Records only
const recordsOnly = await client.search.content({
  query: 'ACE inhibitor dosing',
  mode: 'HYBRID',
  contentTypes: ['records'],
});

Narrow records to one schema type with typeName (this implicitly restricts to records):

const patientRecords = await client.search.content({
  query: 'follow-up scheduled',
  mode: 'TEXT',
  typeName: 'patient_visit',   // only records of this schema type
  limit: 50,
});

Scope to a folder (exact folder), or to a folder and all its descendants:

const inFolder = await client.search.content({
  query: 'intake notes',
  mode: 'HYBRID',
  folderId: '<folder-uuid>',        // this exact folder only
  // rootFolderId: '<folder-uuid>', // this folder AND all descendants
});

Filter by ownership (user / org / client) and by metadata fields:

const scoped = await client.search.content({
  query: 'medication review',
  mode: 'HYBRID',
  clientId: '<client-uuid>',                 // ownership scope
  filters: {
    status: 'open',                          // equality
    tag: ['anxiety', 'depression'],          // OR-set: match any
    visitCount: { $gte: 2, $lte: 10 },       // closed range
  },
});

Filter values may be a scalar (equality), an array (match any), or an operator map ($eq, $ne, $gt, $gte, $lt, $lte, $in, $nin). Metadata filter keys must be fields declared filterable on the schema; you cannot use the filters map to inject a tenancy or ownership key to widen your access — those are enforced separately.

Expected result: a results array plus totalResults, searchTimeMs, and the degraded / degradedLegs fields. An empty result set for a query that matches nothing is a normal 200 with results: [] and totalResults: 0 — not an error.

Paging search results

search.content is not wrapped in the cursor envelope that list and lookup endpoints use — it has no nextCursor. Page with limit and offset instead:

const page1 = await client.search.content({ query, mode: 'TEXT', limit: 20, offset: 0 });
const page2 = await client.search.content({ query, mode: 'TEXT', limit: 20, offset: 20 });

limit is capped at 100 and offset at 200 (an offset above 200 is rejected with 400). Consecutive pages are disjoint. For pulling a known set of recent content deterministically — or for reaching past the 200-row offset ceiling — prefer a createdAfter window over deep offset paging.

Collapse multiple chunks of the same document

A long document can match on several of its chunks and produce several hits with the same documentId. To get at most one hit per source document, set uniqueDocuments: true:

const collapsed = await client.search.content({
  query: 'hypertension monitoring',
  mode: 'TEXT',
  uniqueDocuments: true,   // at most one hit per source document
  limit: 50,
});

Get a grounded answer over your whole corpus (RAG)

Goal: ask a natural-language question and stream back a model answer grounded on your indexed content, with citations.

Prerequisites: a key with inference:r (and search:r for the retrieval step); some indexed content; a positive inference balance on balance-mode plans.

Steps

ragInference returns an async iterable of SSE events. The first informative event is search_results (the citations), followed by content_delta events (the streamed answer), ended by a single done event:

const stream = await client.inference.ragInference({
  query: 'What treatment is recommended for stage 1 hypertension?',
  model: 'claude-sonnet-4-5',     // optional; omit for the tier default
  search: {
    mode: 'HYBRID',
    limit: 5,                     // retrieval topK (default 10, max 50)
    // narrow retrieval the same way you narrow search:
    // clientId, userId, orgId, folderId, typeName, createdAfter, createdBefore
  },
  maxTokens: 512,                 // capped at 4096 for RAG
});

let answer = '';
let citations: Array<{ documentId: string; chunkText: string }> = [];

for await (const event of stream) {
  switch (event.event) {
    case 'search_results':
      citations = event.results;          // show these so the user can verify grounding
      break;
    case 'truncation_warning':
      // some retrieved passages were dropped to fit the context budget
      console.warn('grounding truncated:', event.reason);
      break;
    case 'content_delta':
      answer += event.delta;              // append each chunk
      break;
    case 'done':
      console.log('tokens:', event.inputTokens, '→', event.outputTokens);
      console.log('charged:', event.inferenceBalanceCentsCharged, 'cents');
      break;
    case 'error':
      throw new Error(event.message);
  }
}

Each entry in search_results.results carries documentId, score, chunkText, contextText, snippet, metadata, sourceType, typeName, and createdAt — the same shape as a search hit, so you can render the citations exactly as you would render search results. UIs typically show the citations above the answer so the user can verify the grounding while the model is still generating.

If retrieval finds nothing, the search_results event still fires (with an empty results array) and the model still answers — typically stating that it found no relevant content. The contract is that search_results is always emitted, even when empty.

Refuse a partial answer. If you would rather fail than ground on a degraded retrieval, set search.requireComplete: true. The call then returns an error before the stream opens when a search leg is unavailable, instead of grounding on partial results.

Expected result: the assembled answer string, a citations array you can display, and a per-call cost on the done event.

Scoping RAG retrieval with a scoped token

A scoped token's data scope is enforced on the retrieval step, so RAG over a scoped token only grounds on content that token is allowed to see. Mint a token scoped to a user and RAG through it:

const minted = await client.auth.mintToken({
  userId: '<user-uuid>',
  scope: {
    allowedActions: ['inference:r', 'search:r'],
    dataScope: { userId: ['<user-uuid>'] },
  },
});

const scoped = new VectrosClient({
  token: minted.token,
  environment: process.env.VECTROS_API_BASE_URL!,
});

const stream = await scoped.inference.ragInference({
  query: 'What treatments are discussed?',
  search: { mode: 'HYBRID', limit: 10, userId: '<user-uuid>' },
  maxTokens: 256,
});

Because the data scope lists a single userId with no null sentinel, every call must include the matching userId. To also reach tenant-level (owner-less) content under the same token, include null in the scope list: dataScope: { userId: ['<user-uuid>', null] }.

Ask a single document

Goal: ask one question about one specific document and stream the answer.

Prerequisites: a key with inference:r; the document's id; the document indexed with its text stored.

Steps

documentAsk takes the document id and a prompt in the request body. It streams a document_context event first (the document it loaded), then the answer:

const stream = await client.inference.documentAsk({
  id: '<document-uuid>',
  prompt: 'Which medications does this document describe, and how do they work?',
  model: 'claude-haiku-4-5',   // optional; omit for the tier default
  maxTokens: 256,              // output cap 8192
});

let answer = '';
for await (const event of stream) {
  switch (event.event) {
    case 'document_context':
      console.log('asking against:', event.documentId, event.title, `${event.textBytes} bytes`);
      break;
    case 'content_delta':
      answer += event.delta;
      break;
    case 'done':
      console.log('charged:', event.inferenceBalanceCentsCharged, 'cents');
      break;
  }
}

Handle the oversize case. A document larger than the input cap (~25 pages, 32,000 estimated input tokens) is rejected with a 413 before the stream opens — and before any credits are charged. The error payload carries estimatedTokens and limitTokens so you can branch and re-route to RAG:

try {
  const stream = await client.inference.documentAsk({ id, prompt, maxTokens: 256 });
  // ... consume stream
} catch (err: any) {
  if (err.statusCode === 413) {
    // too large for single-document ask — use corpus RAG instead
    return askViaRag(prompt);
  }
  throw err;
}

A not-yet-ready document returns 409. A document that is not yet fully indexed, that failed ingest, or that was ingested without its text stored returns a 409 before the stream opens. Asking a document the instant after you ingest it commonly hits this — wait for indexing to complete, and ingest with the text stored if you intend to ask against it.

Out-of-scope and missing documents both return 404. Asking about a document that does not exist, that belongs to another tenant, or that your token's scope cannot see all return the same 404 — the endpoint never reveals whether a document exists outside your scope.

Expected result: a streamed answer grounded only on that one document, or a structured 413 (too large) or 404 (not visible to you).

Make a stateless chat call

Goal: a single-turn model completion you manage the context for yourself.

Prerequisites: a key with inference:r; a positive inference balance on balance-mode plans.

Steps

chatInference takes a messages array. A system message sets the system prompt; the rest is passed through. It streams content_delta events then a done:

const stream = await client.inference.chatInference({
  messages: [
    { role: 'system', content: 'You are a concise clinical scribe.' },
    { role: 'user',   content: 'Summarize this visit note in three bullets: ...' },
  ],
  model: 'claude-haiku-4-5',  // optional
  maxTokens: 512,             // output cap 8192
});

let reply = '';
for await (const event of stream) {
  if (event.event === 'content_delta') reply += event.delta;
  if (event.event === 'done') {
    console.log('model:', event.model);
    console.log('tokens:', event.inputTokens, '→', event.outputTokens);
  }
}

Multi-turn is your responsibility. Chat stores nothing. To carry a conversation, append the model's reply to your messages array and re-send the whole array on the next turn:

const history = [
  { role: 'user',      content: 'My care plan emphasizes lifestyle changes.' },
  { role: 'assistant', content: 'Noted — lifestyle-first care plan.' },
  { role: 'user',      content: 'What did I just say my care plan emphasized?' },
];
const stream = await client.inference.chatInference({ messages: history, maxTokens: 64 });

Expected result: the streamed reply and a done event with token counts and the resolved model id.

List the models your key can reach

Goal: discover which inference models the calling key's plan permits, before you call.

const catalog = await client.inference.listInferenceModels();

console.log('default model:', catalog.defaultModel);
for (const m of catalog.models) {
  console.log(m.id, '— context window', m.contextWindow, '— plans:', m.availableOn);
}

Each entry carries the alias id (e.g. claude-haiku-4-5), a display name, the provider, the contextWindow, per-1k-token credit rates, and availableOn (the plan tiers that may call it). defaultModel is what an inference call resolves to when you omit model — and it is reachable on the free plan, so a brand-new key can make a working call immediately.

Drive search and RAG from an AI agent (MCP)

If you are building with an agent over the Model Context Protocol rather than calling the SDK directly, the Vectros MCP server exposes the same capabilities as tools:

hybrid_search — wraps content search. Same modes, filters, ownership scope, folder scope, and uniqueDocuments / minSimilarity knobs. (The MCP tool caps results lower than the API — default 3, max 10 — to protect the agent's context window; paginate with offset for more.)
rag_ask — wraps corpus RAG. The agent gets the assembled answer plus the citations and usage in one tool result; progress notifications keep the call alive during generation. (Its retrieval defaults to 5 results, max 10 — also lower than the API.)

Why agent results can look sparse. The MCP tools deliberately default to far fewer results than the API (hybrid_search and record_query default to 3, rag_ask retrieval to 5; the API defaults are 10–100). This protects the agent's context window — it is not a bug. Raise each tool's limit (up to its max of 10) when an agent needs wider recall. The full per-tool cap table is in clients/mcp.md.

document_ask — wraps single-document Q&A, including the structured oversize signal.

The agent never sees content outside the scoped key the MCP server is configured with — the same scope enforcement applies. See the blueprint walkthroughs for the end-to-end no-code agent path.

Where to go next

reference.md — every parameter, field, limit, and error code for search and the three inference surfaces.
explanation.md — the concepts: why the three modes exist, grounding context, and how search and RAG are the same machinery.
../data-model/how-to.md — defining schemas with searchable and sensitive fields, and writing the records you search over.
../identity-access/how-to.md — minting scoped tokens and the data-scope / null-sentinel rules the RAG scoping example relies on.