Skip to content

RAG Seeding

The RAG index is a collection of ~2,400 snippets stored in the Qdrant vector database. These snippets describe the InteLIS database schema, business rules, and clinical terminology — they give the LLM the context it needs to generate accurate SQL.

This page explains how to build, refresh, and reset the index.

Quick Reference

Command What it does When to use
make rag-refresh Re-exports schema, rebuilds snippets, uploads to Qdrant Schema changed, business rules updated, field guide updated
make rag-reset Deletes the Qdrant collection, then does a full rag-refresh Switching embedding models, corrupt index, major schema overhaul

Which one should I use?

Use make rag-refresh most of the time. It's incremental and faster. Use make rag-reset only when you need a clean slate (e.g. after changing the embedding model or if search results seem wrong).

How the Pipeline Works

The seeding process has three steps:

┌──────────────────┐     ┌──────────────────────┐     ┌─────────────────┐
│  export-schema   │     │  build-rag-snippets   │     │   rag-upsert    │
│                  │     │                       │     │                 │
│  Reads MySQL     │────▶│  Schema + rules +     │────▶│  Uploads to     │
│  INFORMATION_    │     │  field guide →         │     │  Qdrant via     │
│  SCHEMA          │     │  JSONL snippets       │     │  RAG API        │
│                  │     │                       │     │                 │
│  → var/          │     │  → corpus/            │     │                 │
│    schema.json   │     │    snippets.jsonl     │     │                 │
└──────────────────┘     └──────────────────────┘     └─────────────────┘

Step 1: Export Database Schema

php bin/export-schema.php

Queries INFORMATION_SCHEMA from the InteLIS database and writes var/schema.json. This captures:

  • All tables and their columns (name, type, nullable, key, extra)
  • Foreign key relationships
  • Reference/lookup tables (tables with fewer than 50 rows)
  • Sample values from reference columns

Output: var/schema.json (~125 tables typically)

Step 2: Build RAG Snippets

php bin/build-rag-snippets.php

Combines the schema with business rules and field guide into JSONL snippets. Reads from three sources:

Source What it provides
var/schema.json Database structure (tables, columns, foreign keys)
config/business-rules.php Privacy rules, query constraints, validation
config/field-guide.php Terminology, clinical thresholds, column semantics

Snippet types generated:

Type Description Count (~)
column Column descriptions with types and semantics ~1,500
table Table descriptions with column lists ~125
syn Terminology synonyms and mappings ~80
rule Business rules (privacy, defaults, formatting) ~50
validation Field validation rules ~40
exemplar Example query patterns ~30
relationship Foreign key relationships between tables ~23
threshold Clinical thresholds (VL suppression, etc.) ~20
test_type Test type logic (VL, EID, COVID, TB) ~15

Output: corpus/snippets.jsonl (~2,400 lines)

Step 3: Upload to Qdrant

php bin/rag-upsert.php corpus/snippets.jsonl

Batch uploads the JSONL snippets to the RAG API, which embeds them and stores the vectors in Qdrant.

Options:

  • Second argument is batch size (default: 500)

All-in-One: rag-refresh.sh

The shell script bin/rag-refresh.sh runs all three steps plus health checks and verification:

bash bin/rag-refresh.sh

Options:

Flag Effect
--reset Delete and recreate the Qdrant collection before uploading
--batch N Change the upload batch size (default: 500)
# Full reset + rebuild
bash bin/rag-refresh.sh --reset

The Makefile targets are wrappers around this script:

  • make rag-refresh → runs rag-refresh.sh inside the Docker container
  • make rag-reset → runs rag-refresh.sh --reset inside the Docker container

rag-refresh vs rag-reset

make rag-refresh make rag-reset
Deletes existing data? No — upserts (adds/updates) Yes — drops the collection and recreates it
Speed Faster (incremental) Slower (rebuilds from scratch)
Use when Schema changed, rules updated, new field guide entries Switching embedding models, index seems corrupted, major schema overhaul
Data loss risk None — old snippets stay unless overwritten by ID All existing snippets are deleted first
Command make rag-refresh make rag-reset

Verification

After seeding, verify the index is working:

curl -X POST http://localhost:8089/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "viral load suppressed by lab",
    "k": 5,
    "filters": {"type": ["rule", "threshold", "column"]}
  }'

You should see relevant snippets about VL suppression thresholds, result categories, and facility columns.

When to Re-seed

Change Action needed
InteLIS schema changes (new tables, renamed columns) make rag-refresh
Updated config/business-rules.php make rag-refresh
Updated config/field-guide.php make rag-refresh
Changed the embedding model (EMBEDDING_MODEL in .env) make rag-reset
Search returns irrelevant or outdated results make rag-reset
First setup after importing InteLIS data make rag-refresh