RAG Seeding¶
The RAG index is a collection of ~2,400 snippets stored in the Qdrant vector database. These snippets describe the InteLIS database schema, business rules, and clinical terminology — they give the LLM the context it needs to generate accurate SQL.
This page explains how to build, refresh, and reset the index.
Quick Reference¶
| Command | What it does | When to use |
|---|---|---|
make rag-refresh |
Re-exports schema, rebuilds snippets, uploads to Qdrant | Schema changed, business rules updated, field guide updated |
make rag-reset |
Deletes the Qdrant collection, then does a full rag-refresh |
Switching embedding models, corrupt index, major schema overhaul |
Which one should I use?
Use make rag-refresh most of the time. It's incremental and faster. Use make rag-reset only when you need a clean slate (e.g. after changing the embedding model or if search results seem wrong).
How the Pipeline Works¶
The seeding process has three steps:
┌──────────────────┐ ┌──────────────────────┐ ┌─────────────────┐
│ export-schema │ │ build-rag-snippets │ │ rag-upsert │
│ │ │ │ │ │
│ Reads MySQL │────▶│ Schema + rules + │────▶│ Uploads to │
│ INFORMATION_ │ │ field guide → │ │ Qdrant via │
│ SCHEMA │ │ JSONL snippets │ │ RAG API │
│ │ │ │ │ │
│ → var/ │ │ → corpus/ │ │ │
│ schema.json │ │ snippets.jsonl │ │ │
└──────────────────┘ └──────────────────────┘ └─────────────────┘
Step 1: Export Database Schema¶
php bin/export-schema.php
Queries INFORMATION_SCHEMA from the InteLIS database and writes var/schema.json. This captures:
- All tables and their columns (name, type, nullable, key, extra)
- Foreign key relationships
- Reference/lookup tables (tables with fewer than 50 rows)
- Sample values from reference columns
Output: var/schema.json (~125 tables typically)
Step 2: Build RAG Snippets¶
php bin/build-rag-snippets.php
Combines the schema with business rules and field guide into JSONL snippets. Reads from three sources:
| Source | What it provides |
|---|---|
var/schema.json |
Database structure (tables, columns, foreign keys) |
config/business-rules.php |
Privacy rules, query constraints, validation |
config/field-guide.php |
Terminology, clinical thresholds, column semantics |
Snippet types generated:
| Type | Description | Count (~) |
|---|---|---|
column |
Column descriptions with types and semantics | ~1,500 |
table |
Table descriptions with column lists | ~125 |
syn |
Terminology synonyms and mappings | ~80 |
rule |
Business rules (privacy, defaults, formatting) | ~50 |
validation |
Field validation rules | ~40 |
exemplar |
Example query patterns | ~30 |
relationship |
Foreign key relationships between tables | ~23 |
threshold |
Clinical thresholds (VL suppression, etc.) | ~20 |
test_type |
Test type logic (VL, EID, COVID, TB) | ~15 |
Output: corpus/snippets.jsonl (~2,400 lines)
Step 3: Upload to Qdrant¶
php bin/rag-upsert.php corpus/snippets.jsonl
Batch uploads the JSONL snippets to the RAG API, which embeds them and stores the vectors in Qdrant.
Options:
- Second argument is batch size (default: 500)
All-in-One: rag-refresh.sh¶
The shell script bin/rag-refresh.sh runs all three steps plus health checks and verification:
bash bin/rag-refresh.sh
Options:
| Flag | Effect |
|---|---|
--reset |
Delete and recreate the Qdrant collection before uploading |
--batch N |
Change the upload batch size (default: 500) |
# Full reset + rebuild
bash bin/rag-refresh.sh --reset
The Makefile targets are wrappers around this script:
make rag-refresh→ runsrag-refresh.shinside the Docker containermake rag-reset→ runsrag-refresh.sh --resetinside the Docker container
rag-refresh vs rag-reset¶
make rag-refresh |
make rag-reset |
|
|---|---|---|
| Deletes existing data? | No — upserts (adds/updates) | Yes — drops the collection and recreates it |
| Speed | Faster (incremental) | Slower (rebuilds from scratch) |
| Use when | Schema changed, rules updated, new field guide entries | Switching embedding models, index seems corrupted, major schema overhaul |
| Data loss risk | None — old snippets stay unless overwritten by ID | All existing snippets are deleted first |
| Command | make rag-refresh |
make rag-reset |
Verification¶
After seeding, verify the index is working:
curl -X POST http://localhost:8089/v1/search \
-H 'Content-Type: application/json' \
-d '{
"query": "viral load suppressed by lab",
"k": 5,
"filters": {"type": ["rule", "threshold", "column"]}
}'
You should see relevant snippets about VL suppression thresholds, result categories, and facility columns.
When to Re-seed¶
| Change | Action needed |
|---|---|
| InteLIS schema changes (new tables, renamed columns) | make rag-refresh |
Updated config/business-rules.php |
make rag-refresh |
Updated config/field-guide.php |
make rag-refresh |
Changed the embedding model (EMBEDDING_MODEL in .env) |
make rag-reset |
| Search returns irrelevant or outdated results | make rag-reset |
| First setup after importing InteLIS data | make rag-refresh |