Training Concepts¶
Understanding how the training system works is the key to writing effective training data and diagnosing inaccurate SQL output.
What Is RAG Training?¶
The DBN Analytics POC uses a Retrieval-Augmented Generation (RAG) pattern. Instead of fine-tuning the underlying language model (which is expensive and slow), it teaches Vanna what your database looks like by storing examples in a vector database (ChromaDB).
At query time, Vanna:
- Embeds the user's question into a vector.
- Searches ChromaDB for the top-10 most similar training records.
- Injects those records as context into the LLM prompt.
- Generates SQL that is grounded in your real schema and verified examples.
flowchart LR
subgraph Training Phase
T1["DDL Schema"] --> Embed["Text Embedder\n(OpenAI ada-002)"]
T2["Question + SQL"] --> Embed
T3["Documentation"] --> Embed
Embed --> Chroma[("ChromaDB\nVector Index")]
end
subgraph Query Phase
Q["User Question"] --> QEmbed["Text Embedder"]
QEmbed --> Search["Similarity Search\n(top-10)"]
Chroma --> Search
Search --> Prompt["LLM Prompt\n= context + question"]
Prompt --> GPT["OpenAI GPT-4o"]
GPT --> SQL["Generated SQL"]
end
The Three Types of Training Data¶
Vanna supports three distinct record types, each serving a different purpose:
1. DDL — Database Schema¶
DDL (Data Definition Language) statements teach Vanna the structure of your database: table names, column names, data types, and relationships.
CREATE TABLE "msmeloans" (
"SeqNo" REAL,
"Gender" TEXT,
"Sector" TEXT,
"AmountGranted" INTEGER,
"pred_default_prob" REAL,
...
)
When to use: Always train on DDL first. Without it, Vanna cannot know what columns exist.
2. SQL — Question–SQL Pairs¶
The most impactful training type. Each record pairs a natural language question with its verified, correct SQL answer.
Question: Which sector has the highest average predicted default probability?
SQL:
SELECT Sector, AVG(pred_default_prob) AS avg_risk
FROM msmeloans
GROUP BY Sector
ORDER BY avg_risk DESC
LIMIT 5;
When to use: Add verified SQL examples for every common query pattern, especially ones involving filters, aggregations, or joins.
3. Documentation — Business Context¶
Plain-English definitions of business rules, terminology, and data semantics that aren't obvious from the schema.
"A loan is classified as high-risk when pred_default_prob > 0.5."
"StartUp_norm = 1 means the business was registered within the last 3 years."
"PFI refers to a Participating Financial Institution, i.e., a bank or microfinance lender."
When to use: Whenever analysts use domain-specific language that differs from column names, or when threshold/classification rules must be applied correctly.
ChromaDB Storage¶
All training records are persisted in chroma.sqlite3 in the project root — this is ChromaDB's local file-based storage. Each record is stored as:
| ChromaDB Field | Content |
|---|---|
id |
A deterministic UUID (e.g., b1304bfd-...-sql) |
document |
The raw text of the DDL, SQL, or documentation |
embedding |
A 1536-dimensional float vector (OpenAI text-embedding-ada-002) |
metadata |
{ "training_data_type": "ddl" | "sql" | "documentation" } |
Idempotent inserts
Adding the same training record twice is safe — ChromaDB will log a warning but will not create a duplicate. The ID is computed from the content hash.
How Similarity Search Works¶
At query time, Vanna calls chromadb_collection.query(query_texts=[question], n_results=10). ChromaDB computes cosine similarity between the question embedding and all stored embeddings, returning the top-10 closest records.
User: "What is the average loan by state?"
↓
Embedding: [0.021, -0.048, 0.013, ..., 0.007] (1536 dims)
↓
Similarity search in ChromaDB:
Score 0.94 → "What is the average loan amount by sector?" + SQL
Score 0.89 → DDL: CREATE TABLE msmeloans (...)
Score 0.81 → "Show avg loan-to-turnover ratio by employees bucket" + SQL
...
↓
Top-10 records injected into LLM prompt
↓
GPT-4o generates: SELECT State, AVG(AmountGranted) FROM msmeloans GROUP BY State;
This means: the more relevant examples you have in training, the more accurate the generated SQL will be.