Training Concepts¶

Understanding how the training system works is the key to writing effective training data and diagnosing inaccurate SQL output.

What Is RAG Training?¶

The DBN Analytics POC uses a Retrieval-Augmented Generation (RAG) pattern. Instead of fine-tuning the underlying language model (which is expensive and slow), it teaches Vanna what your database looks like by storing examples in a vector database (ChromaDB).

At query time, Vanna:

Embeds the user's question into a vector.
Searches ChromaDB for the top-10 most similar training records.
Injects those records as context into the LLM prompt.
Generates SQL that is grounded in your real schema and verified examples.

flowchart LR
    subgraph Training Phase
        T1["DDL Schema"] --> Embed["Text Embedder\n(OpenAI ada-002)"]
        T2["Question + SQL"] --> Embed
        T3["Documentation"] --> Embed
        Embed --> Chroma[("ChromaDB\nVector Index")]
    end

    subgraph Query Phase
        Q["User Question"] --> QEmbed["Text Embedder"]
        QEmbed --> Search["Similarity Search\n(top-10)"]
        Chroma --> Search
        Search --> Prompt["LLM Prompt\n= context + question"]
        Prompt --> GPT["OpenAI GPT-4o"]
        GPT --> SQL["Generated SQL"]
    end

The Three Types of Training Data¶

Vanna supports three distinct record types, each serving a different purpose:

1. DDL — Database Schema¶

DDL (Data Definition Language) statements teach Vanna the structure of your database: table names, column names, data types, and relationships.

CREATE TABLE "msmeloans" (
  "SeqNo" REAL,
  "Gender" TEXT,
  "Sector" TEXT,
  "AmountGranted" INTEGER,
  "pred_default_prob" REAL,
  ...
)

When to use: Always train on DDL first. Without it, Vanna cannot know what columns exist.

2. SQL — Question–SQL Pairs¶

The most impactful training type. Each record pairs a natural language question with its verified, correct SQL answer.

Question: Which sector has the highest average predicted default probability?
SQL:
  SELECT Sector, AVG(pred_default_prob) AS avg_risk
  FROM msmeloans
  GROUP BY Sector
  ORDER BY avg_risk DESC
  LIMIT 5;

When to use: Add verified SQL examples for every common query pattern, especially ones involving filters, aggregations, or joins.

3. Documentation — Business Context¶

Plain-English definitions of business rules, terminology, and data semantics that aren't obvious from the schema.

"A loan is classified as high-risk when pred_default_prob > 0.5."
"StartUp_norm = 1 means the business was registered within the last 3 years."
"PFI refers to a Participating Financial Institution, i.e., a bank or microfinance lender."

When to use: Whenever analysts use domain-specific language that differs from column names, or when threshold/classification rules must be applied correctly.

ChromaDB Storage¶

All training records are persisted in chroma.sqlite3 in the project root — this is ChromaDB's local file-based storage. Each record is stored as:

ChromaDB Field	Content
`id`	A deterministic UUID (e.g., `b1304bfd-...-sql`)
`document`	The raw text of the DDL, SQL, or documentation
`embedding`	A 1536-dimensional float vector (OpenAI `text-embedding-ada-002`)
`metadata`	`{ "training_data_type": "ddl" \| "sql" \| "documentation" }`

Idempotent inserts

Adding the same training record twice is safe — ChromaDB will log a warning but will not create a duplicate. The ID is computed from the content hash.

How Similarity Search Works¶

At query time, Vanna calls chromadb_collection.query(query_texts=[question], n_results=10). ChromaDB computes cosine similarity between the question embedding and all stored embeddings, returning the top-10 closest records.

User: "What is the average loan by state?"
                        ↓
Embedding: [0.021, -0.048, 0.013, ..., 0.007]  (1536 dims)
                        ↓
Similarity search in ChromaDB:
  Score 0.94 → "What is the average loan amount by sector?" + SQL
  Score 0.89 → DDL: CREATE TABLE msmeloans (...)
  Score 0.81 → "Show avg loan-to-turnover ratio by employees bucket" + SQL
  ...
                        ↓
Top-10 records injected into LLM prompt
                        ↓
GPT-4o generates: SELECT State, AVG(AmountGranted) FROM msmeloans GROUP BY State;

This means: the more relevant examples you have in training, the more accurate the generated SQL will be.