RAG Pipeline¶
This page explains how the Retrieval-Augmented Generation (RAG) pipeline converts a natural language question into an accurate SQL query.
What is Vanna?¶
Vanna is an open-source Python library that combines:
- A vector store (ChromaDB) to store and retrieve relevant SQL context
- An LLM (OpenAI GPT-4o) to generate syntactically valid SQL
The key insight is that LLMs generate far more accurate SQL when provided with examples of correct SQL alongside the schema — this is the RAG pattern.
The Three-Step Pipeline¶
Step 1: Index (Training)¶
Before querying, Vanna must be "trained" with knowledge about your database. This is a one-time setup (already done in the notebook):
# Train on the table schema (DDL)
df_ddl = vn.run_sql("SELECT type, sql FROM sqlite_master WHERE sql is not null")
for ddl in df_ddl['sql'].to_list():
vn.train(ddl=ddl)
# Train with example question-SQL pairs
vn.train(
question="Which sector has the highest average predicted default probability?",
sql="""
SELECT Sector, AVG(pred_default_prob) AS avg_risk
FROM msmeloans
GROUP BY Sector
ORDER BY avg_risk DESC
LIMIT 5;
"""
)
Each piece of training data is embedded and stored in ChromaDB (chroma.sqlite3).
Step 2: Retrieve (at Query Time)¶
When a user asks a question, Vanna searches ChromaDB for the top-N most similar training examples:
sequenceDiagram
participant User
participant API
participant ChromaDB
participant OpenAI
User->>API: POST /ask {"question": "..."}
API->>ChromaDB: similarity_search(question, n=10)
ChromaDB-->>API: [DDL schema, similar SQL examples]
API->>OpenAI: prompt = system_msg + schema + examples + question
OpenAI-->>API: SQL string
API-->>User: { sql, data, plotly_code }
Step 3: Generate (LLM Call)¶
The retrieved context (schema + example SQL) is injected into a prompt sent to OpenAI:
You are a SQLite expert. Please help to generate a SQL query to answer the question.
Your response should ONLY be based on the given context and follow the response guidelines
and format instructions.
===Tables===
CREATE TABLE "msmeloans" (
"SeqNo" REAL,
"Gender" TEXT,
"Sector" TEXT,
...
)
===Additional Context===
Q: Which sector has the highest average predicted default probability?
A: SELECT Sector, AVG(pred_default_prob) AS avg_risk FROM msmeloans GROUP BY Sector ORDER BY avg_risk DESC LIMIT 5;
===Response Guidelines===
...
Question: What is the average loan amount by sector?
Training Data Stored¶
The ChromaDB index contains the following types of training data:
| Type | Description | Example |
|---|---|---|
ddl |
Table schema definitions | CREATE TABLE "msmeloans" (...) |
sql |
Question + SQL pairs | Q: Top borrowers? → SELECT FullNames ... |
documentation |
Business context (if added) | "High-risk means pred_default_prob > 0.5" |
Custom Vanna Class¶
The application defines a custom class that combines ChromaDB and OpenAI Chat:
from vanna.chromadb import ChromaDB_VectorStore
from vanna.openai import OpenAI_Chat
class MyVanna(ChromaDB_VectorStore, OpenAI_Chat):
def __init__(self, config=None):
ChromaDB_VectorStore.__init__(self, config=config)
OpenAI_Chat.__init__(self, config=config)
Python's MRO (Method Resolution Order) ensures ChromaDB handles vector operations while OpenAI Chat handles LLM calls.