Skip to content

RAG Pipeline

This page explains how the Retrieval-Augmented Generation (RAG) pipeline converts a natural language question into an accurate SQL query.


What is Vanna?

Vanna is an open-source Python library that combines:

  • A vector store (ChromaDB) to store and retrieve relevant SQL context
  • An LLM (OpenAI GPT-4o) to generate syntactically valid SQL

The key insight is that LLMs generate far more accurate SQL when provided with examples of correct SQL alongside the schema — this is the RAG pattern.


The Three-Step Pipeline

Step 1: Index (Training)

Before querying, Vanna must be "trained" with knowledge about your database. This is a one-time setup (already done in the notebook):

# Train on the table schema (DDL)
df_ddl = vn.run_sql("SELECT type, sql FROM sqlite_master WHERE sql is not null")
for ddl in df_ddl['sql'].to_list():
    vn.train(ddl=ddl)

# Train with example question-SQL pairs
vn.train(
    question="Which sector has the highest average predicted default probability?",
    sql="""
        SELECT Sector, AVG(pred_default_prob) AS avg_risk
        FROM msmeloans
        GROUP BY Sector
        ORDER BY avg_risk DESC
        LIMIT 5;
    """
)

Each piece of training data is embedded and stored in ChromaDB (chroma.sqlite3).


Step 2: Retrieve (at Query Time)

When a user asks a question, Vanna searches ChromaDB for the top-N most similar training examples:

sequenceDiagram
    participant User
    participant API
    participant ChromaDB
    participant OpenAI

    User->>API: POST /ask {"question": "..."}
    API->>ChromaDB: similarity_search(question, n=10)
    ChromaDB-->>API: [DDL schema, similar SQL examples]
    API->>OpenAI: prompt = system_msg + schema + examples + question
    OpenAI-->>API: SQL string
    API-->>User: { sql, data, plotly_code }

Step 3: Generate (LLM Call)

The retrieved context (schema + example SQL) is injected into a prompt sent to OpenAI:

You are a SQLite expert. Please help to generate a SQL query to answer the question.
Your response should ONLY be based on the given context and follow the response guidelines
and format instructions.

===Tables===
CREATE TABLE "msmeloans" (
  "SeqNo" REAL,
  "Gender" TEXT,
  "Sector" TEXT,
  ...
)

===Additional Context===
Q: Which sector has the highest average predicted default probability?
A: SELECT Sector, AVG(pred_default_prob) AS avg_risk FROM msmeloans GROUP BY Sector ORDER BY avg_risk DESC LIMIT 5;

===Response Guidelines===
...

Question: What is the average loan amount by sector?

Training Data Stored

The ChromaDB index contains the following types of training data:

Type Description Example
ddl Table schema definitions CREATE TABLE "msmeloans" (...)
sql Question + SQL pairs Q: Top borrowers? → SELECT FullNames ...
documentation Business context (if added) "High-risk means pred_default_prob > 0.5"

Custom Vanna Class

The application defines a custom class that combines ChromaDB and OpenAI Chat:

services/vanna_service.py
from vanna.chromadb import ChromaDB_VectorStore
from vanna.openai import OpenAI_Chat

class MyVanna(ChromaDB_VectorStore, OpenAI_Chat):
    def __init__(self, config=None):
        ChromaDB_VectorStore.__init__(self, config=config)
        OpenAI_Chat.__init__(self, config=config)

Python's MRO (Method Resolution Order) ensures ChromaDB handles vector operations while OpenAI Chat handles LLM calls.