Skip to content

Retraining

Retraining is the process of adding, correcting, or removing entries in the ChromaDB vector store after the initial bootstrap. It is how you improve accuracy over time and adapt to schema or business rule changes.


When to Retrain

Trigger Action
Users report incorrect SQL Add a corrected question–SQL pair
A column is renamed or a new table is added Remove old DDL, add new DDL
A business rule changes (e.g., new risk threshold) Update or replace the documentation record
A new type of question is frequently asked Add 3–5 representative SQL examples
A stale example is producing wrong queries Delete the bad record by ID

Listing Existing Training Data

Before modifying anything, inspect the current state of the vector store:

GET /api/v1/training/data HTTP/1.1

Response:

{
  "count": 12,
  "records": [
    {
      "id": "b1304bfd-1f4d-5401-9361-e2e48ff68cc7-sql",
      "question": "Which sector has the highest average predicted default probability?",
      "content": "SELECT Sector, AVG(pred_default_prob)...",
      "training_data_type": "sql"
    },
    {
      "id": "c1ada977-d080-5caf-b24f-3826d1a6a9f9-ddl",
      "question": null,
      "content": "CREATE TABLE \"msmeloans\" (...)",
      "training_data_type": "ddl"
    }
  ]
}

Save the id values — you will need them to delete specific records.


Adding New SQL Examples

The most common retraining operation. Add a new example whenever:

  • A user query was answered with wrong SQL
  • A new business question type has been identified
POST /api/v1/training/sql HTTP/1.1
Content-Type: application/json

{
  "question": "How many female borrowers are there per sector?",
  "sql": "SELECT Sector, COUNT(*) AS female_count FROM msmeloans WHERE Gender = 'F' GROUP BY Sector ORDER BY female_count DESC;"
}

Response:

{
  "id": "9a7f3c2e-...-sql",
  "message": "Question–SQL pair successfully added with id=9a7f3c2e-...-sql"
}

Add 3–5 variants for new patterns

If you want Vanna to handle a new category of query well (e.g., gender-based filtering), add multiple semantically distinct examples — not just one.


Updating Business Rules

ChromaDB does not support in-place updates. To change a documentation record:

1. Find the old record's ID:

GET /api/v1/training/data HTTP/1.1

2. Delete the old record:

DELETE /api/v1/training/d3f1b2e0-...-doc HTTP/1.1

Response:

{
  "success": true,
  "message": "Training record 'd3f1b2e0-...-doc' successfully deleted."
}

3. Add the updated documentation:

POST /api/v1/training/documentation HTTP/1.1
Content-Type: application/json

{
  "documentation": "A loan is classified as high-risk when pred_default_prob >= 0.4 (updated threshold from 0.5)."
}


Removing Stale or Incorrect Records

If a training record is causing Vanna to generate consistently wrong SQL, delete it:

curl -X DELETE \
  "http://127.0.0.1:8000/api/v1/training/b1304bfd-1f4d-5401-9361-e2e48ff68cc7-sql"
import httpx

bad_id = "b1304bfd-1f4d-5401-9361-e2e48ff68cc7-sql"
resp = httpx.delete(f"http://127.0.0.1:8000/api/v1/training/{bad_id}")
print(resp.json())
# {'success': True, 'message': "Training record '...' successfully deleted."}

Re-Syncing the Schema after a Migration

If your SQLite schema changes (new table, renamed column, added index), re-sync the DDL:

Step 1 — Delete all old DDL records:

import httpx

BASE_URL = "http://127.0.0.1:8000/api/v1/training"

# Get all records
data = httpx.get(f"{BASE_URL}/data").json()

# Delete every DDL record
for record in data["records"]:
    if record["training_data_type"] == "ddl":
        resp = httpx.delete(f"{BASE_URL}/{record['id']}")
        print(f"Deleted DDL: {record['id']}{resp.json()['success']}")

Step 2 — Re-import from the updated database:

POST /api/v1/training/from-database HTTP/1.1


Retraining Decision Flowchart

flowchart TD
    A["🔍 User reports wrong SQL output"] --> B{What's wrong?}

    B -->|Wrong table or column name| C["Delete old DDL\nRe-run /training/from-database"]
    B -->|Wrong filter / threshold| D["Find documentation record\nDelete → Re-add updated doc"]
    B -->|Wrong aggregation logic| E["Add corrected question–SQL pair\nDelete old pair if it exists"]
    B -->|Unrelated / hallucinated SQL| F["Add 3–5 examples of the\ncorrect query pattern"]

    C --> G["✅ Test with /chat/ask"]
    D --> G
    E --> G
    F --> G

Monitoring Training Data Quality

Run this Python snippet periodically to audit your training set:

scripts/audit_training.py
import httpx
from collections import Counter

resp = httpx.get("http://127.0.0.1:8000/api/v1/training/data")
data = resp.json()

records = data["records"]
types = Counter(r["training_data_type"] for r in records)

print(f"Total records : {data['count']}")
print(f"  DDL         : {types.get('ddl', 0)}")
print(f"  SQL pairs   : {types.get('sql', 0)}")
print(f"  Docs        : {types.get('documentation', 0)}")

# Warn if SQL pairs are sparse
if types.get("sql", 0) < 10:
    print("\n⚠️  Warning: fewer than 10 SQL training pairs. Accuracy may be low.")
    print("   → Run scripts/seed_training.py to add more examples.")