Retraining¶

Retraining is the process of adding, correcting, or removing entries in the ChromaDB vector store after the initial bootstrap. It is how you improve accuracy over time and adapt to schema or business rule changes.

When to Retrain¶

Trigger	Action
Users report incorrect SQL	Add a corrected question–SQL pair
A column is renamed or a new table is added	Remove old DDL, add new DDL
A business rule changes (e.g., new risk threshold)	Update or replace the documentation record
A new type of question is frequently asked	Add 3–5 representative SQL examples
A stale example is producing wrong queries	Delete the bad record by ID

Listing Existing Training Data¶

Before modifying anything, inspect the current state of the vector store:

GET /api/v1/training/data HTTP/1.1

Response:

{
  "count": 12,
  "records": [
    {
      "id": "b1304bfd-1f4d-5401-9361-e2e48ff68cc7-sql",
      "question": "Which sector has the highest average predicted default probability?",
      "content": "SELECT Sector, AVG(pred_default_prob)...",
      "training_data_type": "sql"
    },
    {
      "id": "c1ada977-d080-5caf-b24f-3826d1a6a9f9-ddl",
      "question": null,
      "content": "CREATE TABLE \"msmeloans\" (...)",
      "training_data_type": "ddl"
    }
  ]
}

Save the id values — you will need them to delete specific records.

Adding New SQL Examples¶

The most common retraining operation. Add a new example whenever:

A user query was answered with wrong SQL
A new business question type has been identified

POST /api/v1/training/sql HTTP/1.1
Content-Type: application/json

{
  "question": "How many female borrowers are there per sector?",
  "sql": "SELECT Sector, COUNT(*) AS female_count FROM msmeloans WHERE Gender = 'F' GROUP BY Sector ORDER BY female_count DESC;"
}

Response:

{
  "id": "9a7f3c2e-...-sql",
  "message": "Question–SQL pair successfully added with id=9a7f3c2e-...-sql"
}

Add 3–5 variants for new patterns

If you want Vanna to handle a new category of query well (e.g., gender-based filtering), add multiple semantically distinct examples — not just one.

Updating Business Rules¶

ChromaDB does not support in-place updates. To change a documentation record:

1. Find the old record's ID:

GET /api/v1/training/data HTTP/1.1

2. Delete the old record:

DELETE /api/v1/training/d3f1b2e0-...-doc HTTP/1.1

Response:

{
  "success": true,
  "message": "Training record 'd3f1b2e0-...-doc' successfully deleted."
}

3. Add the updated documentation:

POST /api/v1/training/documentation HTTP/1.1
Content-Type: application/json

{
  "documentation": "A loan is classified as high-risk when pred_default_prob >= 0.4 (updated threshold from 0.5)."
}

Removing Stale or Incorrect Records¶

If a training record is causing Vanna to generate consistently wrong SQL, delete it:

curlPython

curl -X DELETE \
  "http://127.0.0.1:8000/api/v1/training/b1304bfd-1f4d-5401-9361-e2e48ff68cc7-sql"

import httpx

bad_id = "b1304bfd-1f4d-5401-9361-e2e48ff68cc7-sql"
resp = httpx.delete(f"http://127.0.0.1:8000/api/v1/training/{bad_id}")
print(resp.json())
# {'success': True, 'message': "Training record '...' successfully deleted."}

Re-Syncing the Schema after a Migration¶

If your SQLite schema changes (new table, renamed column, added index), re-sync the DDL:

Step 1 — Delete all old DDL records:

import httpx

BASE_URL = "http://127.0.0.1:8000/api/v1/training"

# Get all records
data = httpx.get(f"{BASE_URL}/data").json()

# Delete every DDL record
for record in data["records"]:
    if record["training_data_type"] == "ddl":
        resp = httpx.delete(f"{BASE_URL}/{record['id']}")
        print(f"Deleted DDL: {record['id']} → {resp.json()['success']}")

Step 2 — Re-import from the updated database:

POST /api/v1/training/from-database HTTP/1.1

Retraining Decision Flowchart¶

flowchart TD
    A["🔍 User reports wrong SQL output"] --> B{What's wrong?}

    B -->|Wrong table or column name| C["Delete old DDL\nRe-run /training/from-database"]
    B -->|Wrong filter / threshold| D["Find documentation record\nDelete → Re-add updated doc"]
    B -->|Wrong aggregation logic| E["Add corrected question–SQL pair\nDelete old pair if it exists"]
    B -->|Unrelated / hallucinated SQL| F["Add 3–5 examples of the\ncorrect query pattern"]

    C --> G["✅ Test with /chat/ask"]
    D --> G
    E --> G
    F --> G

Monitoring Training Data Quality¶

Run this Python snippet periodically to audit your training set:

scripts/audit_training.py

import httpx
from collections import Counter

resp = httpx.get("http://127.0.0.1:8000/api/v1/training/data")
data = resp.json()

records = data["records"]
types = Counter(r["training_data_type"] for r in records)

print(f"Total records : {data['count']}")
print(f"  DDL         : {types.get('ddl', 0)}")
print(f"  SQL pairs   : {types.get('sql', 0)}")
print(f"  Docs        : {types.get('documentation', 0)}")

# Warn if SQL pairs are sparse
if types.get("sql", 0) < 10:
    print("\n⚠️  Warning: fewer than 10 SQL training pairs. Accuracy may be low.")
    print("   → Run scripts/seed_training.py to add more examples.")