Retraining¶
Retraining is the process of adding, correcting, or removing entries in the ChromaDB vector store after the initial bootstrap. It is how you improve accuracy over time and adapt to schema or business rule changes.
When to Retrain¶
| Trigger | Action |
|---|---|
| Users report incorrect SQL | Add a corrected question–SQL pair |
| A column is renamed or a new table is added | Remove old DDL, add new DDL |
| A business rule changes (e.g., new risk threshold) | Update or replace the documentation record |
| A new type of question is frequently asked | Add 3–5 representative SQL examples |
| A stale example is producing wrong queries | Delete the bad record by ID |
Listing Existing Training Data¶
Before modifying anything, inspect the current state of the vector store:
Response:
{
"count": 12,
"records": [
{
"id": "b1304bfd-1f4d-5401-9361-e2e48ff68cc7-sql",
"question": "Which sector has the highest average predicted default probability?",
"content": "SELECT Sector, AVG(pred_default_prob)...",
"training_data_type": "sql"
},
{
"id": "c1ada977-d080-5caf-b24f-3826d1a6a9f9-ddl",
"question": null,
"content": "CREATE TABLE \"msmeloans\" (...)",
"training_data_type": "ddl"
}
]
}
Save the id values — you will need them to delete specific records.
Adding New SQL Examples¶
The most common retraining operation. Add a new example whenever:
- A user query was answered with wrong SQL
- A new business question type has been identified
POST /api/v1/training/sql HTTP/1.1
Content-Type: application/json
{
"question": "How many female borrowers are there per sector?",
"sql": "SELECT Sector, COUNT(*) AS female_count FROM msmeloans WHERE Gender = 'F' GROUP BY Sector ORDER BY female_count DESC;"
}
Response:
{
"id": "9a7f3c2e-...-sql",
"message": "Question–SQL pair successfully added with id=9a7f3c2e-...-sql"
}
Add 3–5 variants for new patterns
If you want Vanna to handle a new category of query well (e.g., gender-based filtering), add multiple semantically distinct examples — not just one.
Updating Business Rules¶
ChromaDB does not support in-place updates. To change a documentation record:
1. Find the old record's ID:
2. Delete the old record:
Response:
3. Add the updated documentation:
POST /api/v1/training/documentation HTTP/1.1
Content-Type: application/json
{
"documentation": "A loan is classified as high-risk when pred_default_prob >= 0.4 (updated threshold from 0.5)."
}
Removing Stale or Incorrect Records¶
If a training record is causing Vanna to generate consistently wrong SQL, delete it:
Re-Syncing the Schema after a Migration¶
If your SQLite schema changes (new table, renamed column, added index), re-sync the DDL:
Step 1 — Delete all old DDL records:
import httpx
BASE_URL = "http://127.0.0.1:8000/api/v1/training"
# Get all records
data = httpx.get(f"{BASE_URL}/data").json()
# Delete every DDL record
for record in data["records"]:
if record["training_data_type"] == "ddl":
resp = httpx.delete(f"{BASE_URL}/{record['id']}")
print(f"Deleted DDL: {record['id']} → {resp.json()['success']}")
Step 2 — Re-import from the updated database:
Retraining Decision Flowchart¶
flowchart TD
A["🔍 User reports wrong SQL output"] --> B{What's wrong?}
B -->|Wrong table or column name| C["Delete old DDL\nRe-run /training/from-database"]
B -->|Wrong filter / threshold| D["Find documentation record\nDelete → Re-add updated doc"]
B -->|Wrong aggregation logic| E["Add corrected question–SQL pair\nDelete old pair if it exists"]
B -->|Unrelated / hallucinated SQL| F["Add 3–5 examples of the\ncorrect query pattern"]
C --> G["✅ Test with /chat/ask"]
D --> G
E --> G
F --> G
Monitoring Training Data Quality¶
Run this Python snippet periodically to audit your training set:
import httpx
from collections import Counter
resp = httpx.get("http://127.0.0.1:8000/api/v1/training/data")
data = resp.json()
records = data["records"]
types = Counter(r["training_data_type"] for r in records)
print(f"Total records : {data['count']}")
print(f" DDL : {types.get('ddl', 0)}")
print(f" SQL pairs : {types.get('sql', 0)}")
print(f" Docs : {types.get('documentation', 0)}")
# Warn if SQL pairs are sparse
if types.get("sql", 0) < 10:
print("\n⚠️ Warning: fewer than 10 SQL training pairs. Accuracy may be low.")
print(" → Run scripts/seed_training.py to add more examples.")