Training Best Practices¶
High-quality training data is the single most important factor for accurate SQL generation. Following these guidelines will maximise the reliability of the API.
General Principles¶
| Principle | Explanation |
|---|---|
| Correctness over quantity | 10 perfectly written SQL examples outperform 100 sloppy ones |
| Diversity over repetition | Cover distinct query patterns, not variations of the same one |
| Real SQL only | Never train with SQL that has never been run against the database |
| Always train DDL first | The model can't reference columns it doesn't know exist |
Writing Effective Question–SQL Pairs¶
✅ Good Example¶
{
"question": "What is the average loan amount by sector?",
"sql": "SELECT Sector, AVG(AmountGranted) AS avg_loan FROM msmeloans GROUP BY Sector ORDER BY avg_loan DESC;"
}
Why it works:
- Question is phrased naturally, as a user would type it
- SQL uses the correct column name AmountGranted (not loan_amount)
- Includes ORDER BY for a richer result
- Has been verified to run without errors
❌ Bad Example¶
Why it fails:
- The question is too vague to be useful in similarity search
- SELECT * teaches the model to be lazy — it won't learn patterns
- The pairing gives no signal about what the user actually wants
Question Writing Guidelines¶
| Do | Don't |
|---|---|
| Use complete sentences | Use single words or abbreviations |
| Mirror how analysts speak | Use SQL terminology in questions |
| Be specific about aggregations | Ask "show me data" without context |
| Include the metric name | Use pronouns without referents |
Examples of well-formed questions:
- ✅
"Which age group has the highest default rate?" - ✅
"How many first-time borrowers are in each sector?" - ✅
"What is the total loan amount disbursed by PFI?" - ❌
"default by age"— too vague - ❌
"SELECT FROM msmeloans WHERE..."— never put SQL in a question
SQL Writing Guidelines¶
Use Explicit Column References¶
-- ✅ Good
SELECT "PFI ID", AVG(pred_default_prob) AS avg_risk
FROM msmeloans
GROUP BY "PFI ID";
-- ❌ Avoid implicit or aliased column names not in schema
SELECT pfi, AVG(default_prob) FROM loans GROUP BY pfi;
Include ORDER BY When Applicable¶
A sorted result gives the LLM more context about what "top" or "highest" means:
-- ✅ Good — teaches ranking pattern
SELECT Sector, COUNT(*) AS loan_count
FROM msmeloans
GROUP BY Sector
ORDER BY loan_count DESC;
Use Correct SQLite Syntax¶
Vanna is connected to SQLite — avoid PostgreSQL or MySQL-specific functions:
| Avoid | Use Instead |
|---|---|
ILIKE |
LIKE (case-insensitive by default in SQLite) |
EXTRACT(YEAR FROM date) |
strftime('%Y', date) |
COUNT(*) FILTER (WHERE ...) |
SUM(CASE WHEN ... THEN 1 ELSE 0 END) |
STRING_AGG(...) |
GROUP_CONCAT(...) |
FILTER clause compatibility
The COUNT(*) FILTER (WHERE pred_default_prob > 0.5) syntax used in the original notebook works in SQLite 3.30.0+. If deploying on older environments, replace with SUM(CASE WHEN pred_default_prob > 0.5 THEN 1 ELSE 0 END).
Coverage Strategy¶
Aim to cover at least one SQL example per major query pattern:
| Pattern | Example Question |
|---|---|
| Simple count | "How many loans are in the database?" |
| Filtered count | "How many female borrowers are there?" |
| Aggregation (AVG) | "What is the average loan amount by sector?" |
| Aggregation (SUM) | "What is the total amount disbursed by PFI?" |
| TOP-N | "Show the top 10 borrowers with the highest risk." |
| Percentage | "What percentage of loans are high-risk?" |
| Multi-group | "Show default rate by sector and age group." |
| Boolean filter | "How many startup borrowers are in Agriculture?" |
| Date-based | "How many loans were disbursed in 2024?" |
| Ratio/derived | "Show the loan-to-turnover ratio by employees bucket." |
Documentation String Guidelines¶
Good Documentation¶
"pred_default_prob is a value between 0.0 and 1.0 representing the probability of loan default."
"risk_category is derived from pred_default_prob: Low Risk (< 0.25), Medium Risk (0.25–0.5), High Risk (> 0.5)."
"The msmeloans table uses State as a numeric ID (integer), not a state name."
"ES_Rating values are: A (lowest risk), B, C, D, E (highest risk)."
Bad Documentation¶
"This table has loans." ← Too vague, adds no signal
"See the schema for column types." ← Non-informative
"sector is a column." ← Already in DDL, redundant
Retraining Cadence Recommendations¶
| Scenario | Recommended Frequency |
|---|---|
| Active development | After every schema migration |
| Production with new analyst queries | Weekly review and addition |
| Stable production | Monthly audit using the audit script |
| After user feedback session | Immediately after session ends |
Red Flags in Training Data¶
Review your training set for these issues and fix them immediately:
- Wrong column names — SQL references columns that don't exist → causes runtime SQL errors
- Duplicate questions — Multiple pairs with identical questions but different SQL confuse the similarity search
- Contradictory documentation — E.g., two docs saying different thresholds for "high-risk"
- Hallucinated table names — SQL references tables not in the schema
- Untested SQL — SQL was never run against the real database
Use the Retraining audit script to monitor for these issues regularly.