Training Best Practices¶

High-quality training data is the single most important factor for accurate SQL generation. Following these guidelines will maximise the reliability of the API.

General Principles¶

Principle	Explanation
Correctness over quantity	10 perfectly written SQL examples outperform 100 sloppy ones
Diversity over repetition	Cover distinct query patterns, not variations of the same one
Real SQL only	Never train with SQL that has never been run against the database
Always train DDL first	The model can't reference columns it doesn't know exist

Writing Effective Question–SQL Pairs¶

✅ Good Example¶

{
  "question": "What is the average loan amount by sector?",
  "sql": "SELECT Sector, AVG(AmountGranted) AS avg_loan FROM msmeloans GROUP BY Sector ORDER BY avg_loan DESC;"
}

Why it works: - Question is phrased naturally, as a user would type it - SQL uses the correct column name AmountGranted (not loan_amount) - Includes ORDER BY for a richer result - Has been verified to run without errors

❌ Bad Example¶

{
  "question": "loans",
  "sql": "SELECT * FROM msmeloans"
}

Why it fails: - The question is too vague to be useful in similarity search - SELECT * teaches the model to be lazy — it won't learn patterns - The pairing gives no signal about what the user actually wants

Question Writing Guidelines¶

Do	Don't
Use complete sentences	Use single words or abbreviations
Mirror how analysts speak	Use SQL terminology in questions
Be specific about aggregations	Ask "show me data" without context
Include the metric name	Use pronouns without referents

Examples of well-formed questions:

✅ "Which age group has the highest default rate?"
✅ "How many first-time borrowers are in each sector?"
✅ "What is the total loan amount disbursed by PFI?"
❌ "default by age" — too vague
❌ "SELECT FROM msmeloans WHERE..." — never put SQL in a question

SQL Writing Guidelines¶

Use Explicit Column References¶

-- ✅ Good
SELECT "PFI ID", AVG(pred_default_prob) AS avg_risk
FROM msmeloans
GROUP BY "PFI ID";

-- ❌ Avoid implicit or aliased column names not in schema
SELECT pfi, AVG(default_prob) FROM loans GROUP BY pfi;

Include ORDER BY When Applicable¶

A sorted result gives the LLM more context about what "top" or "highest" means:

-- ✅ Good — teaches ranking pattern
SELECT Sector, COUNT(*) AS loan_count
FROM msmeloans
GROUP BY Sector
ORDER BY loan_count DESC;

Use Correct SQLite Syntax¶

Vanna is connected to SQLite — avoid PostgreSQL or MySQL-specific functions:

Avoid	Use Instead
`ILIKE`	`LIKE` (case-insensitive by default in SQLite)
`EXTRACT(YEAR FROM date)`	`strftime('%Y', date)`
`COUNT(*) FILTER (WHERE ...)`	`SUM(CASE WHEN ... THEN 1 ELSE 0 END)`
`STRING_AGG(...)`	`GROUP_CONCAT(...)`

FILTER clause compatibility

The COUNT(*) FILTER (WHERE pred_default_prob > 0.5) syntax used in the original notebook works in SQLite 3.30.0+. If deploying on older environments, replace with SUM(CASE WHEN pred_default_prob > 0.5 THEN 1 ELSE 0 END).

Coverage Strategy¶

Aim to cover at least one SQL example per major query pattern:

Pattern	Example Question
Simple count	`"How many loans are in the database?"`
Filtered count	`"How many female borrowers are there?"`
Aggregation (AVG)	`"What is the average loan amount by sector?"`
Aggregation (SUM)	`"What is the total amount disbursed by PFI?"`
TOP-N	`"Show the top 10 borrowers with the highest risk."`
Percentage	`"What percentage of loans are high-risk?"`
Multi-group	`"Show default rate by sector and age group."`
Boolean filter	`"How many startup borrowers are in Agriculture?"`
Date-based	`"How many loans were disbursed in 2024?"`
Ratio/derived	`"Show the loan-to-turnover ratio by employees bucket."`

Documentation String Guidelines¶

Good Documentation¶

"pred_default_prob is a value between 0.0 and 1.0 representing the probability of loan default."
"risk_category is derived from pred_default_prob: Low Risk (< 0.25), Medium Risk (0.25–0.5), High Risk (> 0.5)."
"The msmeloans table uses State as a numeric ID (integer), not a state name."
"ES_Rating values are: A (lowest risk), B, C, D, E (highest risk)."

Bad Documentation¶

"This table has loans."            ← Too vague, adds no signal
"See the schema for column types." ← Non-informative
"sector is a column."              ← Already in DDL, redundant

Retraining Cadence Recommendations¶

Scenario	Recommended Frequency
Active development	After every schema migration
Production with new analyst queries	Weekly review and addition
Stable production	Monthly audit using the audit script
After user feedback session	Immediately after session ends

Red Flags in Training Data¶

Review your training set for these issues and fix them immediately:

Wrong column names — SQL references columns that don't exist → causes runtime SQL errors
Duplicate questions — Multiple pairs with identical questions but different SQL confuse the similarity search
Contradictory documentation — E.g., two docs saying different thresholds for "high-risk"
Hallucinated table names — SQL references tables not in the schema
Untested SQL — SQL was never run against the real database

Use the Retraining audit script to monitor for these issues regularly.