Skip to content

Training Best Practices

High-quality training data is the single most important factor for accurate SQL generation. Following these guidelines will maximise the reliability of the API.


General Principles

Principle Explanation
Correctness over quantity 10 perfectly written SQL examples outperform 100 sloppy ones
Diversity over repetition Cover distinct query patterns, not variations of the same one
Real SQL only Never train with SQL that has never been run against the database
Always train DDL first The model can't reference columns it doesn't know exist

Writing Effective Question–SQL Pairs

✅ Good Example

{
  "question": "What is the average loan amount by sector?",
  "sql": "SELECT Sector, AVG(AmountGranted) AS avg_loan FROM msmeloans GROUP BY Sector ORDER BY avg_loan DESC;"
}

Why it works: - Question is phrased naturally, as a user would type it - SQL uses the correct column name AmountGranted (not loan_amount) - Includes ORDER BY for a richer result - Has been verified to run without errors


❌ Bad Example

{
  "question": "loans",
  "sql": "SELECT * FROM msmeloans"
}

Why it fails: - The question is too vague to be useful in similarity search - SELECT * teaches the model to be lazy — it won't learn patterns - The pairing gives no signal about what the user actually wants


Question Writing Guidelines

Do Don't
Use complete sentences Use single words or abbreviations
Mirror how analysts speak Use SQL terminology in questions
Be specific about aggregations Ask "show me data" without context
Include the metric name Use pronouns without referents

Examples of well-formed questions:

  • "Which age group has the highest default rate?"
  • "How many first-time borrowers are in each sector?"
  • "What is the total loan amount disbursed by PFI?"
  • "default by age" — too vague
  • "SELECT FROM msmeloans WHERE..." — never put SQL in a question

SQL Writing Guidelines

Use Explicit Column References

-- ✅ Good
SELECT "PFI ID", AVG(pred_default_prob) AS avg_risk
FROM msmeloans
GROUP BY "PFI ID";

-- ❌ Avoid implicit or aliased column names not in schema
SELECT pfi, AVG(default_prob) FROM loans GROUP BY pfi;

Include ORDER BY When Applicable

A sorted result gives the LLM more context about what "top" or "highest" means:

-- ✅ Good — teaches ranking pattern
SELECT Sector, COUNT(*) AS loan_count
FROM msmeloans
GROUP BY Sector
ORDER BY loan_count DESC;

Use Correct SQLite Syntax

Vanna is connected to SQLite — avoid PostgreSQL or MySQL-specific functions:

Avoid Use Instead
ILIKE LIKE (case-insensitive by default in SQLite)
EXTRACT(YEAR FROM date) strftime('%Y', date)
COUNT(*) FILTER (WHERE ...) SUM(CASE WHEN ... THEN 1 ELSE 0 END)
STRING_AGG(...) GROUP_CONCAT(...)

FILTER clause compatibility

The COUNT(*) FILTER (WHERE pred_default_prob > 0.5) syntax used in the original notebook works in SQLite 3.30.0+. If deploying on older environments, replace with SUM(CASE WHEN pred_default_prob > 0.5 THEN 1 ELSE 0 END).


Coverage Strategy

Aim to cover at least one SQL example per major query pattern:

Pattern Example Question
Simple count "How many loans are in the database?"
Filtered count "How many female borrowers are there?"
Aggregation (AVG) "What is the average loan amount by sector?"
Aggregation (SUM) "What is the total amount disbursed by PFI?"
TOP-N "Show the top 10 borrowers with the highest risk."
Percentage "What percentage of loans are high-risk?"
Multi-group "Show default rate by sector and age group."
Boolean filter "How many startup borrowers are in Agriculture?"
Date-based "How many loans were disbursed in 2024?"
Ratio/derived "Show the loan-to-turnover ratio by employees bucket."

Documentation String Guidelines

Good Documentation

"pred_default_prob is a value between 0.0 and 1.0 representing the probability of loan default."
"risk_category is derived from pred_default_prob: Low Risk (< 0.25), Medium Risk (0.25–0.5), High Risk (> 0.5)."
"The msmeloans table uses State as a numeric ID (integer), not a state name."
"ES_Rating values are: A (lowest risk), B, C, D, E (highest risk)."

Bad Documentation

"This table has loans."            ← Too vague, adds no signal
"See the schema for column types." ← Non-informative
"sector is a column."              ← Already in DDL, redundant

Retraining Cadence Recommendations

Scenario Recommended Frequency
Active development After every schema migration
Production with new analyst queries Weekly review and addition
Stable production Monthly audit using the audit script
After user feedback session Immediately after session ends

Red Flags in Training Data

Review your training set for these issues and fix them immediately:

  • Wrong column namesSQL references columns that don't exist → causes runtime SQL errors
  • Duplicate questions — Multiple pairs with identical questions but different SQL confuse the similarity search
  • Contradictory documentation — E.g., two docs saying different thresholds for "high-risk"
  • Hallucinated table namesSQL references tables not in the schema
  • Untested SQLSQL was never run against the real database

Use the Retraining audit script to monitor for these issues regularly.