Does this model deserve to be trusted?
A language model classifies each real borrower as default or no default and reports how sure it is. Then every verdict is checked against the borrower’s real recorded outcome. The dial up top sets how sure the model must be before a case clears without a human.
Run the live evaluation
Nothing on this page is pre-computed. Paste your Google Gemini API key and the model will classify real held‑out borrowers one by one, then score itself against the known outcomes. The labels are revealed to the scorer only after each verdict is in.
-- borrowers in, three clear piles out
The model sorts every borrower and routes it by how sure it is. Confident cases clear on their own. Only the uncertain ones reach the review queue, so a person spends time where it actually matters.
Open the Evaluation scorecard tab, paste a Gemini key, and run the evaluation to populate these piles.
From this portfolio demo to a system that handles 10,000+ borrowers
This page is a working blueprint, not a throwaway. The parts that are hard to get right, the classification logic and the proof that it works, carry over to production unchanged. What changes is the plumbing around them.
What transfers unchanged
- The classification prompt and the typed {status, confidence, reason} verdict
- The escalation gate: confidence vs threshold routes auto-approve / flag / human
- The evaluation harness: scoring against ground truth, precision, recall, F1, accuracy-at-confidence. The hardest part, and it ports as-is
- The triage console and scorecard become the live monitoring dashboard
What changes to scale
- Classifying runs server-side in batches, not one live call at a time in a browser
- Borrower records live in a database, not a static file
- Caching, retries and audit logs replace skip-on-error
- A periodic re-score watches for model and policy drift
| Concern | This portfolio demo | Production at 10k+ |
|---|---|---|
| Execution | Browser, one live call at a time, your key | Server-side batch job (Gemini Batch API, roughly half the cost, async) or a worker pool with rate-limit handling and resume |
| Data | 150 rows embedded in a JS file | Real database (Postgres / AlloyDB / BigQuery); verdicts written back to a table |
| Idempotency | Re-runs every borrower each time | Cache verdicts per record, never re-classify the same borrower |
| Reliability | Skip on error | Retries, dead-letter queue, monitoring, full audit log |
| Human queue | In-page approve / flag buttons | A real review workflow: assignment, fairness checks, sign-off |
| Drift | One-shot run | Re-score the labelled hold-out on a schedule to catch model or policy drift |
How accuracy is verified
The honest version of “you only tested 50 to 150 borrowers, how can that be ready for thousands of users?” You do not convince anyone by testing more users. You show that the tested sample is statistically sufficient and representative, and let the numbers prove it.
The sample validates the decision rule, not the users
A national election is called from a poll of about a thousand people, not the whole country. You do not drink the whole pot to check the soup, you stir and taste a spoonful. What matters is that the tested borrowers are a fair, random, representative slice, not that there are as many of them as there are real customers.
The sample size you need does not grow with your user count
The number of labelled cases needed to pin accuracy to a given margin is the same whether you serve a thousand customers or ten million. Going to production scale does not require a bigger validation set. So “thousands of users, therefore test thousands” is a misconception: the required test sample is fixed by the margin of error you want, not by the population size.
The confidence interval states exactly how much to trust each number
Every metric on the scorecard carries a 95% confidence interval, like accuracy 88% (95% CI 82–92%, n=150). That is a measured margin of error, not a promise. Nobody has to take “150 is enough” on faith: the interval tells them, and it visibly tightens as you raise the sample.
The sample is representative of the real book
The tested borrowers preserve the real portfolio’s 21% default rate and all six loan purposes in proportion. A sample that mirrors the population is trustworthy; a cherry-picked one is not. This is what makes the spoonful a fair taste of the whole pot.
The small sample is a demo budget, not a method limit
50 to 150 is small only because each borrower is a live, paid API call running in your browser. The method itself scales to any validation size. In production you swap the live run for a one-time offline batch of around a thousand labelled cases (cheap on the Batch API), then re-score new cases monthly as their real outcomes arrive, so accuracy is watched continuously, not proven once.
Why results stay consistent
For an evaluation project, consistency is the credibility: a scorecard you can reproduce is one people trust. The result here does not depend on who runs it or which API key they use. It depends only on the model, a fixed prompt, and a fixed setting.
The API key is only authentication and billing; it has no effect on the verdict. Three things determine the output, none of them the key: the model version, the sampling setting (fixed at temperature 0, so the model always picks its single most likely answer), and the prompt and borrower data, both fixed and in a fixed order so the same borrowers are scored every run.
| How you run it | Model version | What you get |
|---|---|---|
| Different key, same time | Identical | Most reproducible: same model, prompt and setting. Near-identical, often exactly identical. |
| Same key, different time | May differ | Mostly the same; can drift only if the model snapshot is updated between runs. |
| Different key, different time | May differ | Same as the row above. The time gap is the variable, never the key. |
The aggregate is stable by design
Picking the single most likely answer (temperature 0) makes each verdict highly consistent, though not bit-for-bit guaranteed: tiny differences in how the hardware adds up numbers, or how requests are batched on the server, can flip a borderline borrower. Across the whole set that moves accuracy by well under 1%, which is inside the confidence interval the scorecard already shows. So small run-to-run wobble is expected and within the stated margin, not a contradiction.
Making it exact for production
- Pin the model to a dated snapshot instead of the floating alias, so a server-side model update can never quietly drift the numbers
- Cache every verdict so a re-run reuses prior answers instead of re-asking, which a production system does anyway for cost and consistency
- Together these make the scorecard fully reproducible: the same borrowers, the same pinned model, the same cached answers, give the same numbers every time
A realistic production architecture
The same brain and proof, wrapped in standard data infrastructure:
→ batch classifier (Gemini Batch API, governed tool access) writes {status, confidence, reason} per record
→ escalation gate applies the threshold, routes to approve / flag / review queues
→ evaluation service re-scores the labelled hold-out on a schedule, raises drift alerts
→ this dashboard (scorecard + triage) becomes the monitoring surface