Borrower Risk Evaluator

The proof

Does this model deserve to be trusted?

A language model classifies each real borrower as default or no default and reports how sure it is. Then every verdict is checked against the borrower’s real recorded outcome. The dial up top sets how sure the model must be before a case clears without a human.

Run the live evaluation

Nothing on this page is pre-computed. Paste your Google Gemini API key and the model will classify real held‑out borrowers one by one, then score itself against the known outcomes. The labels are revealed to the scorer only after each verdict is in.

Gemini API key

Borrowers to test

Remember the key on this device

Your key stays in your browser and is sent only to Google’s Gemini API. Get a free key at aistudio.google.com/apikey.

Starting...0 / 0

What it does

-- borrowers in, three clear piles out

The model sorts every borrower and routes it by how sure it is. Confident cases clear on their own. Only the uncertain ones reach the review queue, so a person spends time where it actually matters.

No results yet.
Open the Evaluation scorecard tab, paste a Gemini key, and run the evaluation to populate these piles.

The honest path

From this portfolio demo to a system that handles 10,000+ borrowers

This page is a working blueprint, not a throwaway. The parts that are hard to get right, the classification logic and the proof that it works, carry over to production unchanged. What changes is the plumbing around them.

What transfers unchanged

The classification prompt and the typed {status, confidence, reason} verdict
The escalation gate: confidence vs threshold routes auto-approve / flag / human
The evaluation harness: scoring against ground truth, precision, recall, F1, accuracy-at-confidence. The hardest part, and it ports as-is
The triage console and scorecard become the live monitoring dashboard

What changes to scale

Classifying runs server-side in batches, not one live call at a time in a browser
Borrower records live in a database, not a static file
Caching, retries and audit logs replace skip-on-error
A periodic re-score watches for model and policy drift

Concern	This portfolio demo	Production at 10k+
Execution	Browser, one live call at a time, your key	Server-side batch job (Gemini Batch API, roughly half the cost, async) or a worker pool with rate-limit handling and resume
Data	150 rows embedded in a JS file	Real database (Postgres / AlloyDB / BigQuery); verdicts written back to a table
Idempotency	Re-runs every borrower each time	Cache verdicts per record, never re-classify the same borrower
Reliability	Skip on error	Retries, dead-letter queue, monitoring, full audit log
Human queue	In-page approve / flag buttons	A real review workflow: assignment, fairness checks, sign-off
Drift	One-shot run	Re-score the labelled hold-out on a schedule to catch model or policy drift

The credibility question

How accuracy is verified

The honest version of “you only tested 50 to 150 borrowers, how can that be ready for thousands of users?” You do not convince anyone by testing more users. You show that the tested sample is statistically sufficient and representative, and let the numbers prove it.

The sample validates the decision rule, not the users

A national election is called from a poll of about a thousand people, not the whole country. You do not drink the whole pot to check the soup, you stir and taste a spoonful. What matters is that the tested borrowers are a fair, random, representative slice, not that there are as many of them as there are real customers.

The sample size you need does not grow with your user count

The number of labelled cases needed to pin accuracy to a given margin is the same whether you serve a thousand customers or ten million. Going to production scale does not require a bigger validation set. So “thousands of users, therefore test thousands” is a misconception: the required test sample is fixed by the margin of error you want, not by the population size.

The confidence interval states exactly how much to trust each number

Every metric on the scorecard carries a 95% confidence interval, like accuracy 88% (95% CI 82–92%, n=150). That is a measured margin of error, not a promise. Nobody has to take “150 is enough” on faith: the interval tells them, and it visibly tightens as you raise the sample.

The sample is representative of the real book

The tested borrowers preserve the real portfolio’s 21% default rate and all six loan purposes in proportion. A sample that mirrors the population is trustworthy; a cherry-picked one is not. This is what makes the spoonful a fair taste of the whole pot.

The small sample is a demo budget, not a method limit

50 to 150 is small only because each borrower is a live, paid API call running in your browser. The method itself scales to any validation size. In production you swap the live run for a one-time offline batch of around a thousand labelled cases (cheap on the Batch API), then re-score new cases monthly as their real outcomes arrive, so accuracy is watched continuously, not proven once.

Reproducibility

Why results stay consistent

For an evaluation project, consistency is the credibility: a scorecard you can reproduce is one people trust. The result here does not depend on who runs it or which API key they use. It depends only on the model, a fixed prompt, and a fixed setting.

The API key is only authentication and billing; it has no effect on the verdict. Three things determine the output, none of them the key: the model version, the sampling setting (fixed at temperature 0, so the model always picks its single most likely answer), and the prompt and borrower data, both fixed and in a fixed order so the same borrowers are scored every run.

How you run it	Model version	What you get
Different key, same time	Identical	Most reproducible: same model, prompt and setting. Near-identical, often exactly identical.
Same key, different time	May differ	Mostly the same; can drift only if the model snapshot is updated between runs.
Different key, different time	May differ	Same as the row above. The time gap is the variable, never the key.

The aggregate is stable by design

Picking the single most likely answer (temperature 0) makes each verdict highly consistent, though not bit-for-bit guaranteed: tiny differences in how the hardware adds up numbers, or how requests are batched on the server, can flip a borderline borrower. Across the whole set that moves accuracy by well under 1%, which is inside the confidence interval the scorecard already shows. So small run-to-run wobble is expected and within the stated margin, not a contradiction.

Making it exact for production

Pin the model to a dated snapshot instead of the floating alias, so a server-side model update can never quietly drift the numbers
Cache every verdict so a re-run reuses prior answers instead of re-asking, which a production system does anyway for cost and consistency
Together these make the scorecard fully reproducible: the same borrowers, the same pinned model, the same cached answers, give the same numbers every time

A realistic production architecture

The same brain and proof, wrapped in standard data infrastructure:

Borrower records → database (AlloyDB / BigQuery)
→ batch classifier (Gemini Batch API, governed tool access) writes {status, confidence, reason} per record
→ escalation gate applies the threshold, routes to approve / flag / review queues
→ evaluation service re-scores the labelled hold-out on a schedule, raises drift alerts
→ this dashboard (scorecard + triage) becomes the monitoring surface

The leap from demo to production is real work, but it is plumbing around a core that already exists here. The classification logic and the evaluation harness, the parts most builds skip, are done and proven on this page.