Borrower Risk Evaluator - Loan-Default Classifier That Proves Its Own Accuracy

📋 Project Overview & Problem Statement

Challenge: A credit officer at a small lender cannot carefully review every loan application, so they either rush (and approve defaults that cost real money) or over-review (and waste days on safe applicants). They have no defensible way to state how accurate their triage actually is. The market sells a black-box credit score that is never measured on the lender's own loan book, or a heavyweight machine-learning platform with no per-record human-in-the-loop workflow.

Solution: Borrower Risk Evaluator classifies each borrower as default or no default with a confidence score, routes them auto-approve, auto-flag, or send-to-human, and then proves its accuracy live against the known repayment outcomes. The headline skill is the evaluation: a quality scorecard measured against ground truth, with a 95% confidence interval on every number.

Key Benefits

Measured, not promised: accuracy, precision, recall, and F1 are scored against the real outcome label, each with a confidence interval
Confidence where it counts: accuracy on the cases the model cleared on its own, the pile that matters most
Tunable safety dial: drag the confidence threshold and watch precision, recall, and manual-review volume trade off live
Plain-language reasons: every verdict comes with the one or two facts that drove it, and a tick or cross showing whether it matched the real outcome
Honest about scale: a built-in guide explains exactly how the demo becomes a production system for thousands of borrowers

🖥️ Application Features

📊 Evaluation Scorecard

The headline. Accuracy, precision, recall, F1, a confusion matrix, and accuracy-at-high-confidence, all measured against the real Current_loan_status label, each carrying a 95% confidence interval so the audience knows exactly how much to trust the number.

🎚️ Safety Dial

A confidence-threshold slider that re-splits the auto-pass and escalate piles live and redraws an accuracy-versus-coverage tradeoff curve. Raise the bar and the cleared pile gets more accurate but fewer cases clear.

🗂️ Triage Console

Every borrower routed into three lanes: auto-approve, auto-flag, and a human review queue. Only the uncertain cases reach the queue. A by-loan-type breakdown shows how each purpose splits.

🧭 Portfolio to Production

A dedicated tab covering how accuracy is verified, why results stay consistent, and what changes to scale the demo into a system handling 10,000+ borrowers.

🤖 AI Integration & Intelligence

🧠 Typed Verdict (Gemini AI)

Each borrower is classified by Google Gemini using structured output: a typed verdict of status, confidence, and reason. The prompt sees the borrower's facts only, never the true label, so the scoring is honest.

✅ Measured Against Ground Truth

The genuinely new work: every prediction is compared to the borrower's recorded outcome to produce a real, statistical scorecard. Most AI demos report a confidence number that is never checked. This one checks it.

🚦 Deterministic Escalation Gate

A short, deterministic rule, not an LLM, reads the confidence and routes the borrower auto-approve, auto-flag, or send-to-human. Not every component needs a model; the gate is auditable by design.

🎯 Calibration as the Point

Language-model confidence is famously miscalibrated, which is exactly why measuring it matters. The accuracy-at-confidence view shows whether the model's certainty can actually be trusted.

🛠️ Technical Architecture & Implementation

Frontend Stack

Single-file HTML Vanilla JavaScript Inline SVG charts No build step

AI & Data

Google Gemini (gemini-2.5-flash) Structured JSON output BYO-key, client-side Python + pandas (data prep)

Deployment & Infrastructure

GitHub Pages Static asset (borrowers.js) No backend

System Architecture

No backend, no server cost: the page calls Gemini directly from the browser with the visitor's own key
Real data as a static asset: a keyless Python script cleans the 32,577-row dataset and emits 150 stratified real borrowers as data/borrowers.js, loaded via <script src> so it works on GitHub Pages and offline
Honest scoring: the true label travels with each borrower but is used only after the model answers, never shown to it
Typed verdict to UI: Gemini's status and confidence map to a default probability that drives the confusion matrix, metrics, and tradeoff chart
Self-contained: classification, scoring, charts, and the safety dial all run in one HTML file

📖 Setup & How to Run

Prerequisites

A free Gemini API key from Google AI Studio (aistudio.google.com/apikey)
A modern browser. No install, no Node, no Python needed to run the demo

Run the Demo

# Open the live demo, or run locally:
git clone https://github.com/lyven81/ai-project.git
cd ai-project/projects/borrower-risk-evaluator

# Open index.html in a browser, then:
# 1. Paste your Gemini API key
# 2. Choose how many borrowers to test (50 / 100 / 150)
# 3. Click "Run evaluation"
            

Regenerate the Data Asset (optional)

# Re-sample real borrowers from the source dataset (keyless)
pip install pandas
python prep_data.py
# -> writes data/borrowers.js and data/meta.json
            

🚀 Deployment

# Fully static. Deployed on GitHub Pages, no server.
# Live at:
# https://lyven81.github.io/ai-project/projects/borrower-risk-evaluator/demo.html
            

Production Notes

The demo classifies a labelled hold-out live with the visitor's key; production swaps this for a one-time offline batch plus monthly re-scoring
For exact reproducibility, pin the model to a dated snapshot and cache verdicts (covered in the Portfolio to production tab)
Frame as decision support: the system triages, a human owns the final call on any flag

📊 Key Metrics

32,577

Real Labelled Borrowers in Source Data

150

Held-Out Borrowers Scored Live

95%

Confidence Interval on Every Metric

0

Backend Servers (Fully Client-Side)

Business Value

Trust through evidence: proves accuracy on the lender's own labelled history, not a vendor's black box
Less manual review: confident cases clear automatically so officers spend time only where it matters
A tunable safety dial: the lender chooses where to sit on the precision-versus-recall tradeoff
Audit-ready: plain-language reasons and a measured scorecard a regulator could read
Closes the evaluation gap: demonstrates that an AI's output can be measured, not just produced

🔁 Potential Use Cases

The same design framework, classify each record, score its confidence, auto-clear the confident cases, escalate the uncertain ones, and prove accuracy against a known answer key, is not specific to lending. It transfers to any problem that meets two conditions:

A per-item categorical decision: each record is sorted into classes (match or no match, default or no default), not a free-form output.
A historical answer key: past records carry the real outcome, so predictions can be scored against ground truth. Without a label there is no evaluation, only a guess.

Two examples that fit, each with the condition that makes the framework work:

🧑‍💼 Job Matching

Classify each candidate-and-role pair as a match or not, with a confidence score. Auto-shortlist the confident matches, send the borderline ones to a recruiter, and score the model against who was actually hired and how they performed.

Condition to reuse the framework: you hold historical hiring outcomes (hired, retained, rated). The catch is selection bias: you only observe outcomes for people who were actually hired, so the answer key is partial. Use it as decision support with a fairness check, never as an automatic reject.

💞 Couple Matching

Classify each pair of people as compatible or not, with a confidence score. Auto-suggest the confident matches, hold the uncertain ones, and score against what actually happened between matched pairs.

Condition to reuse the framework: you define a concrete proxy for success up front, for example a mutual like within seven days, or still together at six months. The label here is softer, sparser, and slower than a loan outcome, so the scorecard is honest only when the proxy is stated plainly.

The rule of thumb: the stronger and more complete the answer key, the more trustworthy the scorecard. Loan default is close to ideal because the outcome is objective and eventually known for every record. Job matching and couple matching work too, with clear notes on where their labels are biased or soft.

📊 Borrower Risk Evaluator