SERVICES · 04 / ML FOR RISK

ML for fraud, risk,
and underwriting.

Your rules-based system is missing fraud you're paying for, or rejecting customers you should approve. A model could do better — but your risk team needs explainability, your auditors need calibration documentation, and your operations need monitoring that catches drift before customers notice. Production ML for regulated decisions is its own discipline.

SCOPE
Fixed
TIMELINE
6–16 weeks
PRICE
From USD 25K
EXPLAINABILITY
Built in
DECISION · UNDERWRITING MODEL v2.4.1
APPROVED
94% CONFIDENCE
TOP CONTRIBUTING FEATURES
Credit history
Account tenure
Income verification
Address stability
Transaction patterns
MODEL v2.4.1·TRAINED 14 DAYS AGO·LAST AUDIT: PASS
DRIFT MONITOR · LIVE
No alerts · last 30 days
CALIBRATION · WEEKLY
Within tolerance · weekly check
02 / WHERE MODELS FAIL

Where ML models fail in production fintech.

Five specific failure modes we see across fintech ML engagements. The model might be technically excellent — but production failure modes are about data, drift, calibration, and explainability, not just AUC.

01

Training data leakage from the future.

The training set includes features that wouldn't be available at decision time — a chargeback flag updated three days after the transaction, an account-closure timestamp that postdates the prediction window. The model scores beautifully in backtesting because it's seeing the future. In production, the features aren't there, the model degrades, and no one knows why for a quarter.

INVISIBLE IN BACKTESTING
02

Concept drift no one's watching.

The model was calibrated for fraud patterns from six months ago. Fraud actors evolved. Your customer behavior evolved (new product launch, geographic expansion, marketing shift). The model's accuracy decays month over month, but the dashboard only shows binary 'is the model running' — not 'is the model still accurate.' By the time the loss numbers reveal the drift, it's been silent for weeks.

SILENT DECAY
03

Miscalibrated probabilities driving wrong thresholds.

The model outputs a fraud probability of 0.85. The team sets the block threshold at 0.80. But the model is poorly calibrated — its 0.85 actually corresponds to a real fraud rate of 0.40. You're blocking three legitimate customers for every fraudulent one. The model's AUC looked fine; its calibration was never measured.

THRESHOLD ≠ PROBABILITY
04

No explainability when adverse-action notices are required.

The model rejects a credit application. ECOA (US) or equivalent regulation (UK FCA, EU) requires you to tell the applicant *why*. Your model outputs a score — not a reason. The compliance team is left writing generic adverse-action language that may not match the actual decision driver, creating regulatory risk on every rejection.

REGULATORY EXPOSURE
05

Fairness gaps that emerge in production.

The model performed equitably across protected classes in the training set. Production traffic is different — a new customer acquisition channel skews demographics, a geographic expansion changes the distribution. The model's behavior on the new population was never measured. By the time disparate impact shows up in compliance review, the model has made decisions on tens of thousands of applications.

UNMEASURED DISPARITY
03 / WHAT WE BUILD

How we build models that survive audit.

Production ML for regulated decisions — with explainability built into every prediction, calibration measured continuously, drift detection on real traffic, and fairness monitoring across the populations that matter.

ML PIPELINE · REGULATORY-GRADE
01
FEATURES
02
TRAINING
03
INFERENCE
04
EXPLAIN
05
MONITOR
DRIFT DETECTION · CALIBRATION · FAIRNESS · AUDIT TRAILS · VERSION CONTROL

Feature engineering with leak-aware validation.

Every feature is validated against a strict 'available at decision time' check. We build feature stores with explicit timestamping, so the model can't accidentally see future state during training. Backtests run against historical feature snapshots — the way the world actually looked when the prediction would have been made.

NO FUTURE IN TRAINING DATA

Calibration measured continuously.

Every model is calibrated and the calibration is monitored on real production traffic — not just at training time. When the model's stated 0.85 probability stops matching the real 0.85 rate, an alert fires before the threshold-based decisions start drifting. Reliability diagrams generated automatically, weekly.

STATED PROB = REAL RATE

Explainability for every prediction.

Every model output comes with feature attributions — SHAP values, LIME, or simpler interpretable models, depending on the use case. Adverse-action notices write themselves from the top contributing features. Your compliance team has the audit trail they need, in the format regulators ask for, on every decision.

ADVERSE-ACTION READY

Drift detection on real traffic.

Three things monitored continuously: data drift (input distributions changing), concept drift (relationship between inputs and outcomes changing), and prediction drift (output distributions changing). Alerts fire when any of the three crosses thresholds — long before the loss numbers tell you the model is decaying.

DECAY CAUGHT IN WEEKS · NOT QUARTERS

Fairness monitoring across protected classes.

Disparate impact analysis built into the monitoring stack. The model's behavior is tracked across protected classes, geographic regions, customer segments — whatever populations matter for your regulatory environment. Disparities surface as alerts, not as compliance-review surprises six months later.

DISPARITY → ALERT · NOT SURPRISE
04 / DELIVERABLES

What's in the box.

Concrete deliverables of an ML engagement. Everything ships to your repo, in your stack, under your control — including the audit documentation your regulators expect.

Production model

Deployed and serving predictions in your infrastructure, with versioning and rollback in place.

Training and inference pipelines

Reproducible, versioned, with leak-aware validation built into CI. Your team can retrain and redeploy without our involvement.

Explainability infrastructure

Per-prediction feature attributions in a format your compliance team can use to write adverse-action notices.

Calibration monitoring

Continuous calibration tracking with weekly reliability diagrams and alert thresholds.

Drift detection

Data, concept, and prediction drift monitors with configurable alert thresholds, integrated with your observability stack.

Fairness monitoring dashboard

Disparate impact tracking across protected classes and other populations relevant to your regulatory environment.

Model documentation for audit

Model card, training data documentation, validation results, fairness analysis

Handoff training and 30-day support

Two sessions covering retraining, monitoring, troubleshooting, plus direct Slack access for 30 days post-launch.

05 / ENGAGEMENT

How an ML engagement actually runs.

Six to sixteen weeks, broken into four phases. Predictable rhythm, transparent progress, regulatory-grade documentation produced as we go — not bolted on at the end.

01

Data audit and feature scoping.

Weeks 1–3. We audit your existing data — what's available, what's reliable, what has leakage risk. We define the prediction problem precisely (what's being predicted, when the decision happens, what features are available at that moment). We scope the model approach, the explainability strategy, and the monitoring requirements. You get a written scope document and a written data assessment by end of phase.

WEEKS 1–3 · SCOPE + DATA AUDIT
02

Model development with rigorous validation.

Weeks 3–10 typically. Model development with leak-aware backtesting, calibration analysis, and fairness evaluation built in from the start. Multiple model approaches compared on the metrics that matter for your use case — not just AUC. Weekly demos, code in your repo, your data team reviewing PRs.

WEEKS 3–10 · VALIDATED MODELS
03

Production deployment with monitoring from day one.

Weeks 8–14 typically. Model deployed to your infrastructure with shadow mode first (running predictions but not driving decisions), then phased rollout. Drift detection, calibration monitoring, and fairness tracking live before the first real decision is made. Compliance review walkthrough with your team if useful.

WEEKS 8–14 · MONITORED FROM DAY ONE
04

Handoff and 30-day support.

Final week of engagement plus 30 days after. Documentation finalized including the model card and regulatory-submission docs. Two handoff training sessions with your team. Direct Slack access for 30 days post-launch. After day 30, optional retainer available — many ML engagements include a quarterly retainer for retraining and audit support.

WEEK N + 30 DAYS · OPTIONAL RETAINER
06 / QUESTIONS

Questions worth answering before the call.

Things buyers commonly ask about ML engagements for regulated decisions. If your question isn't here, the call is the easiest way to get an answer.

Whatever your team uses or can support. We've shipped production ML on scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, and SageMaker. Inference layers in Python, Go, Rust, or via managed services. Feature stores in Feast, Tecton, or custom. We don't bring a proprietary ML platform — we work in the stack your team can maintain after handoff.

Yes, and we often start there. Many engagements begin with an audit of an existing model — checking for leakage, calibration issues, drift handling, fairness gaps, and explainability coverage. Sometimes the existing model is fundamentally sound and just needs better monitoring infrastructure around it. Sometimes the underlying approach needs to change. We'll tell you honestly which it is, with a written assessment, within the first three weeks.

Depends on the model type and your regulatory environment. For tree-based models (XGBoost, LightGBM, random forests), SHAP values give per-prediction feature attributions that translate cleanly into reason codes. For more complex models, we often use surrogate interpretable models — a simpler, fully explainable model trained to approximate the complex model's decisions. Your compliance team gets reason codes in the format the regulator expects, on every decision, automatically.

Three layers. First, training-time analysis: disparate impact, equal opportunity, demographic parity measured across protected classes before the model goes live. Second, production monitoring: the same metrics tracked on live decisions, segmented by population. Third, configurable alerts: when disparity crosses thresholds, your team gets notified before the issue compounds. The specific fairness metrics we prioritize depend on your regulatory environment — US ECOA looks different from EU AI Act, which looks different from UK FCA expectations.

We produce the technical documentation — model card, training data documentation, validation results, fairness analysis, calibration evidence — in a format your compliance or legal team can use for regulatory submissions. We don't file submissions on your behalf (that's a regulated activity in most jurisdictions), but the artifacts your team needs to submit are part of the engagement deliverables.

READY TO TALK?

Have an ML project in mind?

First call is 30 minutes. You describe the decision you're trying to model and what the regulatory environment looks like. We ask technical questions about your data, your existing approach, and your monitoring posture. By the end of the call, we'll both know whether this is something we should build together.

RESPONSE
Within 1 business day
FORMAT
30 minutes · No deck
FIT
Figured out together
OUTCOME
Yes, no, or a referral