SERVICES · 02 / AGENTS & MCP

AI agents and MCP
for finance ops.

Reconciliation runs, exception handling, KYC ops, transaction monitoring — the multi-step workflows that eat your operations team's hours and don't scale with revenue. Off-the-shelf agents hallucinate on numbers, skip steps your auditor cares about, and break when the upstream API changes.

SCOPE
Fixed
TIMELINE
6–12 weeks
PRICE
From USD 25K
INTEGRATIONS
Your stack
MCPMCPMCP
AGENTRUN #847291
ERP
BANK API
AUDIT
LATEST RUN · 2 MIN AGO
247 reconciliations · 0 errors
AUDIT TRAIL · LIVE
Every action logged
02 / WHERE AGENTS FAIL

Where off-the-shelf agents fall apart on finance ops.

Five specific failure modes we see across fintech agent engagements. The patterns repeat — once you've shipped through them, you recognize them on the first call.

01

Hallucinated numbers in arithmetic-sensitive workflows.

The agent reconciles two ledger entries. The amounts off by $0.01 due to rounding. The LLM 'helpfully' rounds the difference to zero and reports a match. The downstream system books a reconciliation that's technically wrong — and you discover it in next quarter's audit. Agents need deterministic calculation paths for anything involving money.

~35% OF FAILURES
02

Skipped steps that the agent doesn't flag.

The agent's plan calls for six steps. Step 4 fails silently — an upstream API returns a 500, or a webhook never fires. The agent moves to step 5 with stale data and produces a confident-looking output. No alert, no retry, no human-in-the-loop. The error surfaces three weeks later in a customer complaint.

INVISIBLE STEP FAILURES
03

Tool calls without audit context.

The agent makes a tool call to update a customer's KYC status. The call succeeds. There's no log of why the agent made that decision — what input data, what intermediate reasoning, what authorization. When the compliance reviewer asks 'why did this account get flagged?', no one can answer in fewer than four hours of forensic work.

SOC 2 NIGHTMARE
04

Brittleness when upstream APIs change.

The agent works because it has the schema of your ERP API memorized via prompt engineering. The ERP vendor adds a field. The agent's tool calls now fail or — worse — succeed with corrupted data. Without typed contracts between the agent and the tools it uses, every upstream change is a P1 incident.

EVERY API UPDATE BREAKS IT
05

No way to verify agent behavior before deploy.

Agents are nondeterministic by nature. Same input, slightly different output across runs. Without a deterministic eval framework — replaying real production traces with assertion checks — you discover behavior changes after they affect customers. 'It worked in dev' becomes 'we shipped a regression on Friday afternoon.'

UNTRACKED BEHAVIOR DRIFT
03 / WHAT WE BUILD

How we build agents that survive production.

Production agents with deterministic guardrails, typed tool contracts, full audit trails, and an eval framework that catches behavior drift before it ships.

AGENT RUNTIME · WITH GUARDRAILS
01
INTENT
02
PLAN
03
TOOLS via MCP
04
VERIFY
05
AUDIT
EVALS · OBSERVABILITY · AUDIT LOGS · HUMAN-IN-THE-LOOP

Typed tool contracts via MCP.

Every tool the agent can call is defined with a strict input/output contract via Model Context Protocol. When an upstream API changes, the contract surface lights up — you find out in CI, not in production. MCP gives the agent typed access to your ERP, payment processor, ledger, KYC provider, or any system that exposes an API.

TYPED INTEGRATIONS · TESTABLE

Deterministic paths for arithmetic and rules.

LLMs don't do math. So we don't ask them to. Arithmetic, rule evaluation, threshold checks, and any deterministic logic runs in code the agent calls — not in the model. The model decides what to do; the code does it correctly. Your reconciliations are exact, your fraud rules are auditable.

LLMS PLAN · CODE EXECUTES

Step-level audit trails.

Every step of every agent run is logged: input state, decision made, tool called, output received, time elapsed. When compliance asks 'why was this account flagged on the 14th?', the answer is one query away — and includes the model's reasoning at each step, not just the final action.

EVERY ACTION · EXPLAINABLE

Human-in-the-loop for high-stakes actions.

Some actions need a human to approve before execution — moving money over a threshold, flagging an account as fraudulent, escalating to a regulator. We wire approval workflows into the agent's plan, with clear UIs for the reviewer and SLA tracking. Not every step needs human review — only the ones that matter.

APPROVAL WHERE STAKES JUSTIFY IT

Eval framework that catches behavior drift.

Real production traces, replayed through new agent versions with assertion checks at each step. Did the new model make the same decisions? Did the tool calls land in the same order? Did the final output match expectations? Behavior drift gets caught in CI before it ships — not after a customer complains.

REPLAY · ASSERT · BLOCK ON REGRESSION
04 / DELIVERABLES

What's in the box.

Concrete deliverables of an agent engagement. Everything ships to your repo, in your stack, under your control.

Production agent runtime
Deployed and running in your stack, integrated with your existing services via MCP.
Typed MCP integrations
Tool contracts for every system the agent touches (ERP, ledger, payment processor, KYC, custom APIs).
Audit log infrastructure
Step-level logging in a format your SOC 2 auditor can query, retained per your compliance policy.
Human-in-the-loop UIs
Reviewer interfaces for high-stakes actions, wired into your existing tools (Slack, email, custom admin).
Eval framework with replay suite
Production trace replay, assertion harness, CI integration. Behavior regressions caught before deploy.
Observability dashboards
Run latency, step success rates, tool call failures, model token costs. Integrated with your existing observability stack.
Handoff training
Two sessions with your team to walk through the runtime, the evals, and how to add new tools or agents.
30-day post-launch support
Direct Slack access for the first month after handoff, with response within one business day.
05 / ENGAGEMENT

How an agent engagement actually runs.

Six to twelve weeks, broken into four phases. Predictable rhythm, transparent progress, code in your repo from week two.

01

Workflow audit and scoping.

Weeks 1–2. We map the workflows you want to automate — reconciliation runs, exception handling, KYC ops, whatever it is. We identify which steps are deterministic (rules, arithmetic, lookups) and which need agent reasoning. We scope the MCP integrations needed against your existing systems. You get a written scope document at the end of week 2.

WEEKS 1–2 · WRITTEN SCOPE
02

Build the agent, wire the MCP integrations.

Weeks 2–7 typically. Agent runtime shipped to your stack. Typed MCP contracts for every tool the agent calls. Step-level audit logging from day one. Weekly demos with real runs against staging data. Code in your repo from week 2, reviewed by your team in PRs.

WEEKS 2–7 · CODE IN YOUR REPO
03

Eval framework and human-in-the-loop.

Weeks 6–10 typically. Eval framework built from real production traces (synthetic at first, then real). Human-in-the-loop UIs for high-stakes actions, wired to your existing approval tools. Observability dashboards. Compliance review walkthrough if useful.

WEEKS 6–10 · COMPLIANCE-READY
04

Handoff and 30-day support.

Final week of engagement plus 30 days after. Documentation finalized, two handoff training sessions with your team, direct Slack access for 30 days post-launch. After day 30, optional retainer available. No lock-in, no platform fees, no surprise renewals.

WEEK N + 30 DAYS
06 / QUESTIONS

Questions worth answering before the call.

Things buyers commonly ask about agent engagements. If your question isn't here, the call is the easiest way to get an answer.

What's MCP and do we need to adopt it as a standard?

MCP (Model Context Protocol) is an open protocol from Anthropic for connecting language models to tools. We use it because it's typed, testable, and provider-agnostic — but you don't need to adopt MCP as a company standard to use it inside an agent. The agent uses MCP internally to call your existing APIs; your existing APIs don't need to change. If you later want to expose your services as MCP servers for other consumers, the work is straightforward.

Can the agent run on our existing infrastructure?

Yes. We build the agent runtime to run wherever your other services run — Kubernetes, ECS, Lambda, bare VMs, whatever. The agent calls models via the provider of your choice (OpenAI, Anthropic, Bedrock, Vertex, self-hosted). No proprietary BlueSoft infrastructure required, no platform fees after we leave.

How do you handle non-deterministic agent behavior?

Three things. First, we route deterministic logic (arithmetic, rule checks, threshold evaluation) to code, not to the model. Second, we use deterministic eval framework — production trace replay with assertion checks — to catch behavior drift in CI before it ships. Third, we add human-in-the-loop approval for actions where the cost of being wrong justifies it. The combination gives you predictable agent behavior on the actions that matter.

What if our auditor asks why the agent did something six months ago?

They can query the audit log directly. Every agent run produces a step-by-step record: input state at each step, decision made, tool calls executed, outputs received, model reasoning at each branch point. We design the log schema with audit query patterns in mind — your auditor's question becomes a database query, not a four-hour forensic investigation.

Can you work with our existing observability stack?

Yes. The agent emits metrics, logs, and traces in whatever format your stack expects — Datadog, Grafana, New Relic, Honeycomb, custom Prometheus/Loki, whatever you use. The dashboards we build live in your existing observability tool, not in a separate BlueSoft portal.

READY TO TALK?

Have an agent project in mind?

First call is 30 minutes. You describe the workflows you're trying to automate and what's in the way. We ask technical questions about your existing systems, your compliance constraints, and your eval requirements. By the end of the call, we'll both know whether this is something we should build together.

RESPONSE
Within 1 business day
FORMAT
30 minutes · No deck
FIT
Figured out together
OUTCOME
Yes, no, or a referral