B2B SaaS, US

AI infrastructure cleanup: cost & reliability overhaul

−63%
monthly token spend
4.1s → 0.9s
median response time
0
silent failures post-launch

The situation

The client inherited an AI feature shipped by a previous developer. It demoed well, but in production it was burning through the monthly model budget, responding slowly, and — worst of all — failing quietly. When retrieval missed, the system guessed instead of saying it didn't know, and nobody could tell the difference until a customer complained.

What we found

The audit surfaced three root causes. Every request re-embedded the full document corpus instead of querying a persisted index. There were no guardrails between retrieval and generation, so empty retrievals still produced confident answers. And there was no logging at the boundaries, which is why failures stayed invisible.

What we did

We re-architected the retrieval layer around a persisted vector store, added a grounding step that refuses to answer when retrieval confidence is low, and instrumented every stage so cost and failure modes became observable. We delivered the diagnosis as a developer-ready roadmap first, then executed the refactor.

The outcome

Token spend dropped by 63%, median latency went from over four seconds to under one, and silent failures went to zero — replaced by honest "I don't have that" responses and alerts the team can act on.

Building something ambitious, or fixing something that's gone sideways?

Tell us where you are and where you're trying to get to. We'll tell you honestly whether — and how — we can help.