AI infrastructure cleanup: cost & reliability overhaul

The situation

The client inherited an AI feature shipped by a previous developer. It demoed well, but in production it was burning through the monthly model budget, responding slowly, and — worst of all — failing quietly. When retrieval missed, the system guessed instead of saying it didn't know, and nobody could tell the difference until a customer complained.

What we found

The audit surfaced three root causes. Every request re-embedded the full document corpus instead of querying a persisted index. There were no guardrails between retrieval and generation, so empty retrievals still produced confident answers. And there was no logging at the boundaries, which is why failures stayed invisible.

What we did

We re-architected the retrieval layer around a persisted vector store, added a grounding step that refuses to answer when retrieval confidence is low, and instrumented every stage so cost and failure modes became observable. We delivered the diagnosis as a developer-ready roadmap first, then executed the refactor.

The outcome

Token spend dropped by 63%, median latency went from over four seconds to under one, and silent failures went to zero — replaced by honest "I don't have that" responses and alerts the team can act on.

AI infrastructure cleanup: cost & reliability overhaul

The situation

What we found

What we did

The outcome

Building something ambitious, or fixing something that's gone sideways?