Building Agents Remains Challenging: Lessons on SDKs, Caching, Reinforcement, and Design
Building agents remains a complex and challenging endeavor, despite the rapid advancements in AI. After recent hands-on experience, several key insights have emerged that highlight the ongoing difficulties in agent design and execution. One of the most significant realizations is that SDK abstractions—while helpful in theory—often fall short when it comes to real-world tool use. The choice of SDK, whether it's the OpenAI, Anthropic, or higher-level options like the Vercel AI SDK, can have a major impact. The Vercel AI SDK, for instance, provides a clean interface for basic interactions, but it struggles with provider-specific tools. For example, Anthropic’s web search tool frequently disrupts message history when used with the Vercel SDK, and error messages are often unclear. In contrast, working directly with Anthropic’s SDK offers better cache control and more predictable behavior. As a result, the team now believes that building the agent loop from scratch—without relying on high-level abstractions—provides greater control and reliability, especially as model differences become more pronounced. Caching is another area where platform choices matter deeply. Anthropic requires explicit cache management, which initially seemed cumbersome. However, this approach offers far greater predictability in cost and performance. With manual control, it’s possible to split conversations, perform context editing, and maintain cache efficiency across iterations. The team uses three cache points: after the system prompt, at the start of the conversation, and dynamically adjusted as the conversation progresses. This setup allows for better control over context and helps avoid unnecessary token waste. Reinforcement within the agent loop has proven more valuable than expected. Each tool call presents an opportunity to inject feedback—reminding the agent of goals, summarizing progress, or suggesting alternative approaches after a failure. Self-reinforcement tools, like the todo write tool in Claude Code, which simply echoes back tasks, have shown to be surprisingly effective at keeping the agent on track. Reinforcement also helps the agent adapt when background state changes, such as recovering from a failed execution by stepping back and retrying earlier steps. Failure isolation is critical. To prevent cascading failures, tasks are often run in subagents that iterate independently until success. Only the final outcome and brief summaries of failed attempts are reported back. This allows the agent to learn from mistakes without cluttering the main context. Context editing, available in Anthropic models, offers another way to prune failed attempts from the context, but it comes at the cost of invalidating caches—making the trade-off difficult to justify in many cases. Shared state is essential for complex workflows. A virtual file system serves as the backbone, enabling tools like code execution and inference to share data seamlessly. Whether generating an image or unpacking a zip file, all components access the same shared storage via file paths. This prevents dead ends and supports flexible, multi-step workflows. Output generation is surprisingly tricky. Using a dedicated output tool—such as one that sends an email—requires careful prompting to control tone and content. Attempts to refine output using a secondary model like Gemini 2.5 Flash increased latency and degraded quality, likely due to context loss and misalignment. The team now tracks whether the output tool was used and injects reinforcement if it’s missing. Model selection remains task-dependent. Haiku and Sonnet are still top choices for agent loops due to their strong tool-calling ability and transparency. For subtasks involving document summarization or image analysis, Gemini 2.5 excels, especially where Sonnet models trigger safety filters. Cost isn’t just about token price—efficiency matters more. A better tool caller may use fewer tokens overall, making it cheaper in practice. Testing and evaluation remain the hardest part. Because agents are inherently dynamic and context-sensitive, traditional evaluation methods fail. Instrumenting real runs and using observability data is necessary, but no current solution has proven fully satisfactory. Finally, coding agents continue to evolve slowly. The team is experimenting with Amp, not because it’s objectively superior, but because its design—especially the interaction between subagents like the Oracle—feels more thoughtful and product-driven than many others in the space. Other notable observations include the value of minimalism in agent tooling, the decline of small open-source libraries due to AI-generated alternatives, and the importance of Tmux for managing interactive systems. The underlying theme is clear: agent design is still far from solved, and the best path forward often involves rejecting abstractions and building with intention.
