AI++ // memory, search, evals and risk; what it takes to be an agent

Claude Mythos is here, except it’s called Fable 5 and comes with a few restrictions. It appears to be the largest model released and, according to the benchmarks, the most accomplished. Even more so than Opus 4.8 that was only released 2 weeks ago. It’s also the most expensive model, so you might want to think twice before swapping it into your RAG support chat bot.

While it’s impressive to see the frontier march forward, this week in AI++ we’ll take a look at some of the techniques people are using to build agents for production. We’ll also have a look at a new Langflow release and round up some looks at search and memory for agents.

Phil Nash

Developer relations engineer for IBM

🛠️ Building with AI, Agents & MCP

Agents in production

To understand whether your agent is doing its job you need to evaluate its performance. The guide at How to Evaluate AI Agents is a great start to understand what you’re targeting and the ways to go about it. The OpenAI team walk us through building self-improving tax agents with Codex and shows how evals help you build loops that hill-climb to the best results.

The Anthropic team wrote about containing Claude across the different products in which it exists, identifying risks, reducing blast radius and identifying what to trust. Along similar lines, Sean Goedecke compares the risks and benefits of agents over pipelines.

Finally, LangChain describe how Lyft built their own agent platform sharing how the agents are evaluated in production with tracing and monitoring.

Brand new Langflow

Great news in the world of Langflow with the release of version 1.10. This version upgrades the Langflow Assistant from building components to building whole flows with you. It also adds Memory Bases that persist conversation context across sessions in a flow, and configurable database connectors for Knowledge Bases.

Memory and Search

We’ll start with this in depth look into agentic search that was originally a talk from the AI Engineer Europe conference. Watch the talk or walk through the examples yourself. A recent study showed that grep is all you need, but was it right? Was the harness doing a lot of the work instead?

Have you considered what to do with images in RAG? The team at Kapa have, and they describe how they index images for RAG.

Finally, the team at mem0 do a rundown of how popular agent harnesses manage their memory. There is a lot of work to be done in the memory space, and this is a great overview of techniques and drawbacks.

🧠 New models

Anthropic introduced Claude Opus 4.8 and then Claude Fable 5 and Claude Mythos 5, and I can’t keep up
Meanwhile Microsoft released 7 new models of their own including thinking, coding, transcription, voice and image generation. The image model, MAI-Image-2.5, has surpassed Nano Banana in image editing and does very well on text-to-image generation on Arena leaderboards
ElevenLabs released a new music generation model
Cartesia released a top-rated speech-to-text model called Ink
Google released a 12b size of the popular Gemma 4 that fits nicely in 16GB of memory. They also released the experimental DiffusionGemma, which generates text 4x quicker. It does so by generating whole paragraphs at a time and then refining them, rather than the token by token behavior we expect
Google also released Gemini 3.5 Live Translate that detects and translates speech to 70+ languages

🗞️ Other news

Check out the latest episode of The Flow with Prathmesh Patel, CEO of MCPJam covering how to beat token bloat with MCP
Talk to your database with no code, just Claude
Ollama has improved performance and model support with GGUF
LLMs are not the Black Box you were promised
https://vorpus.github.io/performativeUI/
Did you see that attackers could convince the Instagram support agent to reset passwords to an arbitrary email address? This is a reminder that your agents should not step around security for your users
PerformativeUI is a tongue-in-cheek AI-native React component library that promises “Components that signal how oversubscribed your funding round is”

🧑‍💻 Code & Libraries

OpenRAG, the open source RAG distribution, released version 0.5.0 with an upgrade Langflow version, Streamable HTTP MCP server and much more.
AWS release an MCP server for comprehensive threat modeling
Homecrew is a package manager for agent skills
Sandboxd provides isolated environments, with coding agents running inside, and a live URL to preview the work
mnemo is a memory sidecar that builds local knowledge graphs for your agents

🔦 Langflow Spotlight

Did you see that you can now apply policies to agent actions in Langflow? Policies turns natural language rules into guards for tools directly within the agent. Prompts can guide behavior, but Policies constrain execution. Learn about how Policies work in this blog post.

🗓️ Events

The AI Coding Summit will be in London and online on July 6th and 7th with talks and workshops on MCP, agentic systems, AI-driven testing & debugging, and real-world best practices.

Use the promo code AI++ for a 10% discount on tickets.