A year of reasoning, agents, and compressed innovation cycles

If 2024 was the year of the chatbot, 2025 has been the year AI started to think—or at least, the year we started debating what “thinking” really means. It has been a year of profound shifts: from simple instruction following to complex reasoning, from “vibes” to verifiable actions, and from general-purpose models to specialized agents. It has also been a year of significant change for me personally, as I moved on from Google to explore new challenges.

This recap is based on a set of monthly reports I wrote throughout 2025, originally as a way to keep track of the fast‑moving AI landscape for myself and for my team at Google. Over time, those notes became a disciplined way to separate signal from noise and to understand how individual launches and papers fit into a broader trajectory.

2025 turned out to be a year of both acceleration and recalibration. Breakthroughs in reasoning models, agents, multimodality, and efficiency continued at a remarkable pace, while questions around economics, safety, and real‑world impact became harder to ignore. Stepping back each month made it easier to see which themes truly mattered.

What follows is a month‑by‑month synthesis of those notes originally gathered for my internal newsletter at Google, curated and lightly expanded for this recap. The goal is not to be exhaustive, but to capture how the year actually unfolded. However, let me know if you are missing something important. I am also including direct links to monnths in case you have a favorite one!

January: The Efficiency Shock and the “Peak Data” Debate

The year began with a wake-up call on efficiency. DeepSeek R1 made headlines not just for its performance, but for its remarkably low training cost, sparking a massive debate on whether we were entering a new era of “Post-Training” efficiency. While some celebrated this as a democratization moment, it’s important to read the fine print: techniques like distillation and fine-tuning are powerful, but they often rely on the existence of larger, more expensive frontier models. At the same time, we saw healthy skepticism emerge regarding agents, with papers arguing that for simple tasks, we might be over-engineering solutions.

January set the stage for a year where data quality, reasoning reliability, and evaluation rigor would repeatedly resurface as central constraints rather than secondary concerns.

February: The “DeepSeek Moment” and Grok’s Rise

February was dominated by the aftermath of DeepSeek and the surprise arrival of Grok 3. We spent weeks dissecting the DeepSeek papers, realizing that while it wasn’t necessarily a fundamental research breakthrough, it was a fantastic engineering feat that cleverly optimized known techniques. Meanwhile, Grok 3’s arrival—developed by a small team in just 18 months—shook up the leaderboards. We also began to realize that benchmarks are becoming increasingly “gameable” via test-time compute, making “Price vs. Performance” the new metric that matters.

February made clear that efficiency breakthroughs amplify—not reduce—the need for transparency, reproducibility, and robust evaluation.

March: “Agentic Moore’s Law”

By March, the conversation shifted heavily toward Agents. We started seeing real data suggesting an “Agentic Moore’s Law,” where the length of tasks agents can solve autonomously is doubling roughly every 7 months. This was also the month Andrej Karpathy dropped his “Deep Dive,” reminding us that despite the hype, LLMs still struggle with basic token-level tasks (like counting ‘r’s in “strawberry”) and that prompting is still very much an art form.

March previewed a broader shift toward autonomy and structured reasoning that would accelerate throughout the year.

April: Peering Inside the Box

April was about peering inside the black box. Anthropic released fascinating research on tracing the internal “thoughts” of models, while Meta’s Llama 4 release highlighted a crucial finding: Reinforcement Learning is proving much more important than Supervised Fine-Tuning (SFT). In fact, the data suggested that too much SFT can actually hurt performance. This was also the month OpenAI released GPT-4.1, which felt like a minor iteration compared to the architectural shifts we were seeing elsewhere.

April reinforced that evaluation quality and RL‑driven training were no longer optional—they were becoming core pillars of progress. At the same time, some serious questions came up about the validity of public benchmarks.

May: Ecosystem Moves and “Vibe Coding”

May felt like a consolidation month. OpenAI went on an acquisition spree (buying Jony Ive’s startup and Windsurf), signaling a push into hardware and broader ecosystems. Meanwhile, Mark Zuckerberg was on a podcast tour with a refreshing level of honesty, admitting that Llama is essentially a byproduct of Meta’s internal needs rather than a purely altruistic developer play. We also saw the rise of “Vibe Coding”—a development style prioritized by speed and flow over rigorous syntax—gaining legitimacy. Coding, multimodality, and enterprise applications increasingly shared the same underlying capabilities, even as models became more specialized at the surface.

June: Superintelligence and “AI-nxiety”

In June, the race for “Superintelligence” became explicit, with Meta forming a dedicated lab and aggressively poaching talent (reportedly paying 7-to-9 figures). But alongside this race, we started seeing the human cost: “AI-nxiety.” Developers and users alike expressed exhaustion at the relentless pace of updates. We also saw Apple release their “Illusion of Thinking” paper, a controversial splash that argued current reasoning models might be shallower than we assume—a debate that is still ongoing.

June underscored a recurring theme: progress is increasingly gated by infrastructure, incentives, and workflows—not by model quality alone.

July: Agents Getting Real and Practical

July saw agents moving from cool demos to practical, integrated products. OpenAI merged their Operator and Deep Research teams, signaling that agents are the new search. We also saw smart implementation tactics from companies like Shopify, who are building innovative agents that access internal data via the Model Context Protocol (MCP). The economic argument also solidified this month: we began to see compelling evidence that LLM inference costs are dropping fast enough to potentially become cheaper than traditional search.

July made clear that productivity gains are earned through structural change, not incremental tooling.

August: The “Bubble” Panic and GPT-5’s Arrival

August was a rollercoaster of sentiment. We had a wave of “AI Bubble” articles from major publications asking if we had peaked, citing failed pilots and high costs. Then, almost on cue, GPT-5 launched. While the rollout was rocky and required adjustments, the sheer user numbers (700M weekly users) and valuation ($500B) largely silenced the “dead end” narrative. It was a reminder that while the hype might fluctuate, the utility is scaling.

The month surfaced doubts around real progress in AI with the backdrop of many launches that seemed to prove those with doubts were probably wrong.

September: “Nano Banana” and Scientific Breakthroughs

September was a huge month for Google, led by the viral success of the Nano Banana image editing model. But beyond the consumer hype, we saw incredible scientific work from DeepMind on fluid dynamics and a continued push for safer, more accessible AI.

September signaled a new baseline: autonomy and reliability were becoming table stakes.

October: Economic Value over Benchmarks

The focus in October was heavily on the new possibilities of AI video creation and proving the economic value of AI. The GDP-val paper was a standout, proposing that we measure frontier models not by academic exams, but by their ability to perform “economically valuable” tasks.

Progress increasingly began to be evaluated economically, not just technically.

November: The New SOTA Battleground

November brought the year’s biggest shakeup: the launch of Gemini 3. Google’s latest model, accompanied by the new “Deep Think” reasoning mode and the “Google Antigravity” agentic platform, immediately topped the charts. Just days later, Anthropic countered with Claude Opus 4.5, marketed as the ultimate coding model with massive improvements in agentic workflows. The market share data reflects this shift—ChatGPT is no longer the default for everyone.

November felt like a visible inflection point in both capability and market momentum. SOTA leadership, once measured in years, was now clearly measured in weeks.

December: Code Red and the Grand Finale

The year ended with high drama. Feeling the heat from Gemini 3 and Opus 4.5, OpenAI declared a “Code Red,” reminiscent of Google’s own similar move back in 2022. This urgency birthed GPT-5.2, a rapid iteration designed to reclaim the throne, alongside new features like ChatGPT Images. Meanwhile, at NeurIPS 2025 in San Diego, the buzz was all about embodied agents, with DeepMind unveiling Sima 2, a generalist agent for 3D worlds that feels like a real step towards general purpose robotics.

  • OpenAI Code Red - Sam Altman rallying the troops as competition heats up.
  • GPT 5.2 Launch - OpenAI’s rapid response to the shifting benchmarks.
  • ChatGPT Images - Bringing native image capabilities to the forefront.
  • Gemini 3 Flash - A speed-optimized beast that still manages ~78% on SWE-bench Verified.
  • Shipping Sora for Android - A fascinating look at how OpenAI used Codex to build their own app in just 28 days.
  • DeepMind Sima 2 - A generalist embodied agent for 3D worlds, unveiled just in time for the conference season.
  • NeurIPS 2025 conference and proceedings. Also, interestingly, several VCs are now giving good summaries of the conference. * * Here is the one by Amplify Partners and here the one by Radical Ventures.

The year closed with a clear signal: fast iteration now coexists with renewed investment in long‑horizon research.

What I Got Wrong in 2025

Looking back at my own predictions (and anxieties) from the start of the year, a few things stand out: The Bubble That Wasn’t: In August, amidst the “Peak AI” narrative, I worried we were heading for a winter. I was wrong. The utility of these models in coding and enterprise workflows has created a floor for value that is much higher than I anticipated. Agents are Harder than We Thought: I expected autonomous agents to be “solved” by mid-year. Instead, we found that reliability at scale is an immense challenge. The “Agentic Moore’s Law” is real, but the slope is shallower than I hoped. The Persistence of Open Weights: I feared the gap between closed and open models would widen to a chasm. Instead, thanks to DeepSeek, Mistral, and Meta, the open ecosystem is arguably healthier than ever, keeping the giants honest on price.

Bonus: A Deeper Dive

Before closing out the year, I sat down for an in-depth, 2-hour conversation with Jon Hernandez on his “Inteligencia Artificial” podcast. We covered everything from my transition out of Google and the internal dynamics of big labs, to why I believe “AGI” is a distracting term and why we should focus on specialized agents instead. If you want the unfiltered, “director’s cut” version of my take on 2025 and beyond, this is it.

Conclusion

As I close out my recap of 2025, I am struck by how much the narrative has changed. We are no longer just awed by the fluency of LLMs; we are now demanding fidelity, reasoning, and autonomy. The battles of November and December proved that no lead is safe, and “SOTA” is a title you hold for weeks, not months.

Leaving Google this year has given me a fresh perspective on this ecosystem. The rate of change is dizzying, but it is also exhilarating. As we head into 2026, I am more convinced than ever that we are just scratching the surface of what is possible when we combine powerful reasoning foundation models with verifiable reasoning and agentic workflows. I look forward to exploring all of this in the incredible space of Travel from my new position as Chief AI and Data Officer at Expedia Group.

Here’s to a 2026 full of verified rewards and fewer hallucinations. Happy New Year!