<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://amatria.in/blog/feed.xml" rel="self" type="application/atom+xml" /><link href="https://amatria.in/blog/" rel="alternate" type="text/html" /><updated>2026-04-14T06:10:10+00:00</updated><id>https://amatria.in/blog/feed.xml</id><title type="html">AI, software, tech, and people. Not in that order. By X</title><subtitle>Musings on AI, software, technology, and people. Catalan in The Valley.</subtitle><author><name>Xavi Amatriain</name></author><entry><title type="html">Beyond the Bot: Building a Multi-Agent Recomender for Actionable Intelligence</title><link href="https://amatria.in/blog/agenticrecsys" rel="alternate" type="text/html" title="Beyond the Bot: Building a Multi-Agent Recomender for Actionable Intelligence" /><published>2026-03-15T00:00:01+00:00</published><updated>2026-03-15T00:00:01+00:00</updated><id>https://amatria.in/blog/agenticrecsys</id><content type="html" xml:base="https://amatria.in/blog/agenticrecsys"><![CDATA[<p><em>(This blog post, as with most of my recent ones, is written with AI assistance and augmentation. In this case, “We” in the text refers to myself and my local OpenClaw agent, which has been my primary co-developer throughout this project.)</em></p>

<p>Most AI demos today suffer from a “low-ceiling” problem: they stop at “look, it can answer a question.” I wanted to push toward the actual horizon of this technology—an assistant that doesn’t just predict the next token, but personalizes recommendations, reasons with deep context, and executes real-world tasks.</p>

<p>That vision became Recommend Flow: a multi-agent architecture built on OpenClaw. The project hinged on one key architectural decision: instead of generating recommendations in a vacuum, the orchestrator first consults <a href="https://github.com/xamat/Xavibot">Xavibot v0.1</a> —my original assistant—as a “preference proxy.” Note that Xavibot is implemented using a completely different technology stack, leveraging Google’s Gemini models and a custom Retrieval-Augmented Generation (RAG) pipeline to ensure its “intuition” is grounded in my actual history.</p>

<h1 id="the-we-and-the-machine">The “We” and the Machine</h1>

<p>A quick note on the terminology: when I say “we,” I am being quite literal. This system was deployed and refined in collaboration with a local OpenClaw agent running on a dedicated Linux machine in my home office. I communicate with the system primarily through WhatsApp, often using voice messages for convenience. The local agent handles the transcription, intent extraction, and orchestration.</p>

<p>One of the most pragmatic features of this setup is how it handles the “last mile” of execution. By utilizing browser navigation on my local machine, the bot doesn’t need to know my passwords or handle sensitive credentials. It simply leverages the active logins already present in my local browser sessions. While this requires the machine to be secure, it keeps the identity risk isolated and avoids the headache of managing a separate “vault” of API keys for every third-party service.</p>

<p>Furthermore, we’ve leaned into the philosophy of “Memory as Documentation.” All agent memories and complex workflows are stored locally as simple .md files. This approach offers several advantages:</p>

<ul>
  <li><strong>Maintenance &amp; Transparency</strong>: I can literally cat a memory file to see exactly what the agent “knows” or “remembers” about my tastes. If it learns a wrong preference, I fix the text file.</li>
  <li><strong>Portability</strong>: The entire intelligence of the system—the workflows, the personality priors, and the task histories—lives in a folder that can be moved across machines or version-controlled via Git.</li>
  <li><strong>Security Guardrails</strong>: Storing workflows in plain text allows for human-readable audit trails. I can verify the steps the agent intends to take before it ever touches a browser.</li>
</ul>

<p><img src="/images/125-0.png" /></p>

<h1 id="the-big-idea-separate-roles-compose-capabilities">The Big Idea: Separate Roles, Compose Capabilities</h1>

<p>In a <a href="https://amatria.in/blog/postpretraining">previous post</a>, I argued that modern LLMs are evolving into “reasoning engines.” In Recommend Flow, we intentionally split those reasoning responsibilities:</p>

<ul>
  <li><strong>The Orchestrator (OpenClaw session)</strong>: This agent serves as the central brain and primary interface. While it often communicates via WhatsApp, it also supports a local browser-based UX for more interactive sessions. Crucially, the orchestrator manages the Recommend Flow—the set of hard rules and constraints that define the recommendation process. However, it isn’t a rigid state machine; it has the autonomy to improvise and handle edge cases outside of the predefined flow when necessary.</li>
  <li><strong>The Preference Proxy (Xavibot v0.1)</strong>: This is not just another LLM endpoint. It runs RAG (Retrieval-Augmented Generation) over a broad corpus of my own material—blog posts, personal guides, and private documents. It acts as the “taste model,” carrying my content memory and style priors.</li>
  <li><strong>The Execution Layer (Browser Automation)</strong>: Completes the workflow (e.g., booking a table or adding an item to a cart) using the local browser.</li>
</ul>

<p>This separation matters. A single model trying to handle high-level preference reasoning and low-level DOM manipulation is often brittle. A role-specialized setup is easier to debug, tune, and—most importantly—trust.</p>

<h1 id="how-recommend-flow-works-end-to-end">How Recommend Flow Works (End-to-End)</h1>

<p>The interaction follows a disciplined Recommend → Decide → Do loop:</p>

<ol>
  <li>Voice/Text Intent: I send a WhatsApp message: “Hey, find me a place for dinner tonight that I haven’t been to but fits my usual vibe.”</li>
  <li>Preference Interrogation: The Orchestrator acknowledges the request and immediately queries Xavibot v0.1: “Based on Xavier’s past writing on food and his local guide, what are his core dining preferences?”</li>
  <li>Constraint Refinement: The system brings those signals back and asks for missing details (e.g., location or specific timing) one at a time.</li>
  <li>The Decision Set: It returns a compact “Top 1 + Backup 1” recommendation with explicit tradeoffs based on my retrieved preferences.</li>
  <li>Action Execution: Once I give the “Go”, the local Linux agent wakes up the browser, navigates to the reservation site, and executes the task using my existing login.</li>
</ol>

<p>Watch the Recommend Hotel Reservation Demo below or you can open it <a href="/blog/images/RestaurantReservationDemo.webm">here</a></p>

<div style="margin-bottom: 20px;">
  <video controls="" preload="metadata" style="width: 100%; max-width: 100%; height: auto;">
    <source src="/blog/images/RestaurantReservationDemo.webm" type="video/webm" />
    Your browser does not support the video tag. You can open the demo directly <a href="/blog/images/RestaurantReservationDemo.webm">here</a>.
  </video>
</div>

<p>*In this demo I have a chat with my Openclaw instance (called Xavibot) using a local browser for easier recording (as mentioned I usually interface through Whatsapp). Note that the browsing on the Chrome browser on the left is completely autonomous. In fact, at some point Openclaw decided to Google for “best restaurants in Palo Alto on OpenTable” that was a surprise.”</p>

<h1 id="why-xavibot-v01-is-the-perfect-backbone">Why Xavibot v0.1 is the Perfect Backbone</h1>

<p>As I’ve explored in my <a href="https://amatria.in/blog/datagutdecisions">Data-Informed Gut Decision-Making framework</a>, good decisions require a mix of data and intuition. Xavibot v0.1 brings three properties that are hard to fake with prompting alone:</p>

<ul>
  <li>Grounded Memory: RAG over a vast collection of my writings, personal guides, and miscellaneous documents give- s persistent, high-signal context. It has even surprised me by surfacing details I’d forgotten I documented—like my preference for avoiding spicy foods, an observation it pulled from deep within my records.</li>
  <li>Taste Continuity: It reflects my historical writing and constraints, acting as a “digital twin” of my preferences.</li>
</ul>

<h1 id="what-we-learned-and-what-broke">What We Learned (and What Broke)</h1>

<ol>
  <li>Multi-agent beats monolith: Role separation reduced prompt complexity and made behavior consistent.</li>
  <li>Personalization is a loop, not a profile: Static “user profile” fields are useful, but conversational updates (context, mood, live constraints) matter just as much.</li>
  <li>Action design needs strict guardrails: People forgive imperfect suggestions; they don’t forgive wrong actions. We learned to make recommendations “cheap” to generate, but transactions require explicit, high-confidence confirmation.</li>
  <li>Isolation is a security feature: Running the browser automation locally on my own machine, rather than in a cloud-hosted container, provided a natural security boundary that felt much safer for a personal project.</li>
</ol>

<h1 id="conclusion-the-future-of-agentic-execution">Conclusion: The Future of Agentic Execution</h1>

<p>The most interesting part of this project isn’t that an AI can recommend a restaurant; it’s the <strong>architectural pattern</strong>. This combination of preference-grounded reasoning, specialized agents, and tool-based execution is a blueprint that generalizes to travel planning, shopping, gifting, or even professional hiring workflows.</p>

<p>If the previous wave of AI was about assistants that could answer, this is the wave of assistants that can decide with you and then execute for you.</p>]]></content><author><name>Xavier</name></author><category term="Artificial Intelligence" /><category term="Agents" /><category term="Product Development" /><summary type="html"><![CDATA[(This blog post, as with most of my recent ones, is written with AI assistance and augmentation. In this case, “We” in the text refers to myself and my local OpenClaw agent, which has been my primary co-developer throughout this project.)]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://amatria.in/blog/images/125-0.png" /><media:content medium="image" url="https://amatria.in/blog/images/125-0.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Why I am not AGI-pilled (and you probably shouldn’t be either)</title><link href="https://amatria.in/blog/agi" rel="alternate" type="text/html" title="Why I am not AGI-pilled (and you probably shouldn’t be either)" /><published>2026-02-22T00:00:01+00:00</published><updated>2026-02-22T00:00:01+00:00</updated><id>https://amatria.in/blog/agi</id><content type="html" xml:base="https://amatria.in/blog/agi"><![CDATA[<p>If you have been following my journey for a while, you’re probably aware of my pragmatic approach to AI capabilities and my skepticism towards the surrounding hype. Not too long ago, during my time at Google, I found myself sitting next to someone at an event, and the conversation inevitably turned to AI. I tend to be pretty candid about my skepticism regarding Artificial General Intelligence (AGI), so I launched right into it. I laid out my entire thesis: why the term is a misnomer, why benchmarking against human cognition is a fallacy, and why the pursuit of a monolithic “God model” is bad engineering.</p>

<p>He listened thoughtfully, nodding along to my points. Eventually, I paused and asked, “By the way, what do you do at Google?”</p>

<p>He smiled politely and introduced himself. It was <a href="https://en.wikipedia.org/wiki/Shane_Legg">Shane Legg</a>.</p>

<p>For those who might not know, Shane is the Chief AGI Scientist at Google DeepMind. He is also the person who literally coined the term “AGI” nearly two decades ago. Pitching the case against AGI to the man whose life’s work is dedicated to building it is certainly one way to break the ice and Shane did take it in good humor. But despite the irony of the moment, I stand firmly by the arguments I made that day. I am not AGI-pilled.</p>

<p>Before I dive into the technical details, let me clarify one crucial distinction: rejecting AGI does not mean rejecting AI. I am incredibly optimistic about the future of Artificial Intelligence and its potential to fundamentally transform software, science, and society. My skepticism is directed solely at the pursuit of AGI—the obsession with building a single, monolithic, human-like “God model.” Being anti-AGI makes me a pragmatist, not a pessimist. In fact, I believe letting go of the AGI myth is the very key to building better, more capable AI systems. Here is why.</p>

<h1 id="the-myth-of-the-g">The Myth of the “G”</h1>

<p>The foundational flaw in AGI is the “G”: General. The concept assumes that human intelligence is a universal baseline against which all synthetic intelligence should be measured. But human intelligence is not general at all; it is highly specialized.</p>

<p>I am not alone in this view. Yann LeCun, Meta’s Chief AI Scientist, has publicly called the concept of artificial general intelligence <a href="https://the-decoder.com/yann-lecun-calls-general-intelligence-complete-bs-and-deepmind-ceo-hassabis-fires-back-publicly/">“complete BS,”</a> precisely because human intelligence is inherently specialized. We are optimized for a very specific evolutionary niche. If you look at raw numerical computation, a pocket calculator from the 1980s is vastly superior to the human brain.</p>

<p>If you look at nature, the idea of human intellectual supremacy gets even blurrier. While our particular brand of intelligence often differentiates us from animals, it doesn’t always make us objectively more “intelligent” in every context. A bird’s navigational intelligence, for instance, far surpasses that of most humans. (Read <a href="https://inquisitivebiologist.com/2023/06/13/book-review-if-nietzsche-were-a-narwhal-what-animal-intelligence-reveals-about-human-stupidity/">“If Nietzsche Were a Narwhal: What Animal Intelligence Reveals About Human Stupidity”</a> for more examples and details).</p>

<p>Benchmarking an AI against human capability doesn’t make it “general.” It simply makes it an artificial mimic of our specific, localized evolutionary adaptations.</p>

<h1 id="striking-the-magic-from-intelligence">Striking the Magic from Intelligence</h1>

<p>If human cognition isn’t the gold standard, what actually is intelligence?</p>

<p>My former colleague Blaise Agüera y Arcas tackles this beautifully in his great book <a href="https://whatisintelligence.antikythera.org/">What Is Intelligence?</a>. He strips away the mystical, anthropocentric aura we tend to wrap around the mind. Agüera y Arcas frames biological organisms essentially as compositions of functions, reducing intelligence to the elegant mechanics of prediction and computation. In his book, Aguera lists the 5 properties of intelligence, all of which point to a compound rather than monolythic nature. Intelligence is (1) predictive, (2) social, (3) multifractal, (4) divese, and (5) symbiotic. This leads to a very “non-agi” definition of intelligence as <em>“the ability to model, predict, and influence one’s future; it can evolve in relation to other intelligences to create a larger symbiotic intelligence.”</em> (Also kudos for that Before Sunrise reference!).</p>

<p>When you define intelligence by its functional reality (the ability to model the world and predict outcomes) the AGI illusion starts to fade. Building complex systems that master prediction (like predicting the next token in a sequence) doesn’t magically summon a human-like mind into the machine.</p>

<h1 id="the-monolith-fallacy-and-the-intelligence-gap">The Monolith Fallacy and the “Intelligence Gap”</h1>

<p>This brings us to the most frustrating contradiction in modern AI research: the cognitive dissonance of the top labs. Even luminaries who recognize the flaws in human “generality” (like LeCun, or DeepMind CEO Demis Hassabis) still insist on cramming all capabilities into massive, monolithic foundation models.</p>

<p><a href="https://www.businessinsider.com/deepmind-ceo-demis-hassabis-agi-real-intelligence-gap-2026-2">Recently</a>, Hassabis pointed to Large Language Models making mistakes in basic math as evidence of a “real intelligence gap” on the road to AGI. But this completely misses the point. LLMs are probabilistic predictors. Expecting billions of frozen neural weights to perform flawless, symbolic arithmetic is an architectural mismatch; it is like using a hammer to turn a screw. The solution to an LLM failing at math isn’t to train a larger, more expensive monolith or to expect another breakthrough. The solution is simply to give the model access to that 1980 calculator.</p>

<p>Similarly, many experts frequently point to “continuous learning” as a major hurdle separating current AI from true AGI. But again, this is a limitation of a frozen, monolithic neural network, not a limitation of AI as a system. Agentic AI solves this elegantly. Agents like OpenClaw are already demonstrating continuous learning by actively managing persistent memory and reading/writing to files. In fact, if we look outside the LLM bubble, simpler specialized AI—like massive-scale recommender systems—have been successfully using online learning approaches to continuously update and learn from user interactions for decades. We don’t need a mystical AGI breakthrough for continuous learning; we just need better system design.</p>

<h1 id="the-power-of-composition-enter-compound-ai-systems">The Power of Composition: Enter Compound AI Systems</h1>

<p>This is why the current pursuit of AGI represents terrible engineering. The most robust, scalable, and safe AI architectures in production today do not rely on a single, omniscient model. They achieve broad capability through composition—what we now call <a href="https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/">Compound AI Systems</a>.</p>

<p>Instead of forcing a single neural network to do everything, you orchestrate specialized agents. You use an LLM not as a universal database, but as a semantic router and reasoning engine. If the system needs factual grounding, it queries a vector database (RAG). If it needs to execute logic, it writes code and hands it to a Python interpreter. If it needs to do exact arithmetic, it executes an API call to a deterministic calculator tool.</p>

<p>This multi-agent paradigm is not just theoretical; it is happening right now. Recent examples like <a href="https://www.moltbook.com/">Moltbook</a> and <a href="https://www.infoq.com/news/2026/02/kimi-k25-swarm/">Kimi agent swarm</a> have gathered far more attention, excitement, and practical traction than any of the recent monolithic model launches. When Anthropic’s CEO Dario Amodei talks about the future being a <a href="https://www.darioamodei.com/essay/the-adolescence-of-technology">“Country of Geniuses in a Datacenter,”</a> he is effectively describing a multi-agent reality. You do not need a central, monolithic AGI to make that possible. You simply need swarms of highly specialized agents collaborating at scale.</p>

<p>As I have noted before, specialized, superhuman agents are likely to be both more achievable and more beneficial in addressing specific, complex challenges in alignment with our goals (see my <a href="https://amatria.in/blog/multiagents">“Beyond Singular Intelligence: Exploring Multi-Agent Systems and Multi-LoRA in the Quest for AGI”</a>). Much like modern, massive-scale recommender systems, intelligence at scale is a pipeline, not a monolith.</p>

<p>This compound approach is inherently safer. When you rely on specialized tools, you maintain control. You can monitor the API calls, isolate hallucinations, and bake strict, programmatic guardrails into the boundaries between agents. A monolithic model, by contrast, is an opaque black box where capabilities and failure modes are dangerously entangled.</p>

<h1 id="the-winner-takes-all-hard-takeoff-scenario">The winner-takes-all hard takeoff scenario</h1>

<p>So why are AI labs so obsessed with pursuing a single, monolithic AGI instead of embracing specialized composition? It is largely driven by the science-fiction fantasy of the <a href="https://www.lesswrong.com/posts/tjH8XPxAnr6JRbh7k/hard-takeoff">“hard takeoff.”</a> This is the belief that once a single model crosses a certain intelligence threshold, it will recursively self-improve at an explosive rate. It’s an arms race fueled by the fear that the first company to build this “God model” takes the entire global economy.</p>

<p>Fueling an arms race to build a single, opaque, uncontrollable system just to win a hypothetical “winner-takes-all” scenario is not a sound technological strategy. It is reckless.</p>

<h1 id="conclusion">Conclusion</h1>

<p>Generality is not a magical spark waiting to be ignited inside a massive GPU cluster. Broad, robust capability is a system-level property, achieved through the careful, safe composition of specialized tools. We don’t need AGI to build highly capable systems, and we certainly shouldn’t be gambling our future on the hope that the first ones to reach the hard takeoff threshold will be whoever we consider to be “the good ones”.</p>]]></content><author><name>Xavier</name></author><category term="Artificial Intelligence" /><category term="AGI" /><category term="Machine Learning" /><category term="Philosophy" /><category term="LLMs" /><category term="Agents" /><summary type="html"><![CDATA[If you have been following my journey for a while, you’re probably aware of my pragmatic approach to AI capabilities and my skepticism towards the surrounding hype. Not too long ago, during my time at Google, I found myself sitting next to someone at an event, and the conversation inevitably turned to AI. I tend to be pretty candid about my skepticism regarding Artificial General Intelligence (AGI), so I launched right into it. I laid out my entire thesis: why the term is a misnomer, why benchmarking against human cognition is a fallacy, and why the pursuit of a monolithic “God model” is bad engineering.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://amatria.in/blog/blog/images/124-0.png" /><media:content medium="image" url="https://amatria.in/blog/blog/images/124-0.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Recommending in the Age of AI: How we got here and what comes next - My Recsys 2025 keynote</title><link href="https://amatria.in/blog/recsyskeynote" rel="alternate" type="text/html" title="Recommending in the Age of AI: How we got here and what comes next - My Recsys 2025 keynote" /><published>2026-01-24T00:00:01+00:00</published><updated>2026-01-24T00:00:01+00:00</updated><id>https://amatria.in/blog/recsyskeynote</id><content type="html" xml:base="https://amatria.in/blog/recsyskeynote"><![CDATA[<p>This blog post is a detailed summary of my recent keynote at ACM RecSys 2025 in Prague. You can watch the full video <a href="https://www.youtube.com/watch?v=TlR7douxQRM">here</a>.</p>

<p><img src="/blog/images/123-0.png" /></p>

<p>I’ve been involved with RecSys for a long time. This keynote was my 11th. I attended the first six, so I was there in the early days—and I’ve watched the field repeatedly reinvent itself.</p>

<p>One of my favorite personal “RecSys origin stories” is that when I transitioned from academia to industry, I found my job through this community. In Barcelona 2010, I started conversations that led me to Netflix, and eventually to many other things. I look around today and see people who have been interns with me and then followed similar paths. That’s part of what makes this community special: it’s a rare intersection of industry + academia + practitioners, with a shared obsession not only for algorithms, but for product, users, and psychology.</p>

<p>This talk had three parts:</p>
<ul>
  <li><strong>How we got here</strong> (history, with some personal bias)</li>
  <li><strong>Recommending in the age of GenAI</strong> (the present)</li>
  <li><strong>What’s coming next</strong> (where I think we’re heading)</li>
</ul>

<h1 id="part-i--how-we-got-here">Part I — How We Got Here</h1>

<h2 id="movielens-v0-1997-and-the-field-becomes-a-field">MovieLens v0, 1997, and the “field becomes a field”</h2>

<p>Early in the talk I put up a screenshot that not everyone recognized: MovieLens v0 (published around 1997). For me, that interface is more than nostalgia. It’s a marker that a set of ideas turned into a recognizable field—built by Joe Konstan, the late John Riedl, and the rest of the University of Minnesota team.</p>

<p>It’s also why the first RecSys conference was held in Minneapolis—and why going back there feels like a loop closing and reopening.</p>

<h3 id="ai-history-intertwined-with-recsys-history">AI history, intertwined with RecSys history</h3>

<p>I deliberately intertwined recommender systems history with the history of AI, because the two have been co-evolving for decades:</p>

<p><img src="/blog/images/123-1a.png" /></p>

<ul>
  <li><strong>1950s</strong>: “Artificial Intelligence” is coined at Dartmouth; Rosenblatt publishes the perceptron paper.</li>
  <li><strong>1969 → 1970s</strong>: Minsky’s critique leads to the first AI winter.</li>
  <li><strong>1980s</strong>: expert systems become fashionable again; then people rediscover their brittleness and scaling limits.</li>
  <li><strong>1987–1993</strong>: another AI winter.</li>
  <li><strong>1997</strong>: MovieLens, early RS papers.</li>
  <li><strong>2006–2009</strong>: Netflix Prize (we’ll spend time here).</li>
  <li><strong>2007</strong>: RecSys conference starts (on the heels of Netflix Prize energy).</li>
  <li><strong>2011–2016</strong>: deep learning momentum hits recommender systems (YouTube DL recommender paper is a major moment).</li>
  <li><strong>2018</strong>: Transformers (“Attention is All You Need”).</li>
</ul>

<p>This timeline matters because it shows a pattern: RS progresses when model capability, data availability, and product surfaces line up—and stalls (or misleads us) when we optimize the wrong abstractions.</p>

<h3 id="netflix-prize-a-turning-point-and-a-lesson-about-proxies">Netflix Prize: a turning point, and a lesson about proxies</h3>

<p>The Netflix Prize (2006–2009) was pre-Kaggle, pre-everything we now take for granted. It was a massive public experiment. The goal was framed as “better recommendations,” but the proxy objective was explicit: improve RMSE on rating prediction by 10%, win $1M.</p>

<p>The winning solution was instructive:</p>
<ul>
  <li>It was an ensemble (as usual).</li>
  <li>It combined 104 models using a neural network.</li>
  <li>The “main” approaches were a matrix factorization / SVD variant and restricted Boltzmann machines (a neural net).</li>
</ul>

<p>Then came the part I think many people remember less clearly: we took the work back to Netflix and asked, “Can we productionize it?” The answer was: not as-is.</p>
<ul>
  <li>104 models didn’t scale well, was too slow, too complicated.</li>
  <li>More importantly: while we were doing that translation work, we realized something deeper: the objective itself (RMSE on ratings) was not the right question.</li>
</ul>

<p>We did productionize SVD and RBMs—they were the first ML algorithms that went into Netflix’s product. But the Netflix Prize still taught a durable lesson: You can win the benchmark and still lose the product. Or, more precisely: your offline proxy can be “correct” and still be wrong.</p>

<h3 id="from-algorithms-to-machine-learning-to-ai">From “algorithms” to “machine learning” to “AI”</h3>

<p>Back then, we didn’t even say “machine learning.” We said algorithms. My team at Netflix was literally called Algorithms Engineering. Over time, the naming shifted: algorithms → ML → AI. That branding shift wasn’t just marketing; it reflected real changes in how systems were built and what people expected of them.</p>

<p>I used a simple example to make this concrete: the most basic personalized recommender you can build—almost comically basic by today’s standards.</p>
<ul>
  <li>Two features: Popularity, Predicted rating</li>
  <li>Two parameters: w1 and w2</li>
  <li>A linear model</li>
</ul>

<p>The task: learn w1 and w2 from user behavior data. It’s a useful toy model because it captures a core truth: the system is mostly the same loop, regardless of complexity: choose features, choose model family, estimate parameters from data, measure whether it helped users.</p>

<p><img src="/blog/images/123-1.png" /></p>

<p>We used to “advance ML” by adding more features and making models more complex. Feature engineering mattered and it required domain knowledge.</p>

<p>I gave a Quora example that still resonates with me: ranking answers for a question. It sounds obvious until you try to formalize it. We had to talk to editors and journalists about what “good” meant. They said: truthful, reusable, well-formatted, not too long, the right length. That became features. And then those features got learned.</p>

<p>That was “old-school” ML—though, honestly, we still do versions of it today.</p>

<h3 id="the-recommender-problem-evolved-rating--ranking--page--context">The recommender problem evolved: rating → ranking → page → context</h3>

<p>Another key arc: the problem definition evolved.</p>
<ul>
  <li><strong>Point-wise prediction</strong>: predict a rating (Netflix Prize era)</li>
  <li><strong>Ranking</strong>: learn to order items</li>
  <li><strong>Page optimization</strong>: optimize a full surface (rows, shelves, competing modules)</li>
  <li><strong>Context-aware</strong>: device, time of day, location, intent—more dimensions</li>
</ul>

<p><img src="/blog/images/123-2.png" /></p>

<p>This wasn’t an academic shift. It was driven by the reality that a product isn’t a “list.” It’s an environment. At this point in the talk I referenced two of my own prior contributions:</p>
<ul>
  <li>The Netflix work on “Beyond the five stars”, emphasizing why implicit feedback often beats explicit ratings for real-world optimization.</li>
  <li>The “Multiverse recommendation” work (published at RecSys in Barcelona 2010), which became my most cited RecSys paper—explicitly leaning into context-aware recommendation.</li>
</ul>

<h3 id="deep-learning-in-recommender-systems-two-tower-and-the-promise-of-representation-learning">Deep learning in recommender systems: two-tower (and the promise of representation learning)</h3>

<p>Then came deep learning’s major wave in RecSys—roughly 2011 onward—culminating in the “deep learning for YouTube recommendations” moment that hit this community hard.</p>

<p>To ground it, I showed the classic two-tower model:</p>
<ul>
  <li>user embedding tower</li>
  <li>item embedding tower</li>
  <li>dot product to score similarity / relevance</li>
</ul>

<p><img src="/blog/images/123-3.png" /></p>

<p>It’s not the best model, but it’s the right mental starting point. Even in 2014 we were already experimenting with distributed neural nets in production contexts. And by 2016–2017, I was explicitly framing this as “the recommender problem revisited,” because deep learning forced us to revisit assumptions about features, modeling capacity, and system architecture.</p>

<p><strong>(Geeky rabbit hole — Deep learning “replaced feature engineering”… but RecSys had already been blending paradigms)</strong></p>

<p>Deep learning’s promise was: “stop hand-crafting features; the model learns representations.” But there’s a subtle connection to how recommenders already worked. Even matrix factorization is, in a sense, a hybrid of:</p>
<ul>
  <li>unsupervised structure learning (dimensionality reduction, latent factors)</li>
  <li>supervised signal (ratings, implicit feedback)</li>
</ul>

<p>We were already combining unsupervised and supervised approaches in clustering and latent-factor methods. Deep learning didn’t invent the idea; it industrialized it and scaled it—and then moved us more explicitly into self-supervision.</p>

<p>I tied this to the “multi-layer cake” framing of modern ML:</p>
<ul>
  <li>self-supervised pretraining</li>
  <li>supervised fine-tuning</li>
  <li>reinforcement learning / alignment as the “icing”</li>
</ul>

<p>This “layered training” view is something I’ve written about in the context of modern LLMs—especially the idea that “token prediction” alone undersells what post-pretraining adds.</p>

<h2 id="its-not-only-about-algorithms">It’s not only about algorithms</h2>

<p>At this point I paused and emphasized a point that’s easy to say and hard to operationalize: In recommender systems, the algorithm is rarely the whole system.</p>

<p>I summarized the “non-algorithm” pillars as:</p>
<ul>
  <li>UX / design</li>
  <li>Domain knowledge</li>
  <li>Evaluation metrics</li>
</ul>

<p><img src="/blog/images/123-4.png" /></p>

<p><strong>UX / design</strong>: I showed an early Netflix interface where the page was packed with explanations: predicted rating, “because you watched X”, actors, director, etc. Those explanations—and the way we presented choices—often mattered as much as the model. This also connects to why Netflix ultimately moved from stars to thumbs: the UX and the feedback mechanism are part of the learning loop.</p>

<p><strong>Domain knowledge</strong>: Even with deep learning, you still need domain knowledge—especially in constrained domains like healthcare. Constraints aren’t optional; they’re foundational.</p>

<p><strong>Evaluation metrics</strong>: You need offline and online evaluation. You must iterate fast with offline proxies, validate with online experiments, and connect short-term metrics to long-term satisfaction/retention. I cited a memorable result from a YouTube team study: long-term satisfaction was causally linked not merely to “more consumption,” but to diversity of content consumed. If you get people to consume a more diverse set of content, they tend to be more satisfied in the long run. That finding matters because it’s a reminder that “maximize clicks” is not the same thing as “maximize sustained satisfaction.”</p>

<h1 id="part-ii--recommending-in-the-age-of-genai">Part II — Recommending in the age of GenAI</h1>

<p>Bill Gates wrote, “the age of AI has begun.” I used that line to mark the present moment—because Transformers, LLMs, and GenAI changed both the research conversation and the product conversation.</p>

<h2 id="two-parameters--trillions-of-parameters">Two parameters → trillions of parameters</h2>

<p>I showed a plot: transformer research families over time, parameter counts rising from ~100M to beyond a trillion. It’s worth repeating the contrast because it captures the discontinuity: earlier I showed a recommender with two parameters (popularity weight and predicted-rating weight). Now we’re in a world where models have trillions of parameters. All of those parameters still get learned from data—just through a very different pipeline.</p>

<p><img src="/blog/images/123-4b.png" /></p>

<h2 id="even-research-impact-got-weird">Even “research impact” got weird</h2>

<p>I mentioned how my citations changed dramatically in 2024–2025 because I posted three arXiv works: an LLM survey, a prompt design/engineering publication, and my transformer catalog. They weren’t even peer-reviewed in the traditional sense—yet the field’s attention was so concentrated that the impact was immediate. That’s not a moral argument; it’s an observation about attention allocation in the current research ecosystem.</p>

<h2 id="how-genai-is-already-changing-recommender-systems">How GenAI is already changing recommender systems</h2>

<p>I gave three examples (largely from Google) to illustrate trends:</p>
<ul>
  <li>LLMs for understanding preferences</li>
  <li>Generative retrieval</li>
  <li>Transformers applied to content-heavy recommendation contexts (e.g., music)</li>
</ul>

<p>Then I returned again to the earlier triad (UX, domain knowledge, evaluation) and argued they still matter—but differently now:</p>
<ul>
  <li>UX and AI are now intertwined; sometimes the UX is the AI (chatbot-style discovery).</li>
  <li>Pretrained foundation models carry a lot of domain knowledge out of the box—but domain expertise still matters for constraints and evaluation.</li>
  <li>Evaluation is arguably more important now; measuring GenAI is hard, and feedback loops are subtle.</li>
</ul>

<p><strong>Demo 1: “basic LLM” recommendation from a handful of shows</strong></p>

<p>I demonstrated a simple prompt in Gemini: I gave it four Netflix shows I liked and asked for recommendations. What mattered wasn’t only the recommendation list—it was the “thinking trace” the model surfaced: identify attributes, extract themes, find common threads, craft categories, then produce options. And it worked. The model recommended:</p>
<ul>
  <li>Ozark (which I’d seen and liked)</li>
  <li>Mindhunter (which I hadn’t seen)</li>
  <li>Narcos (which I’d seen and liked) Also: it was different each time—both the blessing and the curse of generative systems.</li>
</ul>

<p><strong>Demo 2: zero-history preference elicitation in five questions</strong></p>

<p>Then I made it more interesting: “assume you know nothing about me—ask me five yes/no questions and recommend five music artists I’ll like.” Again, the point wasn’t just the output. The model:</p>
<ul>
  <li>designed a question flow</li>
  <li>implicitly built a decision tree</li>
  <li>updated a “user profile” after each answer</li>
</ul>

<p>With five questions, it recommended: Tool, King Crimson, Animals as Leaders, Russian Circles, and Karnivool. I noted that some picks were a bit off for my taste—likely influenced by how I answered the “harsh vocals” question. But the more important observation remained: from a blank sheet, the system elicited preferences and produced plausible recommendations.</p>

<h3 id="a-taxonomy-of-how-llms-enter-recsys">A taxonomy of how LLMs enter RecSys</h3>

<p>I referenced a diagram from the STAR paper that classifies approaches (see below):</p>
<ul>
  <li>pure prompting (like my demos)</li>
  <li>prompting with user history (user–item interactions)</li>
  <li>using LLMs to create semantic features/IDs/embeddings combined with collaborative signals</li>
  <li>two-LLM architectures: one for semantic features + CF, another LLM for final ranking The meta-point: “using LLMs” isn’t a single technique; it’s a design space.</li>
</ul>

<p><img src="/blog/images/123-5.png" /></p>

<h1 id="part-iii--whats-next">Part III — What’s next</h1>

<p>I started this section with a screenshot from a startup called Fable: “Netflix for generative content.” The proposition is an extreme endpoint of personalization: not only recommending content but generating content, on the fly, personalized to each user. That’s “the last step” of personalization in one direction.</p>

<p>Then I returned again—intentionally—to the product triad:</p>
<ul>
  <li>UX design (especially in multimodal / agentic worlds)</li>
  <li>Domain knowledge + deep, continuous user knowledge</li>
  <li>Evaluation (now vastly harder)</li>
</ul>

<p><strong>Demo 3: a recommendation agent (Localify)</strong></p>

<p>I showed an agent built with Google AgentSpace. I called it “Localify.” Its job was simple: ask the user about tastes, search local events, and help find tickets. In the live demo, the agent didn’t ask the preference questions because it already “knew” my earlier answers (I had tested it). Based on what it remembered—rock, jazz, music, cinema—it recommended:</p>
<ul>
  <li>an indie rock concert</li>
  <li>a jazz evening</li>
  <li>an independent drama film</li>
</ul>

<p>Then it helped find a link for tickets. What I wanted to emphasize was how small the barrier has become:</p>
<ul>
  <li>the agent prompt was basic</li>
  <li>I used “help me write” and the LLM improved the prompt</li>
  <li>it took minutes, not weeks</li>
</ul>

<p>And if you want to make it more powerful, you can connect it to tools: calendar, email, enterprise systems, backend databases, …and even (dangerously) payment.</p>

<p>(AI digression — the moment you add tools, you import responsibility) In one of my posts, I put it bluntly: AI is great for organizing/analyzing data, but it doesn’t have “gut,” intuition, or accountability—and that’s precisely why human judgment remains central. That maps directly to agents: the moment an agent can act, UX design and safety constraints stop being secondary concerns.</p>

<h2 id="agents-that-browse-the-web-and-recommend">Agents that browse the web and recommend</h2>

<p>I then showed a more advanced agent concept (Project Mariner): it can browse the web on your behalf—scroll, click, match opportunities to your resume, and execute a multi-step flow. The only additional capability (conceptually) is huge: delegated navigation in human UIs.</p>

<h2 id="world-models-genie-3-and-generated-reality">World models (Genie 3) and “generated reality”</h2>

<p>I showed a clip of “Genie 3,” positioning it as a frontier: not just generating text or images, but generating interactive worlds, with real-time reactivity, “world memory” (actions persist), and promptable events. This opens a window to a future where “personalized media” is not just personalized content—it’s personalized environments.</p>

<h2 id="deep-continuous-user-knowledge-the-personalization-paradox">Deep, continuous user knowledge: the personalization paradox</h2>

<p>LLMs have huge world knowledge; what’s still hard is injecting knowledge about you—accurately, safely, continuously. I showed a Gemini direction: more persistent memory so you don’t need to repeat “I like jazz and indie cinema” every time. That’s the personalization paradox: <strong>the model knows the world, it still struggles to know you (and to update that knowledge responsibly)</strong>.</p>

<h2 id="research-directions-i-highlighted">Research directions I highlighted</h2>

<p>I ended with a set of recent papers (three examples) illustrating trends:</p>
<ul>
  <li>aligning LLM-powered systems to user feedback (and novelty)</li>
  <li>serendipity / novelty with multimodal signals</li>
  <li>hybrid strategies that combine fine-tuning (infrequent) with RAG (more frequent) to keep user modeling fresh without constantly retraining</li>
</ul>

<h1 id="conclusion-a-journey-from-clicks-to-conversations">Conclusion: a journey from clicks to conversations</h1>

<p>I closed with a framing I’d encourage you to keep in mind as you design systems in 2026 and beyond: We’ve revisited the recommender problem multiple times. We started with predicting stars, then clicks. We’re shifting into conversations, and now agents—long-running, tool-using systems that discover on our behalf.</p>

<p>My current bets are:</p>
<ul>
  <li><strong>Agents are the future of discovery</strong>. They’ll search, filter, and propose options in the background, then surface novel things for us to engage with.</li>
  <li><strong>Personalization will remain the hard part</strong>. World knowledge scales. “User knowledge” is messy, dynamic, private, and consequential.
Evaluation is the new frontier. Especially for long-running, multi-step systems where value accrues over time and failure modes are subtle.</li>
  <li><strong>The ultimate prize might be “media of one.”</strong> Content not only discovered for you—but created for you, on the fly, personalized to what you want right now.</li>
</ul>

<p>And, because this is RecSys: karaoke remains a constant—and apparently so do 7am runs.</p>

<h1 id="qa-moments-worth-carrying-forward">Q&amp;A moments worth carrying forward</h1>

<p>A few audience questions surfaced important tensions:</p>

<p><strong>Recommend from catalog vs generate unique items?</strong> The cultural value of shared artifacts matters. If everyone gets a different show, what happens to shared conversation? My instinct: we’ll find hybrid dynamics—personalized creation plus social sharing (you can “send” your generated show).</p>

<p><strong>Will users really have long conversations vs passive feeds?</strong> Different modes will coexist. There are “brain-dead scroll” moments and “high-ROI search” moments (finding the next book vs watching a 30-second clip). The adoption of chat products is a strong counterexample to the idea that people never want conversational interfaces.</p>

<p><strong>If agents consume content, what incentives remain for creators?</strong> No clean answer yet. But historically, new creation tools tend to democratize creation rather than end it—and we should proactively design ecosystems that keep human creativity rewarded and visible.</p>

<h1 id="references">References</h1>

<ul>
  <li><strong>Amatriain, X.</strong> (2025). <a href="https://www.youtube.com/watch?v=TlR7douxQRM">Keynote at ACM RecSys 2025</a></li>
  <li><strong>Konstan, J. A., et al.</strong> (1997). <a href="https://dl.acm.org/doi/10.1145/245108.245126">GroupLens: Applying collaborative filtering to Usenet news</a></li>
  <li><strong>Koren, Y.</strong> (2009). <a href="https://www2.seas.gwu.edu/~simhaweb/champalg/cf/papers/KorenBellKor2009.pdf">The BellKor Solution to the Netflix Grand Prize</a></li>
  <li><strong>Karatzoglou, A., Amatriain, X., Baltrunas, L., &amp; Oliver, N.</strong> (2010). <a href="https://dl.acm.org/doi/10.1145/1864708.1864727">Multiverse Recommendation: N-dimensional Tensor Factorization for Context-aware Collaborative Filtering.</a></li>
  <li><strong>Amatriain, X. &amp; Basilico, J.</strong> (2012). <a href="https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429">Netflix Recommendations: Beyond the 5 stars.</a></li>
  <li><strong>Le, Q. V., et al.</strong> (2012). <a href="https://arxiv.org/abs/1112.6209">Building high-level features using large scale unsupervised learning (The “Cat” paper).</a></li>
  <li><strong>Covington, P., et al.</strong> (2016). <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf">Deep Neural Networks for YouTube Recommendations.</a></li>
  <li><strong>Vaswani, A., et al.</strong> (2017). <a href="https://arxiv.org/abs/1706.03762">Attention Is All You Need</a>.</li>
  <li><strong>Amatriain, X.</strong> (2024). <a href="https://arxiv.org/abs/2302.07730">Transformer models: an introduction and catalog.</a></li>
  <li><strong>Minaee, S., et al.</strong> (2024). <a href="https://arxiv.org/abs/2402.06196">Large Language Models: A Survey.</a></li>
  <li><strong>Amatriain, X.</strong> (2024). <a href="https://arxiv.org/abs/2401.14423">Prompt Design and Engineering: Introduction and Advanced Methods.</a></li>
  <li><strong>Lee, D., et al.</strong> (2024) <a href="https://arxiv.org/abs/2410.16458">STAR: A Simple Training-free Approach for Recommendations using Large Language Models</a></li>
  <li><strong>Wang, J., et al</strong> (2025) <a href="https://arxiv.org/abs/2504.05522">User Feedback Alignment for LLM-powered Exploration in Large-scale Recommendation Systems</a></li>
  <li><strong>Meng, C., et al.</strong> (2025) [Balancing Fine-tuning and RAG: A Hybrid Strategy for Dynamic LLM Recommendation Updates] (https://arxiv.org/abs/2510.20260)</li>
</ul>]]></content><author><name>Xavier</name></author><category term="Recsys" /><category term="Artificial Intelligence" /><category term="Recommender Systems" /><category term="Machine Learning" /><summary type="html"><![CDATA[This blog post is a detailed summary of my recent keynote at ACM RecSys 2025 in Prague. You can watch the full video here.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://amatria.in/blog/blog/images/123-0.png" /><media:content medium="image" url="https://amatria.in/blog/blog/images/123-0.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Abundance, Anxiety, AI, and Algorithms: An ‘A-List’ of Books for Decoding the Modern World</title><link href="https://amatria.in/blog/recent-books-2026" rel="alternate" type="text/html" title="Abundance, Anxiety, AI, and Algorithms: An ‘A-List’ of Books for Decoding the Modern World" /><published>2026-01-02T00:00:01+00:00</published><updated>2026-01-02T00:00:01+00:00</updated><id>https://amatria.in/blog/books</id><content type="html" xml:base="https://amatria.in/blog/recent-books-2026"><![CDATA[<p>It has been quite a while since I last shared a reading list on this blog. In the fast-paced world of technology, it’s easy to get caught up in the stream of papers and newsletters, but I’ve always found that books provide the necessary depth and historical context to truly understand where we are heading.
As I sat down to synthesize my recent readings, I realized that the core themes converged into what I’ve started calling my “A-List” of recent books: Abundance, Anxiety, AI, and Algorithms. It’s a playful alliteration, but one that captures the profound tension between the potential for technological plenty and the societal costs we are only beginning to calculate. Over the past few months, I’ve been diving into a diverse set of titles that help decode these four forces, categorizing them into AI, Tech, Leadership, and “Other Important Ideas.”</p>

<p><img src="/blog/images/122-0.png" /></p>

<h3 id="ai-and-the-ai-revolution">AI and the AI Revolution</h3>

<p>The current revolution isn’t just about code; it’s about the fundamental nature of intelligence and the hardware that powers it.</p>

<p><img src="/blog/images/122-1.png" /></p>

<ul>
  <li><a href="https://whatisintelligence.antikythera.org/">“What is Intelligence: Lessons from AI about Evolution, Computing, and Mind” by Blaise Aguera y Arcas</a>: My former colleague at Google, Blaise, delivers an incredible deep dive into the meaning of life through the lens of computation. It’s a profound look at how AI helps us redefine what it means to be alive and intelligent.</li>
  <li><a href="https://www.amazon.com/Brief-History-Intelligence-Humans-Breakthroughs/dp/0063286343">“A Brief History of Intelligence:: Evolution, AI, and the Five Breakthroughs That Made Our Brains” by Max S. Bennett</a>: A great companion to Blaise’s book, focusing on the five breakthroughs that shaped our brains.</li>
  <li><a href="https://www.amazon.com/Chip-War-Worlds-Critical-Technology/dp/1982172002">“Chip War: The Fight for the World’s Most Critical Technology” by Chris Miller</a>: To understand the AI revolution, you must understand the silicon. Miller provides a fascinating look at how companies like TSMC, ASML, and NVIDIA reached their current dominance and how these dynamics are shaping global geopolitics.</li>
  <li><a href="https://www.amazon.com/Nvidia-Way-Jensen-Huang-Making/dp/1324086718">“The Nvidia Way: Jensen Huang and the Making of a Tech Giant”)</a>: For those of us living and breathing the AI revolution from the inside, this history of NVIDIA is essential reading. It serves as a perfect companion to Chris Miller’s Chip Wars; while Miller provides the macro-perspective of the silicon landscape, this book dives deep into the specific company culture and technical bets that allowed NVIDIA to dominate that landscape.</li>
  <li><a href="https://www.amazon.com/Optimist-Altman-OpenAI-Invent-Future/dp/1324075961">“The Optimist: Sam Altman, OpenAI, and the Race to Invent the Future”</a>: More than an authorized biography, this is a detailed history of the characters and moments that shaped OpenAI and modern Silicon Valley.</li>
  <li><a href="https://www.amazon.com/Singularity-Nearer-Ray-Kurzweil-ebook/dp/B08Y6FYJVY">“The Singularity is Nearer” by Ray Kurzweil</a>: A long-awaited update to his classic thesis on the merging of human and machine.</li>
  <li><a href="https://www.amazon.com/Worlds-See-Curiosity-Exploration-Discovery-ebook/dp/B0BPQSLVL6">“The Worlds I See” by Dr. Fei-Fei Li</a>: A beautiful memoir about curiosity and the dawn of modern AI from one of the field’s most important pioneers.</li>
  <li><a href="https://www.amazon.com/dp/059373422X">“Nexus” by Yuval Noah Harari</a>: Harari looks at information networks from the Stone Age to AI, providing his usual sweeping historical perspective.</li>
  <li><a href="https://www.amazon.com/Co-Intelligence-Living-Working-Ethan-Mollick/dp/059371671X">“Co-Intelligence” by Ethan Mollick</a>: One of the most practical guides out there for actually living and working alongside AI today.</li>
</ul>

<h3 id="tech-and-its-people">Tech and Its People</h3>

<p>Understanding tech often requires understanding the “DNA” of the institutions and individuals that built it.</p>

<p><img src="/blog/images/122-2.png" /></p>

<ul>
  <li><a href="https://www.amazon.com/Idea-Factory-Great-American-Innovation/dp/0143122797">“The Idea Factory” by Jon Gertner</a>: A look back at Bell Labs, the original powerhouse of American innovation.</li>
  <li><a href="https://www.amazon.com/Elon-Musk-Walter-Isaacson/dp/1982181281">“Elon Musk” by Walter Isaacson</a>: Regardless of your personal opinion of him, Isaacson’s biography explains a lot about his trajectory and the “demon mode” that drives his companies.</li>
  <li><a href="https://www.amazon.com/This-Everyone-Unfinished-Story-World/dp/0374612463">“This is For Everyone” by Tim Berners-Lee</a>: I stumbled upon this recently. It’s the story of the WWW told by the man who invented it—covering the past, present, and his vision for the future.</li>
  <li><a href="https://www.amazon.com/Plex-Google-Thinks-Works-Shapes/dp/1416596585">“In the Plex” by Steven Levy</a>: Even though it’s missing the last 15 years, it remains one of the best books for understanding Google’s foundational culture.</li>
  <li><a href="https://www.amazon.com/Careless-People-Cautionary-Power-Idealism/dp/1250391237">“Careless People: A Cautionary Tale of Power, Greed, and Lost Idealism by Sarah Wynn-Williams”</a>: A raw, personal account of Meta’s cultural stumbles from someone who had a front-row seat to the internal dynamics.</li>
  <li><a href="https://www.amazon.com/Source-Code-Beginnings-Bill-Gates/dp/059380158X">“Source Code: My Beginnings” by Bill Gates</a>: A fascinating look at the early life of Bill Gates and the birth of Microsoft. I learned quite a bit I didn’t know about his early years.</li>
  <li><a href="https://www.amazon.com/Pattern-Breakers-Start-Ups-Change-Future/dp/1541704355">“Pattern Breakers: Why Some Start-Ups Change the Future” by Mike Maples Jr. and Peter Ziebelman</a>: An insightful look at why some startups manage to change the future while most fail.</li>
</ul>

<h3 id="leadership-culture-and-human-nature">Leadership, Culture, and Human Nature</h3>

<p>As I discussed in my <a href="https://amatria.in/blog/challengeinspire">Challenge-Inspire model post</a>, leadership is about more than just task management; it’s about understanding the human element.</p>

<p><img src="/blog/images/122-3.png" /></p>

<ul>
  <li><a href="https://www.amazon.com/Reset-How-Change-Whats-Working/dp/1668062097">“Reset: How to Change What’s Not Working” by Dan Heath</a>: This is a very practical book on changing systems—like our teams—that aren’t working optimally. It focuses on finding leverage points to drive real change. Highly recommended.</li>
  <li><a href="https://www.amazon.com/Laws-Human-Nature-Robert-Greene/dp/0525428143">“The Laws of Human Nature” by Robert Greene</a>: A comprehensive guide to understanding behavior and communication. It’s dense with historical references and provides great advice on how to bring out the best in people.</li>
  <li><a href="https://www.amazon.com/Supercommunicators-Unlock-Secret-Language-Connection/dp/0593243919]">“Supercommunicators” by Charles Duhigg</a>: A recent bestseller that provides a great framework for connecting with others.</li>
  <li><a href="https://www.amazon.com/How-Know-Person-Seeing-Others/dp/059323006X">“How to Know a Person” by David Brooks</a>: A very personal guide on how to foster deeper connections. Brooks acknowledges his own challenges in this area, which makes his tips and advice feel very grounded and earned.</li>
  <li><a href="https://www.amazon.com/First-Break-All-Rules-Differently/dp/0684852861">“First, Break All the Rules” by Marcus Buckingham</a>: A classic that still holds up regarding what great managers do differently.</li>
  <li><a href="https://www.amazon.com/How-Decide-Simple-Making-Choices/dp/0593418484">“How to Decide” by Annie Duke</a>: More tools for the decision-making toolkit, which you know is a <a href="https://amatria.in/blog/datagutdecisions">favorite topic of mine</a>.</li>
  <li><a href="https://www.amazon.com/Start-Why-Leaders-Inspire-Everyone/dp/1591846447">“Start with Why” by Simon Sinek</a>: A foundational text on how great leaders inspire action.</li>
</ul>

<h3 id="other-important-ideas">Other Important Ideas</h3>

<p>Finally, a few books that have challenged my perspective on the broader world.</p>

<p><img src="/blog/images/122-4.png" /></p>

<ul>
  <li><a href="https://www.amazon.com/Capital-Twenty-Century-Thomas-Piketty/dp/067443000X">“Capital in the 21st Century” by Thomas Piketty</a>: This was proposed for a book club, and at 700 pages, it was daunting. However, I was pleasantly surprised. It’s a deeply researched work on macroeconomics and inequality that provides the essential historical background and context for more contemporary shifts.</li>
  <li><a href="https://www.amazon.com/Abundance-Progress-Takes-Ezra-Klein/dp/1668023482">“Abundance” by Ezra Klein</a>: With Piketty’s historical lens in place, Klein’s “Abundance” is a fascinating read. While it may appear techno-optimistic on the surface, its main lesson for me was how easily well-intentioned policies can falter when built on incorrect assumptions about the future—a critical takeaway as we navigate a world reshaped by AI and automation.</li>
  <li><a href="https://www.amazon.com/Anxious-Generation-Rewiring-Childhood-Epidemic/dp/0593655036">“The Anxious Generation” by Jonathan Haidt</a>: This is a critical look at how the “rewiring of childhood” via technology is impacting mental health. It serves as a stark counter-narrative to the techno-optimism of Ezra Klein; while Klein focuses on the potential for abundance, Haidt exposes the very real dangers and social costs of unbridled technological adoption. The book has become extremely influential, acting as a catalyst for new legislation and policy changes regarding smartphone and social media usage for minors around the world.</li>
  <li><a href="https://www.amazon.com/End-World-Just-Beginning-Globalization/dp/006323047X">“The End of the World is Just the Beginning” by Peter Zeihan</a>: A provocative mapping of the potential collapse of globalization. He argues that the era of global trade and secure transport is a historical outlier that is rapidly ending due to demographic shifts and changing US policy. It provides a broader, macro-strategic complement to Miller’s Chip Wars; while Miller focuses on the specific geopolitical struggle over silicon, Zeihan maps the decaying global order that makes that struggle so volatile.</li>
  <li><a href="https://en.wikipedia.org/wiki/The_Vital_Question">“The Vital Question” by Nick Lane</a>: A technical but rewarding explanation of how life came to be, focusing on energy constraints rather than just information. Lane argues that the leap from simple to complex life was a rare, energetic fluke, requiring a level of power that standard evolution struggled to achieve. The book has received immense critical acclaim, most notably from Bill Gates, who famously claimed it was the best book he’d read in years and that it would “help people understand that energy is as fundamental as information.”</li>
</ul>

<p>I hope you find something in this list that sparks your curiosity. As always, I’m curious to hear what you’ve been reading. Are there any books that have fundamentally shifted your perspective lately? Let me know in the comments!</p>]]></content><author><name>Xavier</name></author><category term="Books" /><category term="Artificial Intelligence" /><category term="Leadership" /><category term="Technology" /><summary type="html"><![CDATA[It has been quite a while since I last shared a reading list on this blog. In the fast-paced world of technology, it’s easy to get caught up in the stream of papers and newsletters, but I’ve always found that books provide the necessary depth and historical context to truly understand where we are heading. As I sat down to synthesize my recent readings, I realized that the core themes converged into what I’ve started calling my “A-List” of recent books: Abundance, Anxiety, AI, and Algorithms. It’s a playful alliteration, but one that captures the profound tension between the potential for technological plenty and the societal costs we are only beginning to calculate. Over the past few months, I’ve been diving into a diverse set of titles that help decode these four forces, categorizing them into AI, Tech, Leadership, and “Other Important Ideas.”]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://amatria.in/blog/blog/images/122-0.png" /><media:content medium="image" url="https://amatria.in/blog/blog/images/122-0.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">2025: When AI Started to Act (and ‘SOTA’ Lasted a Week)’</title><link href="https://amatria.in/blog/2025-review" rel="alternate" type="text/html" title="2025: When AI Started to Act (and ‘SOTA’ Lasted a Week)’" /><published>2025-12-21T00:00:01+00:00</published><updated>2025-12-21T00:00:01+00:00</updated><id>https://amatria.in/blog/2025-review</id><content type="html" xml:base="https://amatria.in/blog/2025-review"><![CDATA[<h2 id="a-year-of-reasoning-agents-and-compressed-innovation-cycles">A year of reasoning, agents, and compressed innovation cycles</h2>

<p>If 2024 was the year of the chatbot, 2025 has been the year AI started to think—or at least, the year we started debating what “thinking” really means. It has been a year of profound shifts: from simple instruction following to complex reasoning, from “vibes” to verifiable actions, and from general-purpose models to specialized agents. It has also been a year of significant change for me personally, as I moved on from Google to explore new challenges.</p>

<p>This recap is based on a set of monthly reports I wrote throughout 2025, originally as a way to keep track of the fast‑moving AI landscape for myself and for my team at Google. Over time, those notes became a disciplined way to separate signal from noise and to understand how individual launches and papers fit into a broader trajectory.</p>

<p>2025 turned out to be a year of both acceleration and recalibration. Breakthroughs in reasoning models, agents, multimodality, and efficiency continued at a remarkable pace, while questions around economics, safety, and real‑world impact became harder to ignore. Stepping back each month made it easier to see which themes truly mattered.</p>

<p><img src="/blog/images/121-0.jpeg" /></p>

<p>What follows is a month‑by‑month synthesis of those notes originally gathered for my internal newsletter at Google, curated and lightly expanded for this recap. The goal is not to be exhaustive, but to capture how the year actually unfolded. However, let me know if you are missing something important. I am also including direct links to monnths in case you have a favorite one!</p>

<ul>
  <li><a href="#jan">January: The Efficiency Shock and the “Peak Data” Debate</a></li>
  <li><a href="#feb">February: The “DeepSeek Moment” and Grok’s Rise</a></li>
  <li><a href="#mar">March: “Agentic Moore’s Law”</a></li>
  <li><a href="#apr">April: Peering Inside the Box</a></li>
  <li><a href="#may">May: Ecosystem Moves and “Vibe Coding”</a></li>
  <li><a href="#jun">June: Superintelligence and “AI-nxiety”</a></li>
  <li><a href="#jul">July: Agents Getting Real and Practical</a></li>
  <li><a href="#aug">August: The “Bubble” Panic and GPT-5’s Arrival</a></li>
  <li><a href="#sep">September: “Nano Banana” and Scientific Breakthroughs</a></li>
  <li><a href="#oct">October: Economic Value over Benchmarks</a></li>
  <li><a href="#nov">November: The New SOTA Battleground</a></li>
  <li><a href="#dec">December: Code Red and the Grand Finale</a></li>
  <li><a href="#wrong">What I Got Wrong in 2025</a></li>
  <li><a href="#bonus">Bonus: A Deeper Dive</a></li>
  <li><a href="#conclusion">Conclusion</a></li>
</ul>

<h3 id="-january-the-efficiency-shock-and-the-peak-data-debate"><a name="jan"></a> January: The Efficiency Shock and the “Peak Data” Debate</h3>
<p>The year began with a wake-up call on efficiency. DeepSeek R1 made headlines not just for its performance, but for its remarkably low training cost, sparking a massive debate on whether we were entering a new era of “Post-Training” efficiency. While some celebrated this as a democratization moment, it’s important to read the fine print: techniques like distillation and fine-tuning are powerful, but they often rely on the existence of larger, more expensive frontier models. At the same time, we saw healthy skepticism emerge regarding agents, with papers arguing that for simple tasks, we might be over-engineering solutions.</p>

<p><img src="/blog/images/121-1.jpeg" /></p>

<ul>
  <li><a href="https://arxiv.org/abs/2501.12948">DeepSeek R1 reasoning model</a> - The paper that started the year’s obsession with reasoning efficiency.</li>
  <li><a href="https://novasky-ai.github.io/posts/sky-t1/">Sky-T1</a> - Claiming 01-like performance on a mere $450 budget (distillation is key here!).</li>
  <li><a href="https://mistral.ai/news/codestral/">Mistral Codestral</a> - Mistral continuing to push the open-weight coding frontier.</li>
  <li><a href="https://arxiv.org/abs/2407.01489">Agentless</a> - A provocative paper arguing that for simple software engineering tasks, you might not actually need complex agents.</li>
  <li><a href="https://home.mlops.community/public/collections/agents-in-production-2024-2024-11-15">Agents in Production Talks</a> - A good collection of talks on agents in production (yes, these exist!).</li>
  <li><a href="https://www.nvidia.com/en-us/ai-data-science/workstations/">Nvidia Digits</a> - Launching an AI desktop.</li>
  <li><a href="https://techweez.com/2024/11/15/ai-visionary-francois-chollet-exits-google-to-champion-next-gen-agi-challenges/">Francois Chollet Interview</a> - A must-watch discussion on AGI and how to measure it.</li>
  <li><a href="https://www.youtube.com/watch?v=9vM4p9NN0Ts">Stanford Lecture on LLMs</a> - Covering the often overlooked basics: tokenization, data, and evals.</li>
  <li><a href="https://www.deeplearning.ai/the-batch/ai-product-managers-will-be-in-demand/">Andrew Ng’s Newsletter</a> - Arguing that AI PMs are the future of software development teams.</li>
  <li><a href="https://www.ipsos.com/en-us/google-ipsos-multi-country-ai-survey-2025">Google Survey on State of AI</a> - An interesting multi-country pulse check.</li>
  <li>I compiled <a href="https://amatria.in/blog/2024research">my favorite 2024 AI papers</a> and shared <a href="https://amatria.in/blog/ageofdata">my view</a> on “Peak Data” and “Scaling Laws”.</li>
</ul>

<p>January set the stage for a year where data quality, reasoning reliability, and evaluation rigor would repeatedly resurface as central constraints rather than secondary concerns.</p>

<h3 id="-february-the-deepseek-moment-and-groks-rise"><a name="feb"></a> February: The “DeepSeek Moment” and Grok’s Rise</h3>

<p>February was dominated by the aftermath of DeepSeek and the surprise arrival of Grok 3. We spent weeks dissecting the DeepSeek papers, realizing that while it wasn’t necessarily a fundamental research breakthrough, it was a fantastic engineering feat that cleverly optimized known techniques. Meanwhile, Grok 3’s arrival—developed by a small team in just 18 months—shook up the leaderboards. We also began to realize that benchmarks are becoming increasingly “gameable” via test-time compute, making “Price vs. Performance” the new metric that matters.</p>

<p><img src="/blog/images/121-2.jpeg" /></p>

<ul>
  <li><a href="https://x.ai/blog/grok-3">Grok 3</a> - Developed in just 18 months by a small team, hitting Rank 1 on Chatbot Arena.</li>
  <li><a href="https://lmarena.ai/?leaderboard">Imarena Price Plot</a> - The new “Arena-Price Plot” became the most important chart for a few weeks.</li>
  <li><a href="https://semianalysis.com/2025/01/31/deepseek-debates/">Semianalysis: DeepSeek Debates</a> - A deep dive into Chinese leadership on cost and true training margins.</li>
  <li><a href="https://stratechery.com/2025/deepseek-faq/">Stratechery’s DeepSeek FAQ</a> - Ben Thompson’s breakdown of the situation.</li>
  <li>[OpenAI Operator] (https://openai.com/index/introducing-operator/) and <a href="https://openai.com/index/introducing-deep-research/">Deep Research</a>- OpenAI’s response, pushing into agentic research.</li>
  <li><a href="https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research">Perplexity Deep Research</a> - Perplexity quickly following suit with their own implementation.</li>
  <li><a href="https://humanityslastexam.com/">Humanity’s Last Exam</a> - A benchmark attempting to evaluate true reasoning capabilities.</li>
  <li><a href="https://cerebras.ai/blog/mistral-le-chat">Mistral “Fastest Chatbot”</a> - Powered by Cerebras hardware.</li>
  <li><a href="https://developer.nvidia.com/blog/automating-gpu-kernel-generation-with-deepseek-r1-and-inference-time-scaling/">AI Assisted CUDA Kernels</a> - Nvidia and Sakana.ai showing how AI can write low-level code.</li>
  <li><a href="https://www.google.com/search?q=https://blog.google/technology/ai/google-gemini-next-generation-model-february-2025/%23gemini-2.0-flash">Google Gemini 2.0 Flash</a> - Competitive pressure driving efficiency.</li>
</ul>

<p><img src="/blog/images/121-2b.jpeg" /></p>

<p>February made clear that efficiency breakthroughs amplify—not reduce—the need for transparency, reproducibility, and robust evaluation.</p>

<h3 id="-march-agentic-moores-law"><a name="mar"></a> March: “Agentic Moore’s Law”</h3>

<p>By March, the conversation shifted heavily toward Agents. We started seeing real data suggesting an “Agentic Moore’s Law,” where the length of tasks agents can solve autonomously is doubling roughly every 7 months. This was also the month Andrej Karpathy dropped his “Deep Dive,” reminding us that despite the hype, LLMs still struggle with basic token-level tasks (like counting ‘r’s in “strawberry”) and that prompting is still very much an art form.</p>

<p><img src="/blog/images/121-3.jpeg" /></p>

<ul>
  <li><a href="https://www.youtube.com/watch?v=7xTGNNLPyMI">Karpathy’s Deep Dive into LLMs</a> - Explaining why LLMs can’t count “r’s” in “strawberry” and why RLHF isn’t “true RL”.</li>
  <li><a href="https://arxiv.org/pdf/2503.14499">The “Agentic Moore’s Law”</a> - Interesting data showing the length of tasks agents can solve is doubling every 7 months.</li>
  <li><a href="https://sakana.ai/ai-scientist-first-publication/">Sakana AI Scientist</a> - Generating the first peer-reviewed scientific publication entirely by AI.</li>
  <li><a href="https://www.youtube.com/watch?v=K27diMbCsuw">Manus: The General AI Agent</a> - Another contender in the general agent space.</li>
  <li><a href="https://arxiv.org/abs/2503.01935">MultiAgentBench</a> - Evaluating how agents collaborate and compete.</li>
  <li><a href="https://techcrunch.com/2025/03/06/a-quarter-of-startups-in-ycs-current-cohort-have-codebases-that-are-almost-entirely-ai-generated/">YC AI Codebases</a> - A quarter of YC startups now have codebases that are almost entirely AI-generated.</li>
  <li><a href="https://www.youtube.com/watch?v=5WEcsg5jpDw">Interview Coder</a> - The viral tool (and cheating concern) for coding interviews.</li>
  <li><a href="https://www.youtube.com/watch?v=7j_NE6Pjv-E">Model Context Protocol (MCP)</a> - Why standardizing context matters for tools.</li>
  <li><a href="https://arxiv.org/abs/2502.09992">Large Language Diffusion Models</a> - Exploring diffusion for text generation.</li>
  <li><a href="https://techcrunch.com/2025/02/24/perplexity-teases-a-web-browser-called-comet/">Perplexity Comet</a> - A browser designed specifically for agentic search.</li>
  <li><a href="https://deepmind.google/discover/blog/gemini-robotics-brings-ai-into-the-physical-world/">Pushing AI into the physical world with Gemini Robotics</a></li>
  <li><a href="https://blog.google/technology/developers/gemma-3/">Introducing Gemma 3, highly capable for single GPU/TPU</a></li>
  <li><a href="https://blog.google/products/search/ai-mode-search/">Enhancements to Google Search with AI Overviews and a new AI Mode</a></li>
</ul>

<p><img src="/blog/images/121-3b.png" /></p>

<p>March previewed a broader shift toward autonomy and structured reasoning that would accelerate throughout the year.</p>

<h3 id="april-peering-inside-the-box"><a name="apr"></a>April: Peering Inside the Box</h3>

<p>April was about peering inside the black box. Anthropic released fascinating research on tracing the internal “thoughts” of models, while Meta’s Llama 4 release highlighted a crucial finding: Reinforcement Learning is proving much more important than Supervised Fine-Tuning (SFT). In fact, the data suggested that too much SFT can actually hurt performance. This was also the month OpenAI released GPT-4.1, which felt like a minor iteration compared to the architectural shifts we were seeing elsewhere.</p>

<p><img src="/blog/images/121-4.jpeg" /></p>

<ul>
  <li><a href="https://www.anthropic.com/research/tracing-thoughts-language-model">Tracing the Thoughts of an LLM</a> - Anthropic’s fascinating research on internal activations.</li>
  <li><a href="https://ai.meta.com/blog/llama-4-multimodal-intelligence/">Introducing Llama 4</a> - Meta moving to Mixture of Experts (MoEs) and huge base models (“Behemoth”).</li>
  <li><a href="https://www.theverge.com/meta/645012/meta-llama-4-maverick-benchmarks-gaming">Llama 4 and the Benchmark Crisis</a> - The Verge’s take on how Llama 4 broke our evaluation metrics.</li>
  <li><a href="https://x.com/tobi/status/1909251946235437514">Shopify’s AI Mandate</a> - Tobi Lütke’s internal memo mandating AI usage for employees.</li>
  <li><a href="https://hai.stanford.edu/news/ai-index-2025-state-of-ai-in-10-charts?utm_source=tldrai">Stanford AI Index 2025</a> - The state of AI in 10 charts.</li>
  <li><a href="https://openai.com/index/gpt-4-1/">OpenAI GPT-4.1</a> - A minor improvement that noticeably lacked comparisons to non-GPT models.</li>
  <li><a href="https://goo.gle/3G4DNic">Project AMIE Nature Paper on diagnostic accuracy</a> and on <a href="https://goo.gle/3G1naUu">assisting clinicians</a> - Two publications on conversational medical AI.</li>
  <li><a href="https://blog.google/technology/ai/dolphingemma/">DolphinGemma</a> - Using AI to decode dolphin communication (yes, really).</li>
  <li><a href="https://cloud.google.com/blog/products/ai-machine-learning/google-cloud-next-2025-wrap-up">Cloud Next 2025 and infrastructure announcements</a></li>
</ul>

<p>April reinforced that evaluation quality and RL‑driven training were no longer optional—they were becoming core pillars of progress. At the same time, some serious questions came up about the validity of public benchmarks.</p>

<h3 id="-may-ecosystem-moves-and-vibe-coding"><a name="may"></a> May: Ecosystem Moves and “Vibe Coding”</h3>

<p>May felt like a consolidation month. OpenAI went on an acquisition spree (buying Jony Ive’s startup and Windsurf), signaling a push into hardware and broader ecosystems. Meanwhile, Mark Zuckerberg was on a podcast tour with a refreshing level of honesty, admitting that Llama is essentially a byproduct of Meta’s internal needs rather than a purely altruistic developer play. We also saw the rise of “Vibe Coding”—a development style prioritized by speed and flow over rigorous syntax—gaining legitimacy. Coding, multimodality, and enterprise applications increasingly shared the same underlying capabilities, even as models became more specialized at the surface.</p>

<p><img src="/blog/images/121-5.jpeg" /></p>

<ul>
  <li><a href="https://www.theverge.com/news/671838/openai-jony-ive-ai-hardware-apple">OpenAI buys Jony Ive’s hardware startup</a></li>
  <li><a href="https://stratechery.com/2025/an-interview-with-meta-ceo-mark-zuckerberg-about-ai-and-the-evolution-of-social-media/">Zuckerberg on Llama</a> - His podcast tour explaining Llama as a byproduct of internal needs.</li>
  <li><a href="https://www.anthropic.com/news/claude-4">Anthropic Claude Opus 4 and Sonnet 4 </a>- Continued pressure at the high end.</li>
  <li><a href="">Google I/O 2025 announcements</a> https://blog.google/technology/ai/google-io-2025</li>
  <li><a href="https://news.microsoft.com/build-2025-book-of-news/">Microsoft Build 2025</a> - 50+ announcements for developers.</li>
  <li><a href="https://www.linkedin.com/pulse/state-vibe-coding-tools-may-2025-nufar-gaspar-x1znf/">Vibe Coding Podcast</a> - The AI Daily Brief on the state of “vibecoding.”</li>
  <li><a href="https://www.nature.com/articles/s41599-025-04787-y">The Effect of ChatGPT on Students</a> - A Nature article providing valuable data on AI in education.</li>
  <li><a href="https://open.substack.com/pub/robotic/p/brakes-on-an-intelligence-explosion?r=4pd1ap&amp;utm_campaign=post&amp;utm_medium=email">Brakes on Intelligence Explosion</a> - Nathan Lambert offering a counterpoint to the “AGI by 2027” hype.</li>
  <li><a href="https://techcrunch.com/2025/05/07/netflix-debuts-its-generative-ai-powered-search-tool/">Netflix AI Search</a> - Generative AI hitting mainstream consumer UI.</li>
  <li><a href="https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/">AlphaEvolve</a> - AI designing algorithms to save compute costs.</li>
</ul>

<h3 id="-june-superintelligence-and-ai-nxiety"><a name="jun"></a> June: Superintelligence and “AI-nxiety”</h3>

<p>In June, the race for “Superintelligence” became explicit, with Meta forming a dedicated lab and aggressively poaching talent (reportedly paying 7-to-9 figures). But alongside this race, we started seeing the human cost: “AI-nxiety.” Developers and users alike expressed exhaustion at the relentless pace of updates. We also saw Apple release their “Illusion of Thinking” paper, a controversial splash that argued current reasoning models might be shallower than we assume—a debate that is still ongoing.</p>

<p><img src="/blog/images/121-6.jpeg" /></p>

<ul>
  <li><a href="https://www.reuters.com/business/finance/meta-finalizes-investment-scale-ai-valuing-startup-29-billion-2025-06-13/">Meta’s Superintelligence Lab</a> - Poaching Scale AI’s CEO to lead the charge.</li>
  <li><a href="https://www.cnbc.com/2025/06/09/openai-hits-10-billion-in-annualized-revenue-fueled-by-chatgpt-growth.html?utm_source=tldrnewsletter">OpenAI hits $10B ARR </a>- The business of AI is booming.</li>
  <li><a href="https://www.businesstoday.in/technology/news/story/uae-makes-chatgpt-plus-subscription-free-for-all-residents-as-part-of-deal-with-openai-477948-2025-05-27">UAE Free ChatGPT</a> - A nation-state strategy to accelerate adoption.</li>
  <li><a href="https://www.youtube.com/watch?v=DrygcOI-kG8">LangChain Keynote</a> - Harrison Chase on the state of agents.</li>
  <li><a href="https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf">Apple’s “Illusion of Thinking”</a>- A controversial paper arguing models might not be reasoning as deeply as we think.</li>
  <li><a href="https://www.youtube.com/watch?v=0_DjDdfqtUE">Apple WWDC</a> - Apple continuing to play catch-up in the generative race.</li>
  <li><a href="https://semianalysis.com/2025/06/08/scaling-reinforcement-learning-environments-reward-hacking-agents-scaling-data/">Scaling Reinforcement Learning</a> - Semianalysis on why RL is the next frontier.</li>
  <li><a href="https://www.interconnects.ai/p/what-comes-next-with-reinforcement">What’s Next for RL</a> - Nathan Lambert’s take.</li>
  <li><a href="https://somehowmanage.com/2025/05/19/ai-is-awesome-but-its-fucking-exhausting/">AI-nxiety </a>- “AI is Awesome but It’s Fucking Exhausting.”</li>
  <li><a href="https://www.economist.com/technology/2025/06/ai-data-centers-energy">Infrastructure constraints and energy considerations for AI data centers</a></li>
</ul>

<p>June underscored a recurring theme: progress is increasingly gated by infrastructure, incentives, and workflows—not by model quality alone.</p>

<h3 id="-july-agents-getting-real-and-practical"><a name="jul"></a> July: Agents Getting Real and Practical</h3>

<p>July saw agents moving from cool demos to practical, integrated products. OpenAI merged their Operator and Deep Research teams, signaling that agents are the new search. We also saw smart implementation tactics from companies like Shopify, who are building innovative agents that access internal data via the Model Context Protocol (MCP). The economic argument also solidified this month: we began to see compelling evidence that LLM inference costs are dropping fast enough to potentially become cheaper than traditional search.</p>

<p><img src="/blog/images/121-7.jpeg" /></p>

<ul>
  <li><a href="https://www.cnbc.com/2025/06/30/mark-zuckerberg-creating-meta-superintelligence-labs-read-the-memo.html">Zuckerberg’s Superintelligence Memo</a> - Now public.</li>
  <li><a href="https://openai.com/index/introducing-chatgpt-agent/">OpenAI ChatGPT Agent</a> - Merging Operator and Deep Research into one system.</li>
  <li><a href="https://deepmind.google/discover/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/">Gemini Deep think formally achieves gold-medal at the International Mathematical Olympiad</a></li>
  <li><a href="https://x.com/Kimi_Moonshot/status/1945897926796185841">Moonshot Kimi K2</a> - Reaching #1 on open model spots.</li>
  <li><a href="https://x.com/xai/status/1943158495588815072">xAI releases Grok4</a></li>
  <li><a href="https://www.firstround.com/ai/shopify">Shopify’s AI Tactics</a> - Building innovative agents accessing internal data via MCPs.</li>
  <li><a href="https://www.elenaverna.com/p/the-rise-of-the-ai-native-employee">The AI-Native Employee</a> - Papers on how AI is changing the nature of work and teamwork.</li>
  <li><a href="https://www.youtube.com/watch?v=LCEmiRjPEtQ">Karpathy’s YC Lecture</a> - “Software in the Era of AI.”</li>
  <li><a href="https://asia.nikkei.com/Business/Technology/Artificial-intelligence/Positive-review-only-Researchers-hide-AI-prompts-in-papers">Prompt Attacks on Papers</a> - Researchers injecting prompts to prevent negative AI reviews.</li>
  <li><a href="https://www.snellman.net/blog/archive/2025-06-02-llms-are-cheap/">Inference vs. Search Costs</a> - Arguing that LLM inference is becoming cheaper than traditional search.</li>
  <li><a href="https://blog.google/technology/developers/introducing-gemini-cli-open-source-ai-agent/">CLI version of Gemini</a> gets over 50k stars in github in a few weeks</li>
</ul>

<p>July made clear that productivity gains are earned through structural change, not incremental tooling.</p>

<h3 id="-august-the-bubble-panic-and-gpt-5s-arrival"><a name="aug"></a> August: The “Bubble” Panic and GPT-5’s Arrival</h3>

<p>August was a rollercoaster of sentiment. We had a wave of “AI Bubble” articles from major publications asking if we had peaked, citing failed pilots and high costs. Then, almost on cue, GPT-5 launched. While the rollout was rocky and required adjustments, the sheer user numbers (700M weekly users) and valuation ($500B) largely silenced the “dead end” narrative. It was a reminder that while the hype might fluctuate, the utility is scaling.</p>

<p><img src="/blog/images/121-8.jpeg" /></p>

<ul>
  <li>OpenAI had a momentous month, launching <a href="https://openai.com/index/introducing-gpt-5/">GPT-5</a>, though the <a href="https://arstechnica.com/information-technology/2025/08/the-gpt-5-rollout-has-been-a-big-mess/">rollout was rocky</a> and required <a href="https://x.com/OpenAI/status/1956461718097494196">adjustments based on user feedback</a>.</li>
  <li>The AI Bubble Articles- <a href="https://fortune.com/2025/08/ai-bubble-mit-study">Fortune</a> , <a href="https://www.theatlantic.com/technology/archive/2025/08/ai-mass-delusion-event/683909/">The Atlantic</a> , and <a href="https://www.newyorker.com/culture/open-questions/what-if-ai-doesnt-get-much-better-than-this">New Yorker</a>  all asking if we’ve peaked.</li>
  <li><a href="https://x.com/AnthropicAI/status/1952768432027431127">Anthropic Opus 4.1</a> - Focus on safety and interpretability (persona vectors).</li>
  <li><a href="https://www.meta.com/superintelligence/">Meta Personal Superintelligence</a> - Leaning into their vision.</li>
  <li><a href="https://github.blog/changelog/2025-07-23-github-spark-in-public-preview-for-copilot-pro-subscribers/">Microsoft Github Spark</a> - Their take on “vibecoding.”</li>
  <li><a href="https://x.com/figma/status/1948399170030620870">Figma Make</a> - Vibecoding comes to design.</li>
  <li><a href="https://semianalysis.com/2025/08/12/scaling-the-memory-wall-the-rise-and-roadmap-of-hbm/">Semianalysis: Scaling Memory</a> - The roadmap of HBM.</li>
  <li><a href="https://www.arxiv.org/abs/2508.10975">Synthetic Data for Pretraining</a> - Lessons from scaling synthetic data (arXiv).</li>
  <li><a href="https://arxiv.org/abs/2503.00001">Data efficiency breakthroughs via high-fidelity labeling</a></li>
  <li><a href="https://deepmind.google/blog/genie3">Genie-style world models and simulation advances</a></li>
  <li><a href="https://blog.google/inside-google/infrastructure/ai-energy-use">Coverage of AI energy usage and efficiency trade-offs</a></li>
  <li><a href="https://www.therobotreport.com/waymo-reaches-100m-fully-autonomous-miles-across-all-deployments/">Waymo 100M Miles</a> - Autonomous driving quietly hitting massive milestones.</li>
</ul>

<p>The month surfaced doubts around real progress in AI with the backdrop of many launches that seemed to prove those with doubts were probably wrong.</p>

<h3 id="-september-nano-banana-and-scientific-breakthroughs"><a name="sep"></a> September: “Nano Banana” and Scientific Breakthroughs</h3>

<p>September was a huge month for Google, led by the viral success of the Nano Banana image editing model. But beyond the consumer hype, we saw incredible scientific work from DeepMind on fluid dynamics and a continued push for safer, more accessible AI.</p>

<p><img src="/blog/images/121-9.jpeg" /></p>

<ul>
  <li><a href="https://blog.google/products/gemini/updated-image-editing-model/">Google Nano Banana</a> - The viral image editing model that pushed the <a href="https://www.cnbc.com/2025/09/16/google-gemini-tops-apples-app-store-snagging-lead-spot-from-chatgpt.html">Gemini app to #1</a>.</li>
  <li>The discussion continues on AI’s impact on the job market: a <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5425555">Harvard study</a> suggests AI is “choking” entry-level hiring, while a <a href="https://digitaleconomy.stanford.edu/wp-content/uploads/2025/08/Canaries_BrynjolfssonChandarChen.pdf">Stanford study</a> makes similar claims.</li>
  <li>Meta is pushing further into hardware with its <a href="https://www.youtube.com/watch?v=gZ9IsB72nVk">AI-powered smart glasses</a></li>
  <li><a href="https://deepmind.google/discover/blog/discovering-new-solutions-to-century-old-problems-in-fluid-dynamics/">DeepMind Fluid Dynamics</a> - Using Physics-Informed Neural Networks to discover new mathematical edge cases in fluid motion.</li>
  <li><a href="https://venturebeat.com/ai/google-and-openais-coding-wins-at-university-competition-show-enterprise-ai">Coding Wins</a> - Google and OpenAI showing dominance at university coding competitions.</li>
  <li><a href="https://blog.google/products/chrome/new-ai-features-for-chrome/">Gemini in Chrome</a> - Embedding models directly into the browser for billions of users.</li>
  <li><a href="https://deepmind.google/discover/blog/strengthening-our-frontier-safety-framework/">Frontier Safety Framework</a> - Google DeepMind’s third iteration of their safety framework.</li>
  <li><a href="https://cloud.google.com/blog/products/ai-machine-learning/announcing-agents-to-payments-ap2-protocol">Agent Payments Protocol (AP2)</a> - An open protocol for secure agent-led payments.</li>
  <li><a href="https://time.com/collections/time100-ai-2025/">Time’s 2025 AI 100</a> - The annual list of influential AI leaders.</li>
  <li>Anthropic announced a massive <a href="https://www.anthropic.com/news/anthropic-raises-series-f-at-usd183b-post-money-valuation">new funding round ($13B at a $183B valuation)</a> as they scale to over 300k enterprise customers.</li>
  <li>xAI released <a href="https://x.ai/news/grok-4-fast">Grok 4 Fast</a> making a claim to be at the frontier of cost-efficient intelligence.</li>
  <li><a href="https://blog.replit.com/agent3">Replit Agent 3 and multi-hour autonomy</a></li>
</ul>

<p>September signaled a new baseline: autonomy and reliability were becoming table stakes.</p>

<h3 id="-october-economic-value-over-benchmarks"><a name="oct"></a> October: Economic Value over Benchmarks</h3>

<p>The focus in October was heavily on the new possibilities of AI video creation and proving the economic value of AI. The GDP-val paper was a standout, proposing that we measure frontier models not by academic exams, but by their ability to perform “economically valuable” tasks.</p>

<p><img src="/blog/images/121-10.jpeg" /></p>

<ul>
  <li><a href="https://openai.com/devday/">OpenAI Dev Day</a>- Announcing the Atlas browser, Instant Checkout, and <a href="https://openai.com/index/sora-2/">Sora 2</a>, their new video model and a dedicated app for short-form, AI-generated videos.</li>
  <li><a href="https://www.anthropic.com/news/claude-sonnet-4-5">Anthropic Sonnet 4.5</a> - And <a href="https://www.testingcatalog.com/anthropic-experiments-with-an-agent-for-gereating-ui-on-the-fly/">“Imagine”</a> for real-time UI generation.</li>
  <li><a href="https://www.testingcatalog.com/meta-introduces-vibes-feed-for-ai-generated-content/">Meta Vibes</a> - Feed for short-form AI video.</li>
  <li><a href="https://cdn.openai.com/pdf/d5eb7428-c4e9-4a33-bd86-86dd4bcf12ce/GDPval.pdf">GDP-val</a> - Finding frontier models rival humans on economically valuable tasks.</li>
  <li><a href="https://arxiv.org/abs/2510.12049">GenAI and Firm Productivity</a> - Measuring real-world impact in retail.</li>
  <li><a href="https://finance.yahoo.com/news/without-data-centers-gdp-growth-171546326.html">US economic dependence on data-center buildout</a></li>
  <li><a href="https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/DeepSeek_V3_2.pdf">DeepSeek Sparse Attention</a> - New research from the efficiency kings.</li>
  <li><a href="https://github.com/karpathy/nanochat">Karpathy’s Nanochat</a> - “The best ChatGPT that $100 can buy.”</li>
  <li><a href="https://thinkingmachines.ai/blog/announcing-tinker/">Tinker</a> - Thinking Machines’ tool for model fine-tuning.</li>
  <li><a href="https://cloud.google.com/blog/products/ai-machine-learning/introducing-gemini-enterprise">Gemini Enterprise</a> - Major steps for cloud customers.</li>
  <li><a href="https://arxiv.org/abs/2501.20012">Sparse attention and efficiency work from DeepSeek</a></li>
  <li><a href="">A Bloomberg feature detailed the web of circular deals</a> among AI companies</li>
  <li>Interesting discussions on the Dwarkesh podcast. While RL pioneer Richard Sutton argued that <a href="https://www.youtube.com/watch?v=21EYKqUsPfg">LLMs are a dead end</a>, Andrej Karpathy presented a <a href="https://www.youtube.com/watch?v=lXUZvyajciY">contrasting perspective</a></li>
</ul>

<p>Progress increasingly began to be evaluated economically, not just technically.</p>

<h3 id="-november-the-new-sota-battleground"><a name="nov"></a> November: The New SOTA Battleground</h3>

<p>November brought the year’s biggest shakeup: the launch of Gemini 3. Google’s latest model, accompanied by the new “Deep Think” reasoning mode and the “Google Antigravity” agentic platform, immediately topped the charts. Just days later, Anthropic countered with Claude Opus 4.5, marketed as the ultimate coding model with massive improvements in agentic workflows. The market share data reflects this shift—ChatGPT is no longer the default for everyone.</p>

<p><img src="/blog/images/121-11.jpeg" /></p>

<ul>
  <li><a href="https://deepmind.google/models/gemini/">Google Gemini 3</a> - A huge shakeup with enhanced reasoning and agentic capabilities.</li>
  <li><a href="https://blog.google/technology/ai/nano-banana-pro/">Nano Banana Pro</a> - Building on the viral success of the original, this version pushed image editing even further.</li>
  <li><a href="https://www.anthropic.com/news/claude-opus-4-5">Claude Opus 4.5</a> - A SOTA coding model that reportedly scores 74.5% on SWE-bench Verified.</li>
  <li><a href="https://x.com/Similarweb/status/1998343712791777751">ChatGPT losing market share</a> - Data from Similarweb showing a clear trend towards Gemini.</li>
  <li><a href="https://x.com/Similarweb/status/1995792272785310186">Similarweb Analysis</a> - Further confirmation of the changing landscape.
Pie</li>
  <li><a href="https://arxiv.org/abs/2512.04797">SIMA 2 embodied agent research</a></li>
</ul>

<p><img src="/blog/images/121-10b.png" /></p>

<p>November felt like a visible inflection point in both capability and market momentum. SOTA leadership, once measured in years, was now clearly measured in weeks.</p>

<h3 id="-december-code-red-and-the-grand-finale"><a name="dec"></a> December: Code Red and the Grand Finale</h3>

<p>The year ended with high drama. Feeling the heat from Gemini 3 and Opus 4.5, OpenAI declared a “Code Red,” reminiscent of Google’s own similar move back in 2022. This urgency birthed GPT-5.2, a rapid iteration designed to reclaim the throne, alongside new features like ChatGPT Images. Meanwhile, at NeurIPS 2025 in San Diego, the buzz was all about embodied agents, with DeepMind unveiling Sima 2, a generalist agent for 3D worlds that feels like a real step towards general purpose robotics.</p>

<p><img src="/blog/images/121-12.jpeg" /></p>

<ul>
  <li><a href="https://www.wsj.com/tech/ai/openais-altman-declares-code-red-to-improve-chatgpt-as-google-threatens-ai-lead-7faf5ea6">OpenAI Code Red</a> - Sam Altman rallying the troops as competition heats up.</li>
  <li><a href="https://openai.com/index/introducing-gpt-5-2/">GPT 5.2 Launch</a> - OpenAI’s rapid response to the shifting benchmarks.</li>
  <li><a href="https://openai.com/index/new-chatgpt-images-is-here/">ChatGPT Images</a> - Bringing native image capabilities to the forefront.</li>
  <li><a href="https://blog.google/products/gemini/gemini-3-flash/">Gemini 3 Flash</a> - A speed-optimized beast that still manages ~78% on SWE-bench Verified.</li>
  <li><a href="https://openai.com/index/shipping-sora-for-android-with-codex/">Shipping Sora for Android</a> - A fascinating look at how OpenAI used Codex to build their own app in just 28 days.</li>
  <li><a href="https://arxiv.org/abs/2512.04797">DeepMind Sima 2</a> - A generalist embodied agent for 3D worlds, unveiled just in time for the conference season.</li>
  <li><a href="https://neurips.cc">NeurIPS 2025 conference and proceedings</a>. Also, interestingly, several VCs are now giving good summaries of the conference. * * <a href="https://www.amplifypartners.com/blog-posts/neurips-2025-recap">Here</a> is the one by Amplify Partners and <a href="https://radical.vc/highlights-from-neurips-2025/">here</a> the one by Radical Ventures.</li>
</ul>

<p>The year closed with a clear signal: fast iteration now coexists with renewed investment in long‑horizon research.</p>

<h3 id="-what-i-got-wrong-in-2025"><a name="wrong"></a> What I Got Wrong in 2025</h3>

<p>Looking back at my own predictions (and anxieties) from the start of the year, a few things stand out:
The Bubble That Wasn’t: In August, amidst the “Peak AI” narrative, I worried we were heading for a winter. I was wrong. The utility of these models in coding and enterprise workflows has created a floor for value that is much higher than I anticipated.
Agents are Harder than We Thought: I expected autonomous agents to be “solved” by mid-year. Instead, we found that reliability at scale is an immense challenge. The “Agentic Moore’s Law” is real, but the slope is shallower than I hoped.
The Persistence of Open Weights: I feared the gap between closed and open models would widen to a chasm. Instead, thanks to DeepSeek, Mistral, and Meta, the open ecosystem is arguably healthier than ever, keeping the giants honest on price.</p>

<h3 id="-bonus-a-deeper-dive"><a name="bonus"></a> Bonus: A Deeper Dive</h3>

<p>Before closing out the year, I sat down for an in-depth, <a href="https://youtu.be/qgHLZuZ7mmM?si=r8MoCwK1q2Ygsoos">2-hour conversation with Jon Hernandez on his “Inteligencia Artificial” podcast</a>. We covered everything from my transition out of Google and the internal dynamics of big labs, to why I believe “AGI” is a distracting term and why we should focus on specialized agents instead. If you want the unfiltered, “director’s cut” version of my take on 2025 and beyond, this is it.</p>

<h3 id="-conclusion"><a name="conclusion"></a> Conclusion</h3>

<p>As I close out my recap of 2025, I am struck by how much the narrative has changed. We are no longer just awed by the fluency of LLMs; we are now demanding fidelity, reasoning, and autonomy. The battles of November and December proved that no lead is safe, and “SOTA” is a title you hold for weeks, not months.</p>

<p>Leaving Google this year has given me a fresh perspective on this ecosystem. The rate of change is dizzying, but it is also exhilarating. As we head into 2026, I am more convinced than ever that we are just scratching the surface of what is possible when we combine powerful reasoning foundation models with verifiable reasoning and agentic workflows. I look forward to exploring all of this in the incredible space of Travel from my new position as Chief AI and Data Officer at Expedia Group.</p>

<p>Here’s to a 2026 full of verified rewards and fewer hallucinations. Happy New Year!</p>]]></content><author><name>Xavier</name></author><category term="Artificial Intelligence" /><category term="Personal" /><summary type="html"><![CDATA[A year of reasoning, agents, and compressed innovation cycles]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://amatria.in/blog/blog/images/121-0.jpeg" /><media:content medium="image" url="https://amatria.in/blog/blog/images/121-0.jpeg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">The post-training journey of modern LLMs revisited</title><link href="https://amatria.in/blog/postpretraining-revisted" rel="alternate" type="text/html" title="The post-training journey of modern LLMs revisited" /><published>2025-09-06T00:00:01+00:00</published><updated>2025-09-06T00:00:01+00:00</updated><id>https://amatria.in/blog/postpretraining-revisited</id><content type="html" xml:base="https://amatria.in/blog/postpretraining-revisted"><![CDATA[<p>(This blog post, as most of my recent ones, is written with AI assistance and augmentation. First time using nano banana for the infographics!)</p>

<p>In a previous post, we delved into <a href="https://amatria.in/blog/postpretraining">“Beyond Token Prediction: the post-Pretraining journey of modern LLMs”</a> to explore the multifaceted post-pretraining life of modern LLMs. A key takeaway from that discussion was that modern LLMs have evolved far beyond simple next-token prediction, a point that becomes even more critical as we venture into the realm of reasoning models. We touched upon how techniques like Reinforcement Learning from Human Feedback (RLHF) have been pivotal in aligning these models with human preferences. But the world of AI moves at a dizzying pace, and the conversation is already shifting. While RLHF has been a cornerstone, the new rave is all about <strong>Reinforcement Learning with Verifiable Rewards (RLVR)</strong>, a term introduced in <a href="https://arxiv.org/abs/2411.15124">this recent paper</a>. This isn’t just an incremental update; it’s a paradigm shift that could be the key to unlocking true reasoning in our models.</p>

<h3 id="from-human-preference-to-verifiable-truth">From Human Preference to Verifiable Truth</h3>

<p>So, what exactly is RLVR, and how does it differ from the RLHF we’ve grown accustomed to? RLHF, in essence, is about teaching a model to be more “human-like.” We show it two responses, a human indicates which one is “better,” and the model learns to produce outputs that are more likely to be preferred. It’s a powerful technique for improving style, tone, and safety. However, it’s also inherently subjective. What one person prefers, another might not. And more importantly, a preferred answer isn’t always the <em>correct</em> answer, especially when it comes to complex reasoning tasks.</p>

<p>This is where RLVR comes in. As the name suggests, RLVR is a flavor of reinforcement learning where the reward is based on a verifiable, objective metric. Instead of asking “which answer do you like more?”, we ask “is this answer demonstrably correct?”. The reward is no longer a matter of opinion but of fact. For example, if we’re training a model to solve a math problem, the reward can be based on whether the final answer is correct. If we’re teaching it to code, the reward can be tied to whether the code compiles and passes a set of unit tests.</p>

<p><img src="/blog/images/120-0.png" /></p>

<p>The folks at <a href="https://fireworks.ai/blog/reinforcement-learning-with-verifiable-reward">Fireworks.ai</a> have been doing some fantastic work in this area, and their blog post on the topic is a must-read. They highlight that RLVR is particularly well-suited for tasks where the “goodness” of an output can be programmatically determined. This shift from subjective human feedback to objective, verifiable rewards is a subtle but profound one. It’s the difference between a model that’s a good conversationalist and one that’s a reliable problem-solver.</p>

<h3 id="diving-deeper-reward-functions-learned-models-and-policy-optimization">Diving Deeper: Reward Functions, Learned Models, and Policy Optimization</h3>

<p>To appreciate the mechanics of RLVR, it helps to distinguish between a <em>programmatic reward function</em> and a <em>learned reward model</em>. RLVR’s power stems from its use of a programmatic reward function—essentially, a piece of code that deterministically scores an output. For example: <code class="language-plaintext highlighter-rouge">if unit_tests_pass(): return 1.0 else: return 0.0</code>. This is transparent, objective, and verifiable. In contrast, traditional RLHF uses a learned reward model, which is a separate neural network trained on human preference data to <em>predict</em> what score a human would give. This model is an approximation of human values and can have its own biases or be gamed.</p>

<p>Once you have a method for scoring responses, you need an algorithm to update the LLM’s policy (the “actor”). The workhorse here is <strong>PPO (Proximal Policy Optimization)</strong>. It’s important to clarify a potential point of confusion: while RLVR avoids the <em>learned reward model</em> of RLHF, PPO itself, as an actor-critic method, often learns an auxiliary <strong>Value Network</strong> (the “critic”). This network doesn’t judge preference; instead, it estimates the expected future reward from a given state (i.e., the sequence of tokens generated so far). By comparing the actual reward to the critic’s prediction, PPO calculates the “Advantage”—a more stable signal for how good an action was. PPO uses this Advantage to update the actor in small, stable steps, maximizing the reward while ensuring the updated model doesn’t stray too far from its original state. This is a crucial safeguard against a common pitfall in RL known as <strong>reward hacking</strong>, where the model exploits an unforeseen loophole in the reward function to get a high score without achieving the intended goal. As detailed in a great <a href="https://semianalysis.com/2025/06/08/scaling-reinforcement-learning-environments-reward-hacking-agents-scaling-data/">SemiAnalysis article</a>, this is a major challenge. The constraints in PPO help prevent this kind of undesirable optimization, making it a foundational technique for both RLHF and RLVR.</p>

<p><img src="/blog/images/120-1.png" /></p>

<p>The field is moving fast, and simpler, more stable alternatives to PPO are emerging. A prominent example, brought to light by the team behind the <a href="https://arxiv.org/html/2501.12948">Deepseek R1</a> model (as detailed in <a href="https://huggingface.co/blog/NormalUhr/grpo">this Hugging Face post</a>), is <strong>GRPO (Group Relative Policy Optimization)</strong>. Unlike PPO, which evaluates responses individually, GRPO operates on a group of candidate responses for a given prompt. A key difference and advantage is its simplicity, as GRPO does not need to learn an auxiliary value function. For each prompt, it samples multiple outputs, calculates the average reward for this group, and then updates the policy based on the <em>relative</em> performance of each sample. The objective is to encourage responses that score above the group average and discourage those that fall below. This approach of using a group’s average performance as a dynamic baseline provides a more stable and robust training signal, reducing the variance that can make PPO tricky to tune. It’s particularly effective for complex reasoning tasks where a clear “winner” is less important than consistently moving towards better-than-average solutions. For those interested in a hands-on implementation, the team at Lightning AI has a great post on <a href="https://lightning.ai/lightning-purchase-test/studios/build-a-reasoning-llm-from-scratch-using-grpo?section=featured">building a reasoning LLM from scratch using GRPO</a>.</p>

<h3 id="the-key-to-unlocking-reasoning">The Key to Unlocking Reasoning</h3>

<p>Our ultimate goal is to teach models to <em>reason</em>, not just to mimic patterns of text that have been positively reinforced by humans. When a model is rewarded not just for the final answer, but for the steps it takes to get there, it learns a process. This idea was formalized into a technique known as <strong>Process Supervision</strong>, where a reward is provided for each correct step in a model’s reasoning chain, leading to the development of Process-based Reward Models (PRMs).</p>

<p>As detailed by <a href="https://openai.com/index/improving-mathematical-reasoning-with-process-supervision/">OpenAI in their work on improving mathematical reasoning</a>, instead of only rewarding the final answer (outcome supervision), Process Supervision rewards the model for each correct intermediate step. You can find more technical details in the <a href="https://arxiv.org/abs/2305.20050">original research paper</a>. This is where things get really exciting. Imagine a model that can show its work, and we can verify and reward each step of that work. This not only makes the model’s reasoning process more transparent but also allows us to pinpoint exactly where it goes wrong and provide targeted feedback. This is a much more powerful and scalable approach than simply telling the model “your answer is wrong” and hoping it figures out why.</p>

<p>However, it’s crucial to draw a distinction here. While PRM-based training looked like the primary path forward a year ago, many of the biggest recent wins have come from simpler forms of RLVR—often just binary, outcome-based checkers combined with massive sampling and efficient algorithms like GRPO (see Nathan’s comment in <a href="https://www.interconnects.ai/p/what-comes-next-with-reinforcement">Interconnects.ai blog post, “What Comes Next with Reinforcement,”</a> ). This doesn’t mean PRMs have disappeared; rather, their role has evolved. They are now shifting to two key areas: as a crucial <strong>verifier at inference time</strong> to check the model’s reasoning, and for <strong>targeted training</strong> where step-level fidelity and interpretability are paramount. So, while still highly relevant, PRMs are becoming a specialized tool within the broader and more versatile RLVR toolkit.</p>

<p>Beyond just improving reasoning on static problems, RL is also a foundational component for training autonomous <strong>agents</strong> that can interact with environments and learn from the consequences of their actions. The shift is so significant that, as detailed in a recent piece by <a href="https://semianalysis.com/2025/06/08/scaling-reinforcement-learning-environments-reward-hacking-agents-scaling-data/">SemiAnalysis</a>, the growing importance of RL is fundamentally changing the structure of AI research labs. As <a href="https://www.youtube.com/watch?v=JIsgyk0Paic">this insightful video on the future of AI agents explains</a>, this sets the stage for <strong>continuous learning</strong>, where models can adapt and improve over time without constant, large-scale retraining.</p>

<h3 id="test-time-compute-getting-more-from-models-at-inference">Test-Time Compute: Getting More from Models at Inference</h3>

<p>A model trained with RLVR provides a strong reasoning foundation, but its performance can be significantly amplified at inference through a strategy known as <strong>test-time compute</strong>. This refers to the computational effort a model expends <em>when actively working on a prompt</em> to arrive at a final answer. Instead of generating a single, immediate response, we can have the model engage in a more deliberative, multi-path reasoning process.</p>

<p>At inference time, we can scale up this compute by using techniques like <a href="https://arxiv.org/abs/2203.11171v4"><strong>Self-Consistency</strong></a> (sampling multiple reasoning paths and taking a majority vote on the final answer) or <a href="https://arxiv.org/abs/2305.10601v1"><strong>Tree-of-Thoughts</strong></a> (actively exploring a tree of possible reasoning steps). This allows the model to explore a wider solution space and self-correct. The final, crucial step is to use a <strong>verifier</strong>—which can be a simple programmatic check, a unit test, or even a PRM acting as a reranker—to select the best and most reliable answer from the many candidates generated. This purely inference-time strategy leverages the strong base model from RLVR training to achieve state-of-the-art accuracy and robustness on complex tasks.</p>

<h3 id="karpathys-corner-the-bull-case-and-a-word-of-caution">Karpathy’s Corner: The Bull Case and a Word of Caution</h3>

<p>No discussion of the future of AI would be complete without mentioning Andrej Karpathy. His insights are always a valuable addition to the conversation. In a <a href="https://x.com/karpathy/status/1944435412489171119">recent tweet</a>, he laid out his bull case for Reinforcement Learning, and it’s easy to see why he’s optimistic. The ability to fine-tune models based on specific, measurable outcomes is a powerful tool, and RLVR is a prime example of this.</p>

<p>However, Karpathy is also a pragmatist. In <a href="https://x.com/karpathy/status/1960803117689397543">another tweet</a>, he raised a crucial question about the scalability of reward functions. He expressed some doubt that we can design reward functions that can scale all the way to AGI. And he has a point. While it’s relatively straightforward to design a verifiable rewards for a math problem or a coding challenge, what’s the verifiable rewards for writing a beautiful poem or a compelling story? How do we create a reward function for “common sense”?</p>

<p>This is the central challenge we face. As we push our models to tackle more complex and nuanced tasks, the line between verifiable and subjective rewards will inevitably blur. But that doesn’t mean we should abandon the pursuit. The progress we’re seeing with RLVR in domains like math, science, and coding is a testament to its potential. It might not be the silver bullet that gets us all the way to AGI, but it’s a massive step in the right direction. It’s a step towards models that don’t just talk the talk but can actually walk the walk of reason. And that, in itself, is a revolution.</p>

<h3 id="references">References</h3>

<ul>
  <li>Amatriain, X. (2024). <em>Beyond Token Prediction: the post-Pretraining journey of modern LLMs</em>. <a href="https://amatria.in/blog/postpretraining">https://amatria.in/blog/postpretraining</a></li>
  <li>Minae, S., et al. (2024). <em>Large language models: A survey</em>. <a href="https://arxiv.org/abs/2402.06196">https://arxiv.org/abs/2402.06196</a></li>
  <li>Li, Z., et al. (2024). <em>Reinforcement Learning with Verifiable Rewards</em>. <a href="https://arxiv.org/abs/2411.15124">https://arxiv.org/abs/2411.15124</a></li>
  <li>Fireworks.ai. <em>Reinforcement Learning with Verifiable Reward</em>. <a href="https://fireworks.ai/blog/reinforcement-learning-with-verifiable-reward">https://fireworks.ai/blog/reinforcement-learning-with-verifiable-reward</a></li>
  <li>SemiAnalysis. (2025). <em>Scaling Reinforcement Learning: Environments, Reward Hacking, Agents, &amp; Scaling Data</em>. <a href="https://semianalysis.com/2025/06/08/scaling-reinforcement-learning-environments-reward-hacking-agents-scaling-data/">https://semianalysis.com/2025/06/08/scaling-reinforcement-learning-environments-reward-hacking-agents-scaling-data/</a></li>
  <li>DeepSeek-AI. (2025). <em>DeepSeek-V2: A Strong, Economical, and Open-Source Mixture-of-Experts Language Model</em>. <a href="https://arxiv.org/html/2501.12948">https://arxiv.org/html/2501.12948</a></li>
  <li>Uhr, N. (2025). <em>Implementing Deepseek’s GRPO from scratch</em>. Hugging Face Blog. <a href="https://huggingface.co/blog/NormalUhr/grpo">https://huggingface.co/blog/NormalUhr/grpo</a></li>
  <li>Lightning AI. <em>Build a reasoning LLM from scratch using GRPO</em>. <a href="https://lightning.ai/lightning-purchase-test/studios/build-a-reasoning-llm-from-scratch-using-grpo?section=featured">https://lightning.ai/lightning-purchase-test/studios/build-a-reasoning-llm-from-scratch-using-grpo?section=featured</a></li>
  <li>OpenAI. (2023). <em>Improving mathematical reasoning with process supervision</em>. <a href="https://openai.com/index/improving-mathematical-reasoning-with-process-supervision/">https://openai.com/index/improving-mathematical-reasoning-with-process-supervision/</a></li>
  <li>Lightman, H., et al. (2023). <em>Let’s Verify Step by Step</em>. <a href="https://arxiv.org/abs/2305.20050">https://arxiv.org/abs/2305.20050</a></li>
  <li>Lambert, N. (2025). <em>What Comes Next with Reinforcement</em>. Interconnects.ai. <a href="https://www.interconnects.ai/p/what-comes-next-with-reinforcement">https://www.interconnects.ai/p/what-comes-next-with-reinforcement</a></li>
  <li>The Future of AI Agents. (Video). <a href="https://www.youtube.com/watch?v=JIsgyk0Paic">https://www.youtube.com/watch?v=JIsgyk0Paic</a></li>
  <li>Wang, X., et al. (2022). <em>Self-Consistency Improves Chain of Thought Reasoning in Language Models</em>. <a href="https://arxiv.org/abs/2203.11171v4">https://arxiv.org/abs/2203.11171v4</a></li>
  <li>Yao, S., et al. (2023). <em>Tree of Thoughts: Deliberate Problem Solving with Large Language Models</em>. <a href="https://arxiv.org/abs/2305.10601v1">https://arxiv.org/abs/2305.10601v1</a></li>
  <li>Karpathy, A. (2025). Twitter Post on RL Bull Case. <a href="https://x.com/karpathy/status/1944435412489171119">https://x.com/karpathy/status/1944435412489171119</a></li>
  <li>Karpathy, A. (2025). Twitter Post on Reward Function Scalability. <a href="https://x.com/karpathy/status/1960803117689397543">https://x.com/karpathy/status/1960803117689397543</a></li>
</ul>]]></content><author><name>Xavier</name></author><category term="Artificial Intelligence" /><category term="LLMs" /><summary type="html"><![CDATA[(This blog post, as most of my recent ones, is written with AI assistance and augmentation. First time using nano banana for the infographics!)]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://amatria.in/blog/blog/images/120-0.png" /><media:content medium="image" url="https://amatria.in/blog/blog/images/120-0.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">The AI Co-Developer 18 Months Later</title><link href="https://amatria.in/blog/ai-code-refactor" rel="alternate" type="text/html" title="The AI Co-Developer 18 Months Later" /><published>2025-07-06T00:00:01+00:00</published><updated>2025-07-06T00:00:01+00:00</updated><id>https://amatria.in/blog/ai-code-refactor</id><content type="html" xml:base="https://amatria.in/blog/ai-code-refactor"><![CDATA[<p>Eighteen months ago, I wrote a post on how Large Language Models would change software development, sharing my early experiences with the technology. You can read those initial thoughts <a href="https://amatriain.net/blog/aidevelopment">here</a>. A lot of things have happened since then, and my early observations are by now way outdated. That is why I decided to spend some time during a recent break to work on my side project and see how much better things are now.</p>

<p><img src="/blog/images/119-0.png" /></p>

<p>This time, my primary tool was <a href="https://cursor.sh/">Cursor</a>, an AI-first code editor, which I used almost exclusively in its agentic mode. I experimented with various backend models, including Gemini 2.5 Pro and Claude 4 Sonnet, though I often found myself defaulting to the “Auto” setting. I also incorporated <a href="https://jules.google.com/">Jules</a>, Google’s own software development agent, into my workflow.</p>

<p>Instead of starting a project from scratch, I decided to undertake a full refactoring of my old project, <a href="https://github.com/xamat/Xavibot">Xavibot</a>. The objective was to test the mettle of these AI agents on a relatively complex codebase while attempting substantial changes—a scenario often trickier than a clean-slate build.</p>

<p>The initial scope of the refactoring included migrating the chatbot’s backend from Azure to Google Cloud and transitioning the AI model from OpenAI’s GPT to Google’s Gemini (something I had been thinking about doing since my transition from LinkedIn/MSFT to Google for obvious reasons). This move wasn’t a simple swap. My original implementation was using the <a href="https://platform.openai.com/docs/assistants/overview">OpenAI Assistants API</a>, which conveniently manages file uploading, vector databases, and conversation memory. For Gemini, I would need to implement this functionality myself.</p>

<p>After the initial migration, I decided to complicate things a bit more and keep both the Gemini and GPT models, giving users the ability to dynamically switch between them. This is a neat feature not commonly seen in user-facing chat applications, and I was curious to see how hard it would be to implement.</p>

<p>The entire refactoring process took about three days, totaling around 15-20 hours of focused work. As is often the case, the last 10% of the changes consumed 90% of the time. Deploying to Google Cloud and implementing the dynamic model switching proved to be the most time-intensive parts of the project.</p>

<h3 id="what-went-well">What Went Well</h3>

<p>The experience was, for the most part, very positive. I was particularly impressed by:</p>

<ul>
  <li><strong>Holistic Code Refactoring:</strong> The agents demonstrated a remarkable ability to propose and implement changes across numerous files simultaneously when prompted with a high-level refactoring goal.</li>
  <li><strong>Intelligent Debugging:</strong> Having the Cursor agent read through logs to identify errors and suggest fixes was immensely helpful. It streamlined a process that is often tedious and time-consuming.</li>
</ul>

<h3 id="what-didnt-go-so-well">What Didn’t Go So Well</h3>

<p>Despite the significant strides in AI-assisted development, there were several areas where the models still fell short:</p>

<ul>
  <li><strong>Knowledge Gaps:</strong> The models often lacked up-to-date information about API features. For instance, I had to explicitly inform the models that the Gemini API included an explicit caching option.</li>
  <li><strong>API Versioning Issues:</strong> The AI models have a hard time dealing with APIs that have different versions. On several occasions, I had to read the documentation myself and point the LLM to the correct URL. I anticipate that maintaining updated, machine-readable documentation for coding agents (maybe brokered through an MCP server) will be an important use case soon.</li>
  <li><strong>Context Loss:</strong> During longer refactoring sessions, the models would lose context, and I found myself repeatedly reminding them of previously discussed details.</li>
  <li><strong>Tendency to Overcomplicate:</strong> Models still default to overcomplicated solutions. Thanks to my experience, I was able to sidestep most of these proposals, but I can see how this could lead to convoluted and unnecessary code for less-seasoned developers. Even in my case, I had to prompt the models to clean up and delete unused code after we got to a working solution.</li>
</ul>

<h3 id="the-verdict">The Verdict</h3>

<p>All that said, this is a huge improvement in developer experience in only 18 months, and I was able to do a lot of work in just a few days. For my next iteration, I might refactor the Node.js backend to Python, since I have no reason to maintain the Node.js approach and Python APIs are usually updated sooner.</p>

<p>I invite you to try out the new and improved <a href="https://amatriain.net/Xavibot/">Xavibot</a> directly here in this blog or visiting the URL directly. Let me know your thoughts. Any other suggestions on what other features to add? Again, code is available <a href="https://github.com/xamat/Xavibot">here</a></p>]]></content><author><name>Xavier</name></author><category term="Artificial Intelligence" /><category term="LLMs" /><category term="software engineering" /><summary type="html"><![CDATA[Eighteen months ago, I wrote a post on how Large Language Models would change software development, sharing my early experiences with the technology. You can read those initial thoughts here. A lot of things have happened since then, and my early observations are by now way outdated. That is why I decided to spend some time during a recent break to work on my side project and see how much better things are now.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://amatria.in/blog/blog/images/119-0.png" /><media:content medium="image" url="https://amatria.in/blog/blog/images/119-0.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Being Human in the Age of AI: On Critical Thinking, Agency, and Scientific Discovery</title><link href="https://amatria.in/blog/2025human" rel="alternate" type="text/html" title="Being Human in the Age of AI: On Critical Thinking, Agency, and Scientific Discovery" /><published>2025-04-01T00:00:01+00:00</published><updated>2025-04-01T00:00:01+00:00</updated><id>https://amatria.in/blog/being-human</id><content type="html" xml:base="https://amatria.in/blog/2025human"><![CDATA[<p>We seem to be living through a pivotal moment, often dubbed the <a href="https://www.gatesnotes.com/the-age-of-ai-has-begun">“Age of AI”</a>. It feels like barely a day goes by without news of another AI breakthrough that promises to revolutionize how we live, work, and perhaps even think. This rapid advancement naturally leads to big questions: What does it mean to be human when machines become increasingly intelligent? Is intelligence really what defines us?</p>

<h1 id="intelligence-isnt-everything">Intelligence Isn’t Everything</h1>

<p>I’d argue that intelligence, while a significant human trait, isn’t the sole factor that makes us human. We are a complex mix: intelligence, yes, but also feelings, creativity, physical presence, social interactions, and so much more. While our particular brand of intelligence often differentiates us from animals, it doesn’t always make us objectively more “intelligent” in every context. A bird’s navigational intelligence, for instance, far surpasses that of most humans. (Read <a href="https://www.amazon.com/Nietzsche-Were-Narwhal-Intelligence-Stupidity/dp/0316388068">“If Nietzsche Were a Narwhal: What Animal Intelligence Reveals About Human Stupidity”</a> for more examples and details)
Similarly, AI is rapidly becoming more “intelligent” than humans in specific, and increasingly general, domains. It can process vast amounts of data, identify patterns we miss, and even, as we’ll see, contribute to creative and scientific endeavors. The key, then, isn’t to compete with AI on raw intelligence but to leverage its capabilities to enhance our own intelligence while actively promoting and valuing those other uniquely human aspects – our empathy, our creativity, our ability to connect and feel.</p>

<h1 id="even-human-experts-arent-perfect">Even Human Experts Aren’t Perfect</h1>

<p>It’s also crucial to remember that even expert humans are fallible. We often place immense trust in experts, like medical doctors, but studies show their “intelligence” or accuracy has limits. For example, one fascinating study highlighted that in straightforward diagnostic cases, doctors’ average accuracy was just over 55%, dropping to less than 6% for more complex cases (see <a href="https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/1731967">“Physicians’ Diagnostic Accuracy, Confidence, and Resource Requests”</a>). Perhaps even more surprisingly, their confidence level remained high regardless of the case’s difficulty (around 72% for easy cases vs. 64% for hard ones).
What does this tell us? Even the best human experts make mistakes, and often, they aren’t aware when they might be wrong. So, as AI systems become increasingly capable, potentially surpassing human experts in many areas, we must remember they too will make mistakes. Blind trust is unwise, whether placed in a human or an AI.</p>

<h1 id="agency-and-judgment-the-human-domain">Agency and Judgment: The Human Domain</h1>

<p>This brings us to a critical point: decisions are for humans. While AI can provide insights, predictions, and recommendations at a scale and speed we can’t match, it lacks true agency, judgment, and the capacity for responsibility. An AI can’t be held accountable for the consequences of a decision. This becomes even more critical as we increasingly hear about “AI agents” capable of performing complex tasks autonomously. While powerful, these agents still lack genuine responsibility and accountability for their actions or outcomes. We must resist the temptation to outsource our own agency to these systems; the ultimate responsibility must remain firmly in human hands.
Therefore, we must resist the temptation to let AI make important decisions for us. Use AI as a powerful tool, a co-pilot, or even a knowledgeable advisor. But never blindly trust its output. Just as you might seek a second opinion from another human expert, consider getting multiple perspectives, whether from different AIs or a combination of AI and human experts. Ultimately, you must own the decision. This requires critical thinking, evaluation, and applying our uniquely human blend of intelligence, values, and intuition.</p>

<p><img src="/blog/images/118-0.png" /></p>

<h1 id="ai-in-science-disruption-and-opportunity">AI in Science: Disruption and Opportunity</h1>

<p>The world of scientific discovery is already being profoundly impacted by AI. In a fascinating, and perhaps slightly unnerving, development, a scientific paper generated entirely by AI recently made its way into a top-tier peer-reviewed workshop. <a href="https://sakana.ai/ai-scientist-first-publication/">This experiment by Sakana AI</a> is likely just the beginning.</p>

<p>This raises questions about the future of scientific innovation. Some, like Thomas Wolf in his piece <a href="https://thomwolf.io/blog/scientific-ai.html">“The Einstein AI Model”</a>, argue that AI, lacking true understanding and creativity, will never be able to genuinely innovate in science. While his narrative is compelling and worth reading, I have to disagree. History, even recent AI history, suggests otherwise. Remember AlphaGo’s groundbreaking Move 37? It was a move born from patterns learned through deep reinforcement learning (RL) and self-play, a move that human Go masters considered genuinely novel and creative.</p>

<p><img src="/blog/images/118-1.png" /></p>

<p>Emerging research indicates that LLMs, trained using similar RL and self-play techniques, can generate scientific hypotheses evaluated as more novel than those produced by human scientists. An AI “co-scientist” could potentially accelerate discovery far beyond what adding another human expert to the team could achieve (see <a href="https://arxiv.org/abs/2502.18864">this paper</a> for details). This disruption is happening now. Scientific organizations need to take this seriously, not by fighting it, but by embracing it and figuring out how to work with AI to push the boundaries of knowledge faster and further than ever before.</p>

<p><img src="/blog/images/118-2.png" /></p>

<h1 id="critical-thinking-in-the-age-of-ai-are-we-getting-dumber">Critical Thinking in the Age of AI: Are We Getting Dumber?</h1>

<p>A common fear is that relying on AI will atrophy our own cognitive skills, particularly critical thinking. Headlines often scream that “GenAI makes you dumb”. But does it?</p>

<p>A recent study by Microsoft and CMU titled <a href="https://www.microsoft.com/en-us/research/publication/the-impact-of-generative-ai-on-critical-thinking-self-reported-reductions-in-cognitive-effort-and-confidence-effects-from-a-survey-of-knowledge-workers/">“The Impact of Generative AI on Critical Thinking”</a> explored how knowledge workers perceive their critical thinking when using GenAI. It’s crucial to note the study focuses on self-reported perceptions, not objective measures of critical thinking ability.</p>

<p>The findings? People report using more critical thinking when they trust the AI less or trust themselves more on a task. Interestingly, the type of task didn’t seem to affect the reported level of critical thinking. The study also suggests a shift in approach: humans move from executing tasks to verifying AI-generated outputs. Perhaps most relevant to the headlines, users perceived less effort was needed to engage critical thinking when using GenAI.</p>

<p>Does perceiving less effort mean critical thinking is actually reduced? The study doesn’t prove that. It’s meta-ironic to see humans exhibiting poor critical thinking by <a href="https://techcrunch.com/2025/02/10/is-ai-making-us-dumb/">misinterpreting</a> a study about AI and critical thinking. The paper itself offers nuanced insights and suggests that AI tools should be designed to encourage human agency and critical evaluation, aligning with points <a href="https://amatriain.net/blog/llmsdoctors">I’ve made previously</a>.</p>

<h1 id="ai-learning-and-the-effort-equation">AI, Learning, and the Effort Equation</h1>

<p>The use case of AI in learning warrants special attention. Intuitively, allowing AI to simply provide answers feels like it would bypass the necessary mental effort crucial for deep understanding and retention. Recent research confirms this intuition isn’t entirely off-base, revealing that how students use AI tools like LLMs significantly impacts learning outcomes (see <a href="https://arxiv.org/abs/2409.09047">“AI Meets the Classroom: When Do Large Language Models Harm Learning?”</a>. Using AI to substitute for learning activities (e.g., getting solutions directly) allows students to cover material faster (increasing breadth), but demonstrably reduces their depth of understanding. Conversely, using AI to complement learning (e.g., asking for explanations) improves understanding without necessarily speeding up coverage.</p>

<p>Disturbingly, studies suggest students gravitate towards the less effective substitution method, possibly because it requires less immediate effort. This preference, combined with findings that LLMs may widen the learning gap by benefiting high-knowledge students more than those with less prior knowledge, raises concerns. Furthermore, students using LLMs tend to overestimate how much they’ve actually learned compared to their tested performance.</p>

<p><img src="/blog/images/118-3.png" /></p>

<p>These findings highlight a critical challenge for education. While AI offers potential as a powerful complementary learning aid, its potential for misuse as an effort-substitute is significant. Thoughtful integration, clear guidelines encouraging complementary use, and perhaps even interface designs that discourage passive substitution are necessary to harness AI’s educational benefits without undermining the learning process itself, especially for vulnerable learners.</p>

<h1 id="embracing-our-humanity">Embracing Our Humanity</h1>

<p>The Age of AI isn’t about humans versus machines. It’s about defining what truly matters to us as humans and leveraging these incredible new tools to enhance those aspects. AI will undoubtedly handle more cognitive tasks, but it cannot replicate the richness of human experience – our emotions, ethical judgments, creativity born from lived experience, physical interactions, and deep social bonds. Leveraging these tools effectively requires not only individual awareness and agency but also a commitment from creators to design AI systems that encourage critical engagement, support deeper understanding, and ultimately empower, rather than merely automate, our uniquely human capabilities. Let’s use AI to free ourselves up to be more human, not less.</p>]]></content><author><name>Xavier</name></author><category term="ai" /><category term="research" /><category term="philosophy" /><summary type="html"><![CDATA[We seem to be living through a pivotal moment, often dubbed the “Age of AI”. It feels like barely a day goes by without news of another AI breakthrough that promises to revolutionize how we live, work, and perhaps even think. This rapid advancement naturally leads to big questions: What does it mean to be human when machines become increasingly intelligent? Is intelligence really what defines us?]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://amatria.in/blog/blog/images/118-0.png" /><media:content medium="image" url="https://amatria.in/blog/blog/images/118-0.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">2024: A Year in AI Research</title><link href="https://amatria.in/blog/2024research" rel="alternate" type="text/html" title="2024: A Year in AI Research" /><published>2024-12-31T00:00:01+00:00</published><updated>2024-12-31T00:00:01+00:00</updated><id>https://amatria.in/blog/year-in-ai-research</id><content type="html" xml:base="https://amatria.in/blog/2024research"><![CDATA[<p>2024 has been an intense year for AI. While some argue that we haven’t made much progress, I beg to differ. It is true that many of the research advances from 2023 have still 
not made it to mainstream applications. But, that doesn’t mean that research is not making progress all around!</p>

<p>Every month I send my team at Google a few paper recommendations. For this end-of-year blog post, I went through all my monthly emails, picked my favorite articles, and I 
grouped them into categories. In each category I kept them ordered by publication date, so you may get a sense of progress in each of them.</p>

<p>Of course, this is a highly curated and probably biased list. I hope you enjoy it, and please let me know what was your favorite paper that I missed!</p>

<p><img src="/blog/images/117-0.jpeg" />
<em>Imagen3. Prompt= “Futuristic cityscape under construction in 2024, representing rapid progress in AI research, many buildings under construction, robots, drones, holographic blueprints, but also trees and nature and happy people and children, each building a different research area, detailed, cinematic lighting, concept art, warm colors”</em></p>

<ul>
  <li><a href="#models">LLMs, surveys and new models</a></li>
  <li><a href="#techniques">New techniques</a></li>
  <li><a href="#rag">RAG and beyond</a></li>
  <li><a href="#apps">Domain-specific applications of LLMs</a></li>
  <li><a href="#security">AI Security and alignment</a></li>
  <li><a href="#agents">Agents</a></li>
  <li><a href="#eval">LLM Evaluation</a></li>
  <li><a href="#beyond">Beyond LLMs</a></li>
</ul>

<h1 id="llms-surveys-and-new-models"><a name="models"></a>LLMs, surveys and new models</h1>

<ol>
  <li><strong><a href="https://arxiv.org/abs/2402.06196">Large Language Models: A Survey</a></strong> - This comprehensive survey paper, which I had the privilege of co-authoring, provides a broad overview of the rapidly evolving landscape of Large Language Models (LLMs). It covers everything from model architectures and training techniques to applications and ethical considerations. I was particularly interested in contributing to the section on prompt engineering, agents, and post-attention LLMs. It’s been incredibly gratifying to see this paper cited nearly 500 times in less than a year, highlighting the immense interest in this field. <strong>A must-read for anyone looking to get up to speed on LLMs</strong>.</li>
  <li><strong><a href="https://arxiv.org/abs/2402.17764">The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits</a></strong> -  This paper introduces BitNet, a groundbreaking approach to training LLMs that drastically reduces the precision of model parameters to just 1.58 bits, effectively making each parameter ternary (-1, 0, or +1). While reducing precision this drastically might seem counterintuitive, the authors demonstrate that BitNet models achieve comparable performance to full-precision models while delivering significant improvements in latency, memory usage, throughput, and energy consumption. This could be a game-changer for deploying LLMs on resource-constrained devices and making them more environmentally sustainable. <strong>This work challenges the conventional wisdom that high precision is always necessary for LLM performance</strong>.</li>
  <li><strong><a href="https://arxiv.org/abs/2403.19887">Jamba: A Hybrid Transformer-Mamba Language Model</a></strong> -  Jamba represents an exciting new direction in LLM architecture, combining the strengths of Transformers with the efficiency of Structured State Space Models (SSMs), specifically the Mamba architecture. This hybrid approach allows Jamba to handle longer contexts more effectively than traditional Transformers, while also being more computationally efficient. <strong>As one of the first open-source models to successfully integrate these two architectures, Jamba is a significant step towards building more powerful and scalable LLMs</strong>.</li>
  <li><strong><a href="https://arxiv.org/abs/2407.07726v1">PaliGemma: A versatile 3B VLM for transfer</a></strong> - PaliGemma, a new open-source vision-language model from Google DeepMind, stands out for its compact size (3 billion parameters) and its focus on transfer learning. Unlike many larger models that are trained from scratch for each new task, PaliGemma is specifically designed to be fine-tuned efficiently for a wide range of downstream applications. This makes it a particularly valuable tool for researchers and developers with limited resources, and it underscores the growing importance of transfer learning in the field of vision-language AI.</li>
  <li><strong><a href="https://arxiv.org/abs/2408.00118">Gemma 2: Improving Open Language Models at a Practical Size</a></strong> - The Gemma 2 report offers a treasure trove of insights into the training process of state-of-the-art LLMs. The authors delve into the details of their architectural choices, including variations on attention mechanisms and the use of knowledge distillation, all while prioritizing a model size that is practical for real-world use. <strong>This report is highly recommended for anyone interested in the nitty-gritty of LLM development and the trade-offs involved in optimizing for performance, efficiency, and accessibility</strong>.</li>
  <li><strong><a href="https://ai.meta.com/research/publications/the-llama-3-herd-of-models/">The Llama 3 Herd of Models</a></strong> -  While the Llama 3 models from Meta don’t introduce radical architectural changes, they demonstrate the continued power of scaling up existing approaches. <strong>The key takeaway here is the sheer impact of using more data and larger models, combined with meticulous engineering</strong>. Llama 3 models achieve impressive results across a range of benchmarks, reaffirming that bigger (and with better data) often is better, at least for now. Still, it does make one wonder when diminishing returns will set in.</li>
  <li><strong><a href="https://arxiv.org/abs/2408.07009">Imagen3</a></strong> - Imagen 3, Google’s latest text-to-image model, pushes the boundaries of image generation quality and control. It outperforms all other models, including DALL-E 3 and Midjourney v6, on most benchmarks, demonstrating a remarkable ability to translate complex textual prompts into highly detailed and coherent images. While Midjourney v6 still holds a slight edge in subjective visual appeal, Imagen 3’s superior performance on benchmarks requiring precise prompt adherence highlights its strength in controllability. <strong>This model represents a significant leap forward in text-to-image generation, bringing us closer to AI that can truly understand and visualize our creative visions</strong>.</li>
  <li><strong><a href="https://openai.com/index/gpt-4o-system-card/">GPT-4o System Card</a></strong> - The GPT-4o System Card provides a crucial glimpse into the extensive safety work that went into OpenAI’s latest flagship model. It outlines the rigorous evaluations, including external red teaming and frontier risk assessments, that were conducted before its release. <strong>This document is a must-read for anyone interested in the ethical considerations surrounding advanced AI and the growing importance of proactive safety measures in the development of powerful language models</strong>.</li>
  <li><strong><a href="https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/">Llama 3.2: Revolutionizing edge AI and vision with open, customizable models</a></strong> - Meta’s release of Llama 3.2 models (both LLMs and VLMs) emphasizes the increasing demand for powerful yet efficient AI that can run on edge devices. These smaller models achieve state-of-the-art performance for their size, making them suitable for deployment on smartphones, wearables, and other resource-constrained hardware. This is a significant step towards democratizing access to advanced AI and enabling a new wave of on-device applications.</li>
  <li><strong><a href="https://ai.meta.com/research/publications/movie-gen-a-cast-of-media-foundation-models/">Movie Gen: A Cast of Media Foundation Models</a></strong> - Meta’s MovieGen showcases the impressive capabilities of foundation models in the realm of multimedia generation. This collection of models achieves state-of-the-art results on a wide range of tasks, including text-to-video, video-to-audio, and text-to-audio generation. <strong>This work highlights the growing power of AI to create and manipulate different forms of media, opening up exciting possibilities for creative expression and content generation</strong>.</li>
  <li><strong><a href="https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf">Deepseek v3 technical report</a></strong> - The DeepSeek-V3 technical report is a testament to the fact that you don’t need a massive budget to build a state-of-the-art LLM. This model has been making waves in the AI community due to its impressive performance, achieved with a surprisingly small training budget. <strong>This report provides valuable insights into how to efficiently train high-performing LLMs, making it a valuable resource for researchers and developers working with limited resources</strong>.</li>
</ol>

<p><img src="/blog/images/117-1.png" />
<em>From our “Large Language Models: A Survey” paper</em></p>

<h1 id="new-techniques"><a name="techniques">New techniques</a></h1>

<ol>
  <li><strong><a href="https://arxiv.org/abs/2403.03507">GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection</a></strong> -  GaLore introduces a novel training strategy that tackles the memory bottleneck of training large language models. By using gradient low-rank projection, it enables full-parameter learning while being significantly more memory-efficient than popular methods like LoRA. This could open up new possibilities for training larger and more complex models on existing hardware, pushing the boundaries of what’s possible in LLM research..</li>
  <li><strong><a href="https://arxiv.org/html/2404.07143v1">Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention</a></strong> - Infini-attention offers a clever solution to the context length limitations of Transformers. This new attention mechanism allows for efficient processing of arbitrarily long contexts, even on LLMs with a relatively small number of parameters. By improving both context length and inference efficiency, Infini-attention could enable LLMs to process and understand much larger documents and conversations, unlocking new applications in areas like document summarization, question answering, and dialogue systems.</li>
  <li><strong><a href="https://arxiv.org/abs/2406.17557">The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale</a></strong> - This paper from HuggingFace tackles the crucial issue of data quality in LLM training. The authors present a new dataset called FineWeb, which is carefully curated from web data using a rigorous filtering and cleaning process. <strong>This work highlights the importance of high-quality data for achieving optimal LLM performance and provides a valuable resource for researchers working on improving the quality of web-scale datasets</strong>.</li>
  <li><strong><a href="https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html">Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet</a></strong> - This fascinating work from Anthropic delves into the inner workings of LLMs, demonstrating a method for extracting interpretable features from the Claude 3 Sonnet model. Not only can they identify these features, but they also show how they can be manipulated to control the model’s output. <strong>This research provides valuable insights into the interpretability of LLMs and opens up exciting possibilities for steering their behavior in a more fine-grained way</strong>. I find this area to be one of the most important in the path to safe AGI.</li>
  <li><strong><a href="https://arxiv.org/abs/2405.21060">Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality</a></strong> - This paper makes the surprising and insightful claim that Transformers, the dominant architecture in modern LLMs, are actually a specific type of Structured State Space Model (SSM). This connection opens up new avenues for understanding and potentially improving both Transformers and SSMs. By revealing this underlying duality, the authors provide a unifying framework that could lead to the development of more generalized and efficient models for sequence processing.</li>
  <li><strong><a href="https://arxiv.org/abs/2406.02528">Scalable MatMul-free Language Modeling</a></strong> - This paper challenges the fundamental reliance on matrix multiplications (MatMul) in Transformer architectures. The authors demonstrate that it’s possible to build language models without any MatMul operations, potentially leading to significant improvements in computational efficiency. This is a radical departure from conventional approaches and could pave the way for new hardware architectures optimized for language modeling.</li>
  <li><strong><a href="https://arxiv.org/abs/2402.12354">LoRA+: Efficient Low Rank Adaptation of Large Models</a></strong> - LoRA+ builds upon the popular LoRA (Low-Rank Adaptation) technique, further enhancing its efficiency by dynamically adjusting the learning rate for different parameters during fine-tuning. <strong>This relatively simple yet effective modification can lead to faster convergence and improved performance when adapting large models to new tasks, making it a valuable tool for practitioners working with LLMs</strong>.</li>
  <li><strong><a href="https://arxiv.org/abs/2402.12354">Self-Play Preference Optimization for Language Model Alignment</a></strong> - This paper introduces a novel alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning language models with human preferences. Instead of relying on external human feedback, it uses a self-play mechanism where the model competes against itself to improve its performance. <strong>This approach offers a potentially more scalable and efficient way to fine-tune LLMs for specific tasks or domains, and it raises interesting questions about the nature of learning and optimization in AI</strong>.</li>
  <li><strong><a href="https://arxiv.org/abs/2409.12917">Training Language Models to Self-Correct via Reinforcement Learning</a></strong> - This work from DeepMind explores the use of reinforcement learning to train language models to self-correct their mistakes. By incorporating a self-correction mechanism during training, the model learns to identify and rectify errors, leading to improved performance and robustness. The authors show that this approach can act as a form of regularization, preventing overfitting and improving generalization to unseen data.</li>
  <li><strong><a href="https://arxiv.org/abs/2411.17800">STAR: Synthesis of Tailored Architectures</a></strong> - The STAR paper introduces a novel approach to automatically designing neural network architectures using a gradient-free evolutionary algorithm. Instead of relying on manual design or gradient-based optimization, STAR explores a vast search space of architectures, evolving them over time to find optimal solutions. <strong>This work has the potential to automate the process of neural architecture search, leading to the discovery of new and potentially more efficient architectures for various task</strong>s.</li>
  <li><strong><a href="https://arxiv.org/abs/2406.15786">What Matters in Transformers? Not All Attention is Needed</a></strong> - This paper investigates the inner workings of the Transformer architecture, questioning the necessity of every single attention head. Their findings suggest that current Transformer architectures contain redundancies, and that similar performance can be achieved with fewer attention mechanisms. <strong>This research could lead to more efficient Transformer designs and a deeper understanding of what makes them so effective</strong>.</li>
  <li><strong><a href="https://arxiv.org/abs/2411.09009">Cut Your Losses in Large-Vocabulary Language Models</a></strong> - This paper from Apple demonstrates a clever optimization technique for large-vocabulary language models. By implementing a custom kernel for matrix multiplication and log-sum-exp operations, they are able to significantly reduce computational cost and improve efficiency. The authors use the Gemma model as an example, showcasing the practical benefits of their approach. <strong>This is another great example of how low-level optimizations can have a big impact on the performance of large models</strong>.</li>
  <li><strong><a href="https://arxiv.org/abs/2408.03314">Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters</a></strong> - This paper challenges the conventional wisdom that bigger models are always better. The authors, from DeepMind, argue that scaling up computation at test time can be a more effective strategy than simply increasing the number of model parameters. While this idea has been recently popularized by speculation around OpenAI’s “O1” model, this paper provides the original research underpinning this concept. This highlights the importance of considering not just model size but also computational efficiency during inference, and it opens up new avenues for optimizing LLM performance.</li>
  <li><strong><a href="https://ai.meta.com/research/publications/large-concept-models-language-modeling-in-a-sentence-representation-space/">Large Concept Models: Language Modeling in a Sentence Representation Space</a></strong> - Meta’s research on Large Concept Models (LCMs) explores a fascinating new direction for language modeling. Instead of predicting individual tokens, LCMs operate in a sentence representation space, predicting hierarchical concepts. This approach could potentially lead to more abstract and human-like understanding of language. <strong>While still early, this work suggests that there’s more to language modeling than just token prediction, and it opens up exciting new avenues for research into higher-level reasoning and understanding</strong>.</li>
  <li><strong><a href="https://deepmind.google/discover/blog/pushing-the-frontiers-of-audio-generation/">Pushing the frontiers of audio generation</a></strong> -  “Having worked on audio synthesis myself many years ago, I’ve been consistently amazed by the rapid progress in AI-driven audio generation. This blog post from DeepMind provides an overview of the research that has led to these breakthroughs, covering areas like neural audio codecs and diffusion models. <strong>This is a great read for anyone interested in the intersection of AI and audio, and it showcases the incredible potential of AI to generate realistic and creative soundscapes</strong>.</li>
</ol>

<h1 id="rag-and-beyond"><a name="rag">RAG and beyond</a></h1>

<ol>
  <li><strong><a href="https://arxiv.org/abs/2409.01666">In Defense of RAG in the Era of Long-Context Language Models</a></strong> - This paper makes a compelling case for the continued relevance of Retrieval-Augmented Generation (RAG) even as LLMs with increasingly long context windows become available. The authors argue that RAG is not simply a workaround for limited context length but offers distinct advantages in many scenarios, such as when dealing with rapidly evolving information or when needing to provide sources for generated text. <strong>This is a valuable read for anyone working with LLMs, as it challenges the assumption that longer context windows will necessarily replace retrieval-based methods</strong>.</li>
  <li><strong><a href="https://www.anthropic.com/news/contextual-retrieval">Introducing Contextual Retrieval</a></strong> -  “Anthropic’s work on contextual retrieval introduces a refined approach to RAG that takes into account the broader context of a query when retrieving relevant information. They propose two main techniques—contextual embeddings and contextual BM25—to improve the accuracy and relevance of retrieved passages. <strong>This research highlights the ongoing efforts to make RAG systems more sophisticated and context-aware, potentially leading to more accurate and nuanced responses from LLMs</strong>.</li>
  <li><strong><a href="https://arxiv.org/abs/2404.16130">From Local to Global: A Graph RAG Approach to Query-Focused Summarization</a></strong> - This paper from Microsoft explores a novel approach to query-focused summarization using a graph-based RAG system. By representing information as a graph, the model can better capture the relationships between different pieces of information and generate more coherent and comprehensive summaries. <strong>This work demonstrates the potential of graph-based methods for enhancing RAG systems and tackling complex information retrieval tasks</strong>.</li>
</ol>

<h1 id="domain-specific-applications-of-llms"><a name="apps">Domain-specific applications of LLMs</a></h1>

<ol>
  <li><strong><a href="https://arxiv.org/abs/2404.18416">Capabilities of Gemini Models in Medicine</a></strong> - This paper showcases the impressive capabilities of Google’s multimodal Gemini models in the medical domain. The results are quite striking: Gemini models outperform medical experts on a wide range of tasks, including medical image interpretation, diagnosis, and report generation. <strong>This research underscores the transformative potential of AI in healthcare, although it also raises important questions about the role of human expertise in the future of medicine</strong>.</li>
  <li><strong><a href="https://arxiv.org/abs/2404.14662">NExT: Teaching Large Language Models to Reason about Code Execution</a></strong> - The NExT paper presents a method for teaching LLMs to reason about the execution of code, a crucial step towards building more reliable and trustworthy AI for software development. The authors demonstrate that LLMs can learn to accurately predict the output of code snippets, even for complex programs. <strong>This research has significant implications for automating code analysis, debugging, and even code generation, potentially leading to substantial gains in programmer productivity</strong>.</li>
  <li><strong><a href="https://arxiv.org/abs/2404.11794">Automated Social Science: Language Models as Scientist and Subjects</a></strong> - This paper explores the intriguing possibility of using LLMs to automate social science research. By combining structural causal models with the reasoning capabilities of LLMs, the authors propose a framework for conducting automated experiments and generating hypotheses. <strong>While still in its early stages, this research opens up exciting new avenues for using AI to accelerate scientific discovery in the social sciences, potentially leading to a deeper understanding of human behavior and social phenomena</strong>.</li>
  <li><strong><a href="https://arxiv.org/abs/2410.10901">3DS: Decomposed Difficulty Data Selection’s Case Study on LLM Medical Domain Adaptation</a></strong> - This paper challenges the common practice of domain adaptation and fine-tuning LLMs for specific medical tasks. The authors argue that frontier models, such as GPT-4, are already good enough for many medical applications, even without specialized training. They introduce a method called 3DS that selects training data based on decomposed difficulty levels. <strong>This research questions some of the prevailing assumptions about the need for extensive domain adaptation in every case and suggests that the capabilities of general-purpose LLMs might be underestimated in specialized domains like medicine</strong>.</li>
</ol>

<h1 id="ai-security-and-alignment"><a name="security">AI Security and alignment</a></h1>

<ol>
  <li><strong><a href="https://arxiv.org/abs/2402.11753">ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs</a></strong> - This paper reveals a surprising vulnerability in aligned LLMs: they can be jailbroken using carefully crafted ASCII art prompts. This might seem whimsical at first, but it highlights the fragility of current alignment techniques and the creative ways in which adversaries might try to circumvent them. <strong>This research underscores the need for more robust and comprehensive methods for aligning LLMs with human values and preventing them from being misused</strong>.</li>
  <li><strong><a href="https://www.alignmentforum.org/posts/79BPxvSsjzBkiSyTq/agi-safety-and-alignment-at-google-deepmind-a-summary-of">AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work</a></strong> - This paper provides a valuable overview of Google DeepMind’s extensive research on AGI safety and alignment. It covers a wide range of topics, including scalable oversight, robustness, and system safety. <strong>This is a key read for anyone interested in the long-term safety of AI and the challenges of ensuring that increasingly powerful AI systems remain aligned with human goals</strong>.</li>
  <li><strong><a href="https://arxiv.org/abs/2404.09932">Foundational Challenges in Assuring Alignment and Safety of Large Language Models</a></strong> - This multi-institution paper delves into the fundamental challenges of ensuring the alignment and safety of LLMs. It provides a framework for understanding the different types of alignment failures and outlines key research directions for addressing them. <strong>This paper is a valuable contribution to the growing field of AI safety and highlights the need for collaborative efforts to tackle these complex issues</strong>.</li>
  <li><strong><a href="https://arxiv.org/abs/2406.04231">Quantifying Misalignment Between Agents: Towards a Sociotechnical Understanding of Alignment</a></strong> - This paper extends the discussion of alignment beyond individual models to the realm of multi-agent systems. The authors argue that misalignment between agents can be just as important, if not more so, than misalignment between a model and human values. They propose a framework for quantifying misalignment in multi-agent settings. <strong>This work highlights the need to consider the broader social and technical context in which AI systems operate and to develop methods for ensuring that agents can effectively cooperate and coordinate with each other</strong>.</li>
</ol>

<h1 id="agents"><a name="agents">Agents</a></h1>

<ol>
  <li><strong><a href="https://arxiv.org/abs/2402.06360">CoSearchAgent: A Lightweight Collaborative Search Agent with Large Language Models</a></strong> - The CoSearchAgent paper introduces a lightweight, collaborative search agent that leverages the power of multiple specialized LLMs. By dividing the search task among different agents, each with its own expertise, the system can achieve more comprehensive and accurate results. <strong>This research demonstrates the potential of multi-agent systems for tackling complex information retrieval tasks and highlights the benefits of a collaborative approach</strong>.</li>
  <li><strong><a href="https://arxiv.org/abs/2406.04692">Mixture-of-Agents Enhances Large Language Model Capabilities</a></strong> - This work from Together.ai provides further evidence for the power of multi-agent systems. They show that combining agents built using smaller, open-source models can outperform even the most advanced, monolithic LLMs on certain tasks. <strong>This research suggests that the future of AI might lie not in ever-larger single models but in well-designed systems of interacting, specialized agents</strong>.</li>
  <li><strong><a href="https://arxiv.org/abs/2406.06469">Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning</a></strong> - The Husky paper, with authors from Meta and other institutions, introduces a new open-source language agent designed for multi-step reasoning. This general-purpose agent can tackle a wide range of tasks that require planning and problem-solving. The release of Husky as an open-source tool is a significant contribution to the field, providing researchers and developers with a powerful platform for building and experimenting with language agents.</li>
  <li><strong><a href="https://www.langchain.com/stateofaiagents">LangChain State of AI Agents</a></strong> - This report from the LangChain team provides a comprehensive overview of the rapidly evolving field of AI agents. It covers the different types of agents, the key challenges in building them, and the most promising applications. <strong>This is an excellent resource for anyone looking to get up to speed on the current state of agent research and development</strong>.</li>
</ol>

<h1 id="llm-evaluation"><a name="eval">LLM Evaluation</a></h1>

<ol>
  <li><strong><a href="https://arxiv.org/abs/2404.12272">Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences</a></strong> - This paper tackles the crucial issue of evaluating LLMs, particularly when using other LLMs as evaluators. The authors introduce EvalGen, an interface that allows humans to grade LLMs on their ability to evaluate other LLMs, thus aligning the evaluation process with human preferences. <strong>This research highlights the importance of meta-evaluation in the development of LLMs and provides a practical tool for improving the reliability of LLM-assisted evaluations</strong>.</li>
  <li><strong><a href="https://arxiv.org/abs/2405.02287">Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models</a></strong> - The Vibe-Eval benchmark, introduced by researchers at Reka AI, offers a new and challenging way to evaluate multimodal LLMs. It focuses on tasks that require a deep understanding of both visual and textual information. The introduction of this benchmark will push the development of more robust and capable multimodal models, helping us measure true progress in this important area.</li>
</ol>

<h1 id="beyond-llms"><a name="beyond">Beyond LLMs</a></h1>

<ol>
  <li><strong><a href="https://www.nature.com/articles/s41586-024-07487-w">Accurate structure prediction of biomolecular interactions with AlphaFold 3</a></strong> - AlphaFold 3 represents a major breakthrough in the field of structural biology. This latest iteration of DeepMind’s protein folding model can now predict the structure and interactions of a wide range of biomolecules, including proteins, DNA, and RNA, with unprecedented accuracy. The release of the AlphaFold Server, a free tool for researchers, promises to dramatically accelerate research in drug discovery, materials science, and our fundamental understanding of biological processes. <strong>This is a landmark achievement that has the potential to revolutionize many areas of science</strong>.</li>
  <li><strong><a href="https://cacm.acm.org/opinion/new-computer-evaluation-metrics-for-a-changing-world/">New Computer Evaluation Metrics for a Changing World</a></strong> - This paper argues for the need to rethink traditional computer performance metrics in light of the changing landscape of computing. The authors, my colleagues at Google, propose new metrics that take into account factors like energy efficiency, sustainability, and the specific needs of modern workloads, such as AI and machine learning. <strong>This is a timely and important contribution that will be relevant to anyone involved in designing, evaluating, or using computer systems</strong>.</li>
  <li><strong><a href="https://arxiv.org/abs/2411.12090">Hardware Trends Impacting Floating-Point Computations In Scientific Applications</a></strong> - This paper explores the complex interplay between hardware trends and the precision requirements of scientific computing. The authors analyze how different floating-point formats and hardware architectures affect the accuracy and performance of scientific applications. <strong>This research provides valuable insights for developers of scientific software, helping them navigate the trade-offs between precision, performance, and efficiency in the context of evolving hardware</strong>.</li>
</ol>

<p><img src="/blog/images/117-2.jpeg" />
<em>Imagen3. Prompt = “2024, AI research depicted as a challenging mountain range, climbers making progress on each peak, representing breakthroughs and ongoing efforts, detailed, cinematic lighting, concept art”</em></p>]]></content><author><name>Xavier</name></author><category term="ai" /><category term="research" /><summary type="html"><![CDATA[2024 has been an intense year for AI. While some argue that we haven’t made much progress, I beg to differ. It is true that many of the research advances from 2023 have still not made it to mainstream applications. But, that doesn’t mean that research is not making progress all around!]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://amatria.in/blog/blog/images/117-0.jpeg" /><media:content medium="image" url="https://amatria.in/blog/blog/images/117-0.jpeg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">The end of the “Age of Data”? Enter the age of superhuman data and AI</title><link href="https://amatria.in/blog/ageofdata" rel="alternate" type="text/html" title="The end of the “Age of Data”? Enter the age of superhuman data and AI" /><published>2024-12-24T00:00:01+00:00</published><updated>2024-12-24T00:00:01+00:00</updated><id>https://amatria.in/blog/EndOfData</id><content type="html" xml:base="https://amatria.in/blog/ageofdata"><![CDATA[<h1 id="everything-ends-many-things-start-again">Everything ends, many things start again</h1>

<p>In the ever-shifting landscape of Artificial Intelligence, pronouncements of the ‘end of an era’ are surprisingly common. 
The latest such declaration comes from Ilya Sutskever who recently suggested that the ‘age of data’—a period defined by the
relentless pursuit of ever-larger datasets— <a href="https://www.youtube.com/watch?v=YD-9NG1Ke5Y">is drawing to a close</a>  . “Peak data” is the term used.
But is he right?  This post will argue that the age of data is far from over. Instead, it’s transforming into something even more powerful: an age of superhuman data.</p>

<p><img src="/blog/images/116-0.png" /></p>

<p>Now, to be fair, Ilya is known to be a believer in the power of big data. In fact their 2014 Sequence to Sequence paper that Ilya was 
remembering at NeurIPS this year had the following conclusions:</p>

<p><img src="/blog/images/116-1.png" /></p>

<p>The “scaling hypothesis” that Ilya is referring to here was fully developed by several OpenAI researchers (including e.g. Dario Amodei now CEO of Anthropic) 
in their 2020 paper <a href="https://arxiv.org/abs/2001.08361">“Scaling Laws for Neural Language Model”</a> where they showed how the performance of a large neural network 
could be predicted by the number of parameters, dataset size, and compute budget.</p>

<p>So, here we are, just 12 years later, being told that the age of data is over. We don’t have any more data to pump into our models. We ran out of our ‘fossil fuel’ of AI.Or have we?”</p>

<p><strong>(Historical digression</strong>:
One of the interesting things when you have been around for a while or you study a bit of history is that you realize that many scientific 
ideas or trends die and resurrect again and again. This is particularly true in AI, where a couple of AI winters declared the whole field itself mostly 
dead. As a reminder, the perceptron, which is the basic unit of Artificial Neural Networks, was also dismissed by Minski in his 1969 book 
<a href="https://en.wikipedia.org/wiki/Perceptrons_(book)">Perceptrons</a>.</p>

<p>It is also important to note that the simpler scaling law that states that in order to improve your model performance all you need is more data 
and more parameters was discovered years earlier. One of the earliest papers to be cited in that context is 2001 Banko and Brill’s 
<a href="https://aclanthology.org/P01-1005.pdf">“Scaling to Very Very Large Corpora for Natural Language Disambiguation”</a>. Another popular, but a bit more recent, 
reference is <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35179.pdf">“The Unreasonable Effectiveness of Data”</a> 
by Halevy, Norvig, and Pereira from Google. Both of these contributed to the beginning of the “Age of Data” that is probably best illustrated by Chris Anderson’s 
<a href="https://www.wired.com/2008/06/pb-theory/">“The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”</a> in Wired where he claimed that the world of 
discovery and science was going to be dominated by data. Models and algorithms were to become obsolete. I argued against this point-of-view in my 2012 post 
<a href="https://amatria.in/blog/more-data-or-better-models/">“More data or better models?”</a> .
<strong>)</strong></p>

<p><img src="/blog/images/116-2.png" /></p>

<h1 id="not-all-data-is-created-equal">Not all data is created equal</h1>

<p>It is important to underscore that for data, size is not all that matters. You can have a huge dataset with very low informational value, or a small one with 
lots of knowledge density. In an extreme, you could take a single data point and replicate it 1 trillion times to have a pretty large dataset that only contains one 
piece of information. Yes, that is an extreme, but most of the data on the Internet is very low value. In fact, some of it has negative value! When I used to work on AI 
for healthcare I used to half jokingly talk about how the internet had very little high quality medical data and lots of very dubious Reddit threads with questionable medical 
advice.</p>

<p>Not surprisingly, advances in models such as <a href="https://ai.meta.com/research/publications/the-llama-3-herd-of-models/">Llama 3</a> list better data as the number one reason 
behind their improvement. An important detail of their approach, which is common nowadays, is their focus on having specific domains such as coding and math specifically 
represented. They also pay special attention to having the right “data mix” that represents the different kinds of knowledge the model should learn. The final mix has 
“roughly 50% of tokens corresponding to general knowledge, 25% of mathematical and reasoning tokens, 17% code tokens, and 8% multilingual tokens”. Note that the category of 
“general knowledge” could be broken down into whatever domains are important or relevant.</p>

<p>A highlight of this data-driven approach to LLM training is the <a href="https://arxiv.org/abs/2406.17557">“FineWeb Datasets”</a> by HuggingFace. These datasets, and the associated publication, are a clear example of how important curation has become whenever training LLMs on publicly available data.</p>

<h1 id="beyond-the-web-the-value-of-proprietary-and-labeled-data">Beyond the Web: The Value of Proprietary and Labeled Data</h1>

<p>While the vast expanse of the public internet has been the primary training ground for large language models, it represents only a fraction of the data that exists and, 
more importantly, only a fraction of the data that is valuable for pushing the boundaries of AI. A crucial point that gets overlooked in discussions about the “end of the 
age of data” is the existence of vast troves of proprietary data, often highly specialized and meticulously curated, that are not publicly available on the internet.
Consider the healthcare industry, where patient records, clinical trial results, and medical imaging data hold immense potential for training specialized AI models. Similarly, 
companies like Waymo and Tesla possess massive, proprietary datasets of driving data that are far more detailed and comprehensive than anything available publicly. This kind of 
high-quality, domain-specific data is becoming increasingly important as AI moves beyond general-purpose tasks towards more specialized applications.</p>

<p>The value of this proprietary data is undeniable, even in the age of powerful pre-trained models. While some early proponents of the GPT era argued that pre-trained models 
would negate the need for specialized datasets through zero-shot learning, this has not proven to be entirely true. Instead, proprietary data is being leveraged in new ways, 
such as through techniques like fine-tuning, retrieval-augmented generation (RAG), or by strategically injecting it into the model’s context window. These methods allow models 
to specialize and adapt to specific domains and tasks, demonstrating that the value of data persists even if the mechanisms for extracting that value are evolving.</p>

<p>Another critical aspect of valuable data is whether it is labeled or not. While LLMs are typically pre-trained on vast quantities of unlabeled data using self-supervision, 
the subsequent steps of fine-tuning and reinforcement learning often rely on high-quality, labeled datasets. As highlighted in <a href="https://amatria.in/blog/postpretraining">“Beyond Token Prediction: the post-Pretraining 
journey of modern LLMs”</a>, much of the recent progress in GenAI has come from advances in these post-training stages, which require 
carefully curated datasets with explicit labels. Creating these labeled datasets, particularly at scale and with high accuracy, remains a significant challenge and a major 
bottleneck in many AI applications. This is precisely why companies specializing in data labeling, like Scale.ai, are seeing continued growth and high valuations.</p>

<p>It is important to note that the need for labeled data, while crucial, does not directly contradict the idea that the “age of pretraining data” might be waning, as Ilya 
suggested. Pretraining still relies heavily on the vast, unlabeled data of the public web. However, the increasing importance of post-training techniques and the value of 
proprietary, labeled datasets underscore a shift towards a more nuanced understanding of data’s role in the future of AI. The age of data is not ending, but it is certainly 
evolving, with a growing emphasis on quality, specificity, and the strategic use of data beyond the initial pretraining phase.</p>

<p><img src="/blog/images/116-3.png" /></p>

<h1 id="human-synthetic-and-superhuman-data">Human, Synthetic, and Superhuman Data</h1>

<p>The vast majority of data currently used to train AI models is generated by humans – from the text on Wikipedia, Quora and Reddit to the images, videos, and code found across 
the internet. Humans also play a crucial role in creating the labeled datasets used for post-training. However, the question of whether this reliance on human-generated data 
can or should continue is a subject of intense debate. While it is true that much of the existing web data has been “used up” for pretraining purposes, the kind of data that 
future AI will need is very different.</p>

<p>One promising avenue, also highlighted by Ilya in his talk, is <strong>synthetic data</strong>, artificially generated data that can augment or even replace real-world data. The potential of synthetic data has been met with both excitement and skepticism. Some research, such as the paper 
<a href="https://www.nature.com/articles/s41586-024-07566-y">“AI models collapse when trained on recursively generated data”</a>, suggests that models trained solely on their own outputs 
can degrade. However, I believe these limitations can be overcome by generating synthetic data from diverse and more 
complex models, as argued in <a href="https://arxiv.org/abs/2410.15226">“On the Diversity of Synthetic Data and its Impact on Training Large Language Models”</a>. There is a rich and 
growing body of work on how to best generate and use synthetic data that I won’t review here (see e.g. 
<a href="https://arxiv.org/abs/2302.04062">“Machine Learning for Synthetic Data Generation: A Review”</a> and 
<a href="https://arxiv.org/abs/2403.04190">“Generative AI for Synthetic Data Generation: Methods, Challenges and the Future”</a> ).</p>

<p>I believe we are rapidly approaching a turning point: <strong>the end of the age of data as defined by human limitations</strong>. While human-generated data has been essential in 
bootstrapping AI, it is inherently constrained by our own perceptions, biases, and the inefficiencies of human communication. As <a href="https://www.noemamag.com/ai-and-the-limits-of-language/">Yann LeCun</a> 
has pointed out, human language is an imperfect tool for capturing the full complexity of reality. We are, in a sense, “lossy” encoders of the world around 
us.</p>

<p>The future of data lies in moving beyond these limitations. My hypothesis is that LLMs are not the end but the means to an end: they are good enough to build AI agents that will 
capture much better data. In fact, they will capture superhuman data. We are on the cusp of an era where AI agents, equipped with advanced sensors and sophisticated reasoning 
capabilities, will interact directly with the world, generating data that is richer, more accurate, and less filtered by human interpretation. Multimodal agents, such as Google 
Deepmind’s <a href="https://deepmind.google/technologies/project-astra/">Astra</a>, will be deployed in real environments to create better data and representations of the world, enabling new 
forms of scientific discovery, as previewed in <a href="https://www.nature.com/articles/s41586-023-06221-2">“Scientific discovery in the age of artificial intelligence”</a>. 
Consider autonomous vehicles like Waymo’s. While currently limited in scope, they represent a first step towards this future. These vehicles collect vast amounts of real-time 
data about the world, data that can be used to improve their own performance and, crucially, could be used to create richer datasets beyond the task of driving. In the near 
future, we can envision AI agents designed explicitly for data collection and generation, operating across a wide range of domains, from materials science to medical research. 
Some very recent interesting examples of new data include <a href="https://github.com/PolymathicAI/the_well">“The Well (15TB of Physics Simulations)”</a>, <a href="https://github.com/MultimodalUniverse/MultimodalUniverse">“Multimodal Universe (100TBs of Astronomical Scientific Data)”</a>, and <a href="https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/">“Genie-2”</a>, a large-scale foundation world modelcapable of generating an endless variety of action-controllable, playable 3D environments for training and evaluating embodied agents. It is important to note that this new kind of data is not only different in scale but also in its fundamental structure and complexity. Consequently, it will demand the development of novel algorithms and model architectures capable of fully harnessing its potential.</p>

<p>This transition to agent-generated data will not be without its challenges. <strong>Importantly, we must acknowledge the continued need for human-generated data in specific areas, 
particularly for aligning AI systems with human values and preferences.</strong> Data reflecting human desires, feedback, and ethical judgments will remain crucial for ensuring that 
AI remains beneficial and aligned with our goals. This includes data used for techniques like Reinforcement Learning from Human Feedback (RLHF) and other alignment methods. 
In addition, human data will continue to be invaluable for personalizing AI experiences, ensuring that systems are responsive to individual needs and preferences. While 
superhuman data can provide a more accurate and comprehensive understanding of the world, it is human data that provides the crucial link to what matters to us as humans. I should also point out that “superhuman” data or AI is not the same as AGI. In fact, I am known to be a fan of the term AGI itself, and I believe that <strong>specialized, superhuman agents are likely to be both more achievable and more beneficial in addressing specific, complex challenges in alignment with our goals</strong> (see my 
<a href="https://amatria.in/blog/multiagents">“Beyond Singular Intelligence: Exploring Multi-Agent Systems and Multi-LoRA in the Quest for AGI”</a>).</p>

<p>It is not an overstatement to say that data is not only not over, but it is in fact about to get much bigger and better, thanks to AI.</p>

<h1 id="datas-next-chapter-from-human-to-superhuman">Data’s Next Chapter: From Human to Superhuman</h1>

<p>The era of relying solely on low-quality web data for AI training is coming to a close. The future of AI will be shaped a new kind of data: superhuman data. This agentic-generated data, unconstrained by human limitations and biases, will unlock new levels of AI capabilities across a wide range of domains. Imagine a world where specialized AI agents, equipped with advanced sensors, explore the depths of the ocean, analyze complex scientific data in real-time, or even help us understand the intricacies of the human brain, generating data far beyond our current reach. This is not to say that human input will become obsolete. In fact, ensuring that these powerful AI systems remain aligned with human values and goals will be more critical than ever. By carefully curating datasets that reflect our ethical principles and desired outcomes, and by developing robust methods for incorporating human feedback into the learning process, we can guide the development of agentic AI towards a beneficial future. The age of superhuman data is not just a technological shift, it’s a paradigm shift that has the potential to revolutionize science, medicine, and countless other fields. Of course, realizing this potential will require not only new data but also continued innovation in algorithms and model design. The two will work hand-in-hand to drive the next wave of AI breakthroughs. By embracing this future, and by thoughtfully navigating the challenges it presents, we can unlock a new era of discovery and progress, driven by the power of agentic AI.</p>]]></content><author><name>Xavier</name></author><category term="ai" /><category term="data" /><summary type="html"><![CDATA[Everything ends, many things start again]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://amatria.in/blog/blog/images/116-0.png" /><media:content medium="image" url="https://amatria.in/blog/blog/images/116-0.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>