AI, software, tech, and people. Not in that order. By X

The Informed Intuition Model: Blending Data with Gut Feel for Smarter Decisions

2024-03-06T00:00:01+00:00

If you have been following my journey for a while, you’re probably aware of my fascination with decision-making frameworks. The line between success and failure, whether in businesses or individual lives, often hinges on the quality of decisions made. In a previous blog post, I delved into decision-making within organizations. Today, I’m excited to share a personal decision-making framework that has been instrumental in navigating significant life choices, most notably in my recent career transition.

Imagine standing at a crossroads in your career, pondering your next move. That was me, not too long ago. The decision about where to take my professional path next was daunting, prompting me to refine and formalize a decision-making framework I had been using intuitively. Drawing inspiration from luminaries like Kahneman to Bezos, this framework evolved into a structured process, complemented by a practical spreadsheet tool. Upon sharing this toolkit with friends and colleagues, the feedback was unanimous: it was a game-changer. I’m confident it will serve you just as well.

Standing on the Shoulders of Decision-Making Giants: Kahneman’s Framework

Nobel laureate Daniel Kahneman, a towering figure in the study of human judgment and decision-making, has extensively explored the intricacies of our decision-making processes. His latest book, Noise, sheds light on the pitfalls of human judgment and our susceptibility to error. Contrary to what one might expect, Kahneman doesn’t advocate for a purely rational, data-driven approach to decision-making. Instead, he introduces the concept of “delayed intuition,” where intuition plays a role only after a thorough analysis of data and information. This nuanced approach resonates deeply with the framework I’ve developed. For a more digestible overview of Kahneman’s insights, I highly recommend this interview with Adam Grant, where the essence of “delayed intuition” is brilliantly unpacked.

In the next section, I’ll unveil the visual representation of this decision-making framework, which has not only guided me through pivotal career decisions but also equipped others with a powerful tool for navigating life’s crossroads.

The Data-Informed Gut Decision-Making Framework

At the heart of my approach lies the Data-Informed Gut Decision-Making Framework, which is elegantly simple yet profoundly effective. It unfolds in two pivotal steps: (1) Data and information collection, followed by (2) An informed gut decision based on this groundwork.

The notion of integrating ‘gut feeling’ with data may seem unconventional to some. Yet, it’s a method even industry titans swear by. Take, for instance, Jeff Bezos’ approach at Amazon. The company is renowned for its operational rigor, but what might be less known is Bezos’ “70% rule”. This principle advocates for decision-making once 70% of the requisite information is in hand. The rationale is threefold: achieving 100% data is a myth that only hampers decision quality, acting on 70% data accelerates decision-making, and embracing this partiality allows for swift corrective measures if needed. Aiming for complete data is a mirage that could lead to indecision and missed opportunities.

With this framework outlined, let’s delve into its components, beginning with the critical first step.

Step 1: Gathering Data and Information - The Decision-Making Spreadsheet

The journey to an informed decision starts with data and information gathering. It’s crucial to approach this phase with structure and objectivity. The process bifurcates into first identifying the decision dimensions that matter to you and assigning them appropriate weights, and later evaluating how each of your options provide the dimensions you care about. These dimensions are reflections of your long-term aspirations, though they naturally evolve over time and vary according to the decision context.

For instance, when weighing job opportunities, ‘Salary Increase’ might take precedence. Conversely, in lifestyle decisions, other factors might come to the fore. The task is to enumerate these dimensions and assign them weights on a scale from +5 to -5, acknowledging that some aspects might inherently carry negative connotations, such as ‘Work Travel.’

Previously, I employed paper cards for this exercise, but have since transitioned to a digital format for efficiency and clarity. You’re invited to explore an exemplar of this process through the decision-making spreadsheet I’ve shared.

Upon defining the dimensions and their weights, the next step is to assess how well each option scores against these criteria. This can be systematically done using the spreadsheet’s second tab, leading to a calculated weighted sum that signifies each option’s total value. While this method assumes independence between dimensions—a simplification aficionados might debate—it serves as a robust starting point for most decision-making scenarios.

With this comprehensive evaluation in hand, take a moment to reflect on the findings. Let the insights marinate before proceeding to the crucial second step, where intuition and data converge to guide your decision.

Step 2: Engaging Your Intuition

With a solid foundation of data and analysis from Step 1, it’s time to shift gears and tune into your intuition. This phase is about marrying the quantitative with the qualitative, allowing your inner voice to have a say in the decision-making process. Reflect on which option resonates with you on a deeper level, beyond the numbers and facts.

It’s crucial here to guard against the paralysis of ‘what-ifs.’ Encourage yourself to lean towards a decision, even if tentatively. Emulate living with that choice for a few hours or even a day. The ‘sleep on it’ test can be particularly revealing, offering insights into your subconscious inclinations. If you wake up feeling positive about your provisional choice, it could be a green light. But remember, hesitancy or doubt at this stage is not uncommon and leads us to the crucial next step: iteration.

[AI Digression] Before we proceed, let’s take a moment to reflect on the role of AI in decision-making. As an AI enthusiast, I recognize its transformative potential in organizing and analyzing data (Step 1). Yet, AI’s utility has its limits. The realm of intuition, gut feelings, and the nuanced understanding of human emotions remain uniquely human. AI lacks the ‘gut’ to make nuanced decisions, the intuition to ‘feel’ the right choice, and the accountability that comes with decision-making. This underscores the enduring value of human judgment, especially in a landscape increasingly navigated with AI assistance.

Step 3: The Iterative Process

Decisions of significant impact rarely come easy and often require revisiting and refining. This iterative process is not a sign of indecision but a hallmark of thoroughness. Should your intuition nudge you towards an option that doesn’t top the ‘rational value’ charts, it might be time to reassess your initial dimensions and weights. Are they truly reflective of your priorities and values?

Conversely, if the most logical choice leaves you feeling uneasy, it’s worth exploring that discomfort. It could be an indicator that not all variables have been accounted for or that their significance needs re-evaluation.

Iteration is particularly pertinent when information is incomplete or evolving, as is often the case in dynamic scenarios like career transitions. Engaging in further conversations, seeking additional insights, and continuously revising your data can illuminate new perspectives, ultimately guiding you towards a decision that aligns with both your rational assessment and intuitive sense.

Decision-Making Time

With thorough iteration and a deep dive into your options, the moment to decide arrives. Echoing the wisdom of not seeking absolute certainty, remember: perfection in data is unattainable. All of the important decisions I have made in my life were guided by this very framework, embodying the principle of acting on 70% information. Dwelling on marginal gains in data can be an exhaustive endeavor with diminishing returns. No matter how important your decision is, going from 70 to 75% of information might take you weeks or months. Embrace the calculated risk and make your decision with confidence.

After Your Decision: Commitment and Course Correction

The decision made, the next critical phase is wholehearted commitment. Dive in with dedication to bring your choice to fruition. Yet, maintain the humility to recognize the potential for missteps. Regularly revisit your decision framework and the assumptions that underpinned your choice. Life is dynamic, and new information may necessitate recalibration. Commitment should not equate to rigidity; adaptability in the face of new evidence is a strength, not a weakness. Avoid the trap of sunk cost fallacy, where past investments dictate continued allegiance to a faltering path.

Conclusion: Embracing the Future of Decision-Making

In our rapidly evolving digital era, characterized by the ascent of artificial intelligence, the art of decision-making gains unparalleled significance. As AI augments our ability to process information, the distinctly human capacity to make informed, intuitive decisions becomes our most critical asset. This framework, a synthesis of data-driven analysis and intuitive judgment, is designed to enhance this capacity.

Through this journey, I’ve shared a methodology that transcends the mere mechanics of decision-making, aiming to empower you with a tool for not just success, but also fulfillment. The Data-Informed Gut Decision-Making Framework is more than a process; it’s a pathway to navigating life’s myriad choices with confidence and wisdom.

As we stand on the threshold of the Age of AI, let us harness these advancements not as replacements but as companions in our decision-making endeavors. I invite you to embark on this journey, leveraging the spreadsheet and framework I’ve shared, to cultivate a life rich in purposeful choices. Here’s to making decisions that not only propel us towards our goals but also enrich our human experience. Are you ready to embrace the power of data-informed gut decisions?

Measuring and Mitigating Hallucinations in Large Language Models: A Multifaceted Approach

2024-03-04T00:00:01+00:00

(Interestingly, this is the first time arXiV has declined a submission from me. I would give the editors kudos for finally taking their role seriously if it wasn’t because I suspect this is simply the result of a poor algorithmic decision that detected I had used AI assistance in writing the paper and violated a rule of disclosing such use. More discussion on this topic in this LinkedIn post and the comments.)

Download pdf

Cite as

@misc{amatriain2024llmhallucinations,
      title={Measuring and Mitigating Hallucinations in Large Language Models: A Multifaceted Approach}, 
      author={Xavier Amatriain},
      year={2024},
}

Abstract

The advent of Large Language Models (LLMs) has ushered in a new era of possibilities in artificial intelligence, yet it has also introduced the challenge of hallucinations—instances where models generate misleading or unfounded content. This paper delves into the multifaceted nature of hallucinations within LLMs, exploring their origins, manifestations, and the underlying mechanics that contribute to their occurrence. We present a comprehensive overview of current strategies and methodologies for mitigating hallucinations, ranging from advanced prompting techniques and model selection to configuration adjustments and alignment with human preferences. Through a synthesis of recent research and innovative practices, we highlight the effectiveness of these approaches in reducing the prevalence and impact of hallucinations. Despite the inherent challenges, the paper underscores the dynamic landscape of AI research and the potential for significant advancements in minimizing hallucinations in LLMs, thereby enhancing their reliability and applicability across diverse domains. Our discussion aims to provide researchers, practitioners, and stakeholders with insights and tools to navigate the complexities of hallucinations in LLMs, contributing to the ongoing development of more accurate and trustworthy AI systems.

Mitigating hallucinations requires a multifaceted approach

Beyond Token Prediction: the post-Pretraining journey of modern LLMs

2024-02-04T00:00:01+00:00

(This blog post, as most of my recent ones, is written with GPT-4 assistance and augmentation)

Large Language Models (LLMs) like GPT-4 are often simplified as mere probabilistic token predictors, a perspective I’ve shared in the past to demystify their capabilities and temper the surrounding hype. Yet, this view undersells the true sophistication of modern LLMs. If you’ve had the opportunity to interact with frontier models such as GPT-4, you might have noticed abilities that hint at something beyond simple token prediction. This post isn’t about convincing you of their advanced capabilities—that should be evident to users of such models. Instead, we’ll delve into the technical underpinnings that elevate LLMs above mere token prediction, exploring the intricate processes involved in their development and training.

Emerging abilities, such as understanding context, generating coherent and creative text, and even exhibiting problem-solving skills, suggest that these models are tapping into deeper layers of language comprehension and generation. How do they achieve this? Is it merely an extension of their token prediction capability, or is there something more at play? By unpacking the mechanisms that drive these advanced models, I will attempt to shed light on the true extent of their capabilities and the implications for the future of artificial intelligence.

The Token Prediction Component - A bit on basic LMs and Transformers

Language models, at their core, are designed to predict the next word in a sentence based on the context provided by the preceding words. This concept, simple in theory, has evolved dramatically over time. Initially, models relied on basic statistical methods, predicting words based on their likelihood of following a given sequence. The introduction of ULMFit by Jeremy Howard in 2017 marked a significant leap forward, advocating for the pre-training of a ‘universal’ model on a vast corpus, which could then be fine-tuned for specific tasks. This approach paved the way for even more sophisticated models.

The same year, the groundbreaking ‘Attention is All You Need’ paper by Google researchers introduced the world to Transformers. These models, which have dominated the landscape of language models in recent years, employ a unique mechanism that focuses on the context of each word within a sentence, regardless of its position. This enables a more nuanced understanding and generation of text. While I’ve delved deeper into Transformers in previous writings (see blogpost and publication versions of my Transformer Catalog), it’s essential to recognize them as the backbone of modern LLMs, which, despite their complexity, still fundamentally rely on predicting the most likely next word in a sequence.”

Beyond Basic Prediction: The Advent of Instruction-Focused LLMs

As we delve deeper into the evolution of Large Language Models, a significant milestone emerges with OpenAI’s introduction of InstructGPT in 2022. This model marked a departure from traditional training methodologies by emphasizing the model’s ability to follow human instructions more effectively. InstructGPT leveraged the concept of fine-tuning, a familiar technique in the LLM toolkit, but with a novel focus: enhancing the model’s responsiveness to nuanced human prompts.

This innovation signaled a broader shift in the LLM landscape. No longer was pre-training alone sufficient; the development of a cutting-edge LLM now required a multi-layered approach. Today’s training regimen encompasses a series of intricate steps: pre-training, instruction tuning, alignment, task-specific or multitask fine-tuning for specialized capabilities, and prompt tuning for optimized interaction.

The Comprehensive Training Process for Modern LLMs: From foundational pre-training to nuanced prompt tuning, this diagram outlines the key stages in developing an LLM capable of sophisticated human interaction.

As we explore these steps further, we’ll uncover how each contributes to building LLMs that not only understand language but can also interpret and execute complex instructions with an unprecedented level of finesse. What implications do these advancements hold for the future of artificial intelligence, and how do they redefine our interaction with machines? The following sections will delve into these questions, shedding light on the intricate dance of technology and human guidance that shapes the LLMs of today.

Enhancing LLMs with Supervised Fine Tuning (SFT)

At the heart of making Large Language Models (LLMs) adept at specific tasks is Supervised Fine Tuning (SFT). This crucial step tailors a broadly trained foundation model, like BERT, to excel in distinct applications by leveraging labeled data. The seminal BERT paper exemplifies this, detailing the model’s adaptation to 11 diverse tasks, showcasing SFT’s transformative impact.

Even as recent LLMs boast impressive out-of-the-box capabilities, the targeted application of SFT can significantly amplify their performance. OpenAI’s findings underscore this, revealing that a fine-tuned GPT-3.5 Turbo can surpass GPT-4 in specialized tasks, highlighting SFT’s enduring relevance.

SFT’s versatility extends beyond single-task optimization. Multi-task fine-tuning enriches models with broader applicability, streamlining prompt engineering and potentially circumventing the need for retrieval-augmented generation. This approach not only enhances result accuracy but also introduces models to novel or exclusive datasets, ensuring their evolution in step with emerging knowledge domains.

SFT does not only improve the performance of the foundation LLM on a given or multiple tasks, but it also offers other advantages. For example, it provides a way to train the model with proprietary data that was not present in the original dataset. Or, when using Parameter Efficient Fine Tuning (PEFT) it can yield not only a more accurate, but also smaller LLM. As we will see in the next section, SFT can also be used to have the model become better at following human instructions.

Improving Human-Model Interaction through Instruction Tuning

Instruction Tuning stands as a pivotal application of fine-tuning, meticulously crafting LLMs to better comprehend and execute tasks (aka instructions) as defined by human prompts. At its core, instruction tuning refines a model’s ability to parse and respond to task-specific instructions, ensuring outcomes that resonate with human expectations.

Central to this process are specialized datasets like Natural Instructions, which offer a rich tapestry of task definitions, exemplars, and critical dos and don’ts. These datasets serve as the blueprint for instruction tuning, guiding models to grasp the nuanced spectrum of human instructions. Note: As we navigate the intricacies of aligning LLMs with human intent, it’s imperative to consider the ethical implications and ensure that instruction datasets embody a broad spectrum of perspectives, thereby fostering models that are both high-performing and equitable.

The efficacy of instruction tuning is underscored by comparative analyses of models like InstructGPT and Alpaca, which, post-tuning, exhibit marked superiority over their foundational counterparts, GPT-3 and LLaMA, respectively. These enhancements are not merely incremental; they redefine the models’ utility across a range of benchmarks, from natural language understanding to task-specific problem-solving.

Steering LLMs Towards Human Values: Alignment Approaches and Techniques

AI alignment emerges as a critical endeavor in the development of LLMs, ensuring these models act in accordance with human goals, principles, and preferences. Despite their sophistication, LLMs can inadvertently generate content that is biased, misleading, or harmful. Addressing these challenges requires a suite of alignment techniques, each with its unique approach and application. Note that instruction tuning is in itself a first step towards alignment.. And, according to some recent research, only large models beyond 7 billion parameters benefit from more complex alignment methods.

Reinforcement Learning from Human Feedback (RLHF) simplifies the complex process of aligning LLM outputs with human judgment. By employing a reward model trained on human feedback, RLHF fine-tunes LLMs to produce responses that better match human expectations. This iterative process involves evaluating LLM outputs and adjusting based on human-rated preferences, fostering a closer alignment with desired outcomes. In a recent research, Deepmind researchers show that choosing the right approach to eliciting the human feedback, for example by tweaking the exploration/exploitation aspect of the process using Thompson Sampling, can improve results significantly.

Reinforcement Learning from AI Feedback (RLAIF), in contrast, leverages existing, well-aligned models to guide the training of LLMs. This approach enables LLMs to learn from the vast, nuanced understanding embedded in these advanced models, accelerating the alignment process without direct human intervention.

Direct Preference Optimization (DPO) addresses some of the challenges inherent in RLHF, such as its complexity and instability. DPO optimizes alignment by directly training LLMs on human preferences, bypassing the need for a separate reward model. This streamlined approach results in more stable, efficient, and effective alignment, particularly in controlling the tone and quality of LLM responses.

Kahneman-Tversky Optimization (KTO) introduces a novel paradigm by eliminating the need for paired preference data. Instead, KTO evaluates outputs based solely on their desirability, simplifying the data requirements and making alignment more accessible, especially for organizations with abundant customer interaction data.

Self-Play Fine-Tuning (SPFT) adopts a unique approach by enabling LLMs to learn from their own generated content. This self-reflective method allows LLMs to refine their understanding and generation of language through a process of internal trial and error, leading to significant improvements across various benchmarks.

It should be noted that there are pre-pretraining techniques that can also help with alignment. In fact, Microsoft researchers showed in their work with Phi that using carefully crafted synthetic data in the pretraining step can lead to more alignment than using post pretraining techniques like RLHF (see more details in the Phi-2 model blog post here.

Fine-Tuning Interaction: The Art of Prompt Tuning

As LLMs approach deployment, prompt tuning becomes essential, refining the interface through which humans communicate with these models. Prompt tuning involves crafting and optimizing the prompts used to elicit specific responses or actions from the LLM in zero-shot or few-shot learning scenarios. Note that prompt tuning, just as instruction tuning, can be used to improve alignment. However, it has many other applications beyond that one.

At the heart of prompt tuning are meta-prompts and prompt templates—scaffolds that guide the model’s understanding and generation of responses. Meta-prompts are higher-level instructions that contextualize the task, while templates provide a structure for the input and expected output. For instance, a meta-prompt might instruct the LLM to provide advice, while the template structures this advice within a specific format or style.

The iterative nature of prompt tuning hinges on rigorous evaluation criteria. These benchmarks assess prompts on various dimensions, such as the relevance and coherence of the LLM’s responses, ensuring that the final prompts align closely with user intentions and task requirements.

Through practical examples, such as optimizing prompts for a customer service chatbot or a content generation tool, we can observe the transformative impact of prompt tuning. These cases underscore the heightened efficiency and specificity with which tuned LLMs can address tasks, heralding a new era of human-AI interaction.

For those eager to explore the intricate strategies and methodologies of prompt design and engineering, I invite you to delve into my recent publication, which provides a comprehensive exploration of this field.

Conclusion

In light of our exploration, it’s evident that modern Large Language Models (LLMs) transcend their foundational token-predicting capabilities. From the nuanced fine-tuning for specific tasks to the sophisticated alignment with human values and the artful crafting of prompts, each step in the development of LLMs contributes layers of complexity and adaptability, making them far more than mere token predictors.

The precise impact of each stage on the LLMs’ emerging abilities remains a fascinating puzzle, one that invites further investigation and discussion within the AI community. As we stand on the brink of new advancements, the potential for unforeseen capabilities and applications of LLMs continues to expand, challenging our understanding and expectations.

So, the next time the conversation turns to the nature of LLMs, remember that these models are the product of a rich tapestry of innovations, each adding depth and dimension to their interactions with the world. Let’s embrace the complexity and continue to push the boundaries of what these extraordinary tools can achieve

From Taskmasters to Trailblazers: The Challenge-Inspire Model of Leadership

2024-01-13T00:00:01+00:00

The breakneck speed of innovation in the artificial intelligence sector has naturally steered my writing towards AI-centric themes on this blog. Regular readers, however, will recall that my passion for leadership and management is a drumbeat that resonates through my work. Posts like “The 7+1 Habits of Highly Successful Leaders” and “Cultural over/under-fitting and transfer learning. Or why the “Netflix Culture” won’t work in your company” have not only been popular but also personal milestones that reflect my ongoing exploration into what makes a leader truly effective. In the following paragraphs, I unveil a simple, yet robust framework for leadership that I believe is pivotal for today’s dynamic environment where technology and human ingenuity intersect. Your thoughts and feedback are not just welcome, they’re invaluable.

Frameworks for Leadership: Standing on the Shoulders of Giants

No discussion on leadership is complete without the due diligence of examining existing frameworks, many of which I’ve dissected in previous posts such as “The 7+1 Habits of Highly Successful Leaders”. Among these, the model introduced by Kim Scott in her seminal work, Radical Candor, resonates deeply with me. Scott elegantly places ‘Challenge’ and ‘Care’ on intersecting axes, proposing a quadrant that categorizes leadership styles. This model serves as a beacon for my own framework, which I will introduce shortly, and has profoundly influenced my understanding of effective leadership. Before diving into the new model, I encourage those unfamiliar with Scott’s work to explore her two-dimensional framework, as it’s a powerful tool for self-reflection and leadership development. It has certainly shaped the way I approach my role and responsibilities as a leader.

Reassessing Radical Candor: The Dichotomy Presented by Musk and Jobs

The Radical Candor model has been a compass in navigating my leadership decisions and coaching strategies for years. Yet, I’ve encountered a persistent dissonance between the model and the real-world examples of certain iconic leaders. This dissonance crystallized for me after delving into the biographies of Elon Musk and Steve Jobs. These two titans of industry share much more than a biographer; they are paragons of successful leadership that seemingly operate outside the Radical Candor framework, particularly in the dimension of personal care.

Now, I must clarify that this observation is not an endorsement of their reputed indifference to personal relationships. On the contrary, I hold personal care in leadership in high regard and admittedly, would not choose to work under leaders like Musk or Jobs for that reason. However, the undeniable success they have achieved forces us to question: if a leadership framework aims to chart a path to success, then perhaps personal care, as defined within the Radical Candor framework, is not the sole axis we should consider.

It’s important to note that while Musk and Jobs might challenge the notion that personal care is critical for leadership success, their approaches are not prescriptive. Instead, they serve as a reminder that leadership effectiveness can diverge significantly from established frameworks, and that success can be multifaceted and complex. It’s this complexity that has led me to develop a framework that strives to encapsulate a broader spectrum of leadership qualities, beyond the binaries of care and challenge.

The Paradox of Challenge: Is it the Sole Ingredient of Leadership?

Upon reflecting on the previous sections, one could hastily surmise that the essence of leadership boils down to the relentless challenge of one’s team, pushing them towards peak performance. Yet, this reductionist view doesn’t capture the full spectrum of what leadership entails. Is it truly enough to challenge? Certainly not, according to Kim Scott, and I concur with her assessment. A leader who offers only challenges becomes what Scott dubs an ‘Obnoxious Aggressor’ – or, in less flattering terms, an authoritarian taskmaster. This is a leadership style characterized more by dictation than inspiration, more feared than revered.

The reality is stark: few, if any, aspire to be led by someone who merely drives them relentlessly without providing other forms of value. Caring might offer a counterweight to the challenge, yet as we’ve considered in the context of outliers like Musk and Jobs, it isn’t the singular antidote to authoritarian leadership.

So we’re left pondering a critical question: If not solely care, then what other dimension is integral to the leadership equation? What is the X factor that complements the challenge to not only drive success but also inspire commitment and loyalty? It is this quest for the missing dimension that has propelled me to explore beyond the conventional, leading to the development of a more nuanced leadership framework.

The Pivotal Role of Inspiration in Leadership

I offer a perspective that may well be the linchpin in leadership dynamics: a leader’s capacity to inspire is not just an asset, it’s a necessity. Inspiration is the catalyst that allows a leader to challenge their team effectively. The more a leader can inspire, the more their team will not only accept challenges but also embrace and even relish them.

Reflecting on our previous examples, Jobs and Musk stand as testaments to this idea. Regardless of whether their particular brand of inspiration resonates with you, it’s difficult to dispute their inspirational impact on many. For those who aren’t moved by their visions or methods, the prospect of being challenged by them might seem unappealing, and rightly so. This aligns with my own stance; while I recognize their inspirational influence, it doesn’t sufficiently outweigh the challenges I foresee in working with them.

Inspiration itself is a multifaceted phenomenon, varying immensely from one person to another. Some are galvanized by lofty ambitions to revolutionize the world, others by the desire to aid their fellow man, and many by love or, indeed, personal care. In this light, “caring personally,” as Kim Scott articulates, can be seen as one avenue of inspiration. It’s also crucial to acknowledge that what leaves us cold may ignite passion in others—consider the varied reactions to figures like Trump, for example.

From this discourse emerges a principle: as a leader, you not only need to inspire but also recognize that you will naturally draw individuals who resonate with your particular brand of inspiration. A nurturing leader will attract those inspired by empathy, while a visionary will magnetize those drawn to ambitious goals. The imperative for leaders is to introspect and identify their inspirational core. Whether you embody care, vision, or another inspiring trait, understand that your ability to challenge will be proportionate to the inspiration you provide in the dimension that matters to your followers. Thus, a person drawn to your leadership for your empathy will not necessarily be more receptive to challenges because of your visionary ideas.

Synthesizing the Challenge-Inspire Leadership Model

Leadership is fundamentally rooted in two pivotal elements: the ability to challenge and the power to inspire. It is at the intersection of these dimensions where true leadership flourishes. As a leader, one must master the delicate art of weaving inspiration into the fabric of challenge. The more profound the inspiration you provide, the higher the performance you can elicit from your team. This isn’t a one-size-fits-all; different individuals resonate with different forms of inspiration—be it a shared vision, a compelling mission, a dedication to excellence, or a nurturing environment.

The diagram presented below delineates this philosophy. It illustrates a leadership landscape where ‘Inspirational Trailblazers’ stand out. These leaders do more than just demand results; they instill a sense of purpose and enthusiasm that transforms challenges into opportunities for growth and achievement.

At one end of the spectrum, we find the Indifferent Administrators who lead without challenge or inspiration, resulting in stagnant teams. Then there are the Authoritorial Taskmasters, those who constantly push but fail to motivate, creating an environment of stress and burnout. Or the Mesmerizing Visionaries who inspire but don’t drive their teams toward concrete goals, leading to untapped potential. It is only the Inspirational Trailblazers, who combine challenge and inspiration, those that are true leaders.

The model I propose strives for the sweet spot: a harmonious blend of high expectations with deep inspiration. It’s in this space that leaders can truly galvanize their teams, attracting and nurturing individuals who share their inspirational language. This is not just about leading; it’s about creating a legacy of motivated, high-achieving teams that thrive on challenges because they believe in the vision they’re working towards.

AI as the New Member of the Engineering Team: Crafting an End-to-End AI Application with AI

2023-12-20T00:00:01+00:00

Exploring the AI-Driven Future of Software Development

Embarking on an experiment that blends the boundaries of AI and software engineering, a few days back I set out to explore a provocative hypothesis. Will AI-generated code dominate software development in the near future, accounting for over 80% of all coding? This question is not just theoretical; it’s a forecast based on the rapid advancements in AI, and I’m putting it to the test through a hands-on project.

The Challenge: AI as the Core Developer

In this unique endeavor, I wanted to construct a comprehensive state-of-the-art AI end-to-end application in mere hours, a task that traditionally would require extensive manual coding. The twist is my reliance on advanced AI, particularly ChatGPT from OpenAI, known for its robust natural language processing and coding capabilities. This experiment is more than an exploration of AI’s utility in coding; it’s a deep dive into the evolving synergy between AI tools and software development, challenging the conventional roles and methods in the field.

AI: More Than a Tool, a Collaborator

ChatGPT with the latest GPT4 model in the background, isn’t just a sophisticated piece of technology; it represents the pinnacle of AI’s integration into creative and technical processes. My goal was for it to act not just as a tool, but as a collaborator, bringing to the table its capacity to understand, generate, and debug code. This collaboration is a glimpse into a future where AI’s role in software development transcends assistance, becoming a core component of the creative and development process.

Embarking on a Groundbreaking Journey

Join me in this exploration. This journey is more than a technical challenge; it’s a quest to uncover the potential of AI in redefining the software development landscape, signifying a shift in how we conceptualize and execute coding in an AI-augmented future.

The App and the Tools: Venturing into Unfamiliar Territory

In this segment of my journey, I decided to push my boundaries by choosing technologies I had never worked with before. The goal was to create a personalized chatbot, “Xavibot,” designed to respond as if it were me. You can interact with this chatbot here.

A Note on Performance and Feedback: Should you experience timeouts or other issues while using the bot, it’s worth noting that it’s hosted on a lower tier of Azure, which may affect performance. I welcome any feedback or queries at xavier at amatriain dot net.

For this project, I utilized:

Node.js for the Backend: A deliberate choice over Python to challenge myself with an unfamiliar environment and assess ChatGPT’s effectiveness in aiding with new technologies.
React for the Front-End: Leveraging the react-chatbot-kit, I ventured into modern UI design, an area where my experience is limited.
OpenAI Assistants APIs with GPT4: A choice driven by the desire to explore this new and powerful technology from OpenAI, with which I had no prior experience.

The source code for the application is available on my my Github. I encourage others to use it for their own projects or adaptations.

Deployment involved Azure for the Node.js backend and GitHub pages for the front-end. This process included new challenges for me, such as configuring CORS and managing a secrets vault for remote keys – all first-time experiences. Reflection on the Initial Phase:

This project was not just about building a state-of-the-art AI application; it was a test of how quickly and efficiently I could adapt to new technologies with AI assistance. Despite my extensive background, I approached most of the tools used in this project as a novice. This experience sheds light on the current capabilities of AI in supporting software development, especially when diving into unfamiliar tech waters.

OpenAI’s Assistants API

The OpenAI Assistants APIs represent a significant advancement in chatbot development. This API simplifies the process, eliminating the need for developers to delve into complex aspects like memory management and retrieval-augmented generation (RAG), or the intricacies of prompt engineering and orchestration.

In creating “Xavibot” available here, I configured the assistant to perform RAG using two specific files, complemented by a basic prompt structure. While this was sufficient to create a functional version 0.1 of the bot, more sophisticated prompt engineering could potentially elevate its capabilities. My initial approach was to prioritize speed and simplicity in deployment.

My experience with the Assistants API has been insightful. For simple chatbot applications, it is exceptionally efficient and user-friendly. However, when it comes to applications requiring greater control and flexibility, the API shows limitations. Future experiments and developments could explore how advanced customization might overcome these constraints and expand the API’s utility in more complex scenarios.

The Positive Impact of AI in Software Development in 2023

My foray into AI-first software development has been an overwhelmingly positive experience. ChatGPT, serving as both a knowledgeable pair programmer and coach, accelerated my project’s development significantly. Within just a few hours, I had a locally running application with about 80% of the intended functionality. This rapid progress was encouraging, though I was aware that perfecting the remaining 20% would be more time-consuming, adhering to the familiar power law dynamics of software development.

Enhanced by RAG and Continuous Interaction

The Retrieval-Augmented Generation (RAG) feature in ChatGPT proved exceptionally useful. Even when faced with complex queries, such as specific API usage, directing ChatGPT to relevant documentation often resulted in accurate and helpful responses.

What truly stood out was the ability to engage in a persistent dialogue with ChatGPT, iterating over problems until they were resolved. This process sometimes involved multiple refinements of my code, with ChatGPT providing consistent and relevant suggestions. It was akin to having a dynamic, interactive version of StackOverflow, but with the added advantage of contextual understanding and memory retention.

AI vs Traditional Resources

Interestingly, on the few occasions when I consulted StackOverflow, I found the solutions there to be less effective than ChatGPT’s. This experience aligns with the growing perception of AI as a formidable tool in the realm of technical problem-solving.

The Supportive Nature of AI

An aspect that particularly resonated with me was ChatGPT’s unwaveringly supportive and positive tone. Even in moments of frustration, when I expressed dissatisfaction with the responses, ChatGPT maintained its composure, apologizing and continuing to offer alternative solutions. This emotional intelligence, often crucial in pair programming and mentorship scenarios, significantly enhances the collaborative experience.

Navigating the Challenges of AI-Assisted Software Development

My journey with AI-first software development, while largely positive, also revealed several areas where AI, specifically ChatGPT, could be improved.

Data Loss and Backup Issues

A significant issue was the loss of my chat history due to corruption in the ChatGPT thread, a common problem I discovered others facing as well. This not only disrupted my workflow but also erased valuable insights and progress. The lack of an integrated backup feature in OpenAI’s platform is a notable gap, necessitating third-party solutions for chat backup and restoration.

Limitations in Information Accuracy and Complexity

GPT-4’s knowledge base often lacked up-to-date information, leading to recommendations of deprecated or incompatible resources, such as a React toolkit that wasn’t suitable for recent React releases. Additionally, when directed to online documentation for APIs, ChatGPT’s responses were sometimes vague or missed the mark. For instance, when I asked about adding a secret to the Azure vault based on Azure documentation, ChatGPT incorrectly guided me to use the wrong role assignment.

Tendency to Overcomplicate Solutions

ChatGPT frequently suggested overly complex solutions. For example, it led me towards using React Context and Redux for a situation where a simpler global state approach would suffice. Similarly, it advised setting up an Azure vault for secrets management, which was an overkill for my specific needs. While ChatGPT eventually acknowledged simpler alternatives, the initial guidance towards more complex solutions often proved to be time-consuming.

Difficulty with Error Analysis and Code Context

One of the most significant challenges was ChatGPT’s struggle to diagnose errors in relation to the provided code. A striking example was when it suggested an extensive refactor of my code, while the issue was actually due to a misspelled variable. This inability to pinpoint simple issues, coupled with a tendency to suggest extensive code changes, often led to inefficient problem-solving.

Conclusions: Embracing the AI Revolution in Software Development

After this deep dive into AI-first software development using ChatGPT with GPT4, I’ve come to appreciate its distinct advantages over other coding-specific AI tools. While solutions like GitHub Copilot have their merits, the versatility and depth of ChatGPT’s assistance surpassed my expectations and demonstrated its potential as a superior tool for developers.

However, there’s room for improvement. Fine-tuning AI models for specific coding tasks and integrating them more seamlessly into development environments like Visual Studio Code could address some of the challenges I encountered. Such enhancements would further streamline the development process and amplify the benefits of AI assistance.

I remain bullish about the transformative impact of AI on software development. The advancements I’ve witnessed and utilized are just the tip of the iceberg. For developers at any career stage, now is the time to embrace AI tools. With AI assistance, even developers with average skills can significantly boost their productivity and effectiveness, potentially becoming ‘10X engineers’. And for those already excelling in their field, these tools could amplify their capabilities, leading to unprecedented levels of efficiency and innovation.

In conclusion, the integration of AI into software development isn’t just a trend; it’s a paradigm shift. It’s reshaping how we approach coding and problem-solving, offering new heights of potential for every developer willing to adapt and learn.

Beyond Singular Intelligence: Exploring Multi-Agent Systems and Multi-LoRA in the Quest for AGI

2023-11-19T00:00:01+00:00

In my view, the concept of Artificial General Intelligence (AGI) as it’s commonly understood might be a misnomer. Human intelligence itself is not ‘general’; it is inherently constrained by our senses, perception, and cognitive abilities. Pursuing AGI as a singular, super-human intelligence system seems flawed to me. Instead, I believe the focus should be on developing independent agents that specialize in performing specific tasks far better than humans. This shift from seeking a universal solver to nurturing a network of specialized agents is at the heart of the current evolution in AI. Technologies like Multi-LoRA and frameworks such as AutoGen and AutoAgents are leading this transformation, redefining our path to what might be the real essence of AGI.

Large Language Model (LLM) Agents

LLM Agents, utilizing models like GPT-3 or GPT-4, represent a significant leap in AI capabilities for natural language understanding and generation. Beyond their ability to process and produce human-like text, these agents are capable of calling functions and using tools. This functionality allows them to perform a wide range of tasks, from generating content to coding. Moreover, LLM Agents possess the ability to plan the use of such functions and tools, enabling them to strategize and execute complex tasks more effectively. This aspect of LLM Agents aligns with the early traits of AGI, as they demonstrate an advanced level of problem-solving and adaptability, handling tasks that go beyond their initial training, as highlighted in the Noema Magazine article.

LangChain’s Agents

LangChain’s agents demonstrate the power of specialized AI agents in making reasoned decisions and executing complex objectives. These agents utilize language models to decide on action sequences, adapting dynamically to user inputs and intermediate steps. A notable feature of LangChain is the LangChain Expression Language (LCEL), which simplifies the creation and management of Functions, Tools, and Agents. For those interested in a deeper understanding of these concepts, Harrison Chase offers a short course titled “Functions, Tools, Agents: LangChain,” available through Deeplearning.ai. This course provides valuable insights into LangChain’s capabilities and applications, making it an essential resource for anyone interested in the practical aspects of AI agent development.

Multi-agent frameworks

As I mentioned in my recent “Beyond Prompt Engineering: The Multi-Layered Cake of GenAI Development”, designing multi-agent systems will be the next frontier of AI system design. Frameworks like Auto-Gen and AutoAgents epitomize the potential of multi-agent systems. Auto-Gen in particular demonstrates the power of multi-agent systems in automating complex workflows. It leverages LLMs to break down large tasks into sub-tasks, autonomously accomplishing them using various tools and internet resources . Similarly, AutoAgents showcases the adaptability of AI systems by automatically generating and coordinating multiple specialized agents to form AI teams for diverse tasks, enhancing problem-solving capabilities and adaptability.

LoRA and Multi-LoRA

The concept of Low-Rank Adaptation (LoRA) is revolutionizing the fine-tuning of large language models. Instead of updating the entire model, LoRA focuses on updating low-rank additive matrices, reducing the computational load. LoRA is a specific implementation of the so-called PEFT, Parameter-efficient Fine Tuning approaches Multi-LoRA extends this approach, allowing numerous LoRA adapters to coexist within a single model. This innovative system, as detailed in the “S-LoRA: Serving Thousands of Concurrent LoRA Adapters” paper, enables serving thousands of LoRA adapters simultaneously, dramatically improving throughput and scalability in deploying fine-tuned LLMs for a variety of applications. Also worth mentioning that Predibase has now open-sourced their LoRA Exchange (LORAX) serving package.

Redefining AGI

Contrary to traditional visions of AGI as a singular, all-knowing entity, the emerging paradigm, as suggested by the advancements in technologies like Multi-LoRA and frameworks such as AutoGen and AutoAgents, indicates that AGI will manifest as a network of specialized agents. These agents, each expert in its field and specialized in some specific tasks, contribute their expertise to a collective intelligence. This network approach, using advanced technologies and systems, offers a more dynamic and practical path to AGI.

The future of AGI is being shaped by the integration of specialized AI agents, each fine-tuned for specific purposes, working in harmony. This collaborative approach, leveraging cutting-edge technologies and frameworks, presents a more feasible and impactful path toward realizing AGI, moving beyond theoretical concepts into practical, impactful applications.

My most epic runs

2023-11-12T00:00:01+00:00

The Big Sur Marathon is considered the most beautiful marathon in the US and top ten in the world. However, it did not make it into my top 10 most epic runs

The 2023 NYC Marathon

Last week I ran the NYC Marathon for my 50th birthday. This was my 18th marathon, and one of the toughest. I finished in 4 hours and 1 minute, my worst time ever, and I struggled starting in about mile 10 due to stomach issues. I had to stop four times for bathroom breaks, and I finished completely exhausted. The 3 miles I had to walk from the finish line to my hotel were terrible. That being said, it was AWESOME. The weather was great and NY is a special marathon. The whole city is out on the street cheering for runners. It is very emotional, and more than once I was close to crying (or maybe I did, who knows). I hope I can be back and do a better job next time.

At the NYC finish line with the medal. It was tough, but I got it done. (And, yes, for those that notice, I had nipple problems too after many, many years)

As always, every race is full of life lessons. I captured some of them in my popular “Ten little lessons for life I learned from running”. It was also a great opportunity to make friends. This time I joined an amazing group of Catalan runners, led by the popular sports journalist Arcadi Alibes, who was running his 20th NYC marathon and 158th overall! (Here is his post on the race, in Catalan).

Catalan runners in Times Square the day before the marathon. I do not explicit endorse any political message or commercial product in this picture :-)

As I reflected on this run, I looked back on the many other epic runs I have done in my life and thought hard about where I would place it on my list. The answer is that this run, as amazing and epic as it was, would not make it to my top 10. If you wonder why, keep reading. I have sorted my most epic runs from 10 to 1, but also link them here on top in case you want to go straight to one of them. You decide how they compare to the most epic runs by tech leaders compiled not long ago by The Information.

I have made many friends through the many years of running, but these are the folks I have shared the most miles with and see pretty much weekly. To my left, Joaquin, Alberto, and Andres. Thanks for all the sweat!

The Relay
Andorra 70.3 Half Ironman
California International Marathon
Escape from Alcatraz
Boston Marathon
Mt. Charleston Marathon
Triple Tahoe Marathon
140.6 Full Ironman Santa Rosa
Quicksilver 100k (2018)
Grand Canyon RimToRimToRim (2015)

10. The Relay (2015)

The Golden Gate Relay (or The Relay for short), is an amazing and epic adventure. You make a team of 12 people, who are split in 2 vans, and you run from Calistoga in Napa Valley all the way to the Pacific Ocean in Santa Cruz. You get to run between 15 to 20 miles in 3 different legs. Some of the legs are brutal, with huge elevation gain going over the Santa Cruz mountains. Some others are just amazing, like crossing the Golden Gate Bridge at night. And, while a runner in your team is running, you are either waiting for them in the van at the finish line or trying to catch some food/sleep while the other van is running.

The epic route from Wine Country to the Pacific

I have done The Relay 5 times, but in 2015 we managed to put together a very competitive team of Spanish runners. Our team, Korrikalaris, finished 6th out of 150 in under 23 hours. In fact, we were less than 1 minute from making the top 5! Very epic.

An epic Relay team

9. Andorra 70.3 Half Ironman (2022)

If I have to pick my most epic Ironman, I have a hard time deciding between this half and the full one. Yes, the full Ironman is an epic distance, but this half Ironman in the Pyrenees was brutal. It is all in high altitude, and the bike goes twice up a first category mountain climb that is considered epic even for professional cyclists. But the worst part was swimming in high altitude!

Finishing in altitude over 90F (almost 40C) after a huge elevation gain wasn’t easy. But, we got it done!

I won’t go into too many details because I have a full, and pretty long, blog post with all the details about this race that you can read aout here.

I should also add that this was not my only epic half IM. My first one in 2018 was also in altitude in Donner Lake, CA. While it wasn’t as brutal as Andorra, it also had altitude, elevation, heat, and even smoke from the nearby fires. The Donner Lake Tri would have made it into my top 20 for sure!

8. California International Marathon (2012)

This was a crazy and epic marathon entirely because of the weather. In fact, I have run this same marathon several times and have some of my fastest race times in it. But, 2012 was VERY different. The day before, we already started hearing about an epic storm that was approaching Sacramento. It was a few of us (5 if I remember correctly), and we weren’t about to go back home without at least trying. The storm during the night was crazy with wind and rain hitting our hotel window. And… it didn’t get better by early morning!

I am usually not risk averse at all, but this one time I was the one telling my friends that we should not get on the race bus. We should just pack and go home. Yes, maybe we could run the race, but we would not have fun, and we could risk injury. But, you gotta do what you gotta do. My friends insisted that we should at least get on the bus, go to the starting line, and then decide.

The 26 miles on the bus were, again, brutal. The storm was still hitting hard and there was no end in sight. We got to the starting line and we stayed in the bus. Of course at that point if the race was not canceled we were going to run. And, it wasn’t. We spent the last few minutes before the rain sheltering in the porta potties. So many people had dropped out from the race because of the weather that this is the only time in a race that I think we had enough porta potties for all the runners :-)

The race did not go very well either. Besides the cold and the rain, there was a lot of water on the road. In fact, they had to change the course in a few places. Because of the water, I probably ended up changing my stride a lot. When we got to the half marathon, I wanted to quit. Well, remember that I didn’t even want to start! I did not quit, but this is one of the few marathons where I even walked for some parts. In the last few miles, the storm stopped and we even saw a bit of sun when we got to the Sacramento Capitol. And, I managed to finish well below 4 hours still. Very epic.

My relation with crazy Sacramento weather doesn’t end there though. In 2021 I was going to run my second full Ironman in the city. The storm was even worse than in 2012 (WTF Sacramento?). I was on the bus ready to go to the starting line for the swim in the Sacramento river when they canceled the race for safety. It was brutal, but I think it was the right decision. I don’t think I will ever do another full Ironman, and I blame Sacramento weather for that LOL.

Not very happy after the cancellation of the 2021 California Ironman

7. Escape from Alcatraz (2019)

I have done a few triathlons, and some of them have made it to this list. But Escape from Alcatraz is my favorite one because of the combination of being super epic and super fun. I have done much harder triathlons, and much longer, but this one is really fun while still being super epic. Here is my Strava activity.

What makes Escape from Alcatraz so epic is the setting… and the swim! All big races have a big build up, and a story before the story. You usually have to travel in a bus very early, wait in the cold, make sure you can hit a porta potty… Escape from Alcatraz is that, and much more. After the bus leaves you in the SF harbor, you need to get into an old ship. All athletes get into the same ship, which is pretty crowded, fun, and scary at the same time. Scary because you know you are going to have to jump into the cold Pacific from a deck that is quite high. And, you’ve hear of swimmers who get frozen because of the shock and can’t swim.

I did escape fortunately!

That, and the planning that goes through your head. Escape from Alcatraz is a very technical swim and there are dozens of tricks to it. You need to know where to jump from in the ship, and, very importantly, where to swim to. No, you cannot swim towards the finish line. Currents are very strong, and you would end up in Sausalito. You have to point your swim to some different SF landmarks as you make your way to land. I was pretty scared of the swim but it ended up being one of my fastest ever since I was able to ride the currents all the way to the shore.

After the swim, things are much easier. The bike is hilly (of course it has to be in SF), but short. And, the run is around the Golden Gate, also challenging but short.

Overall, an epic experience!

6. Boston Marathon (2018)

The dream for many runners is to run the Boston Marathon. That is particularly true for the great majority of runners who have to qualify for it. Qualifying in itself is already a feat. Running is supposed to be the fun part… except in 2018!

2018 was a brutal year for the Boston Marathon. A huge storm hit the city, and we ran the 26 miles in cold rain and wind. In fact, there was a combination of mud and snow in the starting line. Luckily, we knew the day before things would be pretty rough so we bought throwaway clothes and shoes and took the race clothes in a plastic bag.

Miserable weather to the end… with Alberto.

By the time the race was about to start we were wet and muddy. We then changed into clean clothes and off we were! I ran the first part a bit slower than I could because I was pacing my friend Alberto. Things were brutal with cold and windy rain coming to us most of the way. However, when I hit the half marathon mark I was feeling great and started to push. I passed so many people in the second half that I had great fun. Definitely not close to my fastest time, but a 3:33 in those conditions felt amazing… and totally epic! In fact, now I realize that I even used the word epic in the title of my Strava run 🙂

5. Mt. Charleston Marathon (2017)

This was my fastest marathon in 3:08. I never thought I would get close to 3 hours in a marathon, and this one got me pretty darn close. Maybe with a little more training I would have done it. In any case, this was a totally epic run. One of my best.

First things first, this is a straight downhill marathon that is considered to be one of the fastest in the US. We chose it because we wanted to qualify for Boston, and we did by far. You can see my Strava activity here.

I did not dream up that finish time!

The marathon starts at altitude in Mt. Charleston, and ends in Las Vegas. It was freaking cold at the start line and there was even some snow, which is crazy considering you finish the race in the desert. Because of the altitude, I also struggled at the beginning. The combination of altitude and very hard pace was tough, but I got into the rhythm and hit it off with the 3:15 pacer. I felt great during the first half and even held the pacer sign while she was making a pit spot (the pacer had run a 2:20 the year prior). In mile 16 or so I decided to go for it and pushed faster than the pacer. I only struggled a bit in a small hill in mile 22, but ended up finishing strong, getting my all time PR, and my BQ (Boston qualifier). Epic!

4. Triple Tahoe Marathon (2016)

Running a marathon in altitude in Lake Tahoe is already pretty epic, but, how about 3 in 3 consecutive days? Well, that is exactly what we did in 2016. Three marathons back to back that take you all around the lake in 3 days. You start in South Lake Tahoe counterclockwise and after each marathon you are taken back to the starting line with a bus. The course is not only in altitude, but also has pretty challenging elevation gain in some parts. And, to make things more epic, we ran the first two marathons on the side of the road, with normal traffic on the road.

Just another long run with Alberto

To be honest, the most challenging part of the three marathons was probably the first thirty minutes when you need to get your body used to running and breathing in altitude and you think to yourself that there is no way you can finish a marathon, let alone 3. But, once I overcame the initial panic, the rest was very enjoyable. Because I wanted to finish the 3, I paced myself and did not push it too much. A fun anecdote is that in the third marathon, that had Emerald Bay half way through, I ran an inverse split and won free running shoes. Strava had a cool challenge going on at that time that gave free New Balance shoes to those that managed to run an inverse split marathon. Fwiw, this one was pretty easy since the first half was straight up, and the second half was straight down. Epic!

3. 140.6 Full Ironman Santa Rosa (2019)

It took me a bit over 14 hours to finish my only full Ironman. Not a great time, and I did not even have much fun, but, yes a full IM is, by definition, epic.

The swim was…well, long. It was in a beautiful lake near Santa Rosa. The worst part of it is that we had to go around the lake twice. Going out of the water half way to go through the timing chip device is not fun, but we got the swim done, which is my least favorite part of the triathlon.

The bike was again VERY long, and somewhat boring. In an Ironman you ride your bike alone since drafting is not allowed. And riding your bike alone for over 6 hours is a lot. I have done much longer ultramarathons, but you can always chat with people you find around or get a pacer if you are lucky enough. This was plain lonely bike riding except for a couple of times I stopped in an aid station to refill the bottle. Also, the road was in awful conditions, and there were cars! By the time I finished the bike I really, really was looking forward to the run. My favorite sport!

At the end I was a beaten up as I look

And, the first half was pretty good all things considered. I was running on pace to go under 4 hours, which is really good in a full Ironman. But, then I started getting horrible stomach cramps due probably to poor hydration. I had to stop in the porta potties a few times and walk at times. The only positive thing I remember is getting to an aid station where they had warm broth. That felt magic. I ended running the marathon in over 5 hours. My worst marathon ever. But, very epic.

2. Quicksilver 100k (2018)

This is the longest run I have ever done, so it had to be high up on my list. Quicksilver(LINK) is not only long, it is pretty epic all around. It has 13k feet of elevation with some of the steepest climbs I have done in any ultra. And, it is usually very hot.

I say “usually” because I have ran the Quicksilver 100k three times, but only finished once. This race is literally in my backyard here in the Bay Area. In fact, I use parts of the course regularly for trail running. I finished the race in 2018 in a bit under 16h, which was the cutoff time for qualifying for Western Estates. It was brutal. I had my friend Alberto pace me for the last 20 miles, and I had some rough moments where I could barely speak to him and was close to giving up. But, I made it (see Strava activity here.

Pretty happy to finish this shit. I wasn’t thinking straight after almost 16 hours of running

In case you are wondering, the other two times that I started but DNF (did not finish) it was because I had decided beforehand I would not. The first time, I had my son’s graduation that same evening, and I had promised I would be there. The second time, I was only weeks away from an Ironman, and I knew running anything beyond 40 miles could compromise the IM.

Since this is the only official ultra that has made it to the list, I should at least mention that I have ran many others, some that were amazing. I fondly remember the two North Face races in Marin County, particularly the one in 2014, the day after some strong stor had hit the course. The mud in some parts of the trail made it particularly epic!

1. Grand Canyon RimToRimToRim (2015)

And, the most epic run ever goes to the amazing adventure me and a group of friends had in 2015. This is not an official race, but rather a popular adventure for those into ultra-running: running from the South Rim of the Grand Canyon to the North Rim and back [LINK]. This is a 50+ miles run with a crazy 20k feet of elevation (see [Strava] here(https://www.strava.com/activities/414842609)). It is completely self-assisted, meaning that you need to carry all your food and water needs (yes, there are some water sources, but not guaranteed). Besides the huge elevation and complicated logistics, what makes this run very challenging is the huge temperature differences. In fact there are only a few weeks in the year where you can attempt, in spring and fall. Even then, you will be below freezing when you start the run well before dawn, easily hit 100F degrees in the middle of the day in the canyon, and see some snow in the north rim when you get there.

The team. Ready to rock the Canyon!

We had attempted the run a year earlier in the spring. When I was going up the north rim with only one of my friends after splitting up from the rest, he did not feel well. I had to make the hard decision of turning around with him. But, we promised we would be back. And, we were! The year after!

We turned around the first time we attempted the r2r2r due to Alberto getting injured. This time around we got all the way to the end, and he led the way and kept me going.

This is an amazing run, but one that requires lots of training and preparation. It was crazy for us to be helping unprepared hikers who were going up the south rim in the evening after “just” having attempted going down and up the rim when we had over 40 miles on our legs.

A final thing that made this run completely epic was that when we started going up the south rim, a pretty scary storm with lighting started. Lighting storms can be very dangerous in the Grand Canyon, but fortunately, we were far enough from the lighting to make it safely, albeit very wet and muddy, to the top. It took me a bit under 11 hours to complete this amazing feat.

The Multimodal Generative AI Revolution is Here!

2023-10-29T00:00:01+00:00

In the landscape of Generative AI (GenAI), we often find ourselves amazed at the rapidity and scale of advancements. GPT-4 stands as a shining example, pushing the boundaries of linguistic understanding and generation. Yet, as we move forward, a compelling new horizon emerges: the Multimodal Generative AI Revolution. By melding GPT-4’s textual capabilities with multimodality—integrating diverse data types such as images, voice, and video—we’re not just opening a door, but unleashing a tidal wave of transformative potential that promises to redefine our digital experiences.

Four illustrative examples

The conceptual promise of the Multimodal Generative AI Revolution is undeniably thrilling, but it’s in the practical applications that its true value becomes tangible. As we bridge the realms of text, image, voice, and more, we begin to see direct impacts on our daily lives and interactions. Let’s delve into three illustrative examples that showcase how GPT-4’s multimodal capabilities are set to redefine our experience:

Unveiling the Mystery Box: We’ve all been there, receiving a package without any inkling of what lies inside. With GPT-4, by simply interpreting the SKU numbers on the label, it can instantly elucidate the contents of the box. It doesn’t just stop at identification. The AI could provide details, reviews, or even suggest complementary products based on the identified item.

Calorie Counting Made Simple: Picture this - you’re about to dig into a plate of bananas but wonder about the calorie count. Snap a photo, and GPT-4 will not only tell you the calories in those three bananas but also suggest how much you should walk to burn those off. It’s like having a nutritionist and personal trainer in your pocket!

Diagram Analysis: For those delving into the intricate world of neural networks, diagrams can often seem enigmatic. Take, for instance, a Two Tower Deep Neural Network recommender system. With GPT-4’s multimodal prowess, one can upload the diagram and receive a comprehensive breakdown. The AI describes its architecture, the intricacies of its design, and even potential applications and improvements.

Multimodal Post Creation: The synergy between image generation and text generation is truly a marvel in the multimodal AI arena. As a firsthand illustration, I employed DALL-E 3 to generate an imaginative image of a multi-layered cake, where each distinct layer represents a different activity in developing a GenAI application. Not stopping there, I then fed this uniquely conceptualized cake image into GPT-4, which adeptly assisted in crafting the text for my previous post . This process not only showcases the formidable image recognition capabilities of GPT-4 but also the innovative image generation powers of DALL-E 3, offering a comprehensive peek into the holistic potential of multimodal AI.

These examples are merely the tip of the iceberg. As we continue to explore and innovate, the symbiotic relationship between humans and AI will lead to unimaginable advancements. The Multimodal Generative AI Revolution is not just knocking on our doors; it’s here, ready to reshape the future. Embrace it and let’s co-create a world where technology doesn’t just complement our lives but elevates it to realms previously thought to be the domain of science fiction.

Beyond Prompt Engineering: The Multi-Layered Cake of GenAI Development

2023-10-26T00:00:01+00:00

The layers of GenAI development

(Note: This post used DALL-E 3 to generate the multi-layered cake diagram and GPT-4 with image capabilities to generate most of the post)

Often, when GenAI application development is discussed, the conversation veers towards “Prompt Engineering.” Indeed, prompt engineering has garnered significant attention and is a vital component. However, as visualized by the multi-layered cake diagram, GenAI application development is a spectrum of processes, each equally critical. What binds these layers together, serving as the crucial “icing” in between, is Evaluation. Quality checks, hallucination detection, and RAI (Responsible AI) evaluations are seamlessly integrated at every stage, ensuring the consistency and reliability of the model.

The Layered Approach to GenAI Development

Foundational Layers:
- Pre-training: The base of our GenAI cake. It’s where models are given a general understanding of language through vast amounts of data.
- Instruction Tuning & RLHF: Tailoring the pre-trained models to be more specific in their responses and actions.
- Fine-Tuning (Task or Domain Adaption): Further refining the model’s skills, focusing on niche areas or specific tasks.
Prompting Layers:
- Prompting: Crafting specific questions or commands to elicit desired outputs from the model.
- Meta-Prompting: Beyond individual prompts, meta-prompting is about creating programmatically generalized structures for prompts. It allows for adaptability and flexibility in interactions, without needing to manually design each prompt.
- Chaining: Combining multiple prompts in sequence to guide the model towards complex outputs.
Design & Interaction Layers:
- Tool Design: Creating tools that allow users or developers to interact with and harness the power of the GenAI model.
- Agent Design & Multi-Agent Design: Designing how the GenAI system interacts, both as a standalone agent and in multi-agent scenarios.
Final Presentation Layers:
- Product/UX Design: The icing on the cake. Ensuring the final GenAI application is intuitive, engaging, and user-centric.

Conclusion

While prompt engineering remains a focal point in the GenAI development discourse, it’s essential to zoom out and appreciate the broader landscape. Each layer, from pre-training to UX design, plays a crucial role in creating a holistic and effective GenAI product. Think of it as baking a cake: while the icing might catch your eye, it’s the combined flavors of each layer that make it truly delightful.

Mitigating LLM Hallucinations: a multifaceted approach

2023-09-16T00:00:01+00:00

(I recently turned this guide into a paper. You can find it here)

Introduction
What we talk about when we talk about Hallucinations
How to Measure
Mitigating Hallucinations: a multifacted approach
Conclusion: Yann vs. Ilya

Introduction

Ever been curious about the complexities of integrating large language models (LLMs) into user-facing products? One challenge that has been gaining attention is the occurrence of ‘hallucinations’—situations where these advanced AI systems produce misleading or incorrect information. This is a real-world issue that many of us in the tech industry are actively working to address as we develop new features and services.

In this blog post, you’ll find a comprehensive guide to the most effective strategies for mitigating these hallucinations in user-facing products. The field is fast-evolving, so while I don’t plan on continuously updating this post, I hope it serves as a valuable snapshot of current best practices. I’m also open to your insights and ideas, so feel free to reach out with any suggestions or questions you might have.

A multifaceted approach to mitigating LLM hallucinations

What we talk about when we talk about hallucinations

In the context of Large Language Models (LLMs), the term “hallucinations” often surfaces. As defined by the “Survey of Hallucination in Natural Language Generation” paper, a hallucination in an LLM refers to “the generation of content that is nonsensical or unfaithful to the provided source.”

A Controversial Term: Unpacking the Use of “Hallucination” in AI

According to Wikipedia, a hallucination is defined as “a perception in the absence of an external stimulus that has the qualities of a real perception.” Such a description might evoke images of mysterious visions or imagined sounds. However, the term has taken on a different, though not uncontroversial, shade of meaning in the realm of artificial intelligence.

There are three main concerns I’ve come across regarding the use of “hallucination” to describe phenomena in AI systems:

Misattribution of Properties: The application of “hallucination” might inadvertently suggest that LLMs possess some form of consciousness or perception, which they certainly don’t. LLMs generate text based on patterns in their training data, not because they “perceive” or “imagine” in the way living creatures do.
Misunderstanding of Dynamics: Such terminology might cloud understanding about how LLMs function. They don’t “see” or “imagine.” Instead, they churn out text based on statistical patterns from their training data.
Ethical Implications: There’s a fear that describing AI outputs as “hallucinations” trivializes the potential risks of LLMs providing incorrect or misleading information, especially if users over-rely on these models without proper fact-checking.

However, the AI context for “hallucination” has even been acknowledged by dictionaries. For instance, Merriam-Webster defines it in the context of AI as “a plausible but false or misleading response generated by an artificial intelligence algorithm.”

Interestingly, this term isn’t freshly minted. Andrej Karpathy suggested that he might have popularized the term in his enlightening blog post from 2015. But a little digging reveals earlier uses. Notably, an ACL conference paper from 2014 discussed “hallucinating” translations. Even further back, a 2009 paper titled “Review Sentiment Scoring via a Parse-and-Paraphrase Paradigm” used the term in the context of “hallucinating” topics. But perhaps the most ancient reference I found was in a 1996 paper, “Text Databases and Information Retrieval”, which talked about systems that could “hallucinate” words not present in the original document.

In my view, it’s clear that “hallucination” in AI has been in the lexicon for some time, carving out a niche meaning, distinct from its psychological roots.

It’s also worth noting that AI is replete with terms borrowed from human analogies - take “neural networks” for instance. Despite initial reservations, these terms have become integral, largely uncontroversial components of AI discourse.

Types of Hallucinations

Hallucinations can be categorized into two main types:

Intrinsic Hallucinations: These directly contradict the source material, introducing factual inaccuracies or logical inconsistencies.
Extrinsic Hallucinations: These do not contradict but also cannot be verified against the source, adding elements that could be considered speculative or unconfirmable.

The Nuanced Role of the ‘Source’

The concept of a ‘source’ varies depending on the specific task an LLM is performing. In dialogue-based tasks, the source can be considered as ‘world knowledge.’ However, when the task involves text summarization, the source is the input text itself. This is a critical nuance that significantly impacts both the evaluation and interpretation of hallucinations.

Contextual Importance of Hallucinations

The implications of hallucinations are highly context-dependent. For example, in creative applications such as poem-writing, the presence of hallucinations may not only be acceptable but could potentially enrich the output.

Why do LLMs hallucinate

It is important to first keep in mind that LLMs have been pre-trained to predict tokens. They do not have a notion of true/false or correct/incorrect, but rather base their text generation on probabilities. While that leads to some unexpected reasoning abilities (such as being able to pass the legal BAR exam or the medical USMLE), that is only a result of this probabilistic token by token reasoning. To be fair, the additional training steps of instruct tuning and RLHF that most modern LLMs have do introduce a bit more “bias towards factuality”, but they do not change the overal underlying mechanism and its pitfalls.

LLMs have been trained on the whole internet, book collections, question/answers, and Wikipedia, among many others. They have good and not-so-good knowledge in their training set. Their responses are biased towards whatever they have seen the most. If you ask an LLM a medical question and you are not careful on how you prompt it, you might get an answer that is mostly aligned to the best medical literature or to random Reddit threads.

In a recent paper entitled “Sources of Hallucination by Large Language Models on Inference Tasks”, the authors show how hallucinations are originated by two aspects of the LLM’s training dataset: veracity prior and the relative frequency heuristic.

How to Measure Hallucinations in Large Language Models

Understanding hallucinations is one thing, but quantifying them? That’s where things get really interesting. Quantitative metrics are essential for assessing the effectiveness of mitigation strategies. In this section, I’ll guide you through the recommended methodologies for measuring hallucinations.

A Five-Step Approach to Quantitative Measurement

Based on best practices in the field, here’s a systematic five-step approach to accurately measure hallucinations:

1. Identify Grounding Data: Grounding data serves as the benchmark for what the LLM should produce. The choice of grounding data varies by use-case. For instance, actual resumes could serve as grounding data when generating resume-related information. On the other hand, search engine results could be used for web-based queries.

2. Create Measurement Test Sets: These sets usually consist of input/output pairs and may include human-LLM conversations, depending on the application. Ideally, you’d have at least two types of test sets: * A generic or random test set * An adversarial test set, generated from red-teaming exercises to include challenging or high-risk edge cases.

3. Extract Claims: After preparing the test sets, the next step is to extract claims made by the LLM. This can be done manually, through rule-based methods, or even using machine learning models. Each technique has its pros and cons, which we will explore in detail.

4. Validate Against Grounding Data: Validation ensures that the LLM’s generated content aligns with the grounding data. This step often mirrors the extraction methods used previously.

5. Report Metrics: The “Grounding Defect Rate” is a fundamental metric that quantifies the ratio of ungrounded responses to the total number of generated outputs. Additional metrics will be discussed later for a more nuanced evaluation.

Evaluating Hallucinations: Common Metrics and Methodologies

Quantifying hallucinations in Large Language Models isn’t just about recognizing that they exist—it’s about measuring them rigorously. In this section, I’ll delve into the different types of metrics commonly employed for this purpose.

Statistical Metrics

Metrics like ROUGE and BLEU are often the go-to choices for text similarity evaluations. They focus on the intrinsic type of hallucinations by comparing the generated output against a source. Advanced metrics such as PARENT, PARENT-T, and Knowledge F1 come into play when a structured knowledge source is available. However, these metrics have limitations: they primarily focus on intrinsic hallucinations and can falter when capturing syntactic and semantic nuances.

Model-Based Metrics

Model-based metrics leverage neural networks, making them more adaptable to syntactic and semantic complexities. They come in various flavors:

IE-based Metrics: These use Information Extraction (IE) models to distill the knowledge into a simpler relational tuple format—think subject, relation, object. The model then validates these tuples against those extracted from the source or reference.

QA-based Metrics: These implicitly measure the overlap or consistency between the generated content and the source. If the content is factually consistent with the source, similar answers will be generated to the same questions. (see e.g. “Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering”)
NLI-based Metrics: Utilizing Natural Language Inference (NLI) datasets, these metrics determine if a generated “hypothesis” is true, false, or undetermined given a “premise”.(see e.g. “Evaluating Groundedness in Dialogue Systems: The BEGIN Benchmark”).
Faithfulness Classification Metrics: These improve upon NLI-based metrics by creating task-specific datasets, thereby providing a more nuanced evaluation. . (see e.g. “Rome was built in 1776: A Case Study on Factual Correctness in Knowledge-Grounded Response Generation”).

The Role of Human Evaluation

Despite the sophistication of automated metrics, human evaluation still holds significant value. Two primary approaches are commonly employed:

Scoring: Human annotators assign scores within a defined range to rate the level of hallucination.
Comparing: Here, human annotators evaluate the generated content against baselines or ground-truth references, providing an additional layer of validation.

The example of FActScore

FActScore is a recent example of a metric that can be used both for human and model-based evaluation. The metric breaks an LLM generation into “atomic facts”. The final score is computed as the sum of the accuracy of each atomic fact, giving each of them equal weight. Accuracy is a binary number that simply states whether the atomic fact is supported by the source. The authors implement different automation strategies that use LLMs to estimate this metric.

The Art of Red Teaming: Best Practices for Stress-Testing LLMs

While statistical and model-based metrics are indispensable for measuring hallucinations in LLMs, it’s equally important to put these models through the rigor of human evaluation. Red teaming provides an essential layer of scrutiny that complements systematic measurement. Here are some best practices to follow:

Keep Red Teaming Complementary: Although red teaming and stress-testing are invaluable tools, they should not replace systematic measurement. They are meant to augment, not substitute.

Test in Real-world Conditions: Whenever possible, conduct your testing on the production endpoint. This allows for a more realistic assessment of how the model behaves under actual conditions.

Define Harms and Guidelines: Clearly outline the potential harms and provide specific guidelines to the testers. This ensures that everyone is aligned on what to look for during testing.

Prioritize Your Focus Areas: Identify the key features, harms, and scenarios that should be prioritized in the red teaming exercise. This focused approach yields more actionable insights.

Diverse and Skilled Testers: A diverse set of testers with different areas of expertise can provide a multi-faceted evaluation. Diversity here can mean different domains of knowledge, different cultural backgrounds, or even different biases.

Documentation is Key: Decide in advance what kinds of data or findings you’d like your testers to document. Clear documentation aids in a more structured evaluation process.

Manage Tester Time and Well-being: Determine how much time each tester should ideally dedicate to the task. Moreover, be cognizant of potential burnout or a decline in creativity over time, and plan accordingly.

New approaches to red teaming include using an LLM to read team another LLM. See e.g. Deepmind’s “Red Teaming Language Models with Language Models”

Mitigating Hallucinations in Large Language Models: A Multifaceted Approach

The road to minimizing hallucinations is paved with both challenges and opportunities. In this section, we’ll explore various mitigation strategies that can be customized to fit the unique demands of different applications of large language models.

Leverage Product Design to Minimize Impact

The first piece of advice is straightforward: if possible, design your use case in such a way that hallucinations become a non-issue. For instance, in applications that generate written content, focusing on opinion pieces rather than factual articles may naturally lower the risk of problematic hallucinations.

Product-Level Recommendations

User Editability: Allow users to edit AI-generated outputs. This not only adds an extra layer of scrutiny but also improves the overall reliability of the content.
User Responsibility: Make it clear that users are ultimately responsible for the content that is generated and published.
Citations and References: Enabling a feature that incorporates citations can serve as a safety net, helping users verify the information before disseminating it.
User Optionality: Offer various operational modes, such as a “precision” mode that uses a more accurate (but computationally expensive) model.
User Feedback: Implement a feedback mechanism where users can flag generated content as inaccurate, harmful, or incomplete. This data can be invaluable for refining the model in future iterations.
Limit Output and Turns: Be mindful of the length and complexity of generated responses, as longer and more complex outputs have a higher chance of producing hallucinations.
Structured Input/Output: Consider using structured fields instead of free-form text to lower the risk of hallucinations. For example, if the application involves resume generation, predefined fields for educational background, work experience, and skills could be beneficial.

Data Practices for Continuous Improvement

Maintain a Tracking Set: A dynamic database should be maintained to log different types of hallucinations along with the necessary information to reproduce them. This can serve as a powerful tool for regression testing.
Privacy and Trust: Given that the tracking set may contain sensitive data, adhere to best practices for data privacy and security.

Prompt Engineering: Mastering the Art of Metaprompt Design

Although large language models (LLMs) have come a long way, they are not yet perfect—especially when it comes to grounding their responses. That’s why understanding and effectively utilizing metaprompts can make a world of difference. A study revealed that simply instructing the LLM on what not to do could lower hallucination rates dramatically. Even better, guiding the model towards alternative actions slashed these rates further.

General Guidelines to Curb Hallucinations

Simplify Complex Tasks: Break down intricate actions into simpler steps.
Harness Affordances: Utilize built-in functions within your metaprompt.
Use Few-Shot Learning: Include examples when you can.
Iterative Refinement: Don’t hesitate to tweak the model’s output.

One important thing to note is that while these techniques improve grounding, they also come at a computational cost. Anyone leveraging LLMs in product design will need to balance this trade-off carefully.

Fine-Tuning Your Metaprompts

Assertive Tone: Using ALL CAPS and highlighting certain directives can improve model compliance.
Context is King: Providing more background information can better ground the model.
Refinement Steps: Reevaluate the initial output and make necessary adjustments.
Inline Citations: Ask the model to substantiate its claims.
Framing: Approaching tasks as summarization often yields more grounded results compared to question-answering.
Selective Grounding: Ascertain scenarios where grounding is a must versus where it may be optional.
Reiterate Key Points: Repeating essential instructions at the end of the prompt can underline their importance.
Echoing Input: Request the model to recap vital input details, ensuring alignment with the source data.
Algorithmic Filtering: Utilize algorithms to sift through and prioritize the most relevant information.

In upcoming sections, we’ll dissect advanced metaprompting techniques, such as the “chain of thought” approach, and delve into how Retrieval-Augmented Generation (RAG) can be leveraged for better grounding.

Chain of Thought

Chain of thought was initially described in the “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” paper by Google researchers. The simple idea here is that given that LLMs have been trained to predict tokens and not explicitly reason, you can get them closer to reasoning if you specify those required reasoning steps. Here is a simple example from the original paper:

Note that in this case the “required reasoning steps” are given in the example in blue. This is the so-called “Manual CoT”. There are in fact two ways of doing basic chain of thought prompting (see below). In the basic one, called zero-shot CoT, you simply ask the LLM to “think step by step”. In the more complex version, called “manual CoT” you have to give the LLM examples of thinking step by step to illustrate how to reason. Manual prompting is more effective, but harder to scale and maintain.

CoT is just a more structured approach to the “simplify complex tasks” generic recommendation above and is known to mitigate hallucinations in many situations.

Grounding with RAG

Retrieval-Augmented Generation, commonly known as RAG, is a technique aimed at augmenting the capabilities of Large Language Models (LLMs). Initially presented by Facebook in 2020 in the context of their BART model, RAG has since been incorporated as a feature in the Hugging Face library.

The Core Concept

The fundamental idea behind RAG is straightforward: it merges a retrieval component with a generative component, allowing the two to complement each other. This process is visually explained in the diagram below, extracted from the original research paper.

By combining these two elements, RAG enables the LLM to access and incorporate external information, thereby grounding the generated content more effectively. The retrieval component fetches relevant data, while the generative aspect of the model synthesizes this data into coherent and contextually appropriate responses.

RAG has evolved to become an indispensable part of the prompt engineer’s toolkit. Over time, it has expanded into more complex applications, effectively serving as a concrete example within the broader framework of Toolkits, where the “tool” is typically a straightforward retriever or query engine.

Because RAG grounds the response to the LLM to external data, it is known to be a very effective technique to mitigate hallucinations. However, there are some caveats.

RAG known caveats and guardrails

The Pitfall of Over-Reliance. One significant drawback of using RAG is a pronounced over-reliance on the retrieval results, which can, in certain cases, lead to hallucinations. It’s crucial to understand that retrieval might produce results that are either empty, incorrect, or require further disambiguation. Below are strategies to handle each of these scenarios.

Empty Results: Be Prepared for Voids. When the retrieval engine returns empty results, it could either be due to a lack of relevant data in the document source or an incorrect query formulation. Meta-prompts should be designed to anticipate and guard against this scenario. If the retrieval engine returns no results, the system should opt for caution and decline to answer, stating something along the lines of, “Sorry, we don’t have enough information on this topic. Could you please rephrase your question?” More advanced strategies might involve internally reformulating the query to handle issues like user misspellings, which can lead to void results.

Ambiguous Results: Seek Clarification. For ambiguous queries such as “What is a good restaurant in Portland?”, where Portland could refer to multiple locations, it’s advisable to seek further clarification from the user. For example, “Did you mean Portland, OR, or Portland, ME?”

Wrong Results: Navigate Carefully. Incorrect retrieval results are particularly challenging to address because they are difficult to identify without an external ground truth. While improving the accuracy of retrieval engines is a complex problem that’s beyond the scope of this document, we recommend analyzing the performance of your retrieval solution within your application’s specific use cases. Design your prompts to be extra cautious in areas where the retrieval engine has been identified to be less accurate.

Advanced Prompt Engineering methods

Over the past few months, significant efforts have been directed towards mitigating the issues of hallucinations and grounding in Large Language Models (LLMs). These endeavors have led to a variety of innovative approaches that tackle the problem from a prompt engineering perspective. It’s important to note that these advanced methods are distinctly different from the more straightforward “design tricks” discussed earlier. I will give a few examples of advanced prompt engineering methods that are relevant in the context of preventing hallucination. If you are interested in a more comprehensive catalog, check my previous post “Prompt Engineering 201: Advanced methods and toolkits”

Complexity, Latency, and Cost: Advanced prompt engineering techniques often introduce additional complexity, latency, and cost, primarily because they frequently involve making multiple calls to the LLM. However, it’s crucial to grasp their functionality and to have these advanced methods in your prompt engineering toolbox.

Trade-offs and Opportunities: In some cases, the incremental costs and latency might be justifiable, given the improvement in grounding and reduction in hallucinations. Additionally, you may find opportunities to implement some of these advanced methods using smaller, more cost-effective models. This could offer a valuable compromise between performance and expense.

By understanding these advanced prompt engineering methods, you can make more informed decisions about when and how to apply them, and whether their benefits outweigh their costs for your specific application.

Self-consistency

Self consistency, introduced in the paper “SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models”, is a method to use an LLM to fact-check itself. The idea is a simple ensemble-based approach where the LLM is asked to generate several responses to the same prompt. The consistency between those responses indicates how accurate the response may be.

The diagram above illustrates the approach in a QA scenario. In this case, the “consistency” is measured by the number of answers to passages that agree with the overall answer. However, the authors introduce two other measures of consistency (BERT-scores, and n-gram), and a fourth one that combines the three.

Reason and act (React)

React is a specific approach to designing agents introduced by Google in “ReAct: Synergizing Reasoning and Acting in Language Models”. This method prompts the LLM to generate both verbal reasoning traces and actions in an interleaved manner, which allows the model to perform dynamic reasoning. Importantly, the authors find that the React approach reduces hallucination from CoT. However, this increase in groundedness and trustworthiness, also comes at the cost of slightly reduced flexibility in reasoning steps (see the paper for more details).

Reflection

In the Self-consistency approach we saw how LLMs can be used to infer the confidence in a response. In that approach, confidence is measured as a by-product of how similar several responses to the same question are. Reflection goes a step further and tries to answer the question of whether (or how) we can ask an LLM directly about the confidence in its response. As Eric Jang puts it, there is “some preliminary evidence that GPT-4 possesses some ability to edit own prior generations based on reasoning whether their output makes sense”.

The Reflexion paper proposes an approach defined as “reinforcement via verbal reflection” with different components. The actor, an LLM itself, produces a trajectory (hypothesis). The evaluator produces a score on how good that hypothesis is. The self reflection component produces a summary that is stored in memory. The process is repeated iteratively until the Evaluator decides it has a “good enough” answer. The authors show through experiments how reflection greatly improves the ability of detecting hallucinations even when compared to a ReAct agent.

Dialog-Enabled Resolving Agents (DERA)

DERA, developed by my former team at Curai Health for their specific healthcare approach, defines different agents that, in the context of a dialog, take different roles. In the case of high stakes situations like a medical conversation, it pays off to define a set of “Researchers” and a “Decider”. The main difference here is that the Researchers operate in parallel vs. the Reflexion Actors that operate sequentially only if the Evaluator decides.

Chain-of-Verification (COVE)

COVE, recently presented by Meta, presents yet another variation on using different instances of the LLM to produce several responses and self-validate. In their approach, illustrated in the figure below, the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response.

Rails

A rail is simply a programmable way to control the output of an LLM. Rails are specified using Colang, a simple modeling language, and Canonical Forms, templates to standardize natural language sentences (see here)

Using rails, one can implement ways to have the LLM behave in a particular way. Of particular interest to our discussion, there is a rail to minimize hallucination (Fact checking rail).

Guidance (Constrained Prompting)

“Constrained Prompting” is a term recently introduced by Andrej Karpathy to describe approaches and languages that allow us to interleave generation, prompting, and logical control in an LLM flow.

Guidance is the only example of such an approach that I know although one could argue that React is also a constrained prompting approach. The tool is not so much a prompting approach but rather a “prompting language”. Using guidance templates, you can pretty much implement most if not all the approaches in this post. Guidance uses a syntax based on Handlebars that allows to interleave prompting and generation, as well as manage logical control flow and variables. Because Guidance programs are declared in the exact linear order that they will be executed, the LLM can, at any point, be used to generate text or make logical decisions.

Model Choices for Mitigating Hallucinations

Size and Model Complexity as a General Heuristic

A well-accepted guideline within the field suggests that larger, more complex models typically offer superior grounding capabilities. For example, empirical evaluations have shown that GPT-4 substantially outperforms its predecessor, GPT-3.5, in reducing the occurrence of hallucinations.

The Significance of Model Temperature

Model temperature serves as a critical hyperparameter that influences the stochastic behavior of the model’s output. In a nutshell, it determines the level of randomness when predicting subsequent tokens. Higher temperatures increase the selection probabilities for tokens that are less likely, making the model’s output more diverse but potentially less grounded. Conversely, a lower temperature, approaching zero, results in the model sticking more closely to high-probability tokens, generally yielding more reliable and grounded outputs.

Leveraging Reinforcement Learning from Human Feedback (RLHF)

RLHF methods can be applied during the later stages of training to optimize for more accurate and grounded outputs. These methods have shown marked improvements in hallucination mitigation, especially for models that have undergone domain-specific fine-tuning.

Domain adaptation through Fine-Tuning

Lastly, if you’re developing for a specific application, you might want to consider fine-tuning your internal models. Fine-tuning to your own data and examples can make a world of difference in grounding your outputs and minimizing those pesky hallucinations, particularly if you want to use a smaller and more efficient LLM. As of this writing, OpenAI offers fine-tuning for GPT-3.5 Turbo and acknowledges that in some applications this can yield better results than using the much larger and expensive GPT-4.

Conclusion

As we have seen in this discussion of hallucinations, the problem is not an easy one to solve. In fact, Yann Lecun argues that it cannot be solved without a complete redesign of the underlying models (although Ilya Sutskever disagrees). I stand somewhere in between: with the current underlying technology, hallucinations are just an expected side-effect and are hard to completely rule out. However, a combination of techniques can mitigate them and make them completely acceptable for most if not all use cases. After all, as I explained in a previous blog post, even medical doctors hallucinate!