Bridging the AI Metrics Divide: From Models to Real-World Impact
What’s Targeted vs. What’s Needed: The Problem-Solution Disconnect
Let’s talk about AI metrics! I have explored AI metrics in depth for quite some time—both in academic and commercial contexts. In the former, I co-authored a comprehensive book, Evaluating Learning Algorithms (with Nathalie Japkowicz, published by Cambridge University Press). On the latter end, I have led the development, release, and deployment of multiple commercial, revenue-generating, and even safety-critical products and services, resulting in hundreds of millions of dollars in verifiable business value, and also advised on multiple AI initiaties and investments.
Thus, the following post comes from a vantage point of having witnessed the entire AI development lifecycle—from ideation to commercial deployment.
A couple of week ago, I was having a discussion with a C-suite leader who, like many others, is eager to drive strategic AI initiatives across the organization.
Here’s a sample of how the conversation began:
The Leader: We are excited about the massive value that GenAI is going to create. We are looking at a paradigm shift. We plan to adopt GenAI as quickly and as widely as possible.
Me: So, what would this paradigmatically shifted business look like?
The Leader: It would be like nothing we have seen before.
Me: Interesting. Can you tell me more? Maybe some examples?
The Leader: We will have GenAI running everything for us.
Me: Have you conducted due diligence regarding GenAI capabilities, use cases, applications, and priorities? What does this transformational GenAI strategy entail?
The Leader: We are actively and aggressively testing various GenAI models. We have top AI leaders driving the programs and reporting to me.
Me: So, what does the picture look like?
The Leader: We are very excited. I have personally tried creating images and exploring chatbots using GenAI. I have also created an app so quickly that I am amazed.
Me: Great. Are your personal experiences transferable to production environments? Do you foresee any challenges?
The Leader: Well, of course there are small issues like hallucinations. By the way, in your opinion, what levels of hallucinations are acceptable?
Me: Well, that depends on the application context. Have these been defined? Have you assessed the acceptable costs of the system providing incorrect responses, for instance?
The Leader: My AI leaders have told me that’s not really an issue. Hallucinations are practically solved problems. We just plan to adopt some RAG solutions.
...
You can imagine how the rest of the discussion unfolded.
We delved deeper into the details. I was shown how their teams had evaluated various LLM models, as well as the reasoning capabilities and recommendations tailored to their needs. When I inquired about the use cases and applications they were driving—or the organizational priorities regarding AI—their responses were so vague that they seemed to have been lifted straight from a "consulting handbook." This assumption proved accurate (as you might have guessed from the sample interaction above). The lack of effort in aligning AI initiatives with even tactical organizational priorities was so evident that I was unsurprised when the strategic discussion turned out to be vacuous.
This marked yet another instance in a series of similar discussions I’ve had over the years, highlighting a fundamental gap in understanding—especially at the senior leadership level—on how to develop viable AI-powered products and services. This is particularly evident in the case of GenAI. Often, rudimentary internal projects serve as the basis for overhyped claims, without ensuring that the “findings” can be successfully translated to production. In some cases, these claims rely on isolated internal “domain-specific” benchmark datasets. In others, they stem from simple experimentation conducted by or within the team. One example of the latter approach is a “study” conducted by P&G and Harvard, purportedly “demonstrating” how GenAI enhanced the teams’ innovation capabilities for new idea generation. However, this experiment was based on a trivial, unscientific exercise, lacking any substantial empirical evidence. From the study (emphasis mine),:
The experiment was conducted as a one-day virtual product development workshop, involving 811 participants from P&G’s Commercial and R&D functions.
Such internal exercises occur far too often in large organizations, conducted in the name of innovation but yielding little in terms of concrete progress. Even without scientific, verifiable, or empirical data to support their conclusions—let alone their applicability to real-world practice—these studies are often heralded as definitive evidence of the transformational potential of GenAI. This exemplifies not only the measurement of irrelevant metrics but also the establishment of unrealistic use cases to fabricate evidence. It further underscores the hype culture that permeates AI discourse, undermining genuine, nuanced, and evidence-based discussions about the potential and challenges of AI.
The passing question regarding acceptable hallucination levels, and the ensuing discussion, shed light on several critical aspects:
There appears to be a desperation to show results with GenAI—to turn promises into reality. However, the struggle to articulate a clear “to-be” state persists.
Many companies continue to engage in context-less discussions about AI.
Organizations frequently fail to hire and promote the right leaders—from managers to senior AI leadership—to conceptualize, lead, and drive strategic initiatives. Unsurprisingly, this leads to incomplete or irrelevant insights and recommendations. Moreover, initiatives backed by substantial investments often result in failures (ironically, these leaders tend to fail upwards).
The gap in understanding how to translate AI into tangible value is so wide that organizations risk getting bogged down in minutiae, with nothing meaningful to show for the AI hype they themselves created. Consequently, in many cases, GenAI ends up as a more elaborate (and expensive) form of automation that could have been achieved with non-GenAI AI at a significantly lower cost.
Companies continue to struggle with a basic understanding of AI metrics. This issue existed in the pre-GenAI era as well. However, the gap has been exacerbated by external rhetoric and the unrealistic promises made by CEOs. As a result, significant execution roadblocks arise, leading to wasted efforts, high sunk costs, and growing technical debt.
I have covered various aspects on AI value realization earlier in GenAI, AI and broader AI-transformation contexts earlier. I have also emphasized the need to get to the real-world metrics to get an honest assessment of GenAI initiatives’ success.
During the discussion, an even more fundamental issue stood out. The leader’s question, “What levels of hallucinations in the GenAI models are acceptable?” underscored how some singular assessments tend to anchor go/no-go decisions on AI adoption. Unfortunately, this is not an isolated case. I have observed teams fixating on such ad-hoc metrics with no application context—let alone a solid understanding of what systems and products should be measured against in real-world deployment scenarios.
For example, when evaluating accuracy metrics or metrics like hallucination levels, application context is paramount. Hallucinations might seem inconsequential in a toy project, but incorrect responses in large-scale, real-world deployments are a completely different challenge. Imagine a chat agent providing erroneous actions or recommendations to users—even a 1% hallucination rate can have serious business ramifications. For instance, an average 1% error rate across 1 million customers translates to over 10,000 dissatisfied users. When compounded over time, this could make the offering unsustainable.
Safety-critical and mission-critical systems—such as those in defense or autonomous driving—are entirely unsuitable for deployment under such unreliable conditions. Despite the inherently weak validation paradigms for GenAI, the lack of focus on real needs (e.g., appropriate metrics) is exacerbating the difficulties in achieving successful GenAI-powered products and services. Basing crucial decisions on partial analyses or irrelevant metrics can prove disastrous.
Questions such as, "What levels of hallucinations are acceptable? How can we ensure those thresholds? Would the levels observed on benchmarks hold true in the real world?" may initially seem like nice-to-have considerations. However, for deployed offerings, these concerns become critical, and the answers are often subjective. This is where understanding the context becomes essential, and why focusing solely on metrics like hallucination rates in benchmark scenarios is utterly meaningless.
In the broader context of AI, there has been a plethora of commentary on focusing on specific aspects of AI performance to drive business impact. These suggestions include optimizing parameters on an operating curve (e.g., ROC), refining loss functions (such as balancing false positives), assigning costs to AI model decisions, establishing guardrails for model outcomes, managing hallucinations in GenAI algorithms, and more. While these are undoubtedly useful suggestions and well-intentioned efforts to achieve results aligned with the promise of AI, they do not, by themselves or in isolation, lead to real-world success.
To achieve meaningful outcomes, it is essential to understand the system-level metrics (for products or services) that align with business priorities.
This issue isn’t confined to the business world—attempts on the social front face similar challenges and short-sighted approaches. GenAI (and AI) challenges in societal contexts are well-documented, spanning issues such as information distortion, crimes, labor market disruptions, discrimination, consumer exploitation, and industry upheaval. However, the solutions proposed to address these challenges are often presented without proper context.
There is an abundance of fragmented solutions, such as deepfake identification, setting standards for AI model deployment, correcting biases, and promoting transparency. Yet, even when these solutions are framed within real-world challenges, they frequently lack the necessary context for achieving meaningful outcomes. I have previously delved into these aspects in detail.
Both business and social arguments share a common assumption: that fine-tuning AI model performance and measuring accepted metrics can adequately address the broader issues of AI integration across various domains. As a result, the proposed solutions often focus on simplified evaluation metrics—and it’s easy to see why. These metrics are not only simpler to measure but also, in isolation, easier to optimize.
However, determining the real-world utility of any AI-driven offering is far more complex. Success—and the promised high impact—within real-world use cases often requires intricate optimization processes that account for multiple, relevant, and potentially conflicting metrics.
The isolated measures mentioned above (such as hallucination levels on a benchmark) are undeniably useful in the practice of AI. These metrics for evaluating scientific progress and the performance of AI models (both GenAI and non-GenAI) serve as vital directional markers and are highly relevant tools in the evaluation repertoire. As I mentioned at the beginning, I have co-authored a comprehensive book on evaluating learning algorithms, delving deeply into these measures (an expanded edition of the book, authored by Nathalie Japkowicz and Zois Boukouvalas, has just been released).
That said, there are significant disparities between the AI metrics commonly used in academic or scientific settings and the real-world metrics that drive AI adoption and integration—often rendering the former less meaningful in isolation. To achieve meaningful outcomes, it is crucial to understand and address this gap between metrics for scientific or technical development and those required for real-world application and adoption.
Let me give two examples to illustrate this point, one from business perspective and another from social outcome perspective:
Example 1: Manufacturing Optimization
In the first example, consider a manufacturing optimization use case. In any sophisticated manufacturing line, there are often hundreds or even thousands of processes involved. The more complex the product, the greater the number of processes, quality checks, and process calibrations. For instance, assembling an automotive part might involve tens to a few hundred steps, whereas manufacturing complex chips such as GPUs or control units can involve thousands (or even tens of thousands) of highly precise steps, each with minimal room for error.
AI models have been used for various manufacturing optimization tasks. One such use case involves detecting process divergence (deterioration) and pinpointing the root cause, which can span multiple previous processes. Having witnessed and advised on such use cases, I often see AI teams focusing on metrics like mean squared error (MSE) to train and evaluate AI models—for example, to model the underlying process or assess deviations from expected outcomes. However, while models may continue to improve, they often fail to deliver tangible business impact—whether in manufacturing optimization, cost reduction, or process reliability.
From a technical viewpoint, teams often define metrics that are technically precise but not practically meaningful. The business context demands a shift in perspective. Although I won’t delve into the details of how to build this perspective or make the right choices—topics I’ve covered elsewhere—it’s crucial to emphasize that achieving business outcomes requires focusing on what truly matters and designing systems to optimize those priorities.
Notice the shift from AI models to systems here: AI models are typically components within larger systems that support specific functionalities. Understanding the connection between the system and its underlying AI model(s) is vital to ensuring the right metrics are prioritized. For instance, when a process failure flag is raised, manufacturing units face several challenges: identifying the cause of the alarm, diagnosing the process failure, analyzing root causes, resolving issues, and minimizing yield impact during resolution.
In modern manufacturing processes, false alarms incur significant costs, while delays in resolution can lead to adverse yield impacts. Similarly, failing to catch divergence promptly risks material loss. This exemplifies a major gap between AI system design and real-world requirements.
AI efforts often focus on individual model metrics—such as accuracy in identifying the root cause or addressing other mitigation aspects—while neglecting critical cost pressures and, consequently, cost-saving opportunities. Metrics like reducing false process alarms, minimizing latency in system chains, and ensuring timely model engagement are often overlooked in system design. As a result, even if models optimize technical metrics like MSE extensively, they rarely address the actual needs of the business. This mismatch underscores the gap between metric selection and real-world priorities in AI adoption.
Example 2: Fixing Social media recommendation challenges
Let’s consider another example with social outcomes at stake—social media content recommendation. When designing algorithms for content recommendation on social media, teams have often focused on optimizing metrics such as ranking metrics (e.g., accuracy in the top x% to create user feeds), weighted recommendations from the most popular content, and similar measures. However, social media has brought challenges ranging from mental health issues in young adults to increased receptiveness to content with confirmation bias, as well as heightened exposure to mis- and disinformation.
Various suggestions and approaches have been proposed by both academia and industry to address these issues—for instance, models to detect fake content, identify sensitive content, ascertain the veracity of sources, and more. However, there remains an inherent gap—or, more accurately, a misalignment—between what content recommendation systems prioritize in real-world operation and what these independent models are optimized for.
From a business perspective, metrics that drive revenue—such as user attention and engagement—ultimately define the “success” of a social media platform (where success is measured by the service’s monetization ability). This misalignment is a critical factor that influences the feasibility of achieving desired societal outcomes. It is essential to minimize the gap between the metrics used to optimize narrow-context AI models designed to address these challenges and the metrics that define the performance of the overall business system.
Without such alignment, even the best proposals will likely fail to achieve meaningful outcomes. Regardless of how effective an individual model may be—whether in detecting fake content or identifying sensitive material—if system-level metrics (such as engagement) override these objectives, the same challenges will persist.
—
The above two are not isolated examples.
Misalignment between individual algorithmic metrics and system-level metrics and priorities is a common occurrence.
Organizations often rely on initial metrics of algorithmic performance and extrapolate them to evaluate production performance. Additionally, conflating actual business needs with algorithmic performance assessments leads to the optimization of irrelevant metrics. Misalignment between technical and business metrics frequently results in AI project failures—this is one of the most common reasons why AI transformation initiatives fail to materialize.
Failing to prioritize business-relevant goals for products and services leads to offerings that lack maturity and readiness. The same holds true for AI investments. For example, increased adoption of AI tools within an organization is not necessarily an indicator of realizable business value.
It is essential for organizations to address this misalignment, avoid hype and distracting narratives, and account for various competing business priorities. Additionally, solution designs must consider the application context and incorporate practical, feasible choices that reflect actual priorities.