GenAI’s True Worth: Measuring What Really Matters
Benchmarks and anecdotal evaluations hold little significance when it comes to societal and business impact. It's time to shift toward relevant metrics that align with the claims being made.
Almost every week, we have a new GenAI model beating others on some benchmarks. Moreover, on various tasks, their performance seems practically converging while hallucination rates remain challenging.
What tended to be measures of assessing the directional “progress” of scientific algorithms (and not definitive ones even then) have now become assumed proxies for the promise of transformational outcomes for society and businesses. Both the media and the industry have elevated excelling on these benchmarks to the status of holy-grail — treating it as an end in itself. This approach has been used to suggest, tout, and create hyperbolic narratives around, the massive transformation they claim to represent, on scientific-, engineering-, technology- , and business-aspects of GenAI.
For instance, on the scientific side, there are confounding narratives on the development (or even evolution) of some deep “intelligence”. Not only is there a rush to build claims that this intelligence is comparable to human intelligence, but some even suggest it might surpass human capabilities. Terms such as AGI/ASI are being thrown out willy-nilly with little regard for consensus definitions, let alone intellectual honesty or scientific integrity. In some cases it gets even worse. For instance, when a benchmark includes an “AGI” suffix to it, the media and industry rushes to claim AGI because a model excelled at it. I highlighted this disturbing hype culture in my earlier post:
There seem to be few checks and balances, or even rational pushbacks.
What we currently have is, unfortunately, far less representative of a healthy innovation and scientific ecosystem—one where honest inquiry and questioning are welcome and actively employed to foster informed progress. Instead, candid, well-meaning disagreements, informed questions, and validations—which are necessary elements of intelligent debate—are being disregarded and, on occasion, even penalized.
On the technology and productization front, models outperforming benchmarks are being presented as natural precursors to future validations. However, paradoxically, these are framed as “already achievable” technological breakthroughs. These claims are almost immediately followed by promises of business and societal transformations. For example, a model excelling in question-answering benchmarks is often assumed to signify a technological breakthrough, one that will lead to a paradigmatic shift in the capabilities they empower in real-world applications. Similarly, models demonstrating advancements in protein folding predictions are stretched to imply imminent products that will drive extremely rapid drug discoveries.
I have highlighted broader evidence-less claims in my previous post — such as this, this, and this. Unfortunately, media coverage tends to amplify these claims, taking them at face value while challenging neither their basic premise nor the oversimplified consequences of such sweeping assertions.
Adding to this noise is an ever-increasing group of “influencers,” AI celebrities, industry participants, and a myriad of self-proclaimed experts voicing opinions on the state of science or the impacts on business and society, often based on anecdotal examples. While observing specific model behavior on a particular prompt or query may be intriguing, it causes significant damage to the integrity of scientific discourse. Collectively, these are not harmless musings; they actively distract from principled evaluation, validation, and meaningful debate in relevant contexts. The unqualified, sensationalized, click-bait “TikTok-ification” of scientific discourse is undoubtedly one of the more regrettable travesties of our times.
To be fair, neither I nor any of the individuals questioning these trends are inherently AI critics or skeptics. We are not opposed to AI; I have worked in this field my entire life and firmly believe in its potential to help in numerous ways—it already is. The binary framing of AI proponents versus critics is a false dichotomy. What we truly need is a scientifically grounded discussion on the actual progress in AI, its various aspects, and how it can be meaningfully leveraged for economic and social impact. Such discussions don’t require hysteria or aggressive activism, nor do they necessitate questioning every single instance of AI development.
We have witnessed earlier instances of scientific and technological advancements. While there was notable euphoria surrounding them, the core discussions about their capabilities and limitations were relatively more grounded. Actual experts armed with evidence and insights collaborated with relevant stakeholders to put these technologies to practical use. Although these advancements caused disruptions, informed discussions, policies, and decisions helped manage them effectively, resulting in arguable successes.
However, I am less confident in our current ability—or willingness—to manage technological developments responsibly if we continue on our current trajectory.
That brings me to the point of the real impact of GenAI.
It’s great to see model developers releasing new models with exceptional fanfare, based on their performances on benchmarks. These are, without a doubt, non-trivial developments and should be recognized as such. (Here, the non-triviality refers to the broader context of collective AI developments over the past decade, rather than incremental model releases or performance improvements.) That said, it’s about time we shift the discussion to the actual impact of these advancements in the real world. The first step:
Align the assessment metrics with the claims.
Our evaluation metrics, as reported in the coverage of technological developments, do not align with the real-world claims or projections these reports often make. Consequently, they are not necessarily meaningful assessments of the projected capabilities of models in real-world deployments. When paired with claims of business, economic, and social transformation “at a scale never seen before,” it becomes imperative to employ metrics that can provide sensible assessments—and evidence—of Generative AI's impact on these claims, rather than relying solely on benchmarks and anecdotes.
It is well known that benchmark performances hold limited significance in the context of real-world deployments. Creating meaningful business impact requires far more than algorithmic performance. It demands aspects such as use-case establishment, cost-benefit planning, implementing guardrails for privacy and security, system design, integrating models into existing business processes and workflows, maintaining products and services, evaluating the ROI trade-offs of models’ stochasticity, and much more. I have covered some of these aspects with regard to AI and Generative AI in the past.
So far, on the business front, there has been little substantive discussion about the actual business impact or real-world evidence of meaningful products or services. While there are exceptions, they remain few and far between.
When the rubber hits the road, however, we are beginning to see some initial disappointments. For example, take these stories about Salesforce, Microsoft and Klarna. Or consider these reports from The Information highlighting the challenges companies face in adopting reasoning models, as well as the pressures CIOs experience from AI sellers—issues one wouldn’t expect from offerings that are supposedly mature and claimed to be in “extremely high demand.”
Hence, it is crucial to decouple promise from impact, as the two often appear conflated in the reports mentioned above. Here are examples from other fields. User adoption is another potentially misleading metric. Unless it is correlated with actual impact—such as increased revenue, realized productivity gains, or verifiable cost reductions—it remains nothing more than a ‘feel-good’ indicator.
Enterprise adoption of GenAI is, unsurprisingly, challenging. However, certain prerequisites must be met, one of which is the maturity and reliability of the technology. While progress is being made in several areas, GenAI may not yet be fully ready for all applications. It is no surprise that safety- or mission-critical systems are not utilizing it. Even for non-critical applications, there are significant business, economic, and societal considerations for which GenAI’s capabilities have yet to be convincingly vetted.
In any case, beyond the deeper discussions around the impacts or consequences of adopting immature technologies—especially in sensitive areas of society—there remains a more fundamental concern.
It remains unclear where and how much GenAI is truly moving the needle, as well as how business impact—adjusted for costs—is being measured. For example, companies’ 10-K reports often fail to deliver promising insights or instead provide a highly confusing picture of GenAI’s value to the bottom line, as highlighted in The Information’s report on Klarna mentioned earlier. This is not to say that GenAI hasn’t created value or that it lacks potential. What I am proposing is a shift toward meaningfully reporting a clear, evidence-based picture for the benefit of all stakeholders.
Here are some recommendations
STOP:
Using benchmarks outside of initial scientific assessments or as proxies for real-world impact. These are NOT meaningful metrics to extrapolate as measures of real-world production success.
Touting reasoning capabilities of GenAI models without application context. They are meaningless unless properly contextualized.
Making hyperbolic and nonsensical, evidence-free claims (of both success and failure).
Relying on anecdotal examples as evidentiary data. Anecdotes neither constitute evidence in support of nor opposition to anything. No technology is proven or matured based on anecdotal data within any given application context—nor should it be. The data-driven community, along with company executives, product and services leads, and decision-makers across the board, should know better.
For example, if someone quickly builds a simple application or even a “PhD-level” dissertation using tools like DeepResearch, or if a child demonstrates software code or assignment creation (as showcased in various social media posts), that does not indicate the underlying capabilities are ready for prime-time production, adoption, and integration in real-world, consequence-adjusted scenarios.
Media and journalists: Please take care to question, verify, and validate claims, particularly when vested or conflicting interests are at play. Rather than accepting these claims at face value, conduct due diligence, challenge spurious or outlandish assertions, involve experts to provide much-needed depth, and uphold journalistic rigor.
Suggestions for Decision Makers:
C-suite (especially CEOs, CFOs): Clearly report the bottom-line impact of GenAI. Filings and analyst reports can benefit from transparently highlighting these metrics, particularly if your companies are making significant investments in the area. Hype is neither a business model nor a strategy. In fact, there may not even be internal consensus on some of the claims being made externally.
Decouple non-GenAI and AI reporting: Non-GenAI approaches have been widely implemented across various areas and have proven their value. Combining them into a single "AI" category can be misleading.
Separate GenAI value creation from other measures, such as "cost savings," unless these savings are directly attributable to GenAI. Cost savings, for example, often obscure broader implications. In many recent cases, they appear to accompany significant layoffs. As it stands, workforce reductions cannot be meaningfully explained as a direct result of GenAI, as this does not seem to be reflected elsewhere in balance sheets or strategies.
Adjust GenAI impact numbers to account for costs, readiness, and any associated business risks.
Clarify maintenance and support costs of GenAI capabilities. These should be broken down and reported more transparently.
Report risk metrics. For example, how does the adoption of GenAI affect the organization’s risk profile (e.g., privacy, security, customer experience, reputation)?
Avoid unsubstantiated, unverifiable, or insupportable claims, whether they relate to GenAI-powered products and services, investments, business plans, internal adoption, venture funding, or overall strategy. Ensure any claims are accompanied by reasonable, evidence-based and realistic timelines.
For domain-specific applications, it might be valuable to report progress on metrics that directly correspond to the claims of GenAI-powered transformation. Below are a few example areas and metrics that could help build greater confidence:
Drug Discovery:
Number of meaningful, validated drug candidates identified.
Number of drugs advanced to clinical trials.
Quality of outputs.
Opportunity costs saved.
Chatbot Integrations:
Demonstrable improvements in business metrics, e.g., chatbot integration resulting in measurable cost reductions, adjusted for risks (such as customer frustration, disengagement, and GenAI-related overheads).
Impact on customer support metrics, including reductions in human interventions.
Customer satisfaction rates.
Sophistication of requests successfully handled.
GenAI for Coding:
Degrees of code completion with verifiable impact.
Reduction in human supervision needs.
Actual effort saved and sophistication of generated code.
Ability to create production-quality code.
Reduction in rework needs and impact on related dependencies.
Agentic AI:
Agents performing non-harmful and useful tasks.
Reliability metrics for tasks performed.
Sophistication of agents, such as their ability to:
Leverage pre-existing capabilities and integrate them cohesively.
Identify novel, verifiable information and insights.
For various domains, it is essential to focus on reliability metrics. Examples include repeatability, robustness, and risks associated with contaminating products or operational environments. This may involve concerns such as introducing bad data or hallucinated data (see this and this for such examples).
Finally, it is essential to include systematic and relevant metrics of failure in application settings—specifically highlighting where models do not perform as expected. These go beyond metrics like hallucination rates and instead focus on actual failures in deployed scenarios, categorized by their degrees of criticality. In essence, there is an urgent need for robust validation and verification processes.
Relying on benchmarks and anecdotal “evidence” of GenAI capabilities, and extrapolating them to support claims of transformational impacts on business and society, does a significant disservice. Unfortunately, this has become an acceptable and normalized practice. It is time to pivot away from hyperbolic claims—both positive and negative—about AI capabilities. Instead, we need to foster more nuanced, inclusive discussions on the real state of AI, its progress, implications, and challenges.
The starting point? Honest measurement and transparent reporting of success:
Relevant metrics for respective claims.