Navigating the LLM Landscape: The Importance of Benchmarking in GenAI

Key takeaways

It's important for organizations to create custom and independent benchmarks for LLMs to address specific use cases.
Model makers' documentation has limitations.
There are significant benefits to creating a custom evaluation framework for LLMs.

(Editor's note: This post is the first in a three part series that discusses why and how we created and implemented an evaluation model for LLMs that GoDaddy uses when creating GenAI applications. You can read part two here and part three here.)

Large Language Models (LLMs) have burst onto the scene of GenAI applications, offering up a multitude of text processing capabilities, which were previously not available with pre-GenAI natural language processing (NLP) models. In part one of this series, we discuss why it's important for organizations to benchmark and evaluate each model's individual capabilities and shortcomings for specific use cases, and not only rely on an LLM model maker's published documentation. In parts two and three of this series, we'll go into more detail on specific criteria, processes, scoring, tools, and implementations we use to evaluate LLMs and ultimately incorporate them into applications. This attempt to standardize evaluations of different LLMs based on their capabilities and shortcomings for specific use cases and the tooling required to support this effort resulted in the creation of an "LLM evaluation workbench".

LLMs are capable of a huge variety of tasks such as summarization, reasoning, information extraction, sentiment analysis, and question answering. Such capabilities are now possible within a single, large pre-trained model to be leveraged by human friendly natural text prompts. No special programming or ML skills are needed to leverage these capabilities!

This has meant that LLMs can now power scenarios that previously relied on less capable classical NLP models. In addition, LLMs can emulate workflows with natural language-style instructions which previously required human effort to be coded imperatively. LLMs have the capability to handle a wide range of scenarios out of the box. This eliminates the need to train and manage bespoke ML models for task-specific capabilities. For some use cases, this enabled more agile product development, leaner engineering efforts, and shorter time to market for desirable product features such as agentic chatbots for customer support, marketing, upsell, web content generation, shopping concierges, expert reasoning agents, and intelligent search.

Along with the promise come a few pitfalls of leveraging these powerful models. Prime among them are hallucination, the generation of biased or harmful content, non-deterministic reasoning capabilities, and vulnerability to jailbreaking. Additionally, behavioral challenges like slow response times can dampen customer interest. Finally, the cost of LLM use is something that needs to be recognized as a risk that can upset ROI expectations.

There’s a huge upside for companies to adopt LLMs, but there’s also a substantial risk profile of LLM adoption. This can potentially ruin a company’s brand, erode customer trust, cause poor customer satisfaction, and result in financial loss.

Why do we need benchmarks for LLMs?

Given the challenging landscape, it leads us naturally to the need to collect more data on LLM capabilities. How capable is an LLM on summarization? How vulnerable is it to jailbreak? Can it wander off topic and be abused by malevolent actors? This crystallizes into a need for language model cards which would surface metrics on its capabilities. This information is already available in the industry to some extent. Typically, model makers surface model cards for the model they release. An LLM model card is a structured document or metadata file, released by model makers, that provide readers with a view of the LLM capabilities. Details typically include model architecture, intended use, training data, training process, and evaluation data. There is no normative standard for what information needs to be included in the model card and for proprietary models, one or more of the sections may not be released to general public. However, it would be fair to assume that at least usage related information such as evaluation metrics, intended use, and limitations will be covered.

Model cards, authored by model makers, are great starter resources for getting an overview of an LLM's capabilities. In the following sections we'll review the Claude 3 model card to highlight some of these concerns by providing examples. We selected this model card for illustration purposes because:

the Claude 3 series are well known sets of LLMs with good performance and applied in many GenAI applications.
Anthropic, the company that produced the Claude family of models, is well known for their proactive stand in making safe, trustworthy, and reliable LLMs. We are choosing the Claude 3 model card as a representative of a high bar which all LLM makers should aspire to. But we argue even that high bar falls short in addressing various concerns. The issue is then only exacerbated by other LLM makers, making the need for an LLM benchmark even more urgent.

Unclear training provenance

Section 2.5 (Training Data) of the model card lists how Anthropic has obtained its training data. However, they are specified at a very high level:

Claude 3 models are trained on a proprietary mix of publicly available information on the Internet as of August 2023, as well as non-public data from third parties, data provided by data labeling services and paid contractors, and data we generate internally.

There is no mention of what exactly that training data constitutes, how it was cleaned, and whether checks were implemented for controlling inherent bias, factual inaccuracies, and propaganda. There's also no way to verify the provenance of the training data or training process to ensure the model hasn't included undesirable behavior due the generalized descriptions specified in the model card.

Limited data on test results

There is limited test data available by model makers which may provide insufficient detail on the failure points of the model. While the model may behave well for most well known scenarios, they should behave predictably for scenarios in the neighborhood of a tested scenario and be resilient to random behavior when encountering a completely new scenario.

For example, Table 1 of the Claude 3 model card indicates the more powerful Opus performs better in benchmarks compared to the less powerful Sonnet in most cases. Yet for the PubMedQA test suite (biomedical questions), Sonnet performs significantly better despite being a weaker model. Since only limited information is mentioned in the model card, there is no ability to drill down on model card specifics and verify the results. Similarly, Section 5.5 (Human Preferences on Expert Knowledge and Core Capabilities) highlights the improvements in Claude's performance over previous versions, based on human preference feedback. Only the performance of the Claude 3 Sonnet is highlighted compared to Claude 2.1 and Claude 2.0. There is no data made available to see the performance of the other Claude variants (Haiku and Opus). This means we have no ability to make a judgement call about the Haiku and Opus variants of Claude 3 for the same scenarios.

Finally, model evaluations are not made available in a normalized machine format. Hence, we cannot use data visualization software to interactively slice and dice the evaluation data as per the needs of our ML scientists and GenAI app developers.

Potential conflict of interest

There is a conflict of interest vulnerability since model makers may attenuate the risks and drawbacks and exaggerate the benefits of their model. This is a general challenge in the industry due to model makers releasing their own model cards, rather than having a central body evaluate each LLM objectively.

Lack of a standardized process to evaluate models

In Section 6.1 (Responsible Scaling Policy), Anthropic mentions its class leading creation of a framework for assessing and mitigating potential catastrophic risks from AI models. While this does a great job of assessing the risk for the Claude family of models, since it is not an industry standard, comparable evaluations are not available for other public LLMs. This lack of a standardized evaluation makes it hard to compare and contrast the capabilities of the LLMs with each other

Overreliance on generalized test benchmarks

Section 5.1 (Reasoning, Coding, and Question Answering covers the evaluations for different capabilities. They utilize various well-known benchmarks already in the public domain such as GPQA, MMLU, and PubMedQA among others. Since they are already publicly available, they're prone to being inadvertently leaked into the training data during the model training phase. This may give false positives in terms how well the model performs during evaluation time since the standardized principle of segregating the test and training data maybe inadvertantly violated. In addition, the generalized scenarios may not reflect the capabilities we seek for building out GenAI apps. Examples include ability to have focused conversations in a certain domain or reasoning in a domain specific manner such as on GoDaddy products for helping with customer support issues

Benefits to creating a custom evaluation framework

GoDaddy is currently engaged in vigorous GenAI efforts. Our GenAI use cases focus on customer chatbots, intelligent search, web content generation, and product support. While these use cases have an overlap with publicly available test suites, it does not provide a high-fidelity signal to make well-informed judgment calls for picking the most appropriate model for our products and services. The previous section highlighted the shortcomings of publicly available model cards via the example of the Claude 3 model card.

To solve for this, GoDaddy has identified a need for evolving our own standardized evaluation framework which would offer the following benefits:

Test scenarios more closely aligned with our product North Star and GoDaddy GenAI projects rather than broad public domain tests. This ensures that the evaluation is of high quality and provides relevant data on a model's capabilities based on the problem domains we are focused on.
Since our tests will not be in the public domain, there is no risk of training data contamination where our test scenarios are inadvertently fed to the LLM during training. This ensures that when we run our evaluations we get a good test signal and false positives are mitigated.
We will be able to do an apples-to-apples comparison between LLMs from disparate model makers such as Anthropic, Amazon, Google, and OpenAI. This is because we will use the same testing scenarios for all models and come up with objective function scores that carry the same semantics.
There is no conflict of interest or model-maker-attributed bias since the evaluation criteria are chosen with GoDaddy personas at heart. We have the incentive to be unbiased to every LLM we evaluate, since we seek to deploy the most appropriate model for our GenAI applications, agnostic of model-maker influence.
Since we own all the test data in a machine-readable format (JSON), we can plug it into custom viewers or commercial ones like AWS QuickSight. This allows different personas such as ML scientists, product managers, and prompt engineers, the ability to drill down results at various levels of abstraction ranging from highly summarized single-objective function scores to results for a single question.
The provenance of the test data is completely transparent. We will record the origin of the test scenario as well the details of the contributors. In addition, we can track the evolution of the our test criteria with time while having oversight from various actors such as ML Scientists, Product managers and GenAI app developers.

By framing the solution in terms of our own evaluation workbench, we empower our prompt engineers, product managers, and GenAI application engineers to make better-informed choices in data-driven manner.

This post provided a quick overview of the opportunities available for use with LLMs and their potential pitfalls. It also covered why current model cards generated by model makers don't provide the signal desired by ML scientists and GenAI app developers to make informed choice on LLMs. In the following parts of the series we'll discuss the solution we've designed: an LLM evaluation workbench.