Eval­u­at­ing LLM Applic­a­tions (2/2)



A Deep Dive Case Study into Meth­od­o­lo­gies and the RAGAS Library

Stock image, AI generated.

In our pre­vi­ous art­icle, we explored why sys­tem­atic eval­u­ation of LLM applic­a­tions is cru­cial for enter­prise suc­cess and intro­duced RAGAS as a frame­work for address­ing this chal­lenge. Now, we’ll demon­strate how to put these con­cepts into prac­tice through a detailed tech­nical case study that show­cases con­crete imple­ment­a­tion strategies and results.

While our first art­icle focused on the stra­tegic import­ance of eval­u­ation for decision-makers, this deep dive is designed for tech­nical prac­ti­tion­ers, developers, ML engin­eers, and data sci­ent­ists who look for prac­tical insights on imple­ment­ing robust eval­u­ation pipelines for their LLM applications.

Case Study: Eval­u­at­ing an LLM Applic­a­tion for Doc­u­ment-Based Q&A

A For­tune 500 tech­no­logy enter­prise wanted to empower their investor rela­tions team with an AI solu­tion that could quickly extract insights from share­holder com­mu­nic­a­tions. The team reg­u­larly spent hours manu­ally research­ing ques­tions from ana­lysts, investors, and internal stake­hold­ers about com­pet­itor pos­i­tion­ing and mar­ket strategy. With quarterly earn­ings calls approach­ing and increas­ing pres­sure to pos­i­tion their com­pany against mar­ket lead­ers, they needed a solu­tion that could reli­ably answer com­plex ques­tions while ensur­ing fac­tual accuracy.

The busi­ness object­ives were clear:

  • Reduce research time from hours to minutes
  • Ensure responses reflect only fac­tual inform­a­tion from offi­cial documents
  • Provide com­pet­it­ive intel­li­gence insights with proper con­text and attribution
  • Scale the solu­tion to handle hun­dreds of inquir­ies dur­ing peak periods

To demon­strate RAGAS in prac­tical applic­a­tion, let’s exam­ine this enter­prise LLM applic­a­tion designed to answer ques­tions based on share­holder let­ters from four major IT com­pan­ies pub­lished in 2023. This case study illus­trates how sys­tem­atic eval­u­ation trans­forms sub­ject­ive assess­ments into quan­ti­fi­able met­rics, provid­ing clear insights for optimization.

The tested sys­tem archi­tec­ture con­sists of three key com­pon­ents work­ing in concert:

  • An LLM (gpt-4o-mini, which is fast and cost effect­ive) serving as the answer gen­er­a­tion engine
  • A vec­tor data­base stor­ing embed­ded chunks of share­holder let­ters for inform­a­tion retrieval
  • A reranker com­pon­ent that pri­or­it­izes the most rel­ev­ant con­tent before passing it to the LLM

Our eval­u­ation approach tested three pro­gress­ively enhanced con­fig­ur­a­tions to meas­ure the impact of each component:

  • LLM Alone: The baseline con­fig­ur­a­tion where the model gen­er­ates responses using only its internal knowledge
  • LLM + Vec­tor Search: Aug­ment­ing the LLM with rel­ev­ant doc­u­ment retrieval
  • LLM + Vec­tor Search + Reranker: The com­plete sys­tem with pri­or­it­ized con­tent selection

For each con­fig­ur­a­tion, we presen­ted the same 20 ques­tions from a pre­defined test set. Our test ques­tions ranged from spe­cific data points to more com­plex ana­lyt­ical quer­ies, all with cor­res­pond­ing ground truth answers extrac­ted from the source doc­u­ments. Here is an example question:

{
    "question": "What initiatives reduced Amazon's cost to serve per unit in 2023?",
    "reference_answer": "Regionalization and logistics optimization reduced U.S. costs by more than $0.45 per unit",
}

The eval­u­ation pipeline col­lec­ted the ques­tion, retrieved con­text (or used an empty con­text if not applic­able), gen­er­ated answer, and ground truth answer to cre­ate a com­pre­hens­ive eval­u­ation dataset.

{
    "question": "What initiatives reduced Amazon's cost to serve per unit in 2023?",
    "reference_answer": "Regionalization and logistics optimization reduced U.S. costs by more than $0.45 per unit",
    "retrived_contexts": ["...", "...", "...", "...", "..."],
    "llm_answer": "1. **Fulfillment Network Efficiencies**: Improvements in the fulfillment network helped offset increased costs.2. **Regionalization**: Re-architecting the network to store items closer to customers improved delivery speeds and efficiency. 3. **Expansion of Same-Day Facilities**: Increasing the number of items delivered same day or overnight by nearly 70% YoY enhanced service efficiency.4. **Improvements in Operations**: Enhancements to customer self-service features and operational improvements reduced customer service contacts per unit.",
}

This struc­tured data­set was then ana­lyzed using RAGAS, which employs a soph­ist­ic­ated eval­u­ation meth­od­o­logy. For object­ive assess­ment, we util­ized GPT-4o as the judge model, a more power­ful and com­pre­hens­ive model than the one gen­er­at­ing the answers. This archi­tec­tural sep­ar­a­tion ensures unbiased eval­u­ation across mul­tiple per­form­ance dimen­sions, sim­ilar to how human eval­u­at­ors with greater expert­ise might assess the work of junior analysts.

This meth­od­o­logy trans­forms LLM applic­a­tion eval­u­ation from sub­ject­ive ana­lysis into a struc­tured, data-driven pro­cess with clear numer­ical out­puts. The eval­u­ation pipeline auto­mat­ic­ally pro­cesses inputs and out­puts through each sys­tem com­pon­ent, allow­ing for con­trolled com­par­ison between con­fig­ur­a­tions while main­tain­ing con­sist­ent test­ing con­di­tions across trials.

Eval­u­ation Met­rics and Results

RAGAS offers spe­cial­ized met­rics designed spe­cific­ally for RAG sys­tem eval­u­ation, focus­ing on key aspects of reli­ab­il­ity and performance:

  • Con­text Pre­ci­sion: Meas­ures how much of the retrieved inform­a­tion is dir­ectly rel­ev­ant to the query
  • Con­text Recall: Assesses how com­pre­hens­ively the sys­tem retrieves neces­sary information
  • Con­text Entity Recall: Eval­u­ates how well the sys­tem cap­tures spe­cific named entit­ies required for accur­ate responses
  • Faith­ful­ness: Determ­ines whether the gen­er­ated answer aligns fac­tu­ally with the retrieved data
  • Answer Rel­ev­ancy: Meas­ures how dir­ectly the response addresses the user’s spe­cific query
  • Noise Sens­it­iv­ity: Quan­ti­fies how often the sys­tem gen­er­ates errors due to irrel­ev­ant or mis­lead­ing data

All met­rics are nor­mal­ized between 0 and 1, where higher val­ues indic­ate bet­ter per­form­ance for most met­rics (a score of 0.85 in Con­text Pre­ci­sion means 85% of retrieved inform­a­tion is rel­ev­ant), except for Noise Sens­it­iv­ity where lower val­ues are prefer­able (a score of 0.2 indic­ates the sys­tem is less prone to being misled by irrel­ev­ant information).

Eval­u­ation Met­rics Visualisation

The eval­u­ation res­ults reveal a clear pro­gres­sion in sys­tem per­form­ance across con­fig­ur­a­tions, with broader pat­terns emer­ging at both sys­tem and met­ric levels.

At the highest level, we observe sub­stan­tial per­form­ance improve­ments as we move from the baseline LLM to the fully integ­rated sys­tem. The most sig­ni­fic­ant gains appear when adding the reranker com­pon­ent, high­light­ing its crit­ical role in fil­ter­ing and pri­or­it­iz­ing rel­ev­ant information.

Look­ing more closely at spe­cific metrics:

Con­text Qual­ity Met­rics (Recall, Pre­ci­sion, Entity Recall):

The LLM alone shows no con­text qual­ity (0.0) as expec­ted, lack­ing access to external know­ledge, adding vec­tor search improves recall but with mod­er­ate pre­ci­sion. The reranker sig­ni­fic­antly enhances pre­ci­sion (0.20 → 0.40) while main­tain­ing high recall (0.22 → 0.48).

Response Qual­ity Met­rics (Faith­ful­ness, Answer Relevancy):

The baseline LLM struggles with faith­ful­ness (0.0) and shows lim­ited answer rel­ev­ancy (0.15), vec­tor search improves both met­rics mod­er­ately. The reranker con­fig­ur­a­tion achieves sub­stan­tial gains in both faith­ful­ness (0.60) and answer rel­ev­ancy (0.69).

Error Sus­cept­ib­il­ity (Noise Sensitivity):

The baseline LLM shows low noise sens­it­iv­ity (0.35), indic­at­ing pri­or­iz­a­tion of no answer instead of hal­lu­cin­a­tions caused by the miss­ing con­text. Vec­tor search reduces this with more or less rel­ev­ant inform­tion to around 0.60 but still exhib­its vul­ner­ab­il­ity to mis­lead­ing inform­a­tion. The reranker reduces noise sens­it­iv­ity again (0.51), but room for improve­ment remains.

These res­ults demon­strate that while retrieval aug­ment­a­tion provides a found­a­tion for know­ledge ground­ing, the reranker plays a cru­cial role in dis­tin­guish­ing between rel­ev­ant and irrel­ev­ant inform­a­tion. By pri­or­it­iz­ing the most per­tin­ent con­text, the reranker helps the LLM focus on gen­er­at­ing accur­ate, tar­geted responses while redu­cing the like­li­hood of incor­por­at­ing mis­lead­ing information.

Des­pite these impress­ive improve­ments, the eval­u­ation also reveals oppor­tun­it­ies for fur­ther optim­iz­a­tion. Even in the best con­fig­ur­a­tion, met­rics remain below optimal levels, sug­gest­ing poten­tial enhance­ments in sev­eral areas:

  • Chunking Strategies: Refin­ing how doc­u­ments are seg­men­ted before embed­ding could improve retrieval precision
  • Retrieval Para­met­ers: Optim­iz­ing vec­tor data­base con­fig­ur­a­tion for more rel­ev­ant results
  • Sys­tem Prompts: Enhan­cing instruc­tions to the LLM for bet­ter con­text utilization
  • Doc­u­ment Pro­cessing: Improv­ing how struc­tured inform­a­tion is parsed and pre­pared for embedding

This mul­ti­tude of adjustable para­met­ers presents a per­fect oppor­tun­ity for sys­tem­atic hyper­para­meter tun­ing. By lever­aging RAGAS met­rics as object­ive func­tions, developers can imple­ment grid search, Bayesian optim­iz­a­tion, or other sys­tem­atic search strategies to identify optimal con­fig­ur­a­tions, sim­ilar to how tra­di­tional machine learn­ing mod­els are tuned. For example, one might optim­ize chunk size, retrieval k‑value, and reranker threshold sim­ul­tan­eously to max­im­ize faith­ful­ness while main­tain­ing min­imum thresholds for con­text pre­ci­sion. This approach trans­forms LLM applic­a­tion devel­op­ment from intu­ition-driven exper­i­ment­a­tion to data-driven optimization.

This case study demon­strates how RAGAS provides quan­ti­fi­able insights that can guide iter­at­ive sys­tem improve­ment, turn­ing sub­ject­ive obser­va­tions into action­able devel­op­ment priorities.

Broader Implic­a­tions and Future Directions

LLM applic­a­tion eval­u­ation is not merely a tech­nical neces­sity, it’s a stra­tegic advant­age for organ­iz­a­tions deploy­ing these sys­tems in pro­duc­tion envir­on­ments. By estab­lish­ing con­tinu­ous mon­it­or­ing and iter­at­ive improve­ment cycles, organ­iz­a­tions can main­tain sys­tem qual­ity even as usage pat­terns and require­ments evolve.

Key areas for future innov­a­tion in LLM applic­a­tion eval­u­ation include:

  • Auto­mated test set cre­ation: The RAGAS team is act­ively work­ing on redu­cing the manual effort required to gen­er­ate ground truth data, poten­tially using LLMs them­selves to cre­ate diverse, rep­res­ent­at­ive test cases at scale.
  • Long-term per­form­ance mon­it­or­ing: Integ­rat­ing RAGAS met­rics into MLOps work­flows could enable detec­tion of per­form­ance drift over time, alert­ing teams when sys­tem qual­ity degrades due to data shifts or chan­ging user behaviors.
  • Auto­mated prompt optim­iz­a­tion: When other sys­tem com­pon­ents remain static, RAGAS can eval­u­ate how prompt modi­fic­a­tions affect over­all per­form­ance. This cre­ates oppor­tun­it­ies for auto­mated sys­tems that iter­at­ively refine prompts based on eval­u­ation feed­back, con­tinu­ing until per­form­ance meets pre­de­ter­mined qual­ity thresholds.

Con­clu­sion: The Future of LLM Applic­a­tion Evaluation

As LLMs become embed­ded in mis­sion crit­ical enter­prise applic­a­tions, robust eval­u­ation frame­works will sep­ar­ate effect­ive deploy­ments from unre­li­able ones. The RAGAS lib­rary provides an indis­pens­able tool­set for both developers and decision makers, offer­ing a quan­ti­fi­able, scal­able approach to meas­ur­ing LLM applic­a­tion performance.

Although chal­lenges such as eval­u­ation vari­ab­il­ity and lan­guage spe­cific biases remain, con­tinu­ous advance­ments in LLM assess­ment will drive the next wave of AI reli­ab­il­ity. Busi­nesses that adopt sys­tem­atic eval­u­ation strategies today will be best posi­tioned to unlock the full poten­tial of LLM tech­no­logy in the future.

For those look­ing to dive deeper, RAGAS doc­u­ment­a­tion and our sample note­book offers fur­ther insights into imple­ment­a­tion best prac­tices. Whether you’re optim­iz­ing retrieval pipelines or mak­ing stra­tegic AI invest­ments, under­stand­ing LLM applic­a­tion eval­u­ation is key to build­ing truly impact­ful AI applications.

Fur­ther sources: