# Tutorial: Evaluating RAG Pipelines

- **Level**: Intermediate
- **Time to complete**: 15 minutes
- **Components Used**: `InMemoryDocumentStore`, `InMemoryEmbeddingRetriever`, `PromptBuilder`, `OpenAIGenerator`, `DocumentMRREvaluator`, `FaithfulnessEvaluator`, `SASEvaluator`
- **Prerequisites**: You must have an API key from an active OpenAI account as this tutorial is using the gpt-3.5-turbo model by OpenAI: https://platform.openai.com/api-keys
- **Goal**: After completing this tutorial, you'll have learned how to evaluate your RAG pipelines both with model-based, and statistical metrics available in the Haystack evaluation offering. You'll also see which other evaluation frameworks are integrated with Haystack.

> This tutorial uses Haystack 2.0. To learn more, read the [Haystack 2.0 announcement](https://haystack.deepset.ai/blog/haystack-2-release) or visit the [Haystack 2.0 Documentation](https://docs.haystack.deepset.ai/docs/intro).

## Overview

In this tutorial, you will learn how to evaluate Haystack pipelines, in particular, Retriaval-Augmented Generation ([RAG](https://www.deepset.ai/blog/llms-retrieval-augmentation)) pipelines.
1. You will first build a pipeline that answers medical questions based on PubMed data.
2. You will build an evaluation pipeline that makes use of some metrics like Document MRR and Answer Faithfulness.
3. You will run your RAG pipeline and evaluated the output with your evaluation pipeline.

Haystack provides a wide range of [`Evaluators`](https://docs.haystack.deepset.ai/docs/evaluators) which can perform 2 types of evaluations:
- [Model-Based evaluation](https://docs.haystack.deepset.ai/docs/model-based-evaluation)
- [Statistical evaluation](https://docs.haystack.deepset.ai/docs/statistical-evaluation)

We will use some of these evalution techniques in this tutorial to evaluate a RAG pipeline that is designed to answer questions on PubMed data.

>üßë‚Äçüç≥ As well as Haystack's own evaluation metrics, you can also integrate with a number of evaluation frameworks. See the integrations and examples below üëá
> - [Evaluate with DeepEval](https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/rag_eval_deep_eval.ipynb)
> - [Evaluate with RAGAS](https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/rag_eval_ragas.ipynb)
> - [Evaluate with UpTrain](https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/rag_eval_uptrain.ipynb)

### Evaluating RAG Pipelines
RAG pipelines ultimately consist of at least 2 steps:
- Retrieval
- Generation

To evaluate a full RAG pipeline, we have to evaluate each of these steps in isolation, as well as a full unit. While retrieval can in some cases be evaluated with some statistical metrics that require labels, it's not a straight-forward task to do the same for the generation step. Instead, we often rely on model-based metrics to evaluate the generation step, where an LLM is used as the 'evaluator'.

![Steps or RAG](https://raw.githubusercontent.com/deepset-ai/haystack-tutorials/main/tutorials/img/tutorial35_rag.png)

#### üì∫ Code Along

<iframe width="560" height="315" src="https://www.youtube.com/embed/5PrzXaZ0-qk?si=lgBSfHatbV2i59J-" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>


## Preparing the Colab Environment

- [Enable GPU Runtime in Colab](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration)
- [Set logging level to INFO](https://docs.haystack.deepset.ai/docs/setting-the-log-level)

## Installing Haystack

Install Haystack 2.0 and [datasets](https://pypi.org/project/datasets/) with `pip`:

In [1]:
%%bash

pip install haystack-ai
pip install "datasets>=2.6.1"
pip install "sentence-transformers>=3.0.0"

Collecting git+https://github.com/deepset-ai/haystack.git@main
  Cloning https://github.com/deepset-ai/haystack.git (to revision main) to /tmp/pip-req-build-83hiigdl
  Resolved https://github.com/deepset-ai/haystack.git to commit 2509eeea7e82ef52ef65ccce00bfdcc6c1e8c1c2
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting boilerpy3 (from haystack-ai==2.1.0rc0)
  Downloading boilerpy3-1.0.7-py3-none-any.whl (22 kB)
Collecting haystack-bm25 (from haystack-ai==2.1.0rc0)
  Downloading haystack_bm25-1.0.2-py2.py3-none-any.whl (8.8 kB)
Collecting lazy-imports (from haystack-ai==2.1.0rc0)
  Downloading lazy_imports-0.3.1-py3-none-any.whl (12 kB)
Collecting openai>=1.1.0 (from haystack-ai==2.1.0rc0)
  Downloadi

  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-83hiigdl


### Enabling Telemetry

Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/enabling-telemetry) for more details.

In [None]:
from haystack.telemetry import tutorial_running

tutorial_running(35)

## Create the RAG Pipeline to Evaluate

To evaluate a RAG pipeline, we need a RAG pipeline to start with. So, we will start by creating a question answering pipeline.

> üí° For a complete tutorial on creating Retrieval-Augmmented Generation pipelines check out the [Creating Your First QA Pipeline with Retrieval-Augmentation Tutorial](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline)

For this tutorial, we will be using [a labeled PubMed dataset](https://huggingface.co/datasets/vblagoje/PubMedQA_instruction/viewer/default/train?row=0) with questions, contexts and answers. This way, we can use the contexts as Documents, and we also have the required labeled data that we need for some of the evaluation metrics we will be using.

First, let's fetch the prepared dataset and extract `all_documents`, `all_questions` and `all_ground_truth_answers`:

> ‚ÑπÔ∏è The dataset is quite large, we're using the first 1000 rows in this example, but you can increase this if you want to


In [2]:
from datasets import load_dataset
from haystack import Document

dataset = load_dataset("vblagoje/PubMedQA_instruction", split="train")
dataset = dataset.select(range(1000))
all_documents = [Document(content=doc["context"]) for doc in dataset]
all_questions = [doc["instruction"] for doc in dataset]
all_ground_truth_answers = [doc["response"] for doc in dataset]

Downloading readme:   0%|          | 0.00/498 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/274M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/986k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/272458 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Next, let's build a simple indexing pipeline and write the `documents` into a DocumentStore. Here, we're using the `InMemoryDocumentStore`.

> `InMemoryDocumentStore` is the simplest DocumentStore to get started with. It requires no external dependencies and it's a good option for smaller projects and debugging. But it doesn't scale up so well to larger Document collections, so it's not a good choice for production systems. To learn more about the different types of external databases that Haystack supports, see [DocumentStore Integrations](https://haystack.deepset.ai/integrations?type=Document+Store).

In [3]:
from typing import List
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

document_store = InMemoryDocumentStore()

document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
document_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

indexing = Pipeline()
indexing.add_component(instance=document_embedder, name="document_embedder")
indexing.add_component(instance=document_writer, name="document_writer")

indexing.connect("document_embedder.documents", "document_writer.documents")

indexing.run({"document_embedder": {"documents": all_documents}})

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

{'document_writer': {'documents_written': 1000}}

Now that we have our data ready, we can create a simple RAG pipeline.

In this example, we'll be using:
- [`InMemoryEmbeddingRetriever`](https://docs.haystack.deepset.ai/docs/inmemoryembeddingretriever) which will get the relevant documents to the query.
- [`OpenAIGenerator`](https://docs.haystack.deepset.ai/docs/OpenAIGenerator) to generate answers to queries. You can replace `OpenAIGenerator` in your pipeline with another `Generator`. Check out the full list of generators [here](https://docs.haystack.deepset.ai/docs/generators).

In [4]:
import os
from getpass import getpass
from haystack.components.builders import AnswerBuilder, PromptBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

template = """
        You have to answer the following question based on the given context information only.

        Context:
        {% for document in documents %}
            {{ document.content }}
        {% endfor %}

        Question: {{question}}
        Answer:
        """

rag_pipeline = Pipeline()
rag_pipeline.add_component(
    "query_embedder", SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
)
rag_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store, top_k=3))
rag_pipeline.add_component("prompt_builder", PromptBuilder(template=template))
rag_pipeline.add_component("generator", OpenAIGenerator(model="gpt-3.5-turbo"))
rag_pipeline.add_component("answer_builder", AnswerBuilder())

rag_pipeline.connect("query_embedder", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "generator")
rag_pipeline.connect("generator.replies", "answer_builder.replies")
rag_pipeline.connect("generator.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")

Enter OpenAI API key:¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


<haystack.core.pipeline.pipeline.Pipeline object at 0x7b698ec37d60>
üöÖ Components
  - query_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - generator: OpenAIGenerator
  - answer_builder: AnswerBuilder
üõ§Ô∏è Connections
  - query_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - retriever.documents -> answer_builder.documents (List[Document])
  - prompt_builder.prompt -> generator.prompt (str)
  - generator.replies -> answer_builder.replies (List[str])
  - generator.meta -> answer_builder.meta (List[Dict[str, Any]])

### Asking a Question

When asking a question, use the `run()` method of the pipeline. Make sure to provide the question to all components that require it as input. In this case these are the `query_embedder`, the `prompt_builder` and the `answer_builder`.

In [5]:
question = "Do high levels of procalcitonin in the early phase after pediatric liver transplantation indicate poor postoperative outcome?"

response = rag_pipeline.run(
    {
        "query_embedder": {"text": question},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
    }
)
print(response["answer_builder"]["answers"][0].data)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Yes, high levels of procalcitonin in the early phase after pediatric liver transplantation indicated a poor postoperative outcome. Patients with high procalcitonin levels on postoperative day 2 had higher International Normalized Ratio values, suffered more often from primary graft non-function, had longer stays in the pediatric intensive care unit and on mechanical ventilation. However, there was no correlation between procalcitonin elevation and systemic infection.


## Evaluate the Pipeline

For this tutorial, let's evaluate the pipeline with the following metrics:

- [Document Mean Reciprocal Rank](https://docs.haystack.deepset.ai/docs/documentmrrevaluator): Evaluates retrieved documents using ground truth labels. It checks at what rank ground truth documents appear in the list of retrieved documents.
- [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator): Evaluates predicted answers using ground truth labels. It checks the semantic similarity of a predicted answer and the ground truth answer using a fine-tuned language model.
- [Faithfulness](https://docs.haystack.deepset.ai/docs/faithfulnessevaluator): Uses an LLM to evaluate whether a generated answer can be inferred from the provided contexts. Does not require ground truth labels.


Firt, let's actually run our RAG pipeline with a set of questions, and make sure we have the ground truth labels (both answers and documents) for these questions. Let's start with 25 random questions and labels üëá

> üìù **Some Notes:**
> 1. For a full list of available metrics, check out the [Haystack Evaluators](https://docs.haystack.deepset.ai/docs/evaluators).
> 2. In our dataset, for each example question, we have 1 ground truth document as labels. However, in some scenarios more than 1 ground truth document may be provided as labels. You will notice that this is why we provide a list of `ground_truth_documents` for each question.

In [6]:
import random

questions, ground_truth_answers, ground_truth_docs = zip(
    *random.sample(list(zip(all_questions, all_ground_truth_answers, all_documents)), 25)
)

Next, let's run our pipeline and make sure to track what our pipeline returns as answers, and which documents it retrieves:

In [7]:
rag_answers = []
retrieved_docs = []

for question in list(questions):
    response = rag_pipeline.run(
        {
            "query_embedder": {"text": question},
            "prompt_builder": {"question": question},
            "answer_builder": {"query": question},
        }
    )
    print(f"Question: {question}")
    print("Answer from pipeline:")
    print(response["answer_builder"]["answers"][0].data)
    print("\n-----------------------------------\n")

    rag_answers.append(response["answer_builder"]["answers"][0].data)
    retrieved_docs.append(response["answer_builder"]["answers"][0].documents)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: 's it only what you say , it 's also how you say it : communicating nipah virus prevention messages during an outbreak in Bangladesh?
Answer from pipeline:
During the Nipah virus outbreak in Bangladesh, it was not only important to convey prevention messages but also how they were communicated. Field anthropologists played a crucial role in bridging the gap between biomedical explanations and local beliefs about the outbreak. Through interactive sessions with residents and using photos to illustrate how the virus could be transmitted, they were able to successfully convey the message. Prior to this intervention, residents believed in supernatural causes and continued risky behaviors like consuming raw date palm sap. However, after the intervention, residents understood the importance of abstaining from such practices and adopting safer behaviors. This shows that the manner in which prevention messages are communicated can greatly impact their effectiveness during an outbreak.

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does relieving dyspnoea by non-invasive ventilation decrease pain thresholds in amyotrophic lateral sclerosis?
Answer from pipeline:
Yes, relieving dyspnoea by non-invasive ventilation decreases pain thresholds in amyotrophic lateral sclerosis (ALS) patients. The pressure pain thresholds measured in the deltoid muscle during unassisted breathing decreased significantly by a median of 24.5%-33.0% of baseline during non-invasive ventilation at 30 and 60 minutes (NIV30 and NIV60) in ALS patients.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Is patient satisfaction biased by renovations to the interior of a primary care office : a pretest-posttest assessment?
Answer from pipeline:
Based on the information provided, patient satisfaction is not biased by renovations to the interior of a primary care office. The study conducted a pretest-posttest assessment and found that patient satisfaction was higher for all domains after the office was renovated, with statistical significance. Additionally, the results did not change when potential confounders were included in the analysis. Therefore, it can be concluded that patient satisfaction was genuinely influenced by the interior redesign of the primary care office.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Is cD30 expression a novel prognostic indicator in extranodal natural killer/T-cell lymphoma , nasal type?
Answer from pipeline:
Based on the provided context information, CD30 expression is not a novel prognostic indicator in extranodal natural killer/T-cell lymphoma, nasal type. The study found that CD30 expression was significantly correlated with certain clinical features, treatment response, and prognosis in ENKTL patients. CD30 positivity was associated with shorter 5-year overall survival and progression-free survival rates in specific patient groups. Additionally, CD30 expression was identified as an independent prognostic factor for overall survival and progression-free survival in a multivariate Cox regression model. Therefore, while CD30 expression is a significant factor in predicting the prognosis of ENKTL patients, it is not considered a novel prognostic indicator.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Is obesity associated with increased postoperative complications after operative management of proximal humerus fractures?
Answer from pipeline:
Yes, according to the first context provided, obesity was associated with a substantial increase in local and systemic complications following operative management of proximal humerus fractures.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does deep Sequencing the microRNA profile in rhabdomyosarcoma reveal down-regulation of miR-378 family members?
Answer from pipeline:
Yes, deep sequencing of the microRNA profile in rhabdomyosarcoma (RMS) revealed the down-regulation of miR-378 family members in RMS tumour tissue and cell lines.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Is dorsal plication without degloving safe and effective for correcting ventral penile deformities?
Answer from pipeline:
Based on the context information provided, dorsal plication without degloving was not specifically mentioned as a method for correcting ventral penile deformities. The study focused on comparing the safety and efficacy of patients undergoing dorsal penile plication, ventral plication, and lateral plication. Therefore, based on the information provided, it is not clear whether dorsal plication without degloving is safe and effective for correcting ventral penile deformities.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does mental fatigue affect maximal anaerobic exercise performance?
Answer from pipeline:
Based on the given context information, it can be concluded that mental fatigue does not affect maximal anaerobic exercise performance. The study mentioned in the context found no difference in any performance or physiological variable between participants who were mentally fatigued and those who were not. Therefore, mental fatigue does not seem to have a significant impact on maximal anaerobic exercise performance.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Are women using bleach for home cleaning at increased risk of non-allergic asthma?
Answer from pipeline:
Yes, women using bleach for home cleaning are at an increased risk of non-allergic asthma. The study showed that bleach use was significantly associated with non-allergic asthma, particularly non-allergic adult-onset asthma. Women using bleach frequently were more likely to have current asthma compared to non-users, and there were positive associations found between bleach use and bronchial hyperresponsiveness, asthma-like symptoms, and chronic cough among women without allergic sensitization.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does trichostatin A inhibit Retinal Pigmented Epithelium Activation in an In Vitro Model of Proliferative Vitreoretinopathy?
Answer from pipeline:
Yes, trichostatin A inhibits Retinal Pigmented Epithelium Activation in an In Vitro Model of Proliferative Vitreoretinopathy as shown in the study where it was observed that cells treated with transforming growth factor beta 2 (TGFŒ≤2) alone or in the presence of trichostatin A showed inhibited contraction and migration of RPE cells, indicating a role of acetylation in RPE activation and progression of Proliferative Vitreoretinopathy.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Are vitamin D levels and bone turnover markers related to non-alcoholic fatty liver disease in severely obese patients?
Answer from pipeline:
Based on the first context provided, the study concluded that there was no association between liver histology and levels of vitamin D or bone turnover parameters in severely obese patients. Therefore, vitamin D levels and bone turnover markers were not found to be related to non-alcoholic fatty liver disease in this specific group of patients.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does alcohol disrupt levels and function of the cystic fibrosis transmembrane conductance regulator to promote development of pancreatitis?
Answer from pipeline:
Yes, alcohol disrupts levels and function of the cystic fibrosis transmembrane conductance regulator (CFTR) to promote the development of pancreatitis. Studies have shown that alcohol inhibits CFTR activity in pancreatic ductal epithelial cells, reduces CFTR expression and stability, and disrupts CFTR folding, leading to lower levels of CFTR in pancreatic tissues from patients with acute or chronic pancreatitis induced by alcohol. Additionally, CFTR knockout mice given ethanol developed more severe pancreatitis than mice not given ethanol.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Do genome-wide ancestry patterns in Rapanui suggest pre-European admixture with Native Americans?
Answer from pipeline:
Yes, genome-wide ancestry patterns in Rapanui suggest pre-European admixture with Native Americans, as evidenced by statistical support for Native American admixture dating to AD 1280-1495.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Is termination of Nociceptive Bahaviour at the End of Phase 2 of Formalin Test Attributable to Endogenous Inhibitory Mechanisms , but not by Opioid Receptors Activation?
Answer from pipeline:
Yes, termination of nociceptive behavior at the end of phase 2 of the Formalin test appears to be attributable to endogenous inhibitory mechanisms rather than opioid receptors activation. This is supported by the observation that naloxone, a non-selective antagonist of opioid receptors, decreased nociception in phase 2A but had no effect on the delayed termination of the Formalin test. Additionally, the study specifically investigated active inhibitory mechanisms that lead to termination of nociceptive response in phase II, suggesting that other mechanisms besides opioid receptors may be involved.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Is real-time three-dimensional transesophageal echocardiography useful for percutaneous closure of multiple secundum atrial septal defects?
Answer from pipeline:
Yes, real-time three-dimensional transesophageal echocardiography (RT-3D-TEE) was found to be useful for percutaneous closure of multiple secundum atrial septal defects in the study described in the context information. It was used to clarify the diagnosis, determine the operation scheme, monitor and guide the operation during the procedure, and evaluate the result shortly after the procedure.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does thalidomide control adipose tissue inflammation associated with high-fat diet-induced obesity in mice?
Answer from pipeline:
Yes, thalidomide has been shown to control adipose tissue inflammation associated with high-fat diet-induced obesity in mice. Thalidomide administration in obese mice resulted in a reduction in adiposity, decreased production of pro-inflammatory adipokines such as tumor necrosis factor-Œ± (TNF-Œ±), leptin, and monocyte chemoattractant protein-1 (MCP-1) in adipose tissue, reduced macrophage infiltration, and inhibition of c-Jun N-terminal kinase (JNK) activation. Additionally, thalidomide treatment lowered TNF-Œ± and leptin serum levels in obese mice and inhibited the release of TNF-Œ± and MCP-1 in adipocytes.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does puerarin inhibit the inflammatory response in atherosclerosis via modulation of the NF-Œ∫B pathway in a rabbit model?
Answer from pipeline:
Yes, puerarin inhibits the inflammatory response in atherosclerosis via modulation of the NF-Œ∫B pathway in a rabbit model. The study found that puerarin reduced the protein and mRNA levels of adhesion molecules (AMs) in the rabbit model of atherosclerosis. It was also noted that the reduced AM levels were due to inhibition of the phosphorylation and degradation of inhibitor-Œ∫B (I-Œ∫B), resulting in reduced p65 NF-Œ∫B nuclear translocation. This indicates that puerarin has a modulatory effect on the NF-Œ∫B pathway, which plays a crucial role in the inflammatory response in atherosclerosis.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Is serum free 1,25-dihydroxy-vitamin D more closely associated with fibroblast growth factor 23 than other vitamin D forms in chronic dialysis patients?
Answer from pipeline:
Yes, according to the information provided in the context, serum free 1,25-dihydroxy-vitamin D was found to outweigh all other vitamin D forms regarding its association with fibroblast growth factor 23 (FGF-23) in chronic dialysis patients, as indicated by a p-value of 0.03 in the regression analysis.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Do a critical analysis of secondary overtriage to a Level I trauma center?
Answer from pipeline:
Secondary overtriage to a Level I trauma center refers to the transfer of trauma patients to a higher-level trauma center who do not require the specialized resources and care available at that level of facility. In the context provided, the study analyzed the incidence and pattern of secondary overtriage to a Level I trauma center by assessing trauma patients transferred and discharged within 24 hours of admission.

The study found that 24% of transferred trauma patients were discharged within 24 hours of admission, indicating a significant proportion of patients who may not have required the level of care provided at the Level I trauma center. The most common reasons for referral were extremity fractures, head injuries, and soft tissue injuries, which are conditions that may not always necessitate treatment at a higher-level trauma center.

Furthermore, the majority of patients 

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Is methylation of the FGFR2 gene associated with high birth weight centile in humans?
Answer from pipeline:
Yes, methylation of the FGFR2 gene is significantly associated with high birth weight centile in humans (p = 0.004-0.027).

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Do two decades of British newspaper coverage regarding attempt cardiopulmonary resuscitation decisions : Lessons for clinicians?
Answer from pipeline:
Yes, the two decades of British newspaper coverage regarding Do Not Attempt Cardiopulmonary Resuscitation (DNACPR) decisions provide important lessons for clinicians. The coverage highlights the need for adequate patient and family consultation when making DNACPR decisions, as well as the importance of avoiding ageism and discrimination against the disabled in these decisions. Additionally, the association of DNACPR decisions with euthanasia and patients receiving CPR against their wishes should be taken into consideration by clinicians. These lessons can help clinicians make more informed and ethical decisions regarding CPR in the future.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Are phospholipase C epsilon 1 ( PLCE1 ) haplotypes associated with increased risk of gastric cancer in Kashmir Valley?
Answer from pipeline:
Yes, the PLCE1 haplotypes (A2274223C3765524T7922612, G2274223C3765524T7922612, and G2274223T3765524C7922612) were found to be associated with an increased risk of gastric cancer in patients from Kashmir Valley. The frequencies of these haplotypes were higher in patients compared to controls and conferred a high risk for gastric cancer.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Are reclassification rates higher among African American men than Caucasians on active surveillance?
Answer from pipeline:
Yes, reclassification rates are higher among African American men than Caucasians on active surveillance. The study found that African American men on active surveillance were more likely to experience upgrading on serial biopsy compared to Caucasians (36% vs 16%). Adjusting for various factors, African American race was an independent predictor of biopsy reclassification.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does health indicators associated with fall among middle-aged and older women enrolled in an evidence-based program?
Answer from pipeline:
No, the context information provided focuses on older women participating in a fall prevention program, not middle-aged women. The study examines the relationship between older female participants' baseline health status and self-reported falls during the fall prevention interventions.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Do maternal and childhood psychological factors predict chronic disabling fatigue at age 13 years?
Answer from pipeline:
Yes, maternal and childhood psychological factors do predict chronic disabling fatigue at age 13 years. Maternal anxiety, maternal depression, child psychological problems, and upsetting life events were all associated with chronic disabling fatigue in children at age 13 years in the Avon Longitudinal Study of Parents and Children birth cohort. Specifically, maternal anxiety and depression, as well as child psychological problems and upsetting events, were all found to be risk factors for chronic disabling fatigue at age 13 years.

-----------------------------------



While each evaluator is a component that can be run individually in Haystack, they can also be added into a pipeline. This way, we can construct an `eval_pipeline` that includes all evaluators for the metrics we want to evaluate our pipeline on.

In [14]:
from haystack.components.evaluators.document_mrr import DocumentMRREvaluator
from haystack.components.evaluators.faithfulness import FaithfulnessEvaluator
from haystack.components.evaluators.sas_evaluator import SASEvaluator

eval_pipeline = Pipeline()
eval_pipeline.add_component("doc_mrr_evaluator", DocumentMRREvaluator())
eval_pipeline.add_component("faithfulness", FaithfulnessEvaluator())
eval_pipeline.add_component("sas_evaluator", SASEvaluator(model="sentence-transformers/all-MiniLM-L6-v2"))

results = eval_pipeline.run(
    {
        "doc_mrr_evaluator": {
            "ground_truth_documents": list([d] for d in ground_truth_docs),
            "retrieved_documents": retrieved_docs,
        },
        "faithfulness": {
            "questions": list(questions),
            "contexts": list([d.content] for d in ground_truth_docs),
            "predicted_answers": rag_answers,
        },
        "sas_evaluator": {"predicted_answers": rag_answers, "ground_truth_answers": list(ground_truth_answers)},
    }
)

### Constructing an Evaluation Report

Once we've run our evaluation pipeline, we can also create a full evaluation report. Haystac provides an `EvaluationRunResult` which we can use to display a `score_report` üëá

In [17]:
from haystack.evaluation.eval_run_result import EvaluationRunResult

inputs = {
    "question": list(questions),
    "contexts": list([d.content] for d in ground_truth_docs),
    "answer": list(ground_truth_answers),
    "predicted_answer": rag_answers,
}

evaluation_result = EvaluationRunResult(run_name="pubmed_rag_pipeline", inputs=inputs, results=results)
evaluation_result.score_report()

Unnamed: 0,score
doc_mrr_evaluator,1.0
faithfulness,1.0
sas_evaluator,0.718074


#### Extra: Convert the Report into a Pandas DataFrame

In addition, you can display your evaluation results as a pandas dataframe üëá

In [18]:
results_df = evaluation_result.to_pandas()
results_df

Unnamed: 0,question,contexts,answer,predicted_answer,doc_mrr_evaluator,faithfulness,sas_evaluator
0,"'s it only what you say , it 's also how you s...",[During a fatal Nipah virus (NiV) outbreak in ...,"During outbreaks, one-way behaviour change com...","During the Nipah virus outbreak in Bangladesh,...",1.0,1.0,0.688929
1,Does relieving dyspnoea by non-invasive ventil...,[Dyspnoea is a threatening sensation of respir...,Relieving dyspnoea by NIV in patients with ALS...,"Yes, relieving dyspnoea by non-invasive ventil...",1.0,1.0,0.811266
2,Is patient satisfaction biased by renovations ...,[Measuring quality of care is essential to imp...,Renovating the interior of a primary care offi...,"Based on the information provided, patient sat...",1.0,1.0,0.849888
3,Is cD30 expression a novel prognostic indicato...,"[Extranodal natural killer/T-cell lymphoma, na...",Our results showed that expression of CD30 was...,"Based on the provided context information, CD3...",1.0,1.0,0.775011
4,Is obesity associated with increased postopera...,[Obesity has become a significant public healt...,Obesity and its resultant medical comorbiditie...,"Yes, according to the first context provided, ...",1.0,1.0,0.845495
5,Does deep Sequencing the microRNA profile in r...,[Rhabdomyosarcoma (RMS) is a highly malignant ...,MiR-378a-3p may function as a tumour suppresso...,"Yes, deep sequencing of the microRNA profile i...",1.0,1.0,0.661563
6,Is dorsal plication without degloving safe and...,[To compare the safety and efficacy of patient...,Penile plication is a safe and effective techn...,"Based on the context information provided, dor...",1.0,1.0,0.804615
7,Does mental fatigue affect maximal anaerobic e...,[Mental fatigue can negatively impact on subma...,Near identical responses in performance and ph...,"Based on the given context information, it can...",1.0,1.0,0.849995
8,Are women using bleach for home cleaning at in...,[Bleach is widely used for household cleaning....,Frequent use of bleach for home-cleaning is as...,"Yes, women using bleach for home cleaning are ...",1.0,1.0,0.899928
9,Does trichostatin A inhibit Retinal Pigmented ...,[Proliferative vitreoretinopathy (PVR) is a bl...,Our findings indicate a role of acetylation in...,"Yes, trichostatin A inhibits Retinal Pigmented...",1.0,1.0,0.466138


Having our evaluation results as a dataframe can be quite useful. For example, below we can use the pandas dataframe to filter the results to the top 3 best scores for semantic answer similarity (`sas_evaluator`) as well as the bottom 3 üëá


In [19]:
import pandas as pd

top_3 = results_df.nlargest(3, "sas_evaluator")
bottom_3 = results_df.nsmallest(3, "sas_evaluator")
pd.concat([top_3, bottom_3])

Unnamed: 0,question,contexts,answer,predicted_answer,doc_mrr_evaluator,faithfulness,sas_evaluator
13,Is termination of Nociceptive Bahaviour at the...,[Formalin injection induces nociceptive bahavi...,The results of this study suggest the existenc...,"Yes, termination of nociceptive behavior at th...",1.0,1.0,0.901174
8,Are women using bleach for home cleaning at in...,[Bleach is widely used for household cleaning....,Frequent use of bleach for home-cleaning is as...,"Yes, women using bleach for home cleaning are ...",1.0,1.0,0.899928
16,Does puerarin inhibit the inflammatory respons...,[The isoflavone puerarin [7-hydroxy-3-(4-hydro...,This study indicates that the effect of puerar...,"Yes, puerarin inhibits the inflammatory respon...",1.0,1.0,0.894604
9,Does trichostatin A inhibit Retinal Pigmented ...,[Proliferative vitreoretinopathy (PVR) is a bl...,Our findings indicate a role of acetylation in...,"Yes, trichostatin A inhibits Retinal Pigmented...",1.0,1.0,0.466138
19,Is methylation of the FGFR2 gene associated wi...,[This study examined links between DNA methyla...,We identified a novel biologically plausible c...,"Yes, methylation of the FGFR2 gene is signific...",1.0,1.0,0.490618
12,Do genome-wide ancestry patterns in Rapanui su...,"[Rapa Nui (Easter Island), located in the east...",These genetic results can be explained by one ...,"Yes, genome-wide ancestry patterns in Rapanui ...",1.0,1.0,0.517162


## What's next

üéâ Congratulations! You've learned how to evaluate a RAG pipeline with model-based evaluation frameworks and without any labeling efforts.

If you liked this tutorial, you may also enjoy:
- [Serializing Haystack Pipelines](https://haystack.deepset.ai/tutorials/29_serializing_pipelines)
-  [Creating Your First QA Pipeline with Retrieval-Augmentation](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline)

To stay up to date on the latest Haystack developments, you can [sign up for our newsletter](https://landing.deepset.ai/haystack-community-updates). Thanks for reading!