From c851558186e8e20ada93e1261a4d5dbafb47cd6f Mon Sep 17 00:00:00 2001
From: Ayush Thakur <mein2work@gmail.com>
Date: Thu, 30 Nov 2023 20:34:00 +0530
Subject: [PATCH 1/2] add more info on evaluation

---
 README.md | 34 ++++++++++++++++++++++++++--------
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/README.md b/README.md
index faee7cb..b22b7a3 100644
--- a/README.md
+++ b/README.md
@@ -78,18 +78,36 @@ For more detailed instructions on installing and running the bot, please refer t
 
 Executing these commands will launch the API, Slackbot, and Discord bot applications, enabling you to interact with the bot and ask questions related to the Weights & Biases documentation.
 
-### Evaluation
+## Evaluation
 
-To evaluate the performance of the Q&A bot, the provided evaluation script (…) can be used. This script utilizes a separate dataset for evaluation, which can be stored as a W&B Artifact. The evaluation script calculates retrieval accuracy, average string distance, and chat model accuracy.
+We evaluated the performance of the Q&A bot manually and using auto eval strategies. The following W&B reports document the steps taken to evaluate the Q&A bot:
 
-The evaluation script downloads the evaluation dataset from the specified W&B Artifact, performs the evaluation using the Q&A bot, and then logs the results, such as retrieval accuracy, average string distance, and chat model accuracy, back to W&B. The logged results can be viewed on the W&B dashboard.
+- [How to evaluate an LLM Part 1: Building an Evaluation Dataset for our LLM System](http://wandb.me/wandbot-eval-part1): The report dive into the steps taken to build a gold-standard evaluation set.
+- [How to evaluate an LLM Part 2: Manual Evaluation of our LLM System](http://wandb.me/wandbot-eval-part2): The report talks about the throught process and steps taken to perform manual evaluation.
+- [How to evaluate an LLM Part 3: Auto-Evaluation; LLMs evaluating LLMs](http://wandb.me/wandbot-eval-part3): We used various LLM eval LLM startegies documented in this report.
 
-To run the evaluation script, use the following commands:
+### Evaluation Results
 
-```bash
-cd wandbot
-poetry run python -m eval
-```
+**Manual Evaluation**
+
+We manually evaluated the Q&A bot's responses to establish a basline score.
+
+| Evaluation Metric  | Comment  | Score |
+|---|---|---|
+| Accurary | measure the correctness of Q&A bot responses |  66.67 % |
+| URL Hallucination | measure the validity and relevancy of the links | 10.61 %  | 
+| Query Relevancy | measure if the query is relevant to W&B | 88.64 % |
+
+**Auto Evaluation (LLM evaluate LLM)**
+
+We employed a few auto evaluation strategies to speed up the iteration process of the bot's development
+
+| Evaluation Metric  | Comment  | Score |
+|---|---|---|
+| Faithfulness Accuracy | measures if the response from a RAG pipeline matches any retrieved chunk | 53.78 % |
+| Relevancy Accuracy | measures is the generated response is in-line with the context | 61.36 % |
+| Hit Rate | measures if the correct chunk is present in the retrieved chunks | 0.79 |
+| Mean Reciprocal Ranking (MRR) | measures the quality of the retriever | 0.74 |
 
 ## Overview of the Implementation
 

From 5978330dd4b98bde825733fc6eedade4bd14f9e3 Mon Sep 17 00:00:00 2001
From: Bharat Ramanathan <bharat.ramanathan@wandb.com>
Date: Thu, 30 Nov 2023 20:40:31 +0530
Subject: [PATCH 2/2] fix: small typo fixes

---
 README.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index b22b7a3..545bfaf 100644
--- a/README.md
+++ b/README.md
@@ -82,9 +82,9 @@ Executing these commands will launch the API, Slackbot, and Discord bot applicat
 
 We evaluated the performance of the Q&A bot manually and using auto eval strategies. The following W&B reports document the steps taken to evaluate the Q&A bot:
 
-- [How to evaluate an LLM Part 1: Building an Evaluation Dataset for our LLM System](http://wandb.me/wandbot-eval-part1): The report dive into the steps taken to build a gold-standard evaluation set.
-- [How to evaluate an LLM Part 2: Manual Evaluation of our LLM System](http://wandb.me/wandbot-eval-part2): The report talks about the throught process and steps taken to perform manual evaluation.
-- [How to evaluate an LLM Part 3: Auto-Evaluation; LLMs evaluating LLMs](http://wandb.me/wandbot-eval-part3): We used various LLM eval LLM startegies documented in this report.
+- [How to evaluate an LLM Part 1: Building an Evaluation Dataset for our LLM System](http://wandb.me/wandbot-eval-part1): The report dives into the steps taken to build a gold-standard evaluation set.
+- [How to evaluate an LLM Part 2: Manual Evaluation of our LLM System](http://wandb.me/wandbot-eval-part2): The report talks about the thought process and steps taken to perform manual evaluation.
+- [How to evaluate an LLM Part 3: Auto-Evaluation; LLMs evaluating LLMs](http://wandb.me/wandbot-eval-part3): Various LLM auto-eval startegies are documented in this report.
 
 ### Evaluation Results