Evaluating a RAG System: Part 2 of 3

Experiments on Cosine Semantic Similarity

In the previous article, we discussed text splitters' strategies and how they impact the semantic relevance of chunks. After understanding their relevance in context, it's time to assess semantic similarity.

Semantic similarity is a metric used across a set of documents or terms. The distance between items is determined by the likeness of their meaning or semantic content, rather than their lexical similarity. This metric, which needs citation, is a mathematical tool used to estimate the strength of the semantic relationship between language units, concepts, or instances. This numerical representation is obtained by comparing the information that supports their meaning or describes their nature.

In light of this, we conducted an experiment with the previous chunks, which were obtained in the first stage (after the splitters were applied). So, let's dive into it!

The experiment

A common practice to avoid delivering a fixed number of chunks is to set a similarity threshold, so that only those chunks exceeding this threshold are presented as context. To assess the usefulness of this approach, we have decided to analyze the similarity of the first 10 chunks (sorted by similarity) and their relationship with each of the generated questions.

Similarity of each chunk according to each question, by splitter.

It is observed that the similarity drastically decreases after the first chunk, which would suggest that the subsequent chunks are not as relevant from a semantic point of view. In fact, the average similarity approaches zero from the fifth chunk onwards, regardless of the splitter used. We also notice that in some cases, the similarity is extremely low and can even become negative, which in simple terms tells us that the content of the chunk is diametrically opposite to that of the question. Why does this happen? There are two significant reasons:

The user's question is overly ambiguous, leading to inaccuracies and misconceptions. For instance, if I inquire about a 'plant', am I referring to a botanical organism, the act of placing something, a piece of equipment, or a section of a factory?
The content of a chunk can be so diverse that, despite containing the correct answer, the overall semantic meaning of the chunk does not resemble the question, resulting in incomplete responses.

Let's consider, for instance, the case with index 1. We have highlighted the part of the context that is necessary to answer the question. The questions were originally in Spanish, as the context was also in Spanish. For better understanding, we translated both the questions and the context into English.

Question: What is the function of the intergovernmental committee according 
to Article 18 of the convention on intangible cultural heritage?
Actual Context: Eight more elements from the country were added to this new
list: the scissors dance, the huaconada, ritual dance of Mito, eshuva, sung
prayers of the Huachipaeri ethnic group, the pilgrimage to the sanctuary of
the Lord of Quyllurit'i, the knowledge, techniques and rituals linked to the
annual renewal of the Q'eswachaka bridge, the festival in honor of the Virgin
of Candelaria, the wititi dance of the Colca valley, the traditional system 
of Water Judges of Corongo and the 'Hatajo de Negritos' and 'Las Pallitas', 
dances from the south of the central coast of Peru.449 Article 18 of the 
convention stipulates that the intergovernmental committee selects programs,
projects and activities for the safeguarding of intangible cultural heritage 
that best reflect the principles and objectives of the convention.

Let's see the chunks with 1 index:

Chunk 2 Langchain splitter (cos_sim = 0.14):
...rituals linked to the annual renewal of the Q'eswachaka bridge, the festival
in honor of the Virgin of Candelaria, the wititi dance of the Colca valley, 
the traditional system of Water Judges of Corongo and the 'Hatajo de Negritos'
and 'Las Pallitas', dances from the south of the central coast of Peru.449 
Article 18 of the convention stipulates that the intergovernmental committee 
selects programs, projects and activities for the safeguarding of intangible
cultural heritage that best reflect the principles and...

Chunk 2 Llama-Index splitter (cos_sim = -0.27):
...of Negritos' and 'Las Pallitas', dances from the south of the central coast 
of Peru.449 Article 18 of the convention stipulates that the intergovernmental
committee selects programs, projects and activities for the safeguarding of 
intangible cultural heritage that best reflect the principles and objectives 
of the convention. One of these projects corresponds to the country under the
title of safeguarding the intangible cultural heritage of the Aymara communities 
of Bolivia, Chile and Peru, selected in 2009.450 Festivities See also: 
Holidays in Peru Date January 1 February 2 March/April May 1 2nd Sunday of May 
June 7 3rd Sunday of June June 23 and 24 Holiday New Year's Day Feast of the 
Virgin of Candelaria Holy Week Labor Day Mother's Day Flag Day Father's Day 
Feast of San Juan Sport June 24 June 29 July 28 and 29 August 30 October 1 to 
November 1 October 8 October 31 November 1 November 2 November 3 December 8 
December 9 December 24 December 25 Peasant's Day Saints Peter and Paul National
Holidays Santa Rosa de Lima Lord of Miracles Naval Combat of Angamos Day of 
the Creole Song All Saints' Day Day of the Faithful Departed San Martin de 
Porres Immaculate Conception Battle of Ayacucho Christmas Eve Christmas See 
also: Peruvian Olympic Committee The practice of sport in the Peruvian 
territory dates back to Ancient Peru. With the arrival of the Spaniards in 
this territory, the practice of sport radically changed. Later, it was 
influenced by the American ideology of physical education linked to commercialization. Sport in the country is divided into several federations...

Chunk 1 Semantic splitter (cos_sim = 0.03):
To this new list, eight more elements from the country were added: the scissors
dance, the huaconada, ritual dance of Mito, eshuva, sung prayers of the Huachipaeri 
ethnicity, the pilgrimage to the sanctuary of the Lord of Quyllurit'i, the knowledge, 
techniques and rituals linked to the annual renewal of the Q'eswachaka bridge, 
the festival in honor of the Virgin of Candelaria, the wititi dance of the Colca valley,
the traditional system of Water Judges of Corongo and the 'Hatajo de Negritos' 
and 'Las Pallitas', dances from the south of the central coast of Peru.449 
Article 18 of the convention stipulates that the intergovernmental committee 
selects programs, projects and activities for the safeguarding of intangible
cultural heritage that best reflect the principles and objectives of the 
convention. One of these projects corresponds to the country under the title
of safeguarding the intangible cultural heritage of the Aymara communities 
of Bolivia, Chile and Peru, selected in 2009.450

The three splitters in the chunk have a very low similarity with respect to the question, and yet, the answer is found there. It's interesting to see that for the semantic splitter, the chunk with the answer was in the first place, while in the rest it was in the second chunk.
If we wanted to be cautious with the number of chunks that would come as context, perhaps we would define a threshold at some significant value like 0.5. However, for our case this means that for the Langchain splitter, 76% of the questions would be left without context, for the Llama-Index one 100%, and for the semantic one 81%. If we look at the distribution of the similarity of the chunks, we realize that there are very few values above 0.5, and in fact the distribution has an average very close to 0.

Distribution of the semantic similarity of the chunks in relation to each question, according to each splitter.

Let's then see how a threshold in similarity affects the number of questions that would be left without context:

Ratio of total questions left without context given a threshold in similarity, according to each splitter.

Based on the graph, we could sensibly choose a value like 0.1, where we would only lose around 10%. However, this would somewhat defeat the purpose of using a similarity threshold, as considering such low similarities would also mean we're interested in content that has nothing to do with the question. So, how can we improve a RAG system? This is where reranking models come into play.

In the next article, we will discuss the evaluation using Reranker. Don't miss it!