Text similarity measure in NLP

Introduction

Text similarity allows us to quantify how much two or more pieces of text resemble each other. Text similarity measures play a pivotal role in a multitude of applications, from information retrieval and recommendation systems to plagiarism detection and text summarization. In this blog, we will explore various text similarity measures and how they work.

Bag of Words

The Bag of Words model is a simple and powerful representation of text data. It treats a document as a “bag” (collection) of words, ignoring the order in which words appear. The basic idea is to create a vocabulary of unique words from the entire corpus (collection of documents) and then represent each document as a vector, where each dimension corresponds to a unique word in the vocabulary. The values in this vector represent the frequency of each word’s occurrence in the document.

from sklearn.feature_extraction.text import CountVectorizer
documents = ["I love NLP", "I love ML", "The dog barks loudly."]
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)
vocabulary = vectorizer.get_feature_names_out()
bow_array = bow_matrix.toarray()

print("Vocabulary:", vocabulary)
print("Bag of Words Matrix:")
print(bow_array)

Common Text Similarity Measures

Jaccard Similarity

Jaccard similarity measures the similarity between two sets by calculating the size of their intersection divided by the size of their union. It is commonly used for text data with multiple words. The smaller the value, the less similar the strings are.

Formula

Jaccard Similarity = (Intersection of sets A and B) / (Union of sets A and B)

set1 = set("I love NLP.".split())
set2 = set("I love ML.".split())

jaccard_similarity = len(set1.intersection(set2)) / len(set1.union(set2))
print(jaccard_similarity)

Cosine Similarity

Cosine similarity is used to measure the cosine of the angle between two vectors representing text documents. The smaller the angle, the more similar the texts are. It is often applied in information retrieval and recommendation systems. The smaller the value, the less similar the strings are.

Formula

Cosine Similarity = (A · B) / (||A|| * ||B||)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
documents = ["This is the first document.", "This is the second document."]
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
cosine_similarity = cosine_similarity(tfidf_matrix)
cosine_similarity

Levenshtein Distance

Levenshtein distance, also known as edit distance, is a metric used to measure the similarity between two strings by counting the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into the other. It plays a significant role in various fields, including natural language processing, spell-checking, and bioinformatics.

Insertion: Adding a character to one of the strings.

Deletion: Removing a character from one of the strings.

Substitution: Replacing one character with another.

The smaller the distance, the more similar the strings are.

def levenshtein_distance(str1, str2):
    m, n = len(str1), len(str2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(m + 1):
        for j in range(n + 1):
            if i == 0:
                dp[i][j] = j
            elif j == 0:
                dp[i][j] = i
            elif str1[i - 1] == str2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = 1 + min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1])

    return dp[m][n]


str1 = "king"
str2 = "ring"
distance = levenshtein_distance(str1, str2)
print(f"Levenshtein Distance: {distance}")

Word2Vec Similarity

Word2Vec is a widely used technique for capturing word similarity. It represents words as vectors in a continuous vector space where similar words are located closer to each other.

from gensim.test.utils import common_texts
from gensim.models import Word2Vec

model = Word2Vec(sentences=common_texts, window=5, min_count=1, workers=4)
similarity = model.wv.similarity("computer", "human")

print(f"Word2Vec Similarity: {similarity}")

Word Embedding Averaging

Another approach for document similarity is to represent documents as the average of word embeddings of the words contained within the document.

Resources

https://pytechie.com/category/data-science/nlp/

https://mariogarcia.github.io/blog/2021/04/nlp_text_similarity.html

Leave a Reply