Introduction
Text similarity allows us to quantify how much two or more pieces of text resemble each other. Text similarity measures play a pivotal role in a multitude of applications, from information retrieval and recommendation systems to plagiarism detection and text summarization. In this blog, we will explore various text similarity measures and how they work.
Bag of Words
The Bag of Words model is a simple and powerful representation of text data. It treats a document as a “bag” (collection) of words, ignoring the order in which words appear. The basic idea is to create a vocabulary of unique words from the entire corpus (collection of documents) and then represent each document as a vector, where each dimension corresponds to a unique word in the vocabulary. The values in this vector represent the frequency of each word’s occurrence in the document.
from sklearn.feature_extraction.text import CountVectorizer documents = ["I love NLP", "I love ML", "The dog barks loudly."] vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(documents) vocabulary = vectorizer.get_feature_names_out() bow_array = bow_matrix.toarray() print("Vocabulary:", vocabulary) print("Bag of Words Matrix:") print(bow_array)
Common Text Similarity Measures
Jaccard Similarity
Jaccard similarity measures the similarity between two sets by calculating the size of their intersection divided by the size of their union. It is commonly used for text data with multiple words. The smaller the value, the less similar the strings are.
Formula
Jaccard Similarity = (Intersection of sets A and B) / (Union of sets A and B)
set1 = set("I love NLP.".split()) set2 = set("I love ML.".split()) jaccard_similarity = len(set1.intersection(set2)) / len(set1.union(set2)) print(jaccard_similarity)
Cosine Similarity
Cosine similarity is used to measure the cosine of the angle between two vectors representing text documents. The smaller the angle, the more similar the texts are. It is often applied in information retrieval and recommendation systems. The smaller the value, the less similar the strings are.
Formula
Cosine Similarity = (A · B) / (||A|| * ||B||)
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity documents = ["This is the first document.", "This is the second document."] tfidf_vectorizer = TfidfVectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(documents) cosine_similarity = cosine_similarity(tfidf_matrix) cosine_similarity
Levenshtein Distance
Levenshtein distance, also known as edit distance, is a metric used to measure the similarity between two strings by counting the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into the other. It plays a significant role in various fields, including natural language processing, spell-checking, and bioinformatics.
Insertion: Adding a character to one of the strings.
Deletion: Removing a character from one of the strings.
Substitution: Replacing one character with another.
The smaller the distance, the more similar the strings are.
def levenshtein_distance(str1, str2): m, n = len(str1), len(str2) dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(m + 1): for j in range(n + 1): if i == 0: dp[i][j] = j elif j == 0: dp[i][j] = i elif str1[i - 1] == str2[j - 1]: dp[i][j] = dp[i - 1][j - 1] else: dp[i][j] = 1 + min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]) return dp[m][n] str1 = "king" str2 = "ring" distance = levenshtein_distance(str1, str2) print(f"Levenshtein Distance: {distance}")
Word2Vec Similarity
Word2Vec is a widely used technique for capturing word similarity. It represents words as vectors in a continuous vector space where similar words are located closer to each other.
from gensim.test.utils import common_texts from gensim.models import Word2Vec model = Word2Vec(sentences=common_texts, window=5, min_count=1, workers=4) similarity = model.wv.similarity("computer", "human") print(f"Word2Vec Similarity: {similarity}")
Word Embedding Averaging
Another approach for document similarity is to represent documents as the average of word embeddings of the words contained within the document.
Resources
https://pytechie.com/category/data-science/nlp/
https://mariogarcia.github.io/blog/2021/04/nlp_text_similarity.html