Similarity search methods have been researched since the 1970s in the field of NLP.
Jaccard similarity and w-shingling are two traditional methods for text similarity search.
TF-IDF, BM25, and Sentence BERT are popular vector-based methods for similarity search.
TF-IDF is used to compare the importance of each word to a particular document by looking at a large set of documents.
BM25 is an optimized version of TF-IDF that uses adjustable hyperparameters to improve results.
Sentence BERT uses BERT to generate vector embeddings of sentences that are all of the same size and excel at semantic search.
All of these methods can be implemented on actual problems using libraries like FAISS or vector databases like Postgres pgvector, Qdrant, ChromaDB, and Pinecone.
Cosine similarity is used to compare vector embeddings of sentences and calculate the similarity between two sentences, ranging from -1 to 1.
Journey through this article gave an extensive knowledge of different similarity search methods.
The field of NLP has come a long way in text similarity search since the 1970s.