Introduction
Searching for songs using incomplete or paraphrased lyrics is challenging for traditional keyword-based search engines. This project evaluates different ranking models for lyrics-based retrieval, focusing on their effectiveness rather than developing a full search engine.
Using a dataset of 50,000 songs, we compared statistical ranking models like BM25, TF-IDF etc with deep learning-based models. Performance was measured using Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG) Scores.
Dataset & Preprocessing
The Genius Lyrics Dataset provided 50,000 songs with metadata such as artists, albums, and release dates. We manually labeled 1,000 query-document pairs with relevance scores (1-5) to evaluate ranking models.
Preprocessing steps included:- Tokenization: Processed lyrics with a Regex-based tokenizer
- Stopword Removal: Filtered non-informative words
- Metadata Structuring: Extracted artist, album, and genre details
Ranking Models Evaluated
Traditional Models:
- BM25 – A term frequency-based ranking model that balances term importance and document length.
- TF-IDF – Weighs word importance in a document relative to the whole dataset..
- Pivoted Normalization – Adjusts term weighting by normalizing document length.
- WordCountCosineSimilarity – Measures similarity between query and lyrics using vector-based term overlap.
Deep Learning Models:
- Siamese BERT – A transformer-based model that captures semantic relationships, improving retrieval for paraphrased lyrics.
- Latent Semantic Analysis (LSA) – Reduces text to a lower-dimensional space to capture conceptual similarities.
Evaluation & Results
Ranking models were compared using MAP (measuring ranking quality) and NDCG (giving higher weight to top-ranked relevant results).
| Ranker | MAP Score | NDCG Score |
|---|---|---|
| BM25 | 0.2566 | 0.0269 |
| TF-IDF | 0.4268 | 0.0488 |
| WordCountCosineSimilarity | 0.3883 | 0.0357 |
| Siamese BERT | 0.4972 | 0.0704 |
| LSA | 0.3457 | 0.0488 |
Key Findings:
- Siamese BERT achieved the highest performance, proving deep learning is more effective for lyrics retrieval.
- TF-IDF outperformed BM25, highlighting better handling of stopwords and weighting.
- BM25 struggled with paraphrased lyrics, limiting retrieval effectiveness.