Introduction

Searching for songs using incomplete or paraphrased lyrics is challenging for traditional keyword-based search engines. This project evaluates different ranking models for lyrics-based retrieval, focusing on their effectiveness rather than developing a full search engine.

Using a dataset of 50,000 songs, we compared statistical ranking models like BM25, TF-IDF etc with deep learning-based models. Performance was measured using Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG) Scores.

Dataset & Preprocessing

The Genius Lyrics Dataset provided 50,000 songs with metadata such as artists, albums, and release dates. We manually labeled 1,000 query-document pairs with relevance scores (1-5) to evaluate ranking models.

Preprocessing steps included:
  • Tokenization: Processed lyrics with a Regex-based tokenizer
  • Stopword Removal: Filtered non-informative words
  • Metadata Structuring: Extracted artist, album, and genre details

Ranking Models Evaluated

Traditional Models:

  • BM25 – A term frequency-based ranking model that balances term importance and document length.
  • TF-IDF – Weighs word importance in a document relative to the whole dataset..
  • Pivoted Normalization – Adjusts term weighting by normalizing document length.
  • WordCountCosineSimilarity – Measures similarity between query and lyrics using vector-based term overlap.

Deep Learning Models:

  • Siamese BERT – A transformer-based model that captures semantic relationships, improving retrieval for paraphrased lyrics.
  • Latent Semantic Analysis (LSA) – Reduces text to a lower-dimensional space to capture conceptual similarities.

Evaluation & Results

Ranking models were compared using MAP (measuring ranking quality) and NDCG (giving higher weight to top-ranked relevant results).

Ranker MAP Score NDCG Score
BM25 0.2566 0.0269
TF-IDF 0.4268 0.0488
WordCountCosineSimilarity 0.3883 0.0357
Siamese BERT 0.4972 0.0704
LSA 0.3457 0.0488

Key Findings:

  • Siamese BERT achieved the highest performance, proving deep learning is more effective for lyrics retrieval.
  • TF-IDF outperformed BM25, highlighting better handling of stopwords and weighting.
  • BM25 struggled with paraphrased lyrics, limiting retrieval effectiveness.