Project Overview
This project involved designing and implementing an end-to-end search engine from scratch, evolving across five phases to improve retrieval accuracy and ranking effectiveness. The search engine supports various ranking models, incorporates machine learning for ranking optimization, integrates deep learning for semantic retrieval, and enhances user experience with personalized search.
Phases of Development
1. Indexing & Retrieval
The first step in building the search engine was implementing an inverted index, the core data structure for efficient retrieval. This phase involved:
- Tokenizing documents and queries to break text into searchable units.
- Removing stopwords and applying stemming for improved search accuracy.
- Building a Basic Inverted Index to store term frequencies and document frequency statistics.
- Developing a Positional Inverted Index, storing word positions within documents to enable phrase searching and proximity-based retrieval.
2. Ranking Models
Once indexing was complete, we implemented traditional ranking models to retrieve and score documents based on query relevance. This included:
- BM25 – A probabilistic ranking function balancing term frequency and document length.
- TF-IDF – A statistical measure that weighs word importance in a document.
- Pivoted Normalization – A ranking model adjusting term weighting based on document length.
- Dirichlet Prior Smoothing – A language modeling approach improving ranking by handling term sparsity.
- WordCountCosineSimilarity – A vector space model measuring word overlap between query and document.
3. Learning-to-Rank (L2R)
Traditional ranking models provided a baseline, but machine learning-based ranking helped improve retrieval relevance. We trained a LambdaMART (LightGBM-based) model using labeled query-document relevance scores, allowing the system to learn optimal ranking functions. This phase involved:
- Extracting features like document length, TF-IDF scores, and BM25 scores for supervised learning.
- Training a Learning-to-Rank (L2R) model to optimize ranking effectiveness.
- Evaluating ranking performance by comparing L2R outputs with traditional rankers.
4. Deep Learning for IR
To move beyond keyword matching, we incorporated neural ranking models for improved semantic retrieval. This phase introduced:
- Bi-Encoders – Used pre-trained transformer models to encode query and document vectors for similarity comparison.
- Cross-Encoders – Applied deep learning models for query-document pair scoring.
- Sentence Transformers – Enhanced document retrieval by capturing contextual meaning beyond exact word matches.
5. Feedback & Personalization
The final phase aimed to make search results more user-adaptive and personalized. This included:
- Implementing query expansion to refine user searches dynamically.
- Incorporating user interaction data (click-through rates, past queries) for personalized ranking.
- Adjusting ranking weights based on historical user preferences.
Key Insights
- Deep learning models significantly improved ranking by understanding query intent rather than relying on exact word matches.
- Learning-to-Rank allowed data-driven ranking optimization, surpassing traditional methods.
- Personalization enhanced user engagement by adapting rankings based on search behavior.