Learning to Rank in Hybrid Search: Why LTR Matched RRF
Context
I built an on-device hybrid search engine that combines BM25 and vector retrieval with Reciprocal Rank Fusion. Reranking metrics suggested a learned linear fusion model would outperform RRF, but end-to-end evaluation showed otherwise. This article explains why the model matched baseline behavior and what to improve next.
The Idea
Hybrid search fusion is a ranking problem. You have two scored lists, one from semantic search and one from keyword search, and you need to combine them. RRF uses a fixed formula: score each document as 1/(k + rank) across both result lists, sort by combined score, done. Not bad for a formula with one parameter. A convex combination uses a single alpha weight. Both are static: they don't adapt to the corpus.
A learned-to-rank (LTR) model replaces the fixed formula with a linear function over multiple features, trained on human relevance judgments. Even a simple model should be able to outperform a single static knob by adapting to the corpus characteristics.
I designed 7 features:
| # | Feature | Range | Rationale |
|---|---|---|---|
| 1 | semantic_score |
[0,1] | Raw cosine similarity from vector search |
| 2 | semantic_rank_norm |
[0,1] | Normalized rank position in semantic results |
| 3 | keyword_rank_norm |
[0,1] | Normalized rank position in keyword results |
| 4 | in_both |
{0,1} | 1 if the document appears in both lists |
| 5 | rrf_score |
(0,~0.03) | Standard RRF score (so the model can replicate RRF) |
| 6 | path_depth_norm |
[0,1] | File path depth as a document-level prior |
| 7 | content_length_norm |
[0,1] | Content length as a document-level prior |
The scoring function is a dot product: score(doc) = bias + Σ(wi * fi). Eight floats.
A key design choice: including rrf_score as a feature means the model can replicate RRF exactly by zeroing all other weights. It can only improve, never regress below the baseline. Or so I thought.
Training
The training pipeline lives in qrst-bench, a separate benchmark crate. For each of the 42 evaluation queries:
- Run
qrst vsearch(semantic) andqrst search(keyword) as subprocesses - Collect scored results from both
- Extract the 7 features for every candidate document
- Look up the human relevance grade (0-3) from the judgment file
This produced 460 training samples (101 relevant documents). I trained using pairwise hinge loss with SGD: for every pair of documents where one has a higher relevance grade, push the model to score it higher.
loss = max(0, margin - (score_better - score_worse))
margin = grade_difference × 0.1
Leave-one-out cross-validation across all 42 queries. Train on 41, evaluate on the held-out query, repeat.
Initial Metrics
After 100 epochs with lr=0.001 and L2 regularization:
| Metric | nDCG@5 |
|---|---|
| RRF baseline | 0.794 |
| LTR (training set) | 0.853 |
| LTR (LOO cross-validation) | 0.848 |
A +0.054 improvement over RRF, with minimal overfitting. The learned weights were:
| Feature | Weight |
|---|---|
| semantic_rank_norm | +0.532 |
| rrf_score | +0.390 |
| in_both | +0.313 |
| semantic_score | +0.250 |
| keyword_rank_norm | -0.162 |
| path_depth_norm | +0.133 |
| content_length_norm | 0.000 |
The model learned that semantic rank order is the strongest signal, that documents appearing in both lists are reliably relevant, and that keyword-only rank is a negative indicator, meaning a document matched surface terms but lacked semantic relevance.
Intuitively this makes sense. On this corpus, keyword search has many false positives (nDCG@5 = 0.431). The model correctly identifies keyword-only results as noise.
At that point, the model looked ready to ship.
End-to-End Evaluation
Then I plugged the trained weights into the actual search pipeline and ran the end-to-end evaluation.
| Strategy | nDCG@5 | P@3 | MRR |
|---|---|---|---|
| BM25 (keyword only) | 0.431 | 0.246 | 0.534 |
| LTR (trained) | 0.788 | 0.476 | 0.892 |
| RRF (k=60) | 0.794 | 0.476 | 0.903 |
| Semantic only | 0.827 | 0.500 | 0.880 |
The LTR model scored 0.788, below the RRF baseline it was supposed to beat.
What Went Wrong
The reranking evaluation and the pipeline evaluation measure different things.
Reranking (nDCG@5 = 0.848): "Here are 460 documents already retrieved from both search methods. Sort them." The model sees all candidates, including relevant ones, and only needs to order them correctly.
End-to-end pipeline (nDCG@5 = 0.788): "Run semantic search, run keyword search, fuse the two result lists, return the top results." The fusion strategy also controls which documents survive the cutoff.
The negative keyword_rank_norm weight (-0.162) was the culprit. In reranking, it correctly identifies keyword-only false positives. But in the pipeline, it actively pushes down every document that only appears in keyword results, including the ones that happen to be relevant. Those documents score below the retrieval cutoff and vanish from the final results entirely.
The model learned the right thing for the wrong task.
The Fix
The intervention was simple: constrain the rank-based feature weights to be non-negative during training. The model can still ignore keyword rank (weight -> 0), but it cannot penalize it.
let non_negative: [bool; NUM_FEATURES] = [
false, // bias
false, // semantic_score
true, // semantic_rank_norm
true, // keyword_rank_norm
true, // in_both
true, // rrf_score
false, // path_depth_norm
false, // content_length_norm
];
After retraining with the constraint:
| Strategy | nDCG@5 (e2e) | LOO-CV (reranking) |
|---|---|---|
| RRF baseline | 0.794 | n/a |
| LTR v1 (unconstrained) | 0.788 | 0.848 |
| LTR v2 (non-negative) | 0.794 | 0.844 |
The constrained model recovered the full pipeline performance. The keyword_rank_norm weight went from -0.162 to +0.007, effectively zero. The model learned to ignore keyword rank rather than penalize it.
But it did not beat RRF. It matched it exactly.
Why the Model Converges to RRF
Looking at the final weights:
| Feature | Unconstrained | Constrained |
|---|---|---|
| semantic_rank_norm | +0.532 | +0.534 |
| rrf_score | +0.390 | +0.387 |
| in_both | +0.313 | +0.185 |
| semantic_score | +0.250 | +0.218 |
| keyword_rank_norm | -0.162 | +0.007 |
The dominant features, semantic_rank_norm and rrf_score, are highly correlated with RRF's own scoring. The semantic_rank_norm tracks the semantic component of RRF, and rrf_score is the RRF score. The model is learning a slightly reweighted version of RRF, which on 42 queries produces the same ranking.
With only 42 queries and 460 candidate documents, the 7-feature linear model doesn't have enough signal to find patterns that RRF misses. The features that could differentiate (path_depth_norm, content_length_norm) have near-zero weights. On this small, well-curated corpus, RRF is already near-optimal for linear fusion.
Lessons
Reranking metrics can overstate pipeline gains. This is well documented in learning-to-rank research, but the effect is easy to underestimate. The +0.054 reranking improvement disappeared in end-to-end evaluation. If you evaluate a reranker, always measure end-to-end.
Constraints can encode domain knowledge. The non-negative constraint on rank features captures a practical pipeline rule: the ranker should not discard candidates by penalizing one channel.
Simple baselines are hard to beat with simple models. RRF is robust: it does not penalize documents for appearing in only one list, and its 1/(k + rank) formula is a useful nonlinear rank transform. A 7-feature linear model can approximate RRF, but not reliably exceed it without interaction features.
Small training sets favor conservative behavior. Forty-two queries were enough to control overfitting (LOO-CV confirmed this), but not enough to learn stable corpus-specific patterns beyond the baseline.
What Could Beat RRF
Based on this experiment, improving beyond RRF at pipeline level likely requires:
- More training data. About 150+ judged queries would give the model room to learn beyond baseline behavior.
- Query-dependent features. Per-query signals (query length, term rarity, semantic-keyword score divergence) could adapt fusion per query instead of fixed global weights.
- A non-linear model. Interactions such as "trust keyword rank more when semantic score is low" are not representable in a linear model.
- Listwise loss. Optimizing nDCG directly (for example LambdaRank) would align training with the final metric better than pairwise hinge loss.
The infrastructure is in place: the LtrFusion strategy, the training CLI, the feature extraction pipeline. The linear model just needs richer signal to work with.
Implementation
The full implementation is in qrst:
- Core:
LtrFusionincrates/qrst/src/storage/fusion.rs, 7-feature extraction, dot product scoring,LtrWeightswith TOML serialization - Training:
train-ltrcommand incrates/qrst-bench/src/commands/train_ltr.rs, pairwise SGD, LOO-CV, non-negative constraints - Config:
fusion_strategy = "ltr"in config.toml, weights auto-loaded from{data_dir}/ltr_weights.toml - Results:
bench/results/ltr-fusion-results.md
Inference cost is 8 multiply-adds per candidate document, sub-microsecond. The weights are 8 floats in a human-readable TOML file. No new dependencies beyond what qrst already uses (serde + toml).