Learning to Rank in Hybrid Search: Why LTR Matched RRF

istvan · Thursday, March 5 2026

search machine learning learning to rank hybrid search information retrieval

Context

I built an on-device hybrid search engine that combines BM25 and vector retrieval with Reciprocal Rank Fusion. Reranking metrics suggested a learned linear fusion model would outperform RRF, but end-to-end evaluation showed otherwise. This article explains why the model matched baseline behavior and what to improve next.

The Idea

Hybrid search fusion is a ranking problem. You have two scored lists, one from semantic search and one from keyword search, and you need to combine them. RRF uses a fixed formula: score each document as 1/(k + rank) across both result lists, sort by combined score, done. Not bad for a formula with one parameter. A convex combination uses a single alpha weight. Both are static: they don't adapt to the corpus.

A learned-to-rank (LTR) model replaces the fixed formula with a linear function over multiple features, trained on human relevance judgments. Even a simple model should be able to outperform a single static knob by adapting to the corpus characteristics.

I designed 7 features:

#	Feature	Range	Rationale
1	`semantic_score`	[0,1]	Raw cosine similarity from vector search
2	`semantic_rank_norm`	[0,1]	Normalized rank position in semantic results
3	`keyword_rank_norm`	[0,1]	Normalized rank position in keyword results
4	`in_both`	{0,1}	1 if the document appears in both lists
5	`rrf_score`	(0,~0.03)	Standard RRF score (so the model can replicate RRF)
6	`path_depth_norm`	[0,1]	File path depth as a document-level prior
7	`content_length_norm`	[0,1]	Content length as a document-level prior

The scoring function is a dot product: score(doc) = bias + Σ(wi * fi). Eight floats.

A key design choice: including rrf_score as a feature means the model can replicate RRF exactly by zeroing all other weights. It can only improve, never regress below the baseline. Or so I thought.

Training

The training pipeline lives in qrst-bench, a separate benchmark crate. For each of the 42 evaluation queries:

Run qrst vsearch (semantic) and qrst search (keyword) as subprocesses
Collect scored results from both
Extract the 7 features for every candidate document
Look up the human relevance grade (0-3) from the judgment file

This produced 460 training samples (101 relevant documents). I trained using pairwise hinge loss with SGD: for every pair of documents where one has a higher relevance grade, push the model to score it higher.

loss = max(0, margin - (score_better - score_worse))
margin = grade_difference × 0.1

Leave-one-out cross-validation across all 42 queries. Train on 41, evaluate on the held-out query, repeat.

Initial Metrics

After 100 epochs with lr=0.001 and L2 regularization:

Metric	nDCG@5
RRF baseline	0.794
LTR (training set)	0.853
LTR (LOO cross-validation)	0.848

A +0.054 improvement over RRF, with minimal overfitting. The learned weights were:

Feature	Weight
semantic_rank_norm	+0.532
rrf_score	+0.390
in_both	+0.313
semantic_score	+0.250
keyword_rank_norm	-0.162
path_depth_norm	+0.133
content_length_norm	0.000

The model learned that semantic rank order is the strongest signal, that documents appearing in both lists are reliably relevant, and that keyword-only rank is a negative indicator, meaning a document matched surface terms but lacked semantic relevance.

Intuitively this makes sense. On this corpus, keyword search has many false positives (nDCG@5 = 0.431). The model correctly identifies keyword-only results as noise.

At that point, the model looked ready to ship.

End-to-End Evaluation

Then I plugged the trained weights into the actual search pipeline and ran the end-to-end evaluation.

Strategy	nDCG@5	P@3	MRR
BM25 (keyword only)	0.431	0.246	0.534
LTR (trained)	0.788	0.476	0.892
RRF (k=60)	0.794	0.476	0.903
Semantic only	0.827	0.500	0.880

The LTR model scored 0.788, below the RRF baseline it was supposed to beat.

What Went Wrong

The reranking evaluation and the pipeline evaluation measure different things.

Reranking (nDCG@5 = 0.848): "Here are 460 documents already retrieved from both search methods. Sort them." The model sees all candidates, including relevant ones, and only needs to order them correctly.

End-to-end pipeline (nDCG@5 = 0.788): "Run semantic search, run keyword search, fuse the two result lists, return the top results." The fusion strategy also controls which documents survive the cutoff.

The negative keyword_rank_norm weight (-0.162) was the culprit. In reranking, it correctly identifies keyword-only false positives. But in the pipeline, it actively pushes down every document that only appears in keyword results, including the ones that happen to be relevant. Those documents score below the retrieval cutoff and vanish from the final results entirely.

The model learned the right thing for the wrong task.

The Fix

The intervention was simple: constrain the rank-based feature weights to be non-negative during training. The model can still ignore keyword rank (weight -> 0), but it cannot penalize it.

let non_negative: [bool; NUM_FEATURES] = [
    false, // bias
    false, // semantic_score
    true,  // semantic_rank_norm
    true,  // keyword_rank_norm
    true,  // in_both
    true,  // rrf_score
    false, // path_depth_norm
    false, // content_length_norm
];

After retraining with the constraint:

Strategy	nDCG@5 (e2e)	LOO-CV (reranking)
RRF baseline	0.794	n/a
LTR v1 (unconstrained)	0.788	0.848
LTR v2 (non-negative)	0.794	0.844

The constrained model recovered the full pipeline performance. The keyword_rank_norm weight went from -0.162 to +0.007, effectively zero. The model learned to ignore keyword rank rather than penalize it.

But it did not beat RRF. It matched it exactly.

Why the Model Converges to RRF

Looking at the final weights:

Feature	Unconstrained	Constrained
semantic_rank_norm	+0.532	+0.534
rrf_score	+0.390	+0.387
in_both	+0.313	+0.185
semantic_score	+0.250	+0.218
keyword_rank_norm	-0.162	+0.007

The dominant features, semantic_rank_norm and rrf_score, are highly correlated with RRF's own scoring. The semantic_rank_norm tracks the semantic component of RRF, and rrf_score is the RRF score. The model is learning a slightly reweighted version of RRF, which on 42 queries produces the same ranking.

With only 42 queries and 460 candidate documents, the 7-feature linear model doesn't have enough signal to find patterns that RRF misses. The features that could differentiate (path_depth_norm, content_length_norm) have near-zero weights. On this small, well-curated corpus, RRF is already near-optimal for linear fusion.

Lessons

Reranking metrics can overstate pipeline gains. This is well documented in learning-to-rank research, but the effect is easy to underestimate. The +0.054 reranking improvement disappeared in end-to-end evaluation. If you evaluate a reranker, always measure end-to-end.

Constraints can encode domain knowledge. The non-negative constraint on rank features captures a practical pipeline rule: the ranker should not discard candidates by penalizing one channel.

Simple baselines are hard to beat with simple models. RRF is robust: it does not penalize documents for appearing in only one list, and its 1/(k + rank) formula is a useful nonlinear rank transform. A 7-feature linear model can approximate RRF, but not reliably exceed it without interaction features.

Small training sets favor conservative behavior. Forty-two queries were enough to control overfitting (LOO-CV confirmed this), but not enough to learn stable corpus-specific patterns beyond the baseline.

What Could Beat RRF

Based on this experiment, improving beyond RRF at pipeline level likely requires:

More training data. About 150+ judged queries would give the model room to learn beyond baseline behavior.
Query-dependent features. Per-query signals (query length, term rarity, semantic-keyword score divergence) could adapt fusion per query instead of fixed global weights.
A non-linear model. Interactions such as "trust keyword rank more when semantic score is low" are not representable in a linear model.
Listwise loss. Optimizing nDCG directly (for example LambdaRank) would align training with the final metric better than pairwise hinge loss.

The infrastructure is in place: the LtrFusion strategy, the training CLI, the feature extraction pipeline. The linear model just needs richer signal to work with.

Implementation

The full implementation is in qrst:

Core: LtrFusion in crates/qrst/src/storage/fusion.rs, 7-feature extraction, dot product scoring, LtrWeights with TOML serialization
Training: train-ltr command in crates/qrst-bench/src/commands/train_ltr.rs, pairwise SGD, LOO-CV, non-negative constraints
Config: fusion_strategy = "ltr" in config.toml, weights auto-loaded from {data_dir}/ltr_weights.toml
Results: bench/results/ltr-fusion-results.md

Inference cost is 8 multiply-adds per candidate document, sub-microsecond. The weights are 8 floats in a human-readable TOML file. No new dependencies beyond what qrst already uses (serde + toml).