Context

I built an on-device hybrid search engine that combines BM25 and vector retrieval with Reciprocal Rank Fusion. Reranking metrics suggested a learned linear fusion model would outperform RRF, but end-to-end evaluation showed otherwise. This article explains why the model matched baseline behavior and what to improve next.

How This Started

Before diving in, it’s worth noting that there are many excellent projects in this space - most notably qmd (big thanks to Tobi for the inspiration). While I could have used an existing tool, I wanted to build this myself as a way to dive deeper into Rust and the mechanics of modern search.

All I wanted was a simple semantic search over my own code and documentation. Something local, no cloud APIs, just "find the file that explains how the session store works" without remembering the exact filename or grep pattern. I started with a basic vector search prototype, realized keyword search still catches things embeddings miss, bolted on BM25, and needed a way to combine the two result lists. That led me to RRF, which led me to wondering whether a learned model could do better, which led me into the information retrieval literature: BM25's probabilistic foundations, rank fusion theory, pairwise learning-to-rank, evaluation metrics. One rabbit hole later, I had read the papers, implemented the algorithms, built a benchmark harness, and written qrst - a full hybrid search engine with BM25, vector retrieval, multiple fusion strategies, and a learning-to-rank training pipeline. This article is what I learned along the way.

The Idea

Hybrid search fusion is a ranking problem. You have two scored lists, one from semantic search and one from keyword search, and you need to combine them.

Before fusion, there are the base retrievers. For keyword search, BM25 is the standard. It’s a probabilistic model that scores documents based on term frequency (tf) and inverse document frequency (idf), but with two important safeguards: k1 controls term frequency saturation (preventing a document with 100 mentions of 'rust' from infinitely outscoring one with 10), and b handles document length normalization. These parameters define the 'shape' of keyword relevance.

BM25 Scoring Components TF Saturation 0 50 100 150 200 250 0.0 0.2 0.4 0.6 0.8 1.0 Term Frequency (tf) k₁ = 10 k₁ = 50 k₁ = 100 Document Length Normalization 0.5 1.0 2.0 3.0 4.0 5.0 0.0 0.5 1.0 1.5 2.0 2.5 doc_length / avg_length b = 0 b = 0.5 b = 1 BM25(q,d) = Σ IDF(t) · (k₁+1)·tf / (k₁·(1−b+b·|d|/avgdl) + tf)

Reciprocal Rank Fusion (RRF) then merges these scores into a single list. RRF uses a fixed formula: score each document as 1/(k + rank) across both result lists, sort by combined score, done. It is scale-invariant: it doesn't care about the raw scores, only the ranks. The k parameter acts as a smoothing constant that determines how much weight to give to the top-ranked items versus the long tail. As k increases, the score gap between rank 1 and rank 10 shrinks, making the fusion more robust to noise in any single retriever.

Reciprocal Rank Fusion: Score vs. Rank 1 10 20 30 40 50 60 70 80 90 100 0.00 0.02 0.04 0.06 0.08 0.10 Rank RRF Score k = 10 k = 30 k = 60 RRF(d) = Σ 1 / (k + rank_i(d))

A learned-to-rank (LTR) model replaces the fixed formula with a linear function over multiple features, trained on human relevance judgments. Even a simple model should be able to outperform a single static knob by adapting to the corpus characteristics.

I designed 7 features:

# Feature Range Rationale
1 semantic_score [0,1] Raw cosine similarity from vector search
2 semantic_rank_norm [0,1] Normalized rank position in semantic results
3 keyword_rank_norm [0,1] Normalized rank position in keyword results
4 in_both {0,1} 1 if the document appears in both lists
5 rrf_score (0,~0.03) Standard RRF score (so the model can replicate RRF)
6 path_depth_norm [0,1] File path depth as a document-level prior
7 content_length_norm [0,1] Content length as a document-level prior

The scoring function is a dot product: score(doc) = bias + Σ(wi * fi). Eight floats.

A key design choice: including rrf_score as a feature means the model can replicate RRF exactly by zeroing all other weights. It can only improve, never regress below the baseline. Or so I thought.

Training

The training pipeline lives in qrst-bench, a separate benchmark crate. For each of the 42 evaluation queries:

  1. Run qrst vsearch (semantic) and qrst search (keyword) as subprocesses
  2. Collect scored results from both
  3. Extract the 7 features for every candidate document
  4. Look up the human relevance grade (0-3) from the judgment file

This produced 460 training samples (101 relevant documents). The relevance judgments were created manually: I ran each query, reviewed the top results, and assigned grades from 0 (irrelevant) to 3 (highly relevant). On a small corpus this takes an afternoon. On a production corpus it becomes the bottleneck, which is why most production LTR systems rely on implicit signals like click-through rates, dwell time, or query reformulations rather than manual labels.

I trained using pairwise hinge loss with SGD: for every pair of documents where one has a higher relevance grade, push the model to score it higher.

loss = max(0, margin - (score_better - score_worse))
margin = grade_difference × 0.1
Pairwise Hinge Loss -0.50 -0.25 0.00 0.25 0.50 0.75 1.00 0.0 0.2 0.4 0.6 0.8 Δscore (s_better − s_worse) Loss margin = 0.3 margin = 0.2 margin = 0.1 loss = max(0, margin − (s_better − s_worse))

Leave-one-out cross-validation across all 42 queries. Train on 41, evaluate on the held-out query, repeat.

Initial Metrics

After 100 epochs with lr=0.001 and L2 regularization:

Metric nDCG@5
RRF baseline 0.794
LTR (training set) 0.853
LTR (LOO cross-validation) 0.848

A +0.054 improvement over RRF, with minimal overfitting. The learned weights were:

Feature Weight
semantic_rank_norm +0.532
rrf_score +0.390
in_both +0.313
semantic_score +0.250
keyword_rank_norm -0.162
path_depth_norm +0.133
content_length_norm 0.000

The model learned that semantic rank order is the strongest signal, that documents appearing in both lists are reliably relevant, and that keyword-only rank is a negative indicator, meaning a document matched surface terms but lacked semantic relevance.

Intuitively this makes sense. On this corpus, keyword search has many false positives (nDCG@5 = 0.431). The model correctly identifies keyword-only results as noise.

At that point, the model looked ready to ship.

End-to-End Evaluation

Then I plugged the trained weights into the actual search pipeline and ran the end-to-end evaluation.

Three standard ranking metrics: nDCG@5 (normalized discounted cumulative gain, measures graded relevance with position discount), P@3 (precision of the top 3 results), and MRR (mean reciprocal rank, how early the first relevant result appears).

Strategy nDCG@5 P@3 MRR
BM25 (keyword only) 0.431 0.246 0.534
LTR (trained) 0.788 0.476 0.892
RRF (k=60) 0.794 0.476 0.903
Semantic only 0.827 0.500 0.880
End-to-End Evaluation: Ranking Metrics 0.0 0.2 0.4 0.6 0.8 1.0 Score 0.431 0.246 0.534 BM25 0.788 0.476 0.892 LTR 0.794 0.476 0.903 RRF 0.827 0.500 0.880 Semantic nDCG@5 P@3 MRR

The LTR model scored 0.788, below the RRF baseline it was supposed to beat.

What Went Wrong

The reranking evaluation and the pipeline evaluation measure different things.

Reranking (nDCG@5 = 0.848): "Here are 460 documents already retrieved from both search methods. Sort them." The model sees all candidates, including relevant ones, and only needs to order them correctly.

End-to-end pipeline (nDCG@5 = 0.788): "Run semantic search, run keyword search, fuse the two result lists, return the top results." The fusion strategy also controls which documents survive the cutoff.

The negative keyword_rank_norm weight (-0.162) was the culprit. In reranking, it correctly identifies keyword-only false positives. But in the pipeline, it actively pushes down every document that only appears in keyword results, including the ones that happen to be relevant. Those documents score below the retrieval cutoff and vanish from the final results entirely.

The model learned the right thing for the wrong task.

This is an instance of a general principle in retrieval systems: a reranker can only reorder documents that retrieval surfaces. It cannot recover what was never retrieved. Recall is the ceiling, and ranking quality can only work within it. The reranking evaluation hid this because it presented all candidates at once, removing the retrieval bottleneck entirely.

The Fix

The intervention was simple: constrain the rank-based feature weights to be non-negative during training. The model can still ignore keyword rank (weight -> 0), but it cannot penalize it.

let non_negative: [bool; NUM_FEATURES] = [
false, // bias
false, // semantic_score
true, // semantic_rank_norm
true, // keyword_rank_norm
true, // in_both
true, // rrf_score
false, // path_depth_norm
false, // content_length_norm
];

After retraining with the constraint:

Strategy nDCG@5 (e2e) LOO-CV (reranking)
RRF baseline 0.794 n/a
LTR v1 (unconstrained) 0.788 0.848
LTR v2 (non-negative) 0.794 0.844

The constrained model recovered the full pipeline performance. The keyword_rank_norm weight went from -0.162 to +0.007, effectively zero. The model learned to ignore keyword rank rather than penalize it.

But it did not beat RRF. It matched it exactly.

Why the Model Converges to RRF

Looking at the final weights:

Feature Unconstrained Constrained
semantic_rank_norm +0.532 +0.534
rrf_score +0.390 +0.387
in_both +0.313 +0.185
semantic_score +0.250 +0.218
keyword_rank_norm -0.162 +0.007
Feature Weights: Unconstrained vs. Constrained Unconstrained Constrained -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 semantic_rank_norm +0.532 +0.534 rrf_score +0.390 +0.387 in_both +0.313 +0.185 semantic_score +0.250 +0.218 keyword_rank_norm -0.162 +0.007 path_depth_norm +0.133 content_length_norm

The dominant features, semantic_rank_norm and rrf_score, are highly correlated with RRF's own scoring. The semantic_rank_norm tracks the semantic component of RRF, and rrf_score is the RRF score. This creates strong multicollinearity: several feature combinations can produce nearly the same ordering.

In that regime, individual coefficients are not very identifiable. A different random seed or slightly different sample can shift the learned weights while preserving almost identical rankings. So "the model just reweighted RRF" is less a model failure than a consequence of correlated features and limited independent signal.

With only 42 queries and 460 candidate documents, there isn't enough signal to reliably learn behavior beyond the baseline. The features that could differentiate (path_depth_norm, content_length_norm) have near-zero weights. On this small, well-curated corpus, RRF is already near-optimal for this feature set.

Lessons

Reranking metrics can overstate pipeline gains. This is well documented in learning-to-rank research, but the effect is easy to underestimate. The +0.054 reranking improvement disappeared in end-to-end evaluation. If you evaluate a reranker, always measure end-to-end.

Constraints can encode domain knowledge. The non-negative constraint on rank features captures a practical pipeline rule: the ranker should not discard candidates by penalizing one channel.

Simple baselines are hard to beat with simple models. RRF is robust: it does not penalize documents for appearing in only one list, and its 1/(k + rank) formula is a useful nonlinear rank transform. A 7-feature linear model can approximate RRF, but not reliably exceed it without interaction features.

Small training sets favor conservative behavior. Forty-two queries were enough to control overfitting (LOO-CV confirmed this), but not enough to learn stable corpus-specific patterns beyond the baseline.

A slightly higher training metric than validation metric does not, by itself, prove problematic overfitting. It can also reflect mild train/validation distribution differences in a small evaluation set.

Feature design is at least as important as data volume. Manual relevance judgments are accurate but expensive; production systems usually rely on weaker but abundant implicit feedback (clicks, dwell time, reformulations). In practice, model quality is bounded by both data quality/quantity and whether the features capture the real retrieval process.

What Could Beat RRF

Based on this experiment, improving beyond RRF at pipeline level likely requires:

  • Better query-aware features. Per-query signals (query length, term rarity, semantic-keyword score divergence) could adapt fusion behavior beyond fixed global weights.
  • Interaction features, even in a linear model. Terms like keyword_rank_norm × rare_term_ratio or semantic_score × query_length let a linear model represent conditional behavior.
  • Query-dependent weighting. A single global weight vector may match one corpus, but robust gains often require query-level adaptation.
  • More judged data. About 150+ judged queries would give the model room to learn beyond baseline behavior, especially once richer features are available.
  • Potentially a non-linear model. If linear features saturate, a non-linear model can capture higher-order interactions directly.
  • Listwise loss. Optimizing nDCG directly (for example LambdaRank) would align training with the final metric better than pairwise hinge loss.
  • Neural reranking. Instead of scoring documents independently with a linear model, a neural reranker jointly considers the query and document. Cross-encoders like BERT or monoT5 concatenate query and document into a single input and run a full transformer forward pass per candidate, capturing deep query-document interactions but at high latency - practical only as a second-stage reranker over a small candidate set. Late-interaction models like ColBERT take a different approach: they encode query and document independently into per-token embeddings, then compute fine-grained token-level similarity via MaxSim. This makes ColBERT usable both as a retriever (via ANN search over precomputed token embeddings) and as a reranker, offering a middle ground between bi-encoder speed and cross-encoder quality.

The infrastructure is in place: the LtrFusion strategy, the training CLI, the feature extraction pipeline. The linear model just needs richer signal to work with.

Embedding Model

The goal is on-device search with no cloud API dependency. That rules out hosted embedding services and means inference has to run on whatever hardware is available - typically a laptop CPU, no GPU. The model choice follows from this constraint: we need something small enough to run fast on CPU, accurate enough to produce useful semantic search results, and available in a local runtime format (typically ONNX).

qrst primarily uses ONNX Runtime for inference, loaded with the default execution provider (CPU). Most presets run without GPU acceleration, ensuring compatibility across hardware. However, some models like nomic-embed-text-v1.5 are implemented using the Candle framework, which provides Metal acceleration on macOS. For ONNX-based models, Session::builder()?.commit_from_file(&model_path) handles a forward pass per batch of 8 chunks. ONNX Runtime's CPU backend is well-optimized for small models: quantized attention, SIMD vectorization, and minimal memory overhead. On an M-series Mac, embedding a 6,600-chunk corpus takes about a minute with EmbeddingGemma 300M.

The system supports five model presets:

Model Dimensions Max Tokens Notes
all-MiniLM-L6-v2 384 128 Fastest, good baseline
nomic-embed-text-v1.5 768 8192 Matryoshka embeddings, long context
EmbeddingGemma 300M 768 2048 Best accuracy/speed tradeoff
e5-base-v2 768 512 Balanced, instruction-tuned
bge-base-en 768 512 Balanced, English-focused

EmbeddingGemma 300M is the default for benchmarks and the model behind all results in this article. At 300M parameters it is small enough for real-time CPU inference but large enough to capture semantic nuance that the 33M-parameter MiniLM misses. The SciFact results (nDCG@10 = 0.762 semantic-only) confirm it performs well on domain-specific scientific text without fine-tuning.

For ONNX-based presets, model dimensions are auto-detected from ONNX metadata at load time. Each model preset also defines whether to normalize embeddings and what query/document prefixes to prepend (e.g., EmbeddingGemma uses "task: search result | query: " for queries).

The vector index uses USearch, an HNSW implementation created by Ash Vardanian. USearch is a single-file, dependency-free library for approximate nearest neighbor search that compiles to native code on every major platform. It supports multiple scalar types (F32, F16, I8) for the stored vectors, so you can trade precision for memory: F16 halves memory usage with negligible recall loss, I8 quarters it at some accuracy cost. qrst uses F32 by default but the quantization is configurable. USearch also handles index persistence - the HNSW graph is loaded from disk; while USearch supports memory-mapping via view(), qrst currently uses the load() path which reads the index into memory. For a local search engine that needs to start fast and stay light, these properties matter more than marginal recall differences between ANN libraries.

Why not GPU acceleration on Apple Silicon? ONNX Runtime has no Metal Performance Shaders execution provider. The available path is the CoreML EP, which can target the Apple GPU and Neural Engine (ANE), but for transformer models it is currently impractical. Standard transformer operations - Erf for GELU, ReduceMean for LayerNorm, LayerNormalization - are supported in current ONNX Runtime versions, and dynamic shapes are permitted. However, performance can still be slower due to graph partitioning: the model graph gets split into dozens of fragments, each boundary incurring CPU↔CoreML data transfer overhead. In practice this makes CoreML inference slower than pure CPU for models with partial operator coverage. The Rust ort crate does expose a CoreML EP, but its prebuilt binaries do not include it - you would need to compile ONNX Runtime from source.

Apple's own research on deploying transformers on the Neural Engine shows that ANE acceleration requires restructuring the model: replacing nn.Linear with nn.Conv2d, switching to channels-first layout, and chunking multi-head attention into single-head operations. With these changes, Apple demonstrated a 10x speedup on DistilBERT - but this is manual model surgery, not something you get by flipping an execution provider flag. For embedding models available as standard ONNX exports, the M-series CPU is the fastest path. Its high memory bandwidth and AMX/NEON units already deliver sub-second inference for a 300M-parameter model.

The deliberate choice here is pragmatic: we do not need a 7B-parameter model or the absolute best score on MTEB. We need a model that runs in under a second per query on CPU, fits in memory alongside the rest of the application, and produces embeddings good enough that the ranking pipeline - BM25, fusion, and chunking - can do its job. A 300M-parameter model on ONNX/CPU meets all three requirements.

Chunking

Everything described above operates on chunks, not documents. A 500-line markdown file or a Rust module with twenty functions does not get indexed as a single unit. It gets split into pieces that each fit within the embedding model's effective context, and each chunk becomes its own entry in both the keyword and vector indexes. The chunking strategy directly affects retrieval quality: chunks that are too large dilute the semantic signal, chunks that are too small lose context.

qrst uses a pluggable chunking system with three strategies, selected by file extension.

Markdown chunker. Splits on heading boundaries. When a # line appears, the accumulated content is flushed as a chunk. If a section exceeds the budget (80% of model context), it is split again when the next line would exceed the limit. Each chunk carries its heading as a title, which becomes searchable metadata. The minimum chunk size is 10 tokens, filtering out headings-only fragments.

Code chunker. The code chunker implements cAST-style recursive AST merging. Most RAG pipelines inherit line-based chunking from natural-language retrieval, which breaks semantic structure: a function split at line 50 produces two chunks that are each incomplete. The cAST approach instead uses the parse tree to align chunk boundaries with syntactic units.

The algorithm works in three phases. First, tree-sitter parses the source file into an AST. Second, the chunker walks the AST top-down, maintaining a buffer of pending nodes and a token budget (80% of model context by default). At each child node, it applies three rules in order:

  1. Boundary check. If the child is a boundary node (function, struct, impl, class, interface, trait, enum, module), flush any pending buffer as a chunk. The boundary node then starts a new accumulation.
  2. Size check. If the child's token count exceeds the budget, recurse into its children with a fresh buffer. If the child is a leaf that is still too large, fall back to line-based splitting.
  3. Budget check. If adding the child to the buffer would exceed the budget, flush the buffer first, then add the child.

Otherwise, the child is appended to the buffer. When all children are processed, the remaining buffer is flushed.

cAST: Recursive AST Merging with Token Budget 1. Parse AST (tree-sitter) source_file use_declaration 40t use_declaration 30t struct_item 120t impl_item 800t ↓ recurse (>budget) fn new 140t fn add 200t fn search 400t mod tests 300t 2. Walk top-down, merge with budget for child in node.children: if child is boundary → flush buf if child > budget → recurse if buf + child > budget → flush buf.push(child) flush remaining buf budget = 80% of model context 3. Output chunks Chunk 1 70t source_file + use_declaration ×2 Chunk 2 120t struct Store Chunk 3 140t fn new Chunk 4 200t fn add Chunk 5 400t fn search Chunk 6 300t mod tests = boundary node (starts new chunk) Oversized nodes (>budget) are recursed into; boundary nodes flush the buffer to start a fresh context.

The diagram illustrates how the algorithm walks the top-level children. The two use_declaration nodes (40t + 30t) are not boundary nodes, so they accumulate in the buffer and flush together as Chunk 1 when the boundary node struct_item is encountered. The struct_item (120t) starts a new accumulation, is itself a boundary node, and then the oversized impl_item triggers a flush-before-recurse. Inside the recursive walk, each fn is a boundary node, so each new boundary flushes the previous buffered node (fn new -> Chunk 3, fn add -> Chunk 4, etc.). This ensures that major syntactic units like functions and structs remain isolated unless they are small enough to be merged without crossing boundaries. Non-boundary nodes are merged greedily until a boundary or budget overflow forces a flush.

A key implementation detail: merged chunks preserve inter-node whitespace by slicing source[first.start_byte..last.end_byte] rather than concatenating extracted node texts. This means a chunk reads exactly like the original source, including blank lines between functions, which matters for both readability and keyword search.

Composite chunker. Handles multi-zone files like Astro, Vue, and Svelte components. The file is first split into zones by text boundaries: frontmatter (--- fences), <script>, <style>, and template regions. Each zone is then delegated to the appropriate sub-chunker: TypeScript for script zones, CSS for style zones, HTML for template zones. Zone labels are prefixed to chunk titles ([script] const handler, [style] .container) so search results indicate which part of the component matched.

The registry dispatches by extension and defaults to covering .md, .rs, .js, .jsx, .ts, .tsx, .html, .css, .astro, .vue, and .svelte. Files without a matching chunker are skipped.

The indexer walks the directory tree (respecting .gitignore), dispatches each file to its chunker, and feeds the resulting chunks through the embedding model in batches of 8. Each chunk is stored with its content, file path, title, source line range, and embedding vector. Incremental updates use blake3 content hashing to detect changed files: unchanged files are skipped, changed files have their old chunks deleted and new chunks inserted.

The token bounds (defaulting to 80% of model context for maximum and 10 for minimum) are configurable in config.toml but the defaults work well in practice. For prose-heavy content, 80% of context maps to roughly 2000–2500 characters; for code and mixed syntax, the character count is lower because punctuation, operators, and camelCase identifiers each consume separate tokens. Either way, the result fits perfectly within the embedding model's context window (typically 512 or 2048 tokens) while providing enough context for meaningful semantic similarity. The 10-token minimum filters out empty sections and standalone headings without discarding short but relevant code snippets.

Implementation

The full implementation is in qrst:

Inference cost for the linear model is negligible - just 8 multiply-adds per candidate document, completing in well under a microsecond per item on modern hardware. The weights are 8 floats in a human-readable TOML file. No new dependencies beyond what qrst already uses (serde + toml).

External Validation on BEIR/SciFact

The panzerotti corpus (my private documentation dataset) is a set where both BM25 and semantic search contribute meaningfully (BM25 nDCG@5 = 0.431). To test whether the findings generalize, I ran the same fusion strategies on SciFact, a public benchmark of 300 scientific claim–evidence pairs, using EmbeddingGemma 300M.

Strategy nDCG@10 nDCG@5 P@3 MRR 95% CI nDCG@10
BM25 only 0.047 0.047 0.017 0.050 n/a
Semantic only 0.762 0.744 0.282 0.723 n/a
RRF (k=60) 0.761 0.745 0.282 0.726 [0.724, 0.797]
Convex (α=0.95) 0.762 0.745 0.282 0.726 [0.724, 0.799]
LTR (panzerotti weights) 0.765 0.748 0.283 0.730 [0.726, 0.801]

BM25 is near-zero on SciFact. Beyond the specialized vocabulary, there is a structural confound: qrst indexes at the document level (the full abstract), whereas BEIR baselines often use passage-level indexing. On SciFact's short, dense abstracts, this mismatch significantly penalizes keyword-based retrieval.

When one retriever is broken, fusion approximately collapses to the working retriever. RRF and convex both produce near-identical results to semantic-only because BM25 contributes mostly noise. In expectation, RRF with one random-quality list adds uncorrelated perturbations to the signal-carrying list; with enough documents, the semantic ranking dominates, though individual queries can still see minor rank swaps. The per-system bootstrap confidence intervals overlap heavily - [0.724, 0.797] vs [0.724, 0.799] vs [0.726, 0.801] - which is suggestive but not a formal significance test. A paired bootstrap or permutation test would be needed to make a rigorous claim; the point estimates and CI overlap are consistent with no meaningful difference.

The LTR model, trained on panzerotti weights, transfers to SciFact without degradation but also without improvement (0.765 nDCG@10, within every other strategy's CI). The panzerotti-trained linear model has converged to RRF-equivalent behavior, and that equivalence holds even on an out-of-domain corpus.

This is the complementary case to the panzerotti experiment. On panzerotti, both retrievers contribute and LTR converges to RRF. On SciFact, only one retriever contributes and all fusion strategies converge to semantic-only. Neither case gives LTR room to differentiate. The bottleneck is retriever quality, not the combination method.

Conclusion

This experiment set out to beat RRF with a learned model and ended up explaining why RRF works so well. A 7-feature linear model trained on 42 queries converged to a reweighted version of the very baseline it was supposed to surpass. Importantly, this is not because linear models are inherently weak; with strongly correlated features, the model had little room to learn distinct behavior.

The negative keyword rank weight that looked like a genuine insight in reranking evaluation turned out to be a pipeline-breaking artifact: optimizing for reranking quality is not the same as optimizing for end-to-end retrieval.

The key takeaway is not that learning-to-rank fails for hybrid search. It is that a linear model on a small corpus with limited training signal cannot find structure beyond what a well-chosen heuristic already captures. RRF's 1/(k + rank) formula encodes a reasonable prior: every retrieval channel contributes, no document is penalized for appearing in only one list, and rank is compressed through a monotonic transform. Replicating those properties is easy. Exceeding them requires richer features, more training data, or a model class that can represent interactions.

The more useful outcome is methodological. Reranking metrics and pipeline metrics measure different things. A reranker operates on a fixed candidate set; a fusion strategy also determines which candidates survive. Any learned ranker that can suppress documents below a retrieval cutoff will show a gap between these two evaluations. Measuring end-to-end from the start would have caught the negative-weight problem before it looked like an improvement.

For practitioners building hybrid search: start with RRF. It is fast, parameter-light, and robust. If you have enough labeled data and the right features to justify a learned model, evaluate it end-to-end against RRF before shipping. The bar is higher than reranking metrics suggest.

Appendix

Acronyms

Acronym Meaning
AMX Apple Matrix Co-processor (hardware accelerator in M-series chips)
ANE Apple Neural Engine (on-chip neural network accelerator)
ANN Approximate Nearest Neighbor
AST Abstract Syntax Tree
BEIR Benchmarking IR (standardized information retrieval benchmark suite)
BM25 Best Match 25 (probabilistic term-scoring function)
cAST Code AST (recursive AST-based chunking strategy)
CI Confidence Interval
CoreML Apple's machine learning framework for on-device inference
EP Execution Provider (ONNX Runtime backend for hardware acceleration)
HNSW Hierarchical Navigable Small World (graph-based ANN algorithm)
IDF Inverse Document Frequency
LOO-CV Leave-One-Out Cross-Validation
LTR Learning to Rank
MPS Metal Performance Shaders (Apple GPU compute framework)
MRR Mean Reciprocal Rank
MTEB Massive Text Embedding Benchmark
nDCG Normalized Discounted Cumulative Gain
ONNX Open Neural Network Exchange (portable model format)
P@k Precision at rank k
RAG Retrieval-Augmented Generation
RRF Reciprocal Rank Fusion
SGD Stochastic Gradient Descent
SIMD Single Instruction, Multiple Data
TF Term Frequency
TOML Tom's Obvious Minimal Language (configuration format)

References

Title Summary Link
A General Theory of Relevance (BM25 Review) Foundational review of the BM25 scoring function covering TF saturation and document length normalization PDF
Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods Introduces RRF as a simple, parameter-light method for combining multiple ranked lists that outperforms more complex approaches ACM
Pairwise Hinge Loss (MSR-TR-2010-82) Describes the pairwise hinge loss objective for training ranking models by pushing relevant documents above irrelevant ones by a margin PDF
Cumulative Gain-Based Evaluation of IR Techniques (nDCG) Introduces nDCG as a graded relevance metric with position-based discounting for evaluating ranked retrieval ACM
LambdaRank: Learning to Rank with Nonsmooth Cost Functions Proposes listwise ranking optimization that directly optimizes nDCG through lambda gradients PDF
cAST: Code AST-Based Chunking for Retrieval Presents recursive AST merging for code chunking, aligning chunk boundaries with syntactic units instead of line counts arXiv
Deploying Transformers on the Apple Neural Engine Apple's guide to restructuring transformer architectures for ANE acceleration, achieving 10x speedup on DistilBERT Apple ML