All I wanted was a simple code search, I ended up in a ranking theory rabbit hole
Context
I built an on-device hybrid search engine that combines BM25 and vector retrieval with Reciprocal Rank Fusion. Reranking metrics suggested a learned linear fusion model would outperform RRF, but end-to-end evaluation showed otherwise. This article explains why the model matched baseline behavior and what to improve next.
How This Started
Before diving in, it’s worth noting that there are many excellent projects in this space - most notably qmd (big thanks to Tobi for the inspiration). While I could have used an existing tool, I wanted to build this myself as a way to dive deeper into Rust and the mechanics of modern search.
All I wanted was a simple semantic search over my own code and documentation. Something local, no cloud APIs, just "find the file that explains how the session store works" without remembering the exact filename or grep pattern. I started with a basic vector search prototype, realized keyword search still catches things embeddings miss, bolted on BM25, and needed a way to combine the two result lists. That led me to RRF, which led me to wondering whether a learned model could do better, which led me into the information retrieval literature: BM25's probabilistic foundations, rank fusion theory, pairwise learning-to-rank, evaluation metrics. One rabbit hole later, I had read the papers, implemented the algorithms, built a benchmark harness, and written qrst - a full hybrid search engine with BM25, vector retrieval, multiple fusion strategies, and a learning-to-rank training pipeline. This article is what I learned along the way.
The Idea
Hybrid search fusion is a ranking problem. You have two scored lists, one from semantic search and one from keyword search, and you need to combine them.
Before fusion, there are the base retrievers. For keyword search, BM25 is the standard. It’s a probabilistic model that scores documents based on term frequency (tf) and inverse document frequency (idf), but with two important safeguards: k1 controls term frequency saturation (preventing a document with 100 mentions of 'rust' from infinitely outscoring one with 10), and b handles document length normalization. These parameters define the 'shape' of keyword relevance.
Reciprocal Rank Fusion (RRF) then merges these scores into a single list. RRF uses a fixed formula: score each document as 1/(k + rank) across both result lists, sort by combined score, done. It is scale-invariant: it doesn't care about the raw scores, only the ranks. The k parameter acts as a smoothing constant that determines how much weight to give to the top-ranked items versus the long tail. As k increases, the score gap between rank 1 and rank 10 shrinks, making the fusion more robust to noise in any single retriever.
A learned-to-rank (LTR) model replaces the fixed formula with a linear function over multiple features, trained on human relevance judgments. Even a simple model should be able to outperform a single static knob by adapting to the corpus characteristics.
I designed 7 features:
| # | Feature | Range | Rationale |
|---|---|---|---|
| 1 | semantic_score |
[0,1] | Raw cosine similarity from vector search |
| 2 | semantic_rank_norm |
[0,1] | Normalized rank position in semantic results |
| 3 | keyword_rank_norm |
[0,1] | Normalized rank position in keyword results |
| 4 | in_both |
{0,1} | 1 if the document appears in both lists |
| 5 | rrf_score |
(0,~0.03) | Standard RRF score (so the model can replicate RRF) |
| 6 | path_depth_norm |
[0,1] | File path depth as a document-level prior |
| 7 | content_length_norm |
[0,1] | Content length as a document-level prior |
The scoring function is a dot product: score(doc) = bias + Σ(wi * fi). Eight floats.
A key design choice: including rrf_score as a feature means the model can replicate RRF exactly by zeroing all other weights. It can only improve, never regress below the baseline. Or so I thought.
Training
The training pipeline lives in qrst-bench, a separate benchmark crate. For each of the 42 evaluation queries:
- Run
qrst vsearch(semantic) andqrst search(keyword) as subprocesses - Collect scored results from both
- Extract the 7 features for every candidate document
- Look up the human relevance grade (0-3) from the judgment file
This produced 460 training samples (101 relevant documents). The relevance judgments were created manually: I ran each query, reviewed the top results, and assigned grades from 0 (irrelevant) to 3 (highly relevant). On a small corpus this takes an afternoon. On a production corpus it becomes the bottleneck, which is why most production LTR systems rely on implicit signals like click-through rates, dwell time, or query reformulations rather than manual labels.
I trained using pairwise hinge loss with SGD: for every pair of documents where one has a higher relevance grade, push the model to score it higher.
loss = max(0, margin - (score_better - score_worse))
margin = grade_difference × 0.1
Leave-one-out cross-validation across all 42 queries. Train on 41, evaluate on the held-out query, repeat.
Initial Metrics
After 100 epochs with lr=0.001 and L2 regularization:
| Metric | nDCG@5 |
|---|---|
| RRF baseline | 0.794 |
| LTR (training set) | 0.853 |
| LTR (LOO cross-validation) | 0.848 |
A +0.054 improvement over RRF, with minimal overfitting. The learned weights were:
| Feature | Weight |
|---|---|
| semantic_rank_norm | +0.532 |
| rrf_score | +0.390 |
| in_both | +0.313 |
| semantic_score | +0.250 |
| keyword_rank_norm | -0.162 |
| path_depth_norm | +0.133 |
| content_length_norm | 0.000 |
The model learned that semantic rank order is the strongest signal, that documents appearing in both lists are reliably relevant, and that keyword-only rank is a negative indicator, meaning a document matched surface terms but lacked semantic relevance.
Intuitively this makes sense. On this corpus, keyword search has many false positives (nDCG@5 = 0.431). The model correctly identifies keyword-only results as noise.
At that point, the model looked ready to ship.
End-to-End Evaluation
Then I plugged the trained weights into the actual search pipeline and ran the end-to-end evaluation.
Three standard ranking metrics: nDCG@5 (normalized discounted cumulative gain, measures graded relevance with position discount), P@3 (precision of the top 3 results), and MRR (mean reciprocal rank, how early the first relevant result appears).
| Strategy | nDCG@5 | P@3 | MRR |
|---|---|---|---|
| BM25 (keyword only) | 0.431 | 0.246 | 0.534 |
| LTR (trained) | 0.788 | 0.476 | 0.892 |
| RRF (k=60) | 0.794 | 0.476 | 0.903 |
| Semantic only | 0.827 | 0.500 | 0.880 |
The LTR model scored 0.788, below the RRF baseline it was supposed to beat.
What Went Wrong
The reranking evaluation and the pipeline evaluation measure different things.
Reranking (nDCG@5 = 0.848): "Here are 460 documents already retrieved from both search methods. Sort them." The model sees all candidates, including relevant ones, and only needs to order them correctly.
End-to-end pipeline (nDCG@5 = 0.788): "Run semantic search, run keyword search, fuse the two result lists, return the top results." The fusion strategy also controls which documents survive the cutoff.
The negative keyword_rank_norm weight (-0.162) was the culprit. In reranking, it correctly identifies keyword-only false positives. But in the pipeline, it actively pushes down every document that only appears in keyword results, including the ones that happen to be relevant. Those documents score below the retrieval cutoff and vanish from the final results entirely.
The model learned the right thing for the wrong task.
This is an instance of a general principle in retrieval systems: a reranker can only reorder documents that retrieval surfaces. It cannot recover what was never retrieved. Recall is the ceiling, and ranking quality can only work within it. The reranking evaluation hid this because it presented all candidates at once, removing the retrieval bottleneck entirely.
The Fix
The intervention was simple: constrain the rank-based feature weights to be non-negative during training. The model can still ignore keyword rank (weight -> 0), but it cannot penalize it.
let non_negative: [bool; NUM_FEATURES] = [
false, // bias
false, // semantic_score
true, // semantic_rank_norm
true, // keyword_rank_norm
true, // in_both
true, // rrf_score
false, // path_depth_norm
false, // content_length_norm
];
After retraining with the constraint:
| Strategy | nDCG@5 (e2e) | LOO-CV (reranking) |
|---|---|---|
| RRF baseline | 0.794 | n/a |
| LTR v1 (unconstrained) | 0.788 | 0.848 |
| LTR v2 (non-negative) | 0.794 | 0.844 |
The constrained model recovered the full pipeline performance. The keyword_rank_norm weight went from -0.162 to +0.007, effectively zero. The model learned to ignore keyword rank rather than penalize it.
But it did not beat RRF. It matched it exactly.
Why the Model Converges to RRF
Looking at the final weights:
| Feature | Unconstrained | Constrained |
|---|---|---|
| semantic_rank_norm | +0.532 | +0.534 |
| rrf_score | +0.390 | +0.387 |
| in_both | +0.313 | +0.185 |
| semantic_score | +0.250 | +0.218 |
| keyword_rank_norm | -0.162 | +0.007 |
The dominant features, semantic_rank_norm and rrf_score, are highly correlated with RRF's own scoring. The semantic_rank_norm tracks the semantic component of RRF, and rrf_score is the RRF score. This creates strong multicollinearity: several feature combinations can produce nearly the same ordering.
In that regime, individual coefficients are not very identifiable. A different random seed or slightly different sample can shift the learned weights while preserving almost identical rankings. So "the model just reweighted RRF" is less a model failure than a consequence of correlated features and limited independent signal.
With only 42 queries and 460 candidate documents, there isn't enough signal to reliably learn behavior beyond the baseline. The features that could differentiate (path_depth_norm, content_length_norm) have near-zero weights. On this small, well-curated corpus, RRF is already near-optimal for this feature set.
Lessons
Reranking metrics can overstate pipeline gains. This is well documented in learning-to-rank research, but the effect is easy to underestimate. The +0.054 reranking improvement disappeared in end-to-end evaluation. If you evaluate a reranker, always measure end-to-end.
Constraints can encode domain knowledge. The non-negative constraint on rank features captures a practical pipeline rule: the ranker should not discard candidates by penalizing one channel.
Simple baselines are hard to beat with simple models. RRF is robust: it does not penalize documents for appearing in only one list, and its 1/(k + rank) formula is a useful nonlinear rank transform. A 7-feature linear model can approximate RRF, but not reliably exceed it without interaction features.
Small training sets favor conservative behavior. Forty-two queries were enough to control overfitting (LOO-CV confirmed this), but not enough to learn stable corpus-specific patterns beyond the baseline.
A slightly higher training metric than validation metric does not, by itself, prove problematic overfitting. It can also reflect mild train/validation distribution differences in a small evaluation set.
Feature design is at least as important as data volume. Manual relevance judgments are accurate but expensive; production systems usually rely on weaker but abundant implicit feedback (clicks, dwell time, reformulations). In practice, model quality is bounded by both data quality/quantity and whether the features capture the real retrieval process.
What Could Beat RRF
Based on this experiment, improving beyond RRF at pipeline level likely requires:
- Better query-aware features. Per-query signals (query length, term rarity, semantic-keyword score divergence) could adapt fusion behavior beyond fixed global weights.
- Interaction features, even in a linear model. Terms like
keyword_rank_norm × rare_term_ratioorsemantic_score × query_lengthlet a linear model represent conditional behavior. - Query-dependent weighting. A single global weight vector may match one corpus, but robust gains often require query-level adaptation.
- More judged data. About 150+ judged queries would give the model room to learn beyond baseline behavior, especially once richer features are available.
- Potentially a non-linear model. If linear features saturate, a non-linear model can capture higher-order interactions directly.
- Listwise loss. Optimizing nDCG directly (for example LambdaRank) would align training with the final metric better than pairwise hinge loss.
- Neural reranking. Instead of scoring documents independently with a linear model, a neural reranker jointly considers the query and document. Cross-encoders like BERT or monoT5 concatenate query and document into a single input and run a full transformer forward pass per candidate, capturing deep query-document interactions but at high latency - practical only as a second-stage reranker over a small candidate set. Late-interaction models like ColBERT take a different approach: they encode query and document independently into per-token embeddings, then compute fine-grained token-level similarity via MaxSim. This makes ColBERT usable both as a retriever (via ANN search over precomputed token embeddings) and as a reranker, offering a middle ground between bi-encoder speed and cross-encoder quality.
The infrastructure is in place: the LtrFusion strategy, the training CLI, the feature extraction pipeline. The linear model just needs richer signal to work with.
Embedding Model
The goal is on-device search with no cloud API dependency. That rules out hosted embedding services and means inference has to run on whatever hardware is available - typically a laptop CPU, no GPU. The model choice follows from this constraint: we need something small enough to run fast on CPU, accurate enough to produce useful semantic search results, and available in a local runtime format (typically ONNX).
qrst primarily uses ONNX Runtime for inference, loaded with the default execution provider (CPU). Most presets run without GPU acceleration, ensuring compatibility across hardware. However, some models like nomic-embed-text-v1.5 are implemented using the Candle framework, which provides Metal acceleration on macOS. For ONNX-based models, Session::builder()?.commit_from_file(&model_path) handles a forward pass per batch of 8 chunks. ONNX Runtime's CPU backend is well-optimized for small models: quantized attention, SIMD vectorization, and minimal memory overhead. On an M-series Mac, embedding a 6,600-chunk corpus takes about a minute with EmbeddingGemma 300M.
The system supports five model presets:
| Model | Dimensions | Max Tokens | Notes |
|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | 128 | Fastest, good baseline |
| nomic-embed-text-v1.5 | 768 | 8192 | Matryoshka embeddings, long context |
| EmbeddingGemma 300M | 768 | 2048 | Best accuracy/speed tradeoff |
| e5-base-v2 | 768 | 512 | Balanced, instruction-tuned |
| bge-base-en | 768 | 512 | Balanced, English-focused |
EmbeddingGemma 300M is the default for benchmarks and the model behind all results in this article. At 300M parameters it is small enough for real-time CPU inference but large enough to capture semantic nuance that the 33M-parameter MiniLM misses. The SciFact results (nDCG@10 = 0.762 semantic-only) confirm it performs well on domain-specific scientific text without fine-tuning.
For ONNX-based presets, model dimensions are auto-detected from ONNX metadata at load time. Each model preset also defines whether to normalize embeddings and what query/document prefixes to prepend (e.g., EmbeddingGemma uses "task: search result | query: " for queries).
The vector index uses USearch, an HNSW implementation created by Ash Vardanian. USearch is a single-file, dependency-free library for approximate nearest neighbor search that compiles to native code on every major platform. It supports multiple scalar types (F32, F16, I8) for the stored vectors, so you can trade precision for memory: F16 halves memory usage with negligible recall loss, I8 quarters it at some accuracy cost. qrst uses F32 by default but the quantization is configurable. USearch also handles index persistence - the HNSW graph is loaded from disk; while USearch supports memory-mapping via view(), qrst currently uses the load() path which reads the index into memory. For a local search engine that needs to start fast and stay light, these properties matter more than marginal recall differences between ANN libraries.
Why not GPU acceleration on Apple Silicon? ONNX Runtime has no Metal Performance Shaders execution provider. The available path is the CoreML EP, which can target the Apple GPU and Neural Engine (ANE), but for transformer models it is currently impractical. Standard transformer operations - Erf for GELU, ReduceMean for LayerNorm, LayerNormalization - are supported in current ONNX Runtime versions, and dynamic shapes are permitted. However, performance can still be slower due to graph partitioning: the model graph gets split into dozens of fragments, each boundary incurring CPU↔CoreML data transfer overhead. In practice this makes CoreML inference slower than pure CPU for models with partial operator coverage. The Rust ort crate does expose a CoreML EP, but its prebuilt binaries do not include it - you would need to compile ONNX Runtime from source.
Apple's own research on deploying transformers on the Neural Engine shows that ANE acceleration requires restructuring the model: replacing nn.Linear with nn.Conv2d, switching to channels-first layout, and chunking multi-head attention into single-head operations. With these changes, Apple demonstrated a 10x speedup on DistilBERT - but this is manual model surgery, not something you get by flipping an execution provider flag. For embedding models available as standard ONNX exports, the M-series CPU is the fastest path. Its high memory bandwidth and AMX/NEON units already deliver sub-second inference for a 300M-parameter model.
The deliberate choice here is pragmatic: we do not need a 7B-parameter model or the absolute best score on MTEB. We need a model that runs in under a second per query on CPU, fits in memory alongside the rest of the application, and produces embeddings good enough that the ranking pipeline - BM25, fusion, and chunking - can do its job. A 300M-parameter model on ONNX/CPU meets all three requirements.
Chunking
Everything described above operates on chunks, not documents. A 500-line markdown file or a Rust module with twenty functions does not get indexed as a single unit. It gets split into pieces that each fit within the embedding model's effective context, and each chunk becomes its own entry in both the keyword and vector indexes. The chunking strategy directly affects retrieval quality: chunks that are too large dilute the semantic signal, chunks that are too small lose context.
qrst uses a pluggable chunking system with three strategies, selected by file extension.
Markdown chunker. Splits on heading boundaries. When a # line appears, the accumulated content is flushed as a chunk. If a section exceeds the budget (80% of model context), it is split again when the next line would exceed the limit. Each chunk carries its heading as a title, which becomes searchable metadata. The minimum chunk size is 10 tokens, filtering out headings-only fragments.
Code chunker. The code chunker implements cAST-style recursive AST merging. Most RAG pipelines inherit line-based chunking from natural-language retrieval, which breaks semantic structure: a function split at line 50 produces two chunks that are each incomplete. The cAST approach instead uses the parse tree to align chunk boundaries with syntactic units.
The algorithm works in three phases. First, tree-sitter parses the source file into an AST. Second, the chunker walks the AST top-down, maintaining a buffer of pending nodes and a token budget (80% of model context by default). At each child node, it applies three rules in order:
- Boundary check. If the child is a boundary node (function, struct, impl, class, interface, trait, enum, module), flush any pending buffer as a chunk. The boundary node then starts a new accumulation.
- Size check. If the child's token count exceeds the budget, recurse into its children with a fresh buffer. If the child is a leaf that is still too large, fall back to line-based splitting.
- Budget check. If adding the child to the buffer would exceed the budget, flush the buffer first, then add the child.
Otherwise, the child is appended to the buffer. When all children are processed, the remaining buffer is flushed.
The diagram illustrates how the algorithm walks the top-level children. The two use_declaration nodes (40t + 30t) are not boundary nodes, so they accumulate in the buffer and flush together as Chunk 1 when the boundary node struct_item is encountered. The struct_item (120t) starts a new accumulation, is itself a boundary node, and then the oversized impl_item triggers a flush-before-recurse. Inside the recursive walk, each fn is a boundary node, so each new boundary flushes the previous buffered node (fn new -> Chunk 3, fn add -> Chunk 4, etc.). This ensures that major syntactic units like functions and structs remain isolated unless they are small enough to be merged without crossing boundaries. Non-boundary nodes are merged greedily until a boundary or budget overflow forces a flush.
A key implementation detail: merged chunks preserve inter-node whitespace by slicing source[first.start_byte..last.end_byte] rather than concatenating extracted node texts. This means a chunk reads exactly like the original source, including blank lines between functions, which matters for both readability and keyword search.
Composite chunker. Handles multi-zone files like Astro, Vue, and Svelte components. The file is first split into zones by text boundaries: frontmatter (--- fences), <script>, <style>, and template regions. Each zone is then delegated to the appropriate sub-chunker: TypeScript for script zones, CSS for style zones, HTML for template zones. Zone labels are prefixed to chunk titles ([script] const handler, [style] .container) so search results indicate which part of the component matched.
The registry dispatches by extension and defaults to covering .md, .rs, .js, .jsx, .ts, .tsx, .html, .css, .astro, .vue, and .svelte. Files without a matching chunker are skipped.
The indexer walks the directory tree (respecting .gitignore), dispatches each file to its chunker, and feeds the resulting chunks through the embedding model in batches of 8. Each chunk is stored with its content, file path, title, source line range, and embedding vector. Incremental updates use blake3 content hashing to detect changed files: unchanged files are skipped, changed files have their old chunks deleted and new chunks inserted.
The token bounds (defaulting to 80% of model context for maximum and 10 for minimum) are configurable in config.toml but the defaults work well in practice. For prose-heavy content, 80% of context maps to roughly 2000–2500 characters; for code and mixed syntax, the character count is lower because punctuation, operators, and camelCase identifiers each consume separate tokens. Either way, the result fits perfectly within the embedding model's context window (typically 512 or 2048 tokens) while providing enough context for meaningful semantic similarity. The 10-token minimum filters out empty sections and standalone headings without discarding short but relevant code snippets.
Implementation
The full implementation is in qrst:
- Core:
LtrFusionincrates/qrst/src/storage/fusion.rs, 7-feature extraction, dot product scoring,LtrWeightswith TOML serialization - Training:
train-ltrcommand incrates/qrst-bench/src/commands/train_ltr.rs, pairwise SGD, LOO-CV, non-negative constraints - Config:
fusion_strategy = "ltr"in config.toml, weights auto-loaded from{data_dir}/ltr_weights.toml - Results:
bench/results/ltr-fusion-results.md
Inference cost for the linear model is negligible - just 8 multiply-adds per candidate document, completing in well under a microsecond per item on modern hardware. The weights are 8 floats in a human-readable TOML file. No new dependencies beyond what qrst already uses (serde + toml).
External Validation on BEIR/SciFact
The panzerotti corpus (my private documentation dataset) is a set where both BM25 and semantic search contribute meaningfully (BM25 nDCG@5 = 0.431). To test whether the findings generalize, I ran the same fusion strategies on SciFact, a public benchmark of 300 scientific claim–evidence pairs, using EmbeddingGemma 300M.
| Strategy | nDCG@10 | nDCG@5 | P@3 | MRR | 95% CI nDCG@10 |
|---|---|---|---|---|---|
| BM25 only | 0.047 | 0.047 | 0.017 | 0.050 | n/a |
| Semantic only | 0.762 | 0.744 | 0.282 | 0.723 | n/a |
| RRF (k=60) | 0.761 | 0.745 | 0.282 | 0.726 | [0.724, 0.797] |
| Convex (α=0.95) | 0.762 | 0.745 | 0.282 | 0.726 | [0.724, 0.799] |
| LTR (panzerotti weights) | 0.765 | 0.748 | 0.283 | 0.730 | [0.726, 0.801] |
BM25 is near-zero on SciFact. Beyond the specialized vocabulary, there is a structural confound: qrst indexes at the document level (the full abstract), whereas BEIR baselines often use passage-level indexing. On SciFact's short, dense abstracts, this mismatch significantly penalizes keyword-based retrieval.
When one retriever is broken, fusion approximately collapses to the working retriever. RRF and convex both produce near-identical results to semantic-only because BM25 contributes mostly noise. In expectation, RRF with one random-quality list adds uncorrelated perturbations to the signal-carrying list; with enough documents, the semantic ranking dominates, though individual queries can still see minor rank swaps. The per-system bootstrap confidence intervals overlap heavily - [0.724, 0.797] vs [0.724, 0.799] vs [0.726, 0.801] - which is suggestive but not a formal significance test. A paired bootstrap or permutation test would be needed to make a rigorous claim; the point estimates and CI overlap are consistent with no meaningful difference.
The LTR model, trained on panzerotti weights, transfers to SciFact without degradation but also without improvement (0.765 nDCG@10, within every other strategy's CI). The panzerotti-trained linear model has converged to RRF-equivalent behavior, and that equivalence holds even on an out-of-domain corpus.
This is the complementary case to the panzerotti experiment. On panzerotti, both retrievers contribute and LTR converges to RRF. On SciFact, only one retriever contributes and all fusion strategies converge to semantic-only. Neither case gives LTR room to differentiate. The bottleneck is retriever quality, not the combination method.
Conclusion
This experiment set out to beat RRF with a learned model and ended up explaining why RRF works so well. A 7-feature linear model trained on 42 queries converged to a reweighted version of the very baseline it was supposed to surpass. Importantly, this is not because linear models are inherently weak; with strongly correlated features, the model had little room to learn distinct behavior.
The negative keyword rank weight that looked like a genuine insight in reranking evaluation turned out to be a pipeline-breaking artifact: optimizing for reranking quality is not the same as optimizing for end-to-end retrieval.
The key takeaway is not that learning-to-rank fails for hybrid search. It is that a linear model on a small corpus with limited training signal cannot find structure beyond what a well-chosen heuristic already captures. RRF's 1/(k + rank) formula encodes a reasonable prior: every retrieval channel contributes, no document is penalized for appearing in only one list, and rank is compressed through a monotonic transform. Replicating those properties is easy. Exceeding them requires richer features, more training data, or a model class that can represent interactions.
The more useful outcome is methodological. Reranking metrics and pipeline metrics measure different things. A reranker operates on a fixed candidate set; a fusion strategy also determines which candidates survive. Any learned ranker that can suppress documents below a retrieval cutoff will show a gap between these two evaluations. Measuring end-to-end from the start would have caught the negative-weight problem before it looked like an improvement.
For practitioners building hybrid search: start with RRF. It is fast, parameter-light, and robust. If you have enough labeled data and the right features to justify a learned model, evaluate it end-to-end against RRF before shipping. The bar is higher than reranking metrics suggest.
Appendix
Acronyms
| Acronym | Meaning |
|---|---|
| AMX | Apple Matrix Co-processor (hardware accelerator in M-series chips) |
| ANE | Apple Neural Engine (on-chip neural network accelerator) |
| ANN | Approximate Nearest Neighbor |
| AST | Abstract Syntax Tree |
| BEIR | Benchmarking IR (standardized information retrieval benchmark suite) |
| BM25 | Best Match 25 (probabilistic term-scoring function) |
| cAST | Code AST (recursive AST-based chunking strategy) |
| CI | Confidence Interval |
| CoreML | Apple's machine learning framework for on-device inference |
| EP | Execution Provider (ONNX Runtime backend for hardware acceleration) |
| HNSW | Hierarchical Navigable Small World (graph-based ANN algorithm) |
| IDF | Inverse Document Frequency |
| LOO-CV | Leave-One-Out Cross-Validation |
| LTR | Learning to Rank |
| MPS | Metal Performance Shaders (Apple GPU compute framework) |
| MRR | Mean Reciprocal Rank |
| MTEB | Massive Text Embedding Benchmark |
| nDCG | Normalized Discounted Cumulative Gain |
| ONNX | Open Neural Network Exchange (portable model format) |
| P@k | Precision at rank k |
| RAG | Retrieval-Augmented Generation |
| RRF | Reciprocal Rank Fusion |
| SGD | Stochastic Gradient Descent |
| SIMD | Single Instruction, Multiple Data |
| TF | Term Frequency |
| TOML | Tom's Obvious Minimal Language (configuration format) |
References
| Title | Summary | Link |
|---|---|---|
| A General Theory of Relevance (BM25 Review) | Foundational review of the BM25 scoring function covering TF saturation and document length normalization | |
| Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods | Introduces RRF as a simple, parameter-light method for combining multiple ranked lists that outperforms more complex approaches | ACM |
| Pairwise Hinge Loss (MSR-TR-2010-82) | Describes the pairwise hinge loss objective for training ranking models by pushing relevant documents above irrelevant ones by a margin | |
| Cumulative Gain-Based Evaluation of IR Techniques (nDCG) | Introduces nDCG as a graded relevance metric with position-based discounting for evaluating ranked retrieval | ACM |
| LambdaRank: Learning to Rank with Nonsmooth Cost Functions | Proposes listwise ranking optimization that directly optimizes nDCG through lambda gradients | |
| cAST: Code AST-Based Chunking for Retrieval | Presents recursive AST merging for code chunking, aligning chunk boundaries with syntactic units instead of line counts | arXiv |
| Deploying Transformers on the Apple Neural Engine | Apple's guide to restructuring transformer architectures for ANE acceleration, achieving 10x speedup on DistilBERT | Apple ML |