<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>vectorian</title>
    <link>https://www.vectorian.be</link>
    <description>Engineering for the age of inference</description>
    <language>en</language>
    <managingEditor>Istvan</managingEditor>
    <atom:link href="https://www.vectorian.be/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Thu, 05 Mar 2026 12:00:00 +0100</pubDate>
    <lastBuildDate>Tue, 17 Mar 2026 10:27:06 +0000</lastBuildDate>
    <item>
      <title>All I wanted was a simple code search, I ended up in a ranking theory rabbit hole</title>
      <link>https://www.vectorian.be/articles/2026-03-05/all-i-wanted-was-a-simple-code-search/</link>
      <guid isPermaLink="true">https://www.vectorian.be/articles/2026-03-05/all-i-wanted-was-a-simple-code-search/</guid>
      <description>&lt;p&gt;I built an on-device hybrid search engine that combines &lt;a href=&quot;https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf&quot;&gt;BM25&lt;/a&gt; and vector retrieval with &lt;a href=&quot;https://dl.acm.org/doi/10.1145/1571941.1572114&quot;&gt;Reciprocal Rank Fusion&lt;/a&gt;. Reranking metrics suggested a learned linear fusion model would outperform RRF, but end-to-end evaluation showed otherwise. This article explains why the model matched baseline behavior and what to improve next.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h2 id="context"><a href="#context">Context</a></h2>
<p>I built an on-device hybrid search engine that combines <a href="https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf">BM25</a> and vector retrieval with <a href="https://dl.acm.org/doi/10.1145/1571941.1572114">Reciprocal Rank Fusion</a>. Reranking metrics suggested a learned linear fusion model would outperform RRF, but end-to-end evaluation showed otherwise. This article explains why the model matched baseline behavior and what to improve next.</p>
<h2 id="key-takeaways"><a href="#key-takeaways">Key Takeaways</a></h2>
<ul>
<li>A learned linear fusion model trained on 42 queries did not outperform Reciprocal Rank Fusion (RRF). It converged to a reweighted version of RRF rather than learning a distinct ranking strategy</li>
<li>Reranking metrics overstated fusion model gains by +0.054 nDCG. These gains disappeared in end-to-end pipeline evaluation, which means reranker-level metrics alone are unreliable for hybrid search tuning</li>
<li>For practitioners combining BM25 keyword search with vector retrieval: start with RRF. It is fast, has minimal parameters, and is difficult to beat without richer features, more training data, or cross-attention rerankers</li>
</ul>
<h2 id="how-this-started"><a href="#how-this-started">How This Started</a></h2>
<p>Before diving in, it’s worth noting that there are many excellent projects in this space - most notably <a href="https://github.com/tobi/qmd">qmd</a> (big thanks to Tobi for the inspiration). While I could have used an existing tool, I wanted to build this myself as a way to dive deeper into Rust and the mechanics of modern search.</p>
<p>All I wanted was a simple semantic search over my own code and documentation. Something local, no cloud APIs, just &quot;find the file that explains how the session store works&quot; without remembering the exact filename or grep pattern. I started with a basic vector search prototype, realized keyword search still catches things embeddings miss, bolted on BM25, and needed a way to combine the two result lists. That led me to RRF, which led me to wondering whether a learned model could do better, which led me into the information retrieval literature: BM25's probabilistic foundations, rank fusion theory, pairwise learning-to-rank, evaluation metrics. One rabbit hole later, I had read the papers, implemented the algorithms, built a benchmark harness, and written <a href="https://github.com/l1x/qrst">qrst</a> - a full hybrid search engine with BM25, vector retrieval, multiple fusion strategies, and a learning-to-rank training pipeline. This article is what I learned along the way.</p>
<h2 id="the-idea"><a href="#the-idea">The Idea</a></h2>
<p>Hybrid search fusion is a ranking problem. You have two scored lists, one from semantic search and one from keyword search, and you need to combine them.</p>
<p>Before fusion, there are the base retrievers. For keyword search, BM25 is the standard. It’s a probabilistic model that scores documents based on term frequency (<code>tf</code>) and inverse document frequency (<code>idf</code>), but with two important safeguards: <code>k1</code> controls term frequency saturation (preventing a document with 100 mentions of 'rust' from infinitely outscoring one with 10), and <code>b</code> handles document length normalization. These parameters define the 'shape' of keyword relevance.</p>
<svg id="bm25-chart" viewBox="0 0 720 460" xmlns="http://www.w3.org/2000/svg" style="width:100%;max-width:720px;">
  <style>
    #bm25-chart .axis { stroke: var(--c-brd); stroke-width: 1; }
    #bm25-chart .grid { stroke: var(--c-brd); stroke-width: 0.5; opacity: 0.3; }
    #bm25-chart .tick-label { font: 9px "JetBrains Mono", monospace; fill: var(--c-muted); }
    #bm25-chart .axis-label { font: 11px "Source Serif 4", serif; fill: var(--c-sec); }
    #bm25-chart .chart-title { font: 500 14px "Playfair Display", serif; fill: var(--c-pri); }
    #bm25-chart .plot-title { font: 500 12px "Playfair Display", serif; fill: var(--c-pri); }
    #bm25-chart .formula { font: 11px "JetBrains Mono", monospace; fill: var(--c-sec); }
    #bm25-chart .legend-label { font: 9px "JetBrains Mono", monospace; fill: var(--c-muted); }
    #bm25-chart .s1 { stroke: var(--c-acc); fill: none; stroke-width: 2; }
    #bm25-chart .s2 { stroke: var(--c-acc2); fill: none; stroke-width: 2; stroke-dasharray: 8 4; }
    #bm25-chart .s3 { stroke: var(--c-muted); fill: none; stroke-width: 2; stroke-dasharray: 3 3; }
  </style>
  <text x="360" y="28" text-anchor="middle" class="chart-title">BM25 Scoring Components</text>
  <text x="211.25" y="40" text-anchor="middle" class="plot-title">TF Saturation</text>
  <line x1="60" y1="360.0" x2="362.5" y2="360.0" class="grid"/>
  <line x1="60" y1="298.0" x2="362.5" y2="298.0" class="grid"/>
  <line x1="60" y1="236.0" x2="362.5" y2="236.0" class="grid"/>
  <line x1="60" y1="174.0" x2="362.5" y2="174.0" class="grid"/>
  <line x1="60" y1="112.0" x2="362.5" y2="112.0" class="grid"/>
  <line x1="60" y1="50.0" x2="362.5" y2="50.0" class="grid"/>
  <line x1="60.0" y1="50" x2="60.0" y2="360" class="grid"/>
  <line x1="120.5" y1="50" x2="120.5" y2="360" class="grid"/>
  <line x1="181.0" y1="50" x2="181.0" y2="360" class="grid"/>
  <line x1="241.5" y1="50" x2="241.5" y2="360" class="grid"/>
  <line x1="302.0" y1="50" x2="302.0" y2="360" class="grid"/>
  <line x1="362.5" y1="50" x2="362.5" y2="360" class="grid"/>
  <line x1="60" y1="360" x2="362.5" y2="360" class="axis"/>
  <line x1="60" y1="50" x2="60" y2="360" class="axis"/>
  <text x="60.0" y="376" text-anchor="middle" class="tick-label">0</text>
  <text x="120.5" y="376" text-anchor="middle" class="tick-label">50</text>
  <text x="181.0" y="376" text-anchor="middle" class="tick-label">100</text>
  <text x="241.5" y="376" text-anchor="middle" class="tick-label">150</text>
  <text x="302.0" y="376" text-anchor="middle" class="tick-label">200</text>
  <text x="362.5" y="376" text-anchor="middle" class="tick-label">250</text>
  <text x="54" y="363.0" text-anchor="end" class="tick-label">0.0</text>
  <text x="54" y="301.0" text-anchor="end" class="tick-label">0.2</text>
  <text x="54" y="239.0" text-anchor="end" class="tick-label">0.4</text>
  <text x="54" y="177.0" text-anchor="end" class="tick-label">0.6</text>
  <text x="54" y="115.0" text-anchor="end" class="tick-label">0.8</text>
  <text x="54" y="53.0" text-anchor="end" class="tick-label">1.0</text>
  <text x="211.25" y="395" text-anchor="middle" class="axis-label">Term Frequency (tf)</text>
  <polyline points="60.0,360.0 61.2,331.8 62.4,308.3 63.6,288.5 64.8,271.4 66.0,256.7 67.3,243.8 68.5,232.4 69.7,222.2 70.9,213.2 72.1,205.0 73.3,197.6 74.5,190.9 75.7,184.8 76.9,179.2 78.2,174.0 79.4,169.2 80.6,164.8 81.8,160.7 83.0,156.9 84.2,153.3 85.4,150.0 86.6,146.9 87.8,143.9 89.0,141.2 90.3,138.6 91.5,136.1 92.7,133.8 93.9,131.6 95.1,129.5 96.3,127.5 97.5,125.6 98.7,123.8 99.9,122.1 101.1,120.5 102.3,118.9 103.6,117.4 104.8,116.0 106.0,114.6 107.2,113.3 108.4,112.0 109.6,110.8 110.8,109.6 112.0,108.5 113.2,107.4 114.4,106.4 115.7,105.4 116.9,104.4 118.1,103.4 119.3,102.5 120.5,101.7 121.7,100.8 122.9,100.0 124.1,99.2 125.3,98.4 126.5,97.7 127.8,97.0 129.0,96.3 130.2,95.6 131.4,94.9 132.6,94.3 133.8,93.7 135.0,93.1 136.2,92.5 137.4,91.9 138.7,91.3 139.9,90.8 141.1,90.3 142.3,89.7 143.5,89.2 144.7,88.8 145.9,88.3 147.1,87.8 148.3,87.3 149.5,86.9 150.8,86.5 152.0,86.0 153.2,85.6 154.4,85.2 155.6,84.8 156.8,84.4 158.0,84.1 159.2,83.7 160.4,83.3 161.6,83.0 162.9,82.6 164.1,82.3 165.3,82.0 166.5,81.6 167.7,81.3 168.9,81.0 170.1,80.7 171.3,80.4 172.5,80.1 173.7,79.8 174.9,79.5 176.2,79.2 177.4,79.0 178.6,78.7 179.8,78.4 181.0,78.2 182.2,77.9 183.4,77.7 184.6,77.4 185.8,77.2 187.1,77.0 188.3,76.7 189.5,76.5 190.7,76.3 191.9,76.1 193.1,75.8 194.3,75.6 195.5,75.4 196.7,75.2 197.9,75.0 199.2,74.8 200.4,74.6 201.6,74.4 202.8,74.2 204.0,74.0 205.2,73.8 206.4,73.7 207.6,73.5 208.8,73.3 210.0,73.1 211.3,73.0 212.5,72.8 213.7,72.6 214.9,72.5 216.1,72.3 217.3,72.1 218.5,72.0 219.7,71.8 220.9,71.7 222.1,71.5 223.4,71.4 224.6,71.2 225.8,71.1 227.0,70.9 228.2,70.8 229.4,70.7 230.6,70.5 231.8,70.4 233.0,70.3 234.2,70.1 235.4,70.0 236.7,69.9 237.9,69.7 239.1,69.6 240.3,69.5 241.5,69.4 242.7,69.3 243.9,69.1 245.1,69.0 246.3,68.9 247.6,68.8 248.8,68.7 250.0,68.6 251.2,68.5 252.4,68.3 253.6,68.2 254.8,68.1 256.0,68.0 257.2,67.9 258.4,67.8 259.6,67.7 260.9,67.6 262.1,67.5 263.3,67.4 264.5,67.3 265.7,67.2 266.9,67.1 268.1,67.0 269.3,66.9 270.5,66.8 271.8,66.8 273.0,66.7 274.2,66.6 275.4,66.5 276.6,66.4 277.8,66.3 279.0,66.2 280.2,66.1 281.4,66.1 282.6,66.0 283.9,65.9 285.1,65.8 286.3,65.7 287.5,65.7 288.7,65.6 289.9,65.5 291.1,65.4 292.3,65.3 293.5,65.3 294.7,65.2 296.0,65.1 297.2,65.0 298.4,65.0 299.6,64.9 300.8,64.8 302.0,64.8 303.2,64.7 304.4,64.6 305.6,64.6 306.8,64.5 308.0,64.4 309.3,64.4 310.5,64.3 311.7,64.2 312.9,64.2 314.1,64.1 315.3,64.0 316.5,64.0 317.7,63.9 318.9,63.8 320.1,63.8 321.4,63.7 322.6,63.7 323.8,63.6 325.0,63.5 326.2,63.5 327.4,63.4 328.6,63.4 329.8,63.3 331.0,63.2 332.3,63.2 333.5,63.1 334.7,63.1 335.9,63.0 337.1,63.0 338.3,62.9 339.5,62.9 340.7,62.8 341.9,62.8 343.1,62.7 344.3,62.7 345.6,62.6 346.8,62.6 348.0,62.5 349.2,62.4 350.4,62.4 351.6,62.4 352.8,62.3 354.0,62.3 355.2,62.2 356.4,62.2 357.7,62.1 358.9,62.1 360.1,62.0 361.3,62.0 362.5,61.9" class="s1"/>
  <polyline points="60.0,360.0 61.2,353.9 62.4,348.1 63.6,342.5 64.8,337.0 66.0,331.8 67.3,326.8 68.5,321.9 69.7,317.2 70.9,312.7 72.1,308.3 73.3,304.1 74.5,300.0 75.7,296.0 76.9,292.2 78.2,288.5 79.4,284.8 80.6,281.3 81.8,277.9 83.0,274.6 84.2,271.4 85.4,268.3 86.6,265.3 87.8,262.3 89.0,259.5 90.3,256.7 91.5,253.9 92.7,251.3 93.9,248.7 95.1,246.2 96.3,243.8 97.5,241.4 98.7,239.0 99.9,236.7 101.1,234.5 102.3,232.4 103.6,230.2 104.8,228.2 106.0,226.1 107.2,224.2 108.4,222.2 109.6,220.3 110.8,218.5 112.0,216.7 113.2,214.9 114.4,213.2 115.7,211.5 116.9,209.8 118.1,208.2 119.3,206.6 120.5,205.0 121.7,203.5 122.9,202.0 124.1,200.5 125.3,199.0 126.5,197.6 127.8,196.2 129.0,194.9 130.2,193.5 131.4,192.2 132.6,190.9 133.8,189.6 135.0,188.4 136.2,187.2 137.4,186.0 138.7,184.8 139.9,183.6 141.1,182.5 142.3,181.4 143.5,180.3 144.7,179.2 145.9,178.1 147.1,177.0 148.3,176.0 149.5,175.0 150.8,174.0 152.0,173.0 153.2,172.0 154.4,171.1 155.6,170.2 156.8,169.2 158.0,168.3 159.2,167.4 160.4,166.5 161.6,165.7 162.9,164.8 164.1,164.0 165.3,163.1 166.5,162.3 167.7,161.5 168.9,160.7 170.1,159.9 171.3,159.2 172.5,158.4 173.7,157.6 174.9,156.9 176.2,156.2 177.4,155.4 178.6,154.7 179.8,154.0 181.0,153.3 182.2,152.6 183.4,152.0 184.6,151.3 185.8,150.6 187.1,150.0 188.3,149.4 189.5,148.7 190.7,148.1 191.9,147.5 193.1,146.9 194.3,146.3 195.5,145.7 196.7,145.1 197.9,144.5 199.2,143.9 200.4,143.4 201.6,142.8 202.8,142.3 204.0,141.7 205.2,141.2 206.4,140.6 207.6,140.1 208.8,139.6 210.0,139.1 211.3,138.6 212.5,138.1 213.7,137.6 214.9,137.1 216.1,136.6 217.3,136.1 218.5,135.6 219.7,135.2 220.9,134.7 222.1,134.2 223.4,133.8 224.6,133.3 225.8,132.9 227.0,132.4 228.2,132.0 229.4,131.6 230.6,131.2 231.8,130.7 233.0,130.3 234.2,129.9 235.4,129.5 236.7,129.1 237.9,128.7 239.1,128.3 240.3,127.9 241.5,127.5 242.7,127.1 243.9,126.7 245.1,126.4 246.3,126.0 247.6,125.6 248.8,125.2 250.0,124.9 251.2,124.5 252.4,124.2 253.6,123.8 254.8,123.5 256.0,123.1 257.2,122.8 258.4,122.4 259.6,122.1 260.9,121.8 262.1,121.4 263.3,121.1 264.5,120.8 265.7,120.5 266.9,120.1 268.1,119.8 269.3,119.5 270.5,119.2 271.8,118.9 273.0,118.6 274.2,118.3 275.4,118.0 276.6,117.7 277.8,117.4 279.0,117.1 280.2,116.8 281.4,116.5 282.6,116.2 283.9,116.0 285.1,115.7 286.3,115.4 287.5,115.1 288.7,114.9 289.9,114.6 291.1,114.3 292.3,114.0 293.5,113.8 294.7,113.5 296.0,113.3 297.2,113.0 298.4,112.8 299.6,112.5 300.8,112.2 302.0,112.0 303.2,111.8 304.4,111.5 305.6,111.3 306.8,111.0 308.0,110.8 309.3,110.5 310.5,110.3 311.7,110.1 312.9,109.8 314.1,109.6 315.3,109.4 316.5,109.2 317.7,108.9 318.9,108.7 320.1,108.5 321.4,108.3 322.6,108.1 323.8,107.8 325.0,107.6 326.2,107.4 327.4,107.2 328.6,107.0 329.8,106.8 331.0,106.6 332.3,106.4 333.5,106.2 334.7,106.0 335.9,105.8 337.1,105.6 338.3,105.4 339.5,105.2 340.7,105.0 341.9,104.8 343.1,104.6 344.3,104.4 345.6,104.2 346.8,104.0 348.0,103.8 349.2,103.6 350.4,103.4 351.6,103.3 352.8,103.1 354.0,102.9 355.2,102.7 356.4,102.5 357.7,102.4 358.9,102.2 360.1,102.0 361.3,101.8 362.5,101.7" class="s2"/>
  <polyline points="60.0,360.0 61.2,356.9 62.4,353.9 63.6,351.0 64.8,348.1 66.0,345.2 67.3,342.5 68.5,339.7 69.7,337.0 70.9,334.4 72.1,331.8 73.3,329.3 74.5,326.8 75.7,324.3 76.9,321.9 78.2,319.6 79.4,317.2 80.6,315.0 81.8,312.7 83.0,310.5 84.2,308.3 85.4,306.2 86.6,304.1 87.8,302.0 89.0,300.0 90.3,298.0 91.5,296.0 92.7,294.1 93.9,292.2 95.1,290.3 96.3,288.5 97.5,286.6 98.7,284.8 99.9,283.1 101.1,281.3 102.3,279.6 103.6,277.9 104.8,276.3 106.0,274.6 107.2,273.0 108.4,271.4 109.6,269.9 110.8,268.3 112.0,266.8 113.2,265.3 114.4,263.8 115.7,262.3 116.9,260.9 118.1,259.5 119.3,258.1 120.5,256.7 121.7,255.3 122.9,253.9 124.1,252.6 125.3,251.3 126.5,250.0 127.8,248.7 129.0,247.5 130.2,246.2 131.4,245.0 132.6,243.8 133.8,242.5 135.0,241.4 136.2,240.2 137.4,239.0 138.7,237.9 139.9,236.7 141.1,235.6 142.3,234.5 143.5,233.4 144.7,232.4 145.9,231.3 147.1,230.2 148.3,229.2 149.5,228.2 150.8,227.1 152.0,226.1 153.2,225.1 154.4,224.2 155.6,223.2 156.8,222.2 158.0,221.3 159.2,220.3 160.4,219.4 161.6,218.5 162.9,217.6 164.1,216.7 165.3,215.8 166.5,214.9 167.7,214.0 168.9,213.2 170.1,212.3 171.3,211.5 172.5,210.6 173.7,209.8 174.9,209.0 176.2,208.2 177.4,207.4 178.6,206.6 179.8,205.8 181.0,205.0 182.2,204.2 183.4,203.5 184.6,202.7 185.8,202.0 187.1,201.2 188.3,200.5 189.5,199.8 190.7,199.0 191.9,198.3 193.1,197.6 194.3,196.9 195.5,196.2 196.7,195.5 197.9,194.9 199.2,194.2 200.4,193.5 201.6,192.9 202.8,192.2 204.0,191.6 205.2,190.9 206.4,190.3 207.6,189.6 208.8,189.0 210.0,188.4 211.3,187.8 212.5,187.2 213.7,186.6 214.9,186.0 216.1,185.4 217.3,184.8 218.5,184.2 219.7,183.6 220.9,183.0 222.1,182.5 223.4,181.9 224.6,181.4 225.8,180.8 227.0,180.3 228.2,179.7 229.4,179.2 230.6,178.6 231.8,178.1 233.0,177.6 234.2,177.0 235.4,176.5 236.7,176.0 237.9,175.5 239.1,175.0 240.3,174.5 241.5,174.0 242.7,173.5 243.9,173.0 245.1,172.5 246.3,172.0 247.6,171.6 248.8,171.1 250.0,170.6 251.2,170.2 252.4,169.7 253.6,169.2 254.8,168.8 256.0,168.3 257.2,167.9 258.4,167.4 259.6,167.0 260.9,166.5 262.1,166.1 263.3,165.7 264.5,165.2 265.7,164.8 266.9,164.4 268.1,164.0 269.3,163.6 270.5,163.1 271.8,162.7 273.0,162.3 274.2,161.9 275.4,161.5 276.6,161.1 277.8,160.7 279.0,160.3 280.2,159.9 281.4,159.5 282.6,159.2 283.9,158.8 285.1,158.4 286.3,158.0 287.5,157.6 288.7,157.3 289.9,156.9 291.1,156.5 292.3,156.2 293.5,155.8 294.7,155.4 296.0,155.1 297.2,154.7 298.4,154.4 299.6,154.0 300.8,153.7 302.0,153.3 303.2,153.0 304.4,152.6 305.6,152.3 306.8,152.0 308.0,151.6 309.3,151.3 310.5,151.0 311.7,150.6 312.9,150.3 314.1,150.0 315.3,149.7 316.5,149.4 317.7,149.0 318.9,148.7 320.1,148.4 321.4,148.1 322.6,147.8 323.8,147.5 325.0,147.2 326.2,146.9 327.4,146.6 328.6,146.3 329.8,146.0 331.0,145.7 332.3,145.4 333.5,145.1 334.7,144.8 335.9,144.5 337.1,144.2 338.3,143.9 339.5,143.7 340.7,143.4 341.9,143.1 343.1,142.8 344.3,142.5 345.6,142.3 346.8,142.0 348.0,141.7 349.2,141.4 350.4,141.2 351.6,140.9 352.8,140.6 354.0,140.4 355.2,140.1 356.4,139.9 357.7,139.6 358.9,139.3 360.1,139.1 361.3,138.8 362.5,138.6" class="s3"/>
  <line x1="70" y1="70" x2="90" y2="70" class="s1"/>
  <text x="95" y="73" class="legend-label">k₁ = 10</text>
  <line x1="70" y1="86" x2="90" y2="86" class="s2"/>
  <text x="95" y="89" class="legend-label">k₁ = 50</text>
  <line x1="70" y1="102" x2="90" y2="102" class="s3"/>
  <text x="95" y="105" class="legend-label">k₁ = 100</text>
  <text x="553.75" y="40" text-anchor="middle" class="plot-title">Document Length Normalization</text>
  <line x1="402.5" y1="360.0" x2="705" y2="360.0" class="grid"/>
  <line x1="402.5" y1="298.0" x2="705" y2="298.0" class="grid"/>
  <line x1="402.5" y1="236.0" x2="705" y2="236.0" class="grid"/>
  <line x1="402.5" y1="174.0" x2="705" y2="174.0" class="grid"/>
  <line x1="402.5" y1="112.0" x2="705" y2="112.0" class="grid"/>
  <line x1="402.5" y1="50.0" x2="705" y2="50.0" class="grid"/>
  <line x1="421.4" y1="50" x2="421.4" y2="360" class="grid"/>
  <line x1="452.9" y1="50" x2="452.9" y2="360" class="grid"/>
  <line x1="515.9" y1="50" x2="515.9" y2="360" class="grid"/>
  <line x1="579.0" y1="50" x2="579.0" y2="360" class="grid"/>
  <line x1="642.0" y1="50" x2="642.0" y2="360" class="grid"/>
  <line x1="705.0" y1="50" x2="705.0" y2="360" class="grid"/>
  <line x1="402.5" y1="360" x2="705" y2="360" class="axis"/>
  <line x1="402.5" y1="50" x2="402.5" y2="360" class="axis"/>
  <text x="421.4" y="376" text-anchor="middle" class="tick-label">0.5</text>
  <text x="452.9" y="376" text-anchor="middle" class="tick-label">1.0</text>
  <text x="515.9" y="376" text-anchor="middle" class="tick-label">2.0</text>
  <text x="579.0" y="376" text-anchor="middle" class="tick-label">3.0</text>
  <text x="642.0" y="376" text-anchor="middle" class="tick-label">4.0</text>
  <text x="705.0" y="376" text-anchor="middle" class="tick-label">5.0</text>
  <text x="396.5" y="363.0" text-anchor="end" class="tick-label">0.0</text>
  <text x="396.5" y="301.0" text-anchor="end" class="tick-label">0.5</text>
  <text x="396.5" y="239.0" text-anchor="end" class="tick-label">1.0</text>
  <text x="396.5" y="177.0" text-anchor="end" class="tick-label">1.5</text>
  <text x="396.5" y="115.0" text-anchor="end" class="tick-label">2.0</text>
  <text x="396.5" y="53.0" text-anchor="end" class="tick-label">2.5</text>
  <text x="553.75" y="395" text-anchor="middle" class="axis-label">doc_length / avg_length</text>
  <polyline points="402.5,236.0 404.0,236.0 405.5,236.0 407.0,236.0 408.6,236.0 410.1,236.0 411.6,236.0 413.1,236.0 414.6,236.0 416.1,236.0 417.6,236.0 419.1,236.0 420.6,236.0 422.2,236.0 423.7,236.0 425.2,236.0 426.7,236.0 428.2,236.0 429.7,236.0 431.2,236.0 432.8,236.0 434.3,236.0 435.8,236.0 437.3,236.0 438.8,236.0 440.3,236.0 441.8,236.0 443.3,236.0 444.9,236.0 446.4,236.0 447.9,236.0 449.4,236.0 450.9,236.0 452.4,236.0 453.9,236.0 455.4,236.0 456.9,236.0 458.5,236.0 460.0,236.0 461.5,236.0 463.0,236.0 464.5,236.0 466.0,236.0 467.5,236.0 469.1,236.0 470.6,236.0 472.1,236.0 473.6,236.0 475.1,236.0 476.6,236.0 478.1,236.0 479.6,236.0 481.1,236.0 482.7,236.0 484.2,236.0 485.7,236.0 487.2,236.0 488.7,236.0 490.2,236.0 491.7,236.0 493.3,236.0 494.8,236.0 496.3,236.0 497.8,236.0 499.3,236.0 500.8,236.0 502.3,236.0 503.8,236.0 505.4,236.0 506.9,236.0 508.4,236.0 509.9,236.0 511.4,236.0 512.9,236.0 514.4,236.0 515.9,236.0 517.5,236.0 519.0,236.0 520.5,236.0 522.0,236.0 523.5,236.0 525.0,236.0 526.5,236.0 528.0,236.0 529.5,236.0 531.1,236.0 532.6,236.0 534.1,236.0 535.6,236.0 537.1,236.0 538.6,236.0 540.1,236.0 541.7,236.0 543.2,236.0 544.7,236.0 546.2,236.0 547.7,236.0 549.2,236.0 550.7,236.0 552.2,236.0 553.8,236.0 555.3,236.0 556.8,236.0 558.3,236.0 559.8,236.0 561.3,236.0 562.8,236.0 564.3,236.0 565.9,236.0 567.4,236.0 568.9,236.0 570.4,236.0 571.9,236.0 573.4,236.0 574.9,236.0 576.4,236.0 578.0,236.0 579.5,236.0 581.0,236.0 582.5,236.0 584.0,236.0 585.5,236.0 587.0,236.0 588.5,236.0 590.0,236.0 591.6,236.0 593.1,236.0 594.6,236.0 596.1,236.0 597.6,236.0 599.1,236.0 600.6,236.0 602.1,236.0 603.7,236.0 605.2,236.0 606.7,236.0 608.2,236.0 609.7,236.0 611.2,236.0 612.7,236.0 614.3,236.0 615.8,236.0 617.3,236.0 618.8,236.0 620.3,236.0 621.8,236.0 623.3,236.0 624.8,236.0 626.4,236.0 627.9,236.0 629.4,236.0 630.9,236.0 632.4,236.0 633.9,236.0 635.4,236.0 636.9,236.0 638.5,236.0 640.0,236.0 641.5,236.0 643.0,236.0 644.5,236.0 646.0,236.0 647.5,236.0 649.0,236.0 650.5,236.0 652.1,236.0 653.6,236.0 655.1,236.0 656.6,236.0 658.1,236.0 659.6,236.0 661.1,236.0 662.7,236.0 664.2,236.0 665.7,236.0 667.2,236.0 668.7,236.0 670.2,236.0 671.7,236.0 673.2,236.0 674.8,236.0 676.3,236.0 677.8,236.0 679.3,236.0 680.8,236.0 682.3,236.0 683.8,236.0 685.3,236.0 686.8,236.0 688.4,236.0 689.9,236.0 691.4,236.0 692.9,236.0 694.4,236.0 695.9,236.0 697.4,236.0 699.0,236.0 700.5,236.0 702.0,236.0 703.5,236.0 705.0,236.0" class="s3"/>
  <polyline points="402.5,153.3 404.0,157.4 405.5,161.3 407.0,165.0 408.6,168.6 410.1,172.1 411.6,175.5 413.1,178.7 414.6,181.8 416.1,184.9 417.6,187.8 419.1,190.6 420.6,193.3 422.2,196.0 423.7,198.5 425.2,201.0 426.7,203.4 428.2,205.8 429.7,208.0 431.2,210.2 432.8,212.4 434.3,214.5 435.8,216.5 437.3,218.4 438.8,220.4 440.3,222.2 441.8,224.0 443.3,225.8 444.9,227.5 446.4,229.2 447.9,230.8 449.4,232.4 450.9,234.0 452.4,235.5 453.9,237.0 455.4,238.4 456.9,239.8 458.5,241.2 460.0,242.6 461.5,243.9 463.0,245.2 464.5,246.4 466.0,247.7 467.5,248.9 469.1,250.1 470.6,251.2 472.1,252.4 473.6,253.5 475.1,254.6 476.6,255.6 478.1,256.7 479.6,257.7 481.1,258.7 482.7,259.7 484.2,260.6 485.7,261.6 487.2,262.5 488.7,263.4 490.2,264.3 491.7,265.2 493.3,266.1 494.8,266.9 496.3,267.7 497.8,268.6 499.3,269.4 500.8,270.1 502.3,270.9 503.8,271.7 505.4,272.4 506.9,273.2 508.4,273.9 509.9,274.6 511.4,275.3 512.9,276.0 514.4,276.7 515.9,277.3 517.5,278.0 519.0,278.6 520.5,279.3 522.0,279.9 523.5,280.5 525.0,281.1 526.5,281.7 528.0,282.3 529.5,282.9 531.1,283.5 532.6,284.0 534.1,284.6 535.6,285.1 537.1,285.7 538.6,286.2 540.1,286.7 541.7,287.2 543.2,287.7 544.7,288.2 546.2,288.7 547.7,289.2 549.2,289.7 550.7,290.2 552.2,290.6 553.8,291.1 555.3,291.6 556.8,292.0 558.3,292.5 559.8,292.9 561.3,293.3 562.8,293.8 564.3,294.2 565.9,294.6 567.4,295.0 568.9,295.4 570.4,295.8 571.9,296.2 573.4,296.6 574.9,297.0 576.4,297.4 578.0,297.8 579.5,298.1 581.0,298.5 582.5,298.9 584.0,299.2 585.5,299.6 587.0,299.9 588.5,300.3 590.0,300.6 591.6,301.0 593.1,301.3 594.6,301.6 596.1,301.9 597.6,302.3 599.1,302.6 600.6,302.9 602.1,303.2 603.7,303.5 605.2,303.8 606.7,304.1 608.2,304.4 609.7,304.7 611.2,305.0 612.7,305.3 614.3,305.6 615.8,305.9 617.3,306.2 618.8,306.5 620.3,306.7 621.8,307.0 623.3,307.3 624.8,307.5 626.4,307.8 627.9,308.1 629.4,308.3 630.9,308.6 632.4,308.8 633.9,309.1 635.4,309.3 636.9,309.6 638.5,309.8 640.0,310.1 641.5,310.3 643.0,310.6 644.5,310.8 646.0,311.0 647.5,311.3 649.0,311.5 650.5,311.7 652.1,311.9 653.6,312.2 655.1,312.4 656.6,312.6 658.1,312.8 659.6,313.0 661.1,313.2 662.7,313.5 664.2,313.7 665.7,313.9 667.2,314.1 668.7,314.3 670.2,314.5 671.7,314.7 673.2,314.9 674.8,315.1 676.3,315.3 677.8,315.5 679.3,315.7 680.8,315.8 682.3,316.0 683.8,316.2 685.3,316.4 686.8,316.6 688.4,316.8 689.9,316.9 691.4,317.1 692.9,317.3 694.4,317.5 695.9,317.7 697.4,317.8 699.0,318.0 700.5,318.2 702.0,318.3 703.5,318.5 705.0,318.7" class="s2"/>
  <polyline points="402.5,50.0 404.0,50.0 405.5,50.0 407.0,50.0 408.6,50.0 410.1,50.0 411.6,50.0 413.1,50.0 414.6,50.0 416.1,61.9 417.6,78.2 419.1,92.8 420.6,105.9 422.2,117.8 423.7,128.7 425.2,138.6 426.7,147.7 428.2,156.1 429.7,163.8 431.2,171.0 432.8,177.6 434.3,183.9 435.8,189.7 437.3,195.1 438.8,200.2 440.3,205.0 441.8,209.5 443.3,213.8 444.9,217.8 446.4,221.6 447.9,225.2 449.4,228.6 450.9,231.9 452.4,235.0 453.9,238.0 455.4,240.8 456.9,243.5 458.5,246.0 460.0,248.5 461.5,250.8 463.0,253.1 464.5,255.3 466.0,257.4 467.5,259.4 469.1,261.3 470.6,263.1 472.1,264.9 473.6,266.6 475.1,268.3 476.6,269.9 478.1,271.4 479.6,272.9 481.1,274.4 482.7,275.8 484.2,277.1 485.7,278.4 487.2,279.7 488.7,280.9 490.2,282.1 491.7,283.3 493.3,284.4 494.8,285.5 496.3,286.5 497.8,287.6 499.3,288.6 500.8,289.5 502.3,290.5 503.8,291.4 505.4,292.3 506.9,293.2 508.4,294.0 509.9,294.9 511.4,295.7 512.9,296.5 514.4,297.2 515.9,298.0 517.5,298.7 519.0,299.5 520.5,300.2 522.0,300.8 523.5,301.5 525.0,302.2 526.5,302.8 528.0,303.4 529.5,304.0 531.1,304.6 532.6,305.2 534.1,305.8 535.6,306.4 537.1,306.9 538.6,307.5 540.1,308.0 541.7,308.5 543.2,309.0 544.7,309.5 546.2,310.0 547.7,310.5 549.2,310.9 550.7,311.4 552.2,311.9 553.8,312.3 555.3,312.7 556.8,313.2 558.3,313.6 559.8,314.0 561.3,314.4 562.8,314.8 564.3,315.2 565.9,315.6 567.4,316.0 568.9,316.3 570.4,316.7 571.9,317.1 573.4,317.4 574.9,317.8 576.4,318.1 578.0,318.4 579.5,318.8 581.0,319.1 582.5,319.4 584.0,319.7 585.5,320.1 587.0,320.4 588.5,320.7 590.0,321.0 591.6,321.3 593.1,321.5 594.6,321.8 596.1,322.1 597.6,322.4 599.1,322.7 600.6,322.9 602.1,323.2 603.7,323.4 605.2,323.7 606.7,324.0 608.2,324.2 609.7,324.4 611.2,324.7 612.7,324.9 614.3,325.2 615.8,325.4 617.3,325.6 618.8,325.9 620.3,326.1 621.8,326.3 623.3,326.5 624.8,326.7 626.4,327.0 627.9,327.2 629.4,327.4 630.9,327.6 632.4,327.8 633.9,328.0 635.4,328.2 636.9,328.4 638.5,328.6 640.0,328.8 641.5,328.9 643.0,329.1 644.5,329.3 646.0,329.5 647.5,329.7 649.0,329.8 650.5,330.0 652.1,330.2 653.6,330.4 655.1,330.5 656.6,330.7 658.1,330.9 659.6,331.0 661.1,331.2 662.7,331.3 664.2,331.5 665.7,331.7 667.2,331.8 668.7,332.0 670.2,332.1 671.7,332.3 673.2,332.4 674.8,332.6 676.3,332.7 677.8,332.9 679.3,333.0 680.8,333.1 682.3,333.3 683.8,333.4 685.3,333.5 686.8,333.7 688.4,333.8 689.9,333.9 691.4,334.1 692.9,334.2 694.4,334.3 695.9,334.5 697.4,334.6 699.0,334.7 700.5,334.8 702.0,335.0 703.5,335.1 705.0,335.2" class="s1"/>
  <line x1="625" y1="70" x2="645" y2="70" class="s3"/>
  <text x="650" y="73" class="legend-label">b = 0</text>
  <line x1="625" y1="86" x2="645" y2="86" class="s2"/>
  <text x="650" y="89" class="legend-label">b = 0.5</text>
  <line x1="625" y1="102" x2="645" y2="102" class="s1"/>
  <text x="650" y="105" class="legend-label">b = 1</text>
  <text x="360" y="445" text-anchor="middle" class="formula">BM25(q,d) = Σ IDF(t) · (k₁+1)·tf / (k₁·(1−b+b·|d|/avgdl) + tf)</text>
</svg>
<p>Reciprocal Rank Fusion (RRF) then merges these scores into a single list. RRF uses a fixed formula: score each document as <code>1/(k + rank)</code> across both result lists, sort by combined score, done. It is scale-invariant: it doesn't care about the raw scores, only the ranks. The <code>k</code> parameter acts as a smoothing constant that determines how much weight to give to the top-ranked items versus the long tail. As <code>k</code> increases, the score gap between rank 1 and rank 10 shrinks, making the fusion more robust to noise in any single retriever.</p>
<svg id="rrf-chart" viewBox="0 0 720 380" xmlns="http://www.w3.org/2000/svg" style="width:100%;max-width:720px;">
  <style>
    #rrf-chart .axis { stroke: var(--c-brd); stroke-width: 1; }
    #rrf-chart .grid { stroke: var(--c-brd); stroke-width: 0.5; opacity: 0.3; }
    #rrf-chart .tick-label { font: 10px "JetBrains Mono", monospace; fill: var(--c-muted); }
    #rrf-chart .axis-label { font: 12px "Source Serif 4", serif; fill: var(--c-sec); }
    #rrf-chart .chart-title { font: 500 14px "Playfair Display", serif; fill: var(--c-pri); }
    #rrf-chart .formula { font: 11px "JetBrains Mono", monospace; fill: var(--c-sec); }
    #rrf-chart .caption { font: italic 11px "Source Serif 4", serif; fill: var(--c-muted); }
    #rrf-chart .legend-label { font: 10px "JetBrains Mono", monospace; fill: var(--c-muted); }
    #rrf-chart .s1 { stroke: var(--c-acc); fill: none; stroke-width: 2; }
    #rrf-chart .s2 { stroke: var(--c-acc2); fill: none; stroke-width: 2; stroke-dasharray: 8 4; }
    #rrf-chart .s3 { stroke: var(--c-muted); fill: none; stroke-width: 2; stroke-dasharray: 3 3; }
    #rrf-chart .bar-s1 { fill: var(--c-acc); }
    #rrf-chart .bar-s2 { fill: var(--c-acc2); }
    #rrf-chart .bar-s3 { fill: var(--c-muted); }
    #rrf-chart .bar-s4 { fill: var(--c-sec); }
    #rrf-chart .bar-label { font: 9px "JetBrains Mono", monospace; fill: var(--c-pri); }
  </style>
  <text x="360" y="28" text-anchor="middle" class="chart-title">Reciprocal Rank Fusion: Score vs. Rank</text>
  <line x1="70" y1="310.0" x2="700" y2="310.0" class="grid"/>
  <line x1="70" y1="258.0" x2="700" y2="258.0" class="grid"/>
  <line x1="70" y1="206.0" x2="700" y2="206.0" class="grid"/>
  <line x1="70" y1="154.0" x2="700" y2="154.0" class="grid"/>
  <line x1="70" y1="102.0" x2="700" y2="102.0" class="grid"/>
  <line x1="70" y1="50.0" x2="700" y2="50.0" class="grid"/>
  <line x1="70.0" y1="50" x2="70.0" y2="310" class="grid"/>
  <line x1="127.3" y1="50" x2="127.3" y2="310" class="grid"/>
  <line x1="190.9" y1="50" x2="190.9" y2="310" class="grid"/>
  <line x1="254.5" y1="50" x2="254.5" y2="310" class="grid"/>
  <line x1="318.2" y1="50" x2="318.2" y2="310" class="grid"/>
  <line x1="381.8" y1="50" x2="381.8" y2="310" class="grid"/>
  <line x1="445.5" y1="50" x2="445.5" y2="310" class="grid"/>
  <line x1="509.1" y1="50" x2="509.1" y2="310" class="grid"/>
  <line x1="572.7" y1="50" x2="572.7" y2="310" class="grid"/>
  <line x1="636.4" y1="50" x2="636.4" y2="310" class="grid"/>
  <line x1="700.0" y1="50" x2="700.0" y2="310" class="grid"/>
  <line x1="70" y1="310" x2="700" y2="310" class="axis"/>
  <line x1="70" y1="50" x2="70" y2="310" class="axis"/>
  <text x="70.0" y="328" text-anchor="middle" class="tick-label">1</text>
  <text x="127.3" y="328" text-anchor="middle" class="tick-label">10</text>
  <text x="190.9" y="328" text-anchor="middle" class="tick-label">20</text>
  <text x="254.5" y="328" text-anchor="middle" class="tick-label">30</text>
  <text x="318.2" y="328" text-anchor="middle" class="tick-label">40</text>
  <text x="381.8" y="328" text-anchor="middle" class="tick-label">50</text>
  <text x="445.5" y="328" text-anchor="middle" class="tick-label">60</text>
  <text x="509.1" y="328" text-anchor="middle" class="tick-label">70</text>
  <text x="572.7" y="328" text-anchor="middle" class="tick-label">80</text>
  <text x="636.4" y="328" text-anchor="middle" class="tick-label">90</text>
  <text x="700.0" y="328" text-anchor="middle" class="tick-label">100</text>
  <text x="62" y="314.0" text-anchor="end" class="tick-label">0.00</text>
  <text x="62" y="262.0" text-anchor="end" class="tick-label">0.02</text>
  <text x="62" y="210.0" text-anchor="end" class="tick-label">0.04</text>
  <text x="62" y="158.0" text-anchor="end" class="tick-label">0.06</text>
  <text x="62" y="106.0" text-anchor="end" class="tick-label">0.08</text>
  <text x="62" y="54.0" text-anchor="end" class="tick-label">0.10</text>
  <text x="385" y="355" text-anchor="middle" class="axis-label">Rank</text>
  <text x="16" y="180" text-anchor="middle" class="axis-label" transform="rotate(-90, 16, 180)">RRF Score</text>
  <polyline points="70.0,73.6 73.2,83.8 76.3,93.2 79.5,101.8 82.6,109.7 85.8,117.1 88.9,123.9 92.0,130.3 95.2,136.2 98.3,141.8 101.5,147.0 104.7,151.9 107.8,156.5 111.0,160.9 114.1,165.0 117.3,168.9 120.4,172.6 123.6,176.1 126.7,179.4 129.8,182.6 133.0,185.6 136.1,188.5 139.3,191.2 142.4,193.9 145.6,196.4 148.8,198.8 151.9,201.1 155.1,203.3 158.2,205.4 161.3,207.5 164.5,209.4 167.7,211.3 170.8,213.1 173.9,214.9 177.1,216.6 180.3,218.2 183.4,219.8 186.6,221.3 189.7,222.8 192.9,224.2 196.0,225.6 199.2,226.9 202.3,228.2 205.4,229.5 208.6,230.7 211.8,231.9 214.9,233.0 218.0,234.1 221.2,235.2 224.3,236.3 227.5,237.3 230.7,238.3 233.8,239.2 237.0,240.2 240.1,241.1 243.3,242.0 246.4,242.9 249.5,243.7 252.7,244.5 255.8,245.3 259.0,246.1 262.1,246.9 265.3,247.6 268.4,248.4 271.6,249.1 274.8,249.8 277.9,250.5 281.0,251.1 284.2,251.8 287.3,252.4 290.5,253.0 293.6,253.7 296.8,254.3 299.9,254.8 303.1,255.4 306.3,256.0 309.4,256.5 312.6,257.1 315.7,257.6 318.9,258.1 322.0,258.6 325.2,259.1 328.3,259.6 331.5,260.1 334.6,260.6 337.7,261.0 340.9,261.5 344.1,261.9 347.2,262.3 350.4,262.8 353.5,263.2 356.7,263.6 359.8,264.0 362.9,264.4 366.1,264.8 369.3,265.2 372.4,265.6 375.6,265.9 378.7,266.3 381.9,266.7 385.0,267.0 388.1,267.4 391.3,267.7 394.4,268.1 397.6,268.4 400.8,268.7 403.9,269.0 407.1,269.4 410.2,269.7 413.4,270.0 416.5,270.3 419.7,270.6 422.8,270.9 425.9,271.2 429.1,271.4 432.3,271.7 435.4,272.0 438.5,272.3 441.7,272.5 444.8,272.8 448.0,273.1 451.1,273.3 454.3,273.6 457.4,273.8 460.6,274.1 463.8,274.3 466.9,274.6 470.1,274.8 473.2,275.0 476.4,275.3 479.5,275.5 482.7,275.7 485.8,275.9 489.0,276.2 492.1,276.4 495.3,276.6 498.4,276.8 501.6,277.0 504.7,277.2 507.8,277.4 511.0,277.6 514.2,277.8 517.3,278.0 520.5,278.2 523.6,278.4 526.8,278.6 529.9,278.8 533.0,279.0 536.2,279.1 539.4,279.3 542.5,279.5 545.6,279.7 548.8,279.9 552.0,280.0 555.1,280.2 558.3,280.4 561.4,280.5 564.5,280.7 567.7,280.9 570.8,281.0 574.0,281.2 577.2,281.3 580.3,281.5 583.4,281.6 586.6,281.8 589.8,281.9 592.9,282.1 596.0,282.2 599.2,282.4 602.4,282.5 605.5,282.7 608.6,282.8 611.8,283.0 615.0,283.1 618.1,283.2 621.3,283.4 624.4,283.5 627.5,283.6 630.7,283.8 633.9,283.9 637.0,284.0 640.1,284.2 643.3,284.3 646.5,284.4 649.6,284.5 652.8,284.7 655.9,284.8 659.1,284.9 662.2,285.0 665.4,285.1 668.5,285.2 671.7,285.4 674.8,285.5 677.9,285.6 681.1,285.7 684.2,285.8 687.4,285.9 690.5,286.0 693.7,286.1 696.9,286.3 700.0,286.4" class="s1"/>
  <polyline points="70.0,226.1 73.2,227.4 76.3,228.7 79.5,230.0 82.6,231.2 85.8,232.3 88.9,233.5 92.0,234.6 95.2,235.6 98.3,236.7 101.5,237.7 104.7,238.7 107.8,239.6 111.0,240.5 114.1,241.5 117.3,242.3 120.4,243.2 123.6,244.0 126.7,244.9 129.8,245.7 133.0,246.4 136.1,247.2 139.3,247.9 142.4,248.7 145.6,249.4 148.8,250.1 151.9,250.7 155.1,251.4 158.2,252.0 161.3,252.7 164.5,253.3 167.7,253.9 170.8,254.5 173.9,255.1 177.1,255.6 180.3,256.2 183.4,256.7 186.6,257.3 189.7,257.8 192.9,258.3 196.0,258.8 199.2,259.3 202.3,259.8 205.4,260.3 208.6,260.7 211.8,261.2 214.9,261.6 218.0,262.1 221.2,262.5 224.3,262.9 227.5,263.4 230.7,263.8 233.8,264.2 237.0,264.6 240.1,265.0 243.3,265.3 246.4,265.7 249.5,266.1 252.7,266.5 255.8,266.8 259.0,267.2 262.1,267.5 265.3,267.9 268.4,268.2 271.6,268.5 274.8,268.8 277.9,269.2 281.0,269.5 284.2,269.8 287.3,270.1 290.5,270.4 293.6,270.7 296.8,271.0 299.9,271.3 303.1,271.6 306.3,271.8 309.4,272.1 312.6,272.4 315.7,272.6 318.9,272.9 322.0,273.2 325.2,273.4 328.3,273.7 331.5,273.9 334.6,274.2 337.7,274.4 340.9,274.7 344.1,274.9 347.2,275.1 350.4,275.4 353.5,275.6 356.7,275.8 359.8,276.0 362.9,276.2 366.1,276.5 369.3,276.7 372.4,276.9 375.6,277.1 378.7,277.3 381.9,277.5 385.0,277.7 388.1,277.9 391.3,278.1 394.4,278.3 397.6,278.5 400.8,278.7 403.9,278.9 407.1,279.0 410.2,279.2 413.4,279.4 416.5,279.6 419.7,279.7 422.8,279.9 425.9,280.1 429.1,280.3 432.3,280.4 435.4,280.6 438.5,280.8 441.7,280.9 444.8,281.1 448.0,281.2 451.1,281.4 454.3,281.6 457.4,281.7 460.6,281.9 463.8,282.0 466.9,282.2 470.1,282.3 473.2,282.4 476.4,282.6 479.5,282.7 482.7,282.9 485.8,283.0 489.0,283.2 492.1,283.3 495.3,283.4 498.4,283.6 501.6,283.7 504.7,283.8 507.8,283.9 511.0,284.1 514.2,284.2 517.3,284.3 520.5,284.5 523.6,284.6 526.8,284.7 529.9,284.8 533.0,284.9 536.2,285.1 539.4,285.2 542.5,285.3 545.6,285.4 548.8,285.5 552.0,285.6 555.1,285.8 558.3,285.9 561.4,286.0 564.5,286.1 567.7,286.2 570.8,286.3 574.0,286.4 577.2,286.5 580.3,286.6 583.4,286.7 586.6,286.8 589.8,286.9 592.9,287.0 596.0,287.1 599.2,287.2 602.4,287.3 605.5,287.4 608.6,287.5 611.8,287.6 615.0,287.7 618.1,287.8 621.3,287.9 624.4,288.0 627.5,288.1 630.7,288.2 633.9,288.3 637.0,288.4 640.1,288.4 643.3,288.5 646.5,288.6 649.6,288.7 652.8,288.8 655.9,288.9 659.1,289.0 662.2,289.0 665.4,289.1 668.5,289.2 671.7,289.3 674.8,289.4 677.9,289.5 681.1,289.5 684.2,289.6 687.4,289.7 690.5,289.8 693.7,289.8 696.9,289.9 700.0,290.0" class="s2"/>
  <polyline points="70.0,267.4 73.2,267.7 76.3,268.1 79.5,268.4 82.6,268.7 85.8,269.0 88.9,269.4 92.0,269.7 95.2,270.0 98.3,270.3 101.5,270.6 104.7,270.9 107.8,271.2 111.0,271.4 114.1,271.7 117.3,272.0 120.4,272.3 123.6,272.5 126.7,272.8 129.8,273.1 133.0,273.3 136.1,273.6 139.3,273.8 142.4,274.1 145.6,274.3 148.8,274.6 151.9,274.8 155.1,275.0 158.2,275.3 161.3,275.5 164.5,275.7 167.7,275.9 170.8,276.2 173.9,276.4 177.1,276.6 180.3,276.8 183.4,277.0 186.6,277.2 189.7,277.4 192.9,277.6 196.0,277.8 199.2,278.0 202.3,278.2 205.4,278.4 208.6,278.6 211.8,278.8 214.9,279.0 218.0,279.1 221.2,279.3 224.3,279.5 227.5,279.7 230.7,279.9 233.8,280.0 237.0,280.2 240.1,280.4 243.3,280.5 246.4,280.7 249.5,280.9 252.7,281.0 255.8,281.2 259.0,281.3 262.1,281.5 265.3,281.6 268.4,281.8 271.6,281.9 274.8,282.1 277.9,282.2 281.0,282.4 284.2,282.5 287.3,282.7 290.5,282.8 293.6,283.0 296.8,283.1 299.9,283.2 303.1,283.4 306.3,283.5 309.4,283.6 312.6,283.8 315.7,283.9 318.9,284.0 322.0,284.2 325.2,284.3 328.3,284.4 331.5,284.5 334.6,284.7 337.7,284.8 340.9,284.9 344.1,285.0 347.2,285.1 350.4,285.3 353.5,285.4 356.7,285.5 359.8,285.6 362.9,285.7 366.1,285.8 369.3,285.9 372.4,286.0 375.6,286.2 378.7,286.3 381.9,286.4 385.0,286.5 388.1,286.6 391.3,286.7 394.4,286.8 397.6,286.9 400.8,287.0 403.9,287.1 407.1,287.2 410.2,287.3 413.4,287.4 416.5,287.5 419.7,287.6 422.8,287.7 425.9,287.8 429.1,287.9 432.3,288.0 435.4,288.0 438.5,288.1 441.7,288.2 444.8,288.3 448.0,288.4 451.1,288.5 454.3,288.6 457.4,288.7 460.6,288.8 463.8,288.8 466.9,288.9 470.1,289.0 473.2,289.1 476.4,289.2 479.5,289.3 482.7,289.3 485.8,289.4 489.0,289.5 492.1,289.6 495.3,289.7 498.4,289.7 501.6,289.8 504.7,289.9 507.8,290.0 511.0,290.0 514.2,290.1 517.3,290.2 520.5,290.3 523.6,290.3 526.8,290.4 529.9,290.5 533.0,290.6 536.2,290.6 539.4,290.7 542.5,290.8 545.6,290.8 548.8,290.9 552.0,291.0 555.1,291.1 558.3,291.1 561.4,291.2 564.5,291.3 567.7,291.3 570.8,291.4 574.0,291.5 577.2,291.5 580.3,291.6 583.4,291.6 586.6,291.7 589.8,291.8 592.9,291.8 596.0,291.9 599.2,292.0 602.4,292.0 605.5,292.1 608.6,292.1 611.8,292.2 615.0,292.3 618.1,292.3 621.3,292.4 624.4,292.4 627.5,292.5 630.7,292.6 633.9,292.6 637.0,292.7 640.1,292.7 643.3,292.8 646.5,292.8 649.6,292.9 652.8,293.0 655.9,293.0 659.1,293.1 662.2,293.1 665.4,293.2 668.5,293.2 671.7,293.3 674.8,293.3 677.9,293.4 681.1,293.4 684.2,293.5 687.4,293.5 690.5,293.6 693.7,293.6 696.9,293.7 700.0,293.8" class="s3"/>
  <line x1="560" y1="70" x2="584" y2="70" class="s1"/>
  <text x="590" y="74" class="legend-label">k = 10</text>
  <line x1="560" y1="88" x2="584" y2="88" class="s2"/>
  <text x="590" y="92" class="legend-label">k = 30</text>
  <line x1="560" y1="106" x2="584" y2="106" class="s3"/>
  <text x="590" y="110" class="legend-label">k = 60</text>
  <text x="360" y="368" text-anchor="middle" class="formula">RRF(d) = Σ 1 / (k + rank_i(d))</text>
</svg>
<p>A learned-to-rank (LTR) model replaces the fixed formula with a linear function over multiple features, trained on human relevance judgments. Even a simple model should be able to outperform a single static knob by adapting to the corpus characteristics.</p>
<p>I designed 7 features:</p>
<table>
<thead>
<tr>
<th>#</th>
<th>Feature</th>
<th>Range</th>
<th>Rationale</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td><code>semantic_score</code></td>
<td>[0,1]</td>
<td>Raw cosine similarity from vector search</td>
</tr>
<tr>
<td>2</td>
<td><code>semantic_rank_norm</code></td>
<td>[0,1]</td>
<td>Normalized rank position in semantic results</td>
</tr>
<tr>
<td>3</td>
<td><code>keyword_rank_norm</code></td>
<td>[0,1]</td>
<td>Normalized rank position in keyword results</td>
</tr>
<tr>
<td>4</td>
<td><code>in_both</code></td>
<td>{0,1}</td>
<td>1 if the document appears in both lists</td>
</tr>
<tr>
<td>5</td>
<td><code>rrf_score</code></td>
<td>(0,~0.03)</td>
<td>Standard RRF score (so the model can replicate RRF)</td>
</tr>
<tr>
<td>6</td>
<td><code>path_depth_norm</code></td>
<td>[0,1]</td>
<td>File path depth as a document-level prior</td>
</tr>
<tr>
<td>7</td>
<td><code>content_length_norm</code></td>
<td>[0,1]</td>
<td>Content length as a document-level prior</td>
</tr>
</tbody>
</table>
<p>The scoring function is a dot product: <code>score(doc) = bias + Σ(wi * fi)</code>. Eight floats.</p>
<p>A key design choice: including <code>rrf_score</code> as a feature means the model can replicate RRF exactly by zeroing all other weights. It can only improve, never regress below the baseline. Or so I thought.</p>
<h2 id="training"><a href="#training">Training</a></h2>
<p>The training pipeline lives in <code>qrst-bench</code>, a separate benchmark crate. For each of the 42 evaluation queries:</p>
<ol>
<li>Run <code>qrst vsearch</code> (semantic) and <code>qrst search</code> (keyword) as subprocesses</li>
<li>Collect scored results from both</li>
<li>Extract the 7 features for every candidate document</li>
<li>Look up the human relevance grade (0-3) from the judgment file</li>
</ol>
<p>This produced 460 training samples (101 relevant documents). The relevance judgments were created manually: I ran each query, reviewed the top results, and assigned grades from 0 (irrelevant) to 3 (highly relevant). On a small corpus this takes an afternoon. On a production corpus it becomes the bottleneck, which is why most production LTR systems rely on implicit signals like click-through rates, dwell time, or query reformulations rather than manual labels.</p>
<p>I trained using <a href="https://www.microsoft.com/en-us/research/uploads/prod/2016/02/MSR-TR-2010-82.pdf">pairwise hinge loss</a> with SGD: for every pair of documents where one has a higher relevance grade, push the model to score it higher.</p>
<pre class="athl code-block" style="color: #e6edf3; background-color: #30363d;"><code class="language-plaintext" translate="no" tabindex="0"><div class="line" data-line="1">loss = max(0, margin - (score_better - score_worse))
</div><div class="line" data-line="2">margin = grade_difference × 0.1
</div></code></pre>
<svg id="hinge-chart" viewBox="0 0 720 380" xmlns="http://www.w3.org/2000/svg" style="width:100%;max-width:720px;">
  <style>
    #hinge-chart .axis { stroke: var(--c-brd); stroke-width: 1; }
    #hinge-chart .grid { stroke: var(--c-brd); stroke-width: 0.5; opacity: 0.3; }
    #hinge-chart .tick-label { font: 10px "JetBrains Mono", monospace; fill: var(--c-muted); }
    #hinge-chart .axis-label { font: 12px "Source Serif 4", serif; fill: var(--c-sec); }
    #hinge-chart .chart-title { font: 500 14px "Playfair Display", serif; fill: var(--c-pri); }
    #hinge-chart .formula { font: 11px "JetBrains Mono", monospace; fill: var(--c-sec); }
    #hinge-chart .caption { font: italic 11px "Source Serif 4", serif; fill: var(--c-muted); }
    #hinge-chart .legend-label { font: 10px "JetBrains Mono", monospace; fill: var(--c-muted); }
    #hinge-chart .s1 { stroke: var(--c-acc); fill: none; stroke-width: 2; }
    #hinge-chart .s2 { stroke: var(--c-acc2); fill: none; stroke-width: 2; stroke-dasharray: 8 4; }
    #hinge-chart .s3 { stroke: var(--c-muted); fill: none; stroke-width: 2; stroke-dasharray: 3 3; }
    #hinge-chart .bar-s1 { fill: var(--c-acc); }
    #hinge-chart .bar-s2 { fill: var(--c-acc2); }
    #hinge-chart .bar-s3 { fill: var(--c-muted); }
    #hinge-chart .bar-s4 { fill: var(--c-sec); }
    #hinge-chart .bar-label { font: 9px "JetBrains Mono", monospace; fill: var(--c-pri); }
  </style>
  <text x="360" y="28" text-anchor="middle" class="chart-title">Pairwise Hinge Loss</text>
  <line x1="70" y1="310.0" x2="700" y2="310.0" class="grid"/>
  <line x1="70" y1="245.0" x2="700" y2="245.0" class="grid"/>
  <line x1="70" y1="180.0" x2="700" y2="180.0" class="grid"/>
  <line x1="70" y1="115.0" x2="700" y2="115.0" class="grid"/>
  <line x1="70" y1="50.0" x2="700" y2="50.0" class="grid"/>
  <line x1="70.0" y1="50" x2="70.0" y2="310" class="grid"/>
  <line x1="175.0" y1="50" x2="175.0" y2="310" class="grid"/>
  <line x1="280.0" y1="50" x2="280.0" y2="310" class="grid"/>
  <line x1="385.0" y1="50" x2="385.0" y2="310" class="grid"/>
  <line x1="490.0" y1="50" x2="490.0" y2="310" class="grid"/>
  <line x1="595.0" y1="50" x2="595.0" y2="310" class="grid"/>
  <line x1="700.0" y1="50" x2="700.0" y2="310" class="grid"/>
  <line x1="70" y1="310" x2="700" y2="310" class="axis"/>
  <line x1="70" y1="50" x2="70" y2="310" class="axis"/>
  <text x="70.0" y="328" text-anchor="middle" class="tick-label">-0.50</text>
  <text x="175.0" y="328" text-anchor="middle" class="tick-label">-0.25</text>
  <text x="280.0" y="328" text-anchor="middle" class="tick-label">0.00</text>
  <text x="385.0" y="328" text-anchor="middle" class="tick-label">0.25</text>
  <text x="490.0" y="328" text-anchor="middle" class="tick-label">0.50</text>
  <text x="595.0" y="328" text-anchor="middle" class="tick-label">0.75</text>
  <text x="700.0" y="328" text-anchor="middle" class="tick-label">1.00</text>
  <text x="62" y="314.0" text-anchor="end" class="tick-label">0.0</text>
  <text x="62" y="249.0" text-anchor="end" class="tick-label">0.2</text>
  <text x="62" y="184.0" text-anchor="end" class="tick-label">0.4</text>
  <text x="62" y="119.0" text-anchor="end" class="tick-label">0.6</text>
  <text x="62" y="54.0" text-anchor="end" class="tick-label">0.8</text>
  <text x="385" y="355" text-anchor="middle" class="axis-label">Δscore (s_better − s_worse)</text>
  <text x="16" y="180" text-anchor="middle" class="axis-label" transform="rotate(-90, 16, 180)">Loss</text>
  <line x1="280.0" y1="50" x2="280.0" y2="310" stroke="var(--c-brd)" stroke-width="0.5" stroke-dasharray="4 2" opacity="0.6"/>
  <polyline points="70.0,50.0 73.2,52.4 76.3,54.9 79.5,57.3 82.6,59.8 85.7,62.2 88.9,64.6 92.0,67.1 95.2,69.5 98.4,71.9 101.5,74.4 104.7,76.8 107.8,79.3 111.0,81.7 114.1,84.1 117.3,86.6 120.4,89.0 123.6,91.4 126.7,93.9 129.9,96.3 133.0,98.8 136.1,101.2 139.3,103.6 142.4,106.1 145.6,108.5 148.8,110.9 151.9,113.4 155.1,115.8 158.2,118.3 161.3,120.7 164.5,123.1 167.7,125.6 170.8,128.0 173.9,130.4 177.1,132.9 180.2,135.3 183.4,137.8 186.5,140.2 189.7,142.6 192.8,145.1 196.0,147.5 199.2,149.9 202.3,152.4 205.4,154.8 208.6,157.3 211.8,159.7 214.9,162.1 218.0,164.6 221.2,167.0 224.3,169.4 227.5,171.9 230.7,174.3 233.8,176.8 237.0,179.2 240.1,181.6 243.3,184.1 246.4,186.5 249.5,188.9 252.7,191.4 255.8,193.8 259.0,196.3 262.1,198.7 265.3,201.1 268.4,203.6 271.6,206.0 274.8,208.4 277.9,210.9 281.1,213.3 284.2,215.8 287.4,218.2 290.5,220.6 293.6,223.1 296.8,225.5 299.9,227.9 303.1,230.4 306.3,232.8 309.4,235.3 312.6,237.7 315.7,240.1 318.9,242.6 322.0,245.0 325.1,247.4 328.3,249.9 331.4,252.3 334.6,254.8 337.8,257.2 340.9,259.6 344.1,262.1 347.2,264.5 350.4,266.9 353.5,269.4 356.7,271.8 359.8,274.3 362.9,276.7 366.1,279.1 369.2,281.6 372.4,284.0 375.6,286.4 378.7,288.9 381.8,291.3 385.0,293.8 388.1,296.2 391.3,298.6 394.4,301.1 397.6,303.5 400.8,305.9 403.9,308.4 407.1,310.0 410.2,310.0 413.4,310.0 416.5,310.0 419.7,310.0 422.8,310.0 425.9,310.0 429.1,310.0 432.3,310.0 435.4,310.0 438.5,310.0 441.7,310.0 444.8,310.0 448.0,310.0 451.1,310.0 454.3,310.0 457.4,310.0 460.6,310.0 463.8,310.0 466.9,310.0 470.1,310.0 473.2,310.0 476.4,310.0 479.5,310.0 482.7,310.0 485.8,310.0 489.0,310.0 492.1,310.0 495.3,310.0 498.4,310.0 501.6,310.0 504.7,310.0 507.8,310.0 511.0,310.0 514.1,310.0 517.3,310.0 520.5,310.0 523.6,310.0 526.8,310.0 529.9,310.0 533.0,310.0 536.2,310.0 539.4,310.0 542.5,310.0 545.6,310.0 548.8,310.0 552.0,310.0 555.1,310.0 558.3,310.0 561.4,310.0 564.5,310.0 567.7,310.0 570.9,310.0 574.0,310.0 577.2,310.0 580.3,310.0 583.4,310.0 586.6,310.0 589.7,310.0 592.9,310.0 596.0,310.0 599.2,310.0 602.4,310.0 605.5,310.0 608.6,310.0 611.8,310.0 614.9,310.0 618.1,310.0 621.3,310.0 624.4,310.0 627.6,310.0 630.7,310.0 633.9,310.0 637.0,310.0 640.1,310.0 643.3,310.0 646.5,310.0 649.6,310.0 652.8,310.0 655.9,310.0 659.1,310.0 662.2,310.0 665.4,310.0 668.5,310.0 671.6,310.0 674.8,310.0 677.9,310.0 681.1,310.0 684.3,310.0 687.4,310.0 690.5,310.0 693.7,310.0 696.9,310.0 700.0,310.0" class="s1"/>
  <polyline points="70.0,82.5 73.2,84.9 76.3,87.4 79.5,89.8 82.6,92.3 85.7,94.7 88.9,97.1 92.0,99.6 95.2,102.0 98.4,104.4 101.5,106.9 104.7,109.3 107.8,111.7 111.0,114.2 114.1,116.6 117.3,119.1 120.4,121.5 123.6,123.9 126.7,126.4 129.9,128.8 133.0,131.3 136.1,133.7 139.3,136.1 142.4,138.6 145.6,141.0 148.8,143.4 151.9,145.9 155.1,148.3 158.2,150.8 161.3,153.2 164.5,155.6 167.7,158.1 170.8,160.5 173.9,162.9 177.1,165.4 180.2,167.8 183.4,170.3 186.5,172.7 189.7,175.1 192.8,177.6 196.0,180.0 199.2,182.4 202.3,184.9 205.4,187.3 208.6,189.8 211.8,192.2 214.9,194.6 218.0,197.1 221.2,199.5 224.3,201.9 227.5,204.4 230.7,206.8 233.8,209.3 237.0,211.7 240.1,214.1 243.3,216.6 246.4,219.0 249.5,221.4 252.7,223.9 255.8,226.3 259.0,228.8 262.1,231.2 265.3,233.6 268.4,236.1 271.6,238.5 274.8,240.9 277.9,243.4 281.1,245.8 284.2,248.3 287.4,250.7 290.5,253.1 293.6,255.6 296.8,258.0 299.9,260.4 303.1,262.9 306.3,265.3 309.4,267.8 312.6,270.2 315.7,272.6 318.9,275.1 322.0,277.5 325.1,279.9 328.3,282.4 331.4,284.8 334.6,287.3 337.8,289.7 340.9,292.1 344.1,294.6 347.2,297.0 350.4,299.4 353.5,301.9 356.7,304.3 359.8,306.8 362.9,309.2 366.1,310.0 369.2,310.0 372.4,310.0 375.6,310.0 378.7,310.0 381.8,310.0 385.0,310.0 388.1,310.0 391.3,310.0 394.4,310.0 397.6,310.0 400.8,310.0 403.9,310.0 407.1,310.0 410.2,310.0 413.4,310.0 416.5,310.0 419.7,310.0 422.8,310.0 425.9,310.0 429.1,310.0 432.3,310.0 435.4,310.0 438.5,310.0 441.7,310.0 444.8,310.0 448.0,310.0 451.1,310.0 454.3,310.0 457.4,310.0 460.6,310.0 463.8,310.0 466.9,310.0 470.1,310.0 473.2,310.0 476.4,310.0 479.5,310.0 482.7,310.0 485.8,310.0 489.0,310.0 492.1,310.0 495.3,310.0 498.4,310.0 501.6,310.0 504.7,310.0 507.8,310.0 511.0,310.0 514.1,310.0 517.3,310.0 520.5,310.0 523.6,310.0 526.8,310.0 529.9,310.0 533.0,310.0 536.2,310.0 539.4,310.0 542.5,310.0 545.6,310.0 548.8,310.0 552.0,310.0 555.1,310.0 558.3,310.0 561.4,310.0 564.5,310.0 567.7,310.0 570.9,310.0 574.0,310.0 577.2,310.0 580.3,310.0 583.4,310.0 586.6,310.0 589.7,310.0 592.9,310.0 596.0,310.0 599.2,310.0 602.4,310.0 605.5,310.0 608.6,310.0 611.8,310.0 614.9,310.0 618.1,310.0 621.3,310.0 624.4,310.0 627.6,310.0 630.7,310.0 633.9,310.0 637.0,310.0 640.1,310.0 643.3,310.0 646.5,310.0 649.6,310.0 652.8,310.0 655.9,310.0 659.1,310.0 662.2,310.0 665.4,310.0 668.5,310.0 671.6,310.0 674.8,310.0 677.9,310.0 681.1,310.0 684.3,310.0 687.4,310.0 690.5,310.0 693.7,310.0 696.9,310.0 700.0,310.0" class="s2"/>
  <polyline points="70.0,115.0 73.2,117.4 76.3,119.9 79.5,122.3 82.6,124.8 85.7,127.2 88.9,129.6 92.0,132.1 95.2,134.5 98.4,136.9 101.5,139.4 104.7,141.8 107.8,144.3 111.0,146.7 114.1,149.1 117.3,151.6 120.4,154.0 123.6,156.4 126.7,158.9 129.9,161.3 133.0,163.8 136.1,166.2 139.3,168.6 142.4,171.1 145.6,173.5 148.8,175.9 151.9,178.4 155.1,180.8 158.2,183.3 161.3,185.7 164.5,188.1 167.7,190.6 170.8,193.0 173.9,195.4 177.1,197.9 180.2,200.3 183.4,202.8 186.5,205.2 189.7,207.6 192.8,210.1 196.0,212.5 199.2,214.9 202.3,217.4 205.4,219.8 208.6,222.3 211.8,224.7 214.9,227.1 218.0,229.6 221.2,232.0 224.3,234.4 227.5,236.9 230.7,239.3 233.8,241.8 237.0,244.2 240.1,246.6 243.3,249.1 246.4,251.5 249.5,253.9 252.7,256.4 255.8,258.8 259.0,261.3 262.1,263.7 265.3,266.1 268.4,268.6 271.6,271.0 274.8,273.4 277.9,275.9 281.1,278.3 284.2,280.8 287.4,283.2 290.5,285.6 293.6,288.1 296.8,290.5 299.9,292.9 303.1,295.4 306.3,297.8 309.4,300.3 312.6,302.7 315.7,305.1 318.9,307.6 322.0,310.0 325.1,310.0 328.3,310.0 331.4,310.0 334.6,310.0 337.8,310.0 340.9,310.0 344.1,310.0 347.2,310.0 350.4,310.0 353.5,310.0 356.7,310.0 359.8,310.0 362.9,310.0 366.1,310.0 369.2,310.0 372.4,310.0 375.6,310.0 378.7,310.0 381.8,310.0 385.0,310.0 388.1,310.0 391.3,310.0 394.4,310.0 397.6,310.0 400.8,310.0 403.9,310.0 407.1,310.0 410.2,310.0 413.4,310.0 416.5,310.0 419.7,310.0 422.8,310.0 425.9,310.0 429.1,310.0 432.3,310.0 435.4,310.0 438.5,310.0 441.7,310.0 444.8,310.0 448.0,310.0 451.1,310.0 454.3,310.0 457.4,310.0 460.6,310.0 463.8,310.0 466.9,310.0 470.1,310.0 473.2,310.0 476.4,310.0 479.5,310.0 482.7,310.0 485.8,310.0 489.0,310.0 492.1,310.0 495.3,310.0 498.4,310.0 501.6,310.0 504.7,310.0 507.8,310.0 511.0,310.0 514.1,310.0 517.3,310.0 520.5,310.0 523.6,310.0 526.8,310.0 529.9,310.0 533.0,310.0 536.2,310.0 539.4,310.0 542.5,310.0 545.6,310.0 548.8,310.0 552.0,310.0 555.1,310.0 558.3,310.0 561.4,310.0 564.5,310.0 567.7,310.0 570.9,310.0 574.0,310.0 577.2,310.0 580.3,310.0 583.4,310.0 586.6,310.0 589.7,310.0 592.9,310.0 596.0,310.0 599.2,310.0 602.4,310.0 605.5,310.0 608.6,310.0 611.8,310.0 614.9,310.0 618.1,310.0 621.3,310.0 624.4,310.0 627.6,310.0 630.7,310.0 633.9,310.0 637.0,310.0 640.1,310.0 643.3,310.0 646.5,310.0 649.6,310.0 652.8,310.0 655.9,310.0 659.1,310.0 662.2,310.0 665.4,310.0 668.5,310.0 671.6,310.0 674.8,310.0 677.9,310.0 681.1,310.0 684.3,310.0 687.4,310.0 690.5,310.0 693.7,310.0 696.9,310.0 700.0,310.0" class="s3"/>
  <line x1="550" y1="70" x2="574" y2="70" class="s1"/>
  <text x="580" y="74" class="legend-label">margin = 0.3</text>
  <line x1="550" y1="88" x2="574" y2="88" class="s2"/>
  <text x="580" y="92" class="legend-label">margin = 0.2</text>
  <line x1="550" y1="106" x2="574" y2="106" class="s3"/>
  <text x="580" y="110" class="legend-label">margin = 0.1</text>
  <text x="360" y="368" text-anchor="middle" class="formula">loss = max(0, margin − (s_better − s_worse))</text>
</svg>
<p>Leave-one-out cross-validation across all 42 queries. Train on 41, evaluate on the held-out query, repeat.</p>
<h2 id="initial-metrics"><a href="#initial-metrics">Initial Metrics</a></h2>
<p>After 100 epochs with lr=0.001 and L2 regularization:</p>
<table>
<thead>
<tr>
<th>Metric</th>
<th>nDCG@5</th>
</tr>
</thead>
<tbody>
<tr>
<td>RRF baseline</td>
<td>0.794</td>
</tr>
<tr>
<td>LTR (training set)</td>
<td>0.853</td>
</tr>
<tr>
<td>LTR (LOO cross-validation)</td>
<td>0.848</td>
</tr>
</tbody>
</table>
<p>A +0.054 improvement over RRF, with minimal overfitting. The learned weights were:</p>
<table>
<thead>
<tr>
<th>Feature</th>
<th>Weight</th>
</tr>
</thead>
<tbody>
<tr>
<td>semantic_rank_norm</td>
<td><strong>+0.532</strong></td>
</tr>
<tr>
<td>rrf_score</td>
<td>+0.390</td>
</tr>
<tr>
<td>in_both</td>
<td>+0.313</td>
</tr>
<tr>
<td>semantic_score</td>
<td>+0.250</td>
</tr>
<tr>
<td>keyword_rank_norm</td>
<td><strong>-0.162</strong></td>
</tr>
<tr>
<td>path_depth_norm</td>
<td>+0.133</td>
</tr>
<tr>
<td>content_length_norm</td>
<td>0.000</td>
</tr>
</tbody>
</table>
<p>The model learned that semantic rank order is the strongest signal, that documents appearing in both lists are reliably relevant, and that keyword-only rank is a negative indicator, meaning a document matched surface terms but lacked semantic relevance.</p>
<p>Intuitively this makes sense. On this corpus, keyword search has many false positives (nDCG@5 = 0.431). The model correctly identifies keyword-only results as noise.</p>
<p>At that point, the model looked ready to ship.</p>
<h2 id="end-to-end-evaluation"><a href="#end-to-end-evaluation">End-to-End Evaluation</a></h2>
<p>Then I plugged the trained weights into the actual search pipeline and ran the end-to-end evaluation.</p>
<p>Three standard ranking metrics: <a href="https://dl.acm.org/doi/10.1145/582415.582418">nDCG@5</a> (normalized discounted cumulative gain, measures graded relevance with position discount), P@3 (precision of the top 3 results), and MRR (mean reciprocal rank, how early the first relevant result appears).</p>
<table>
<thead>
<tr>
<th>Strategy</th>
<th>nDCG@5</th>
<th>P@3</th>
<th>MRR</th>
</tr>
</thead>
<tbody>
<tr>
<td>BM25 (keyword only)</td>
<td>0.431</td>
<td>0.246</td>
<td>0.534</td>
</tr>
<tr>
<td>LTR (trained)</td>
<td>0.788</td>
<td>0.476</td>
<td>0.892</td>
</tr>
<tr>
<td>RRF (k=60)</td>
<td>0.794</td>
<td>0.476</td>
<td>0.903</td>
</tr>
<tr>
<td>Semantic only</td>
<td>0.827</td>
<td>0.500</td>
<td>0.880</td>
</tr>
</tbody>
</table>
<svg id="eval-chart" viewBox="0 0 720 400" xmlns="http://www.w3.org/2000/svg" style="width:100%;max-width:720px;">
  <style>
    #eval-chart .axis { stroke: var(--c-brd); stroke-width: 1; }
    #eval-chart .grid { stroke: var(--c-brd); stroke-width: 0.5; opacity: 0.3; }
    #eval-chart .tick-label { font: 10px "JetBrains Mono", monospace; fill: var(--c-muted); }
    #eval-chart .axis-label { font: 12px "Source Serif 4", serif; fill: var(--c-sec); }
    #eval-chart .chart-title { font: 500 14px "Playfair Display", serif; fill: var(--c-pri); }
    #eval-chart .formula { font: 11px "JetBrains Mono", monospace; fill: var(--c-sec); }
    #eval-chart .caption { font: italic 11px "Source Serif 4", serif; fill: var(--c-muted); }
    #eval-chart .legend-label { font: 10px "JetBrains Mono", monospace; fill: var(--c-muted); }
    #eval-chart .s1 { stroke: var(--c-acc); fill: none; stroke-width: 2; }
    #eval-chart .s2 { stroke: var(--c-acc2); fill: none; stroke-width: 2; stroke-dasharray: 8 4; }
    #eval-chart .s3 { stroke: var(--c-muted); fill: none; stroke-width: 2; stroke-dasharray: 3 3; }
    #eval-chart .bar-s1 { fill: var(--c-acc); }
    #eval-chart .bar-s2 { fill: var(--c-acc2); }
    #eval-chart .bar-s3 { fill: var(--c-muted); }
    #eval-chart .bar-s4 { fill: var(--c-sec); }
    #eval-chart .bar-label { font: 9px "JetBrains Mono", monospace; fill: var(--c-pri); }
  </style>
  <text x="360" y="28" text-anchor="middle" class="chart-title">End-to-End Evaluation: Ranking Metrics</text>
  <line x1="70" y1="320.0" x2="700" y2="320.0" class="grid"/>
  <line x1="70" y1="266.0" x2="700" y2="266.0" class="grid"/>
  <line x1="70" y1="212.0" x2="700" y2="212.0" class="grid"/>
  <line x1="70" y1="158.0" x2="700" y2="158.0" class="grid"/>
  <line x1="70" y1="104.0" x2="700" y2="104.0" class="grid"/>
  <line x1="70" y1="50.0" x2="700" y2="50.0" class="grid"/>
  <line x1="70" y1="320" x2="700" y2="320" class="axis"/>
  <line x1="70" y1="50" x2="70" y2="320" class="axis"/>
  <text x="62" y="324.0" text-anchor="end" class="tick-label">0.0</text>
  <text x="62" y="270.0" text-anchor="end" class="tick-label">0.2</text>
  <text x="62" y="216.0" text-anchor="end" class="tick-label">0.4</text>
  <text x="62" y="162.0" text-anchor="end" class="tick-label">0.6</text>
  <text x="62" y="108.0" text-anchor="end" class="tick-label">0.8</text>
  <text x="62" y="54.0" text-anchor="end" class="tick-label">1.0</text>
  <text x="16" y="185" text-anchor="middle" class="axis-label" transform="rotate(-90, 16, 185)">Score</text>
  <rect x="101.5" y="203.6" width="29.5" height="116.4" class="bar-s1" rx="1"/>
  <text x="116.3" y="199.6" text-anchor="middle" class="bar-label">0.431</text>
  <rect x="133.0" y="253.6" width="29.5" height="66.4" class="bar-s2" rx="1"/>
  <text x="147.8" y="249.6" text-anchor="middle" class="bar-label">0.246</text>
  <rect x="164.5" y="175.8" width="29.5" height="144.2" class="bar-s3" rx="1"/>
  <text x="179.3" y="171.8" text-anchor="middle" class="bar-label">0.534</text>
  <text x="148.8" y="338" text-anchor="middle" class="tick-label">BM25</text>
  <rect x="259.0" y="107.2" width="29.5" height="212.8" class="bar-s1" rx="1"/>
  <text x="273.8" y="103.2" text-anchor="middle" class="bar-label">0.788</text>
  <rect x="290.5" y="191.5" width="29.5" height="128.5" class="bar-s2" rx="1"/>
  <text x="305.3" y="187.5" text-anchor="middle" class="bar-label">0.476</text>
  <rect x="322.0" y="79.2" width="29.5" height="240.8" class="bar-s3" rx="1"/>
  <text x="336.8" y="75.2" text-anchor="middle" class="bar-label">0.892</text>
  <text x="306.3" y="338" text-anchor="middle" class="tick-label">LTR</text>
  <rect x="416.5" y="105.6" width="29.5" height="214.4" class="bar-s1" rx="1"/>
  <text x="431.3" y="101.6" text-anchor="middle" class="bar-label">0.794</text>
  <rect x="448.0" y="191.5" width="29.5" height="128.5" class="bar-s2" rx="1"/>
  <text x="462.8" y="187.5" text-anchor="middle" class="bar-label">0.476</text>
  <rect x="479.5" y="76.2" width="29.5" height="243.8" class="bar-s3" rx="1"/>
  <text x="494.3" y="72.2" text-anchor="middle" class="bar-label">0.903</text>
  <text x="463.8" y="338" text-anchor="middle" class="tick-label">RRF</text>
  <rect x="574.0" y="96.7" width="29.5" height="223.3" class="bar-s1" rx="1"/>
  <text x="588.8" y="92.7" text-anchor="middle" class="bar-label">0.827</text>
  <rect x="605.5" y="185.0" width="29.5" height="135.0" class="bar-s2" rx="1"/>
  <text x="620.3" y="181.0" text-anchor="middle" class="bar-label">0.500</text>
  <rect x="637.0" y="82.4" width="29.5" height="237.6" class="bar-s3" rx="1"/>
  <text x="651.8" y="78.4" text-anchor="middle" class="bar-label">0.880</text>
  <text x="621.3" y="338" text-anchor="middle" class="tick-label">Semantic</text>
  <rect x="80" y="54" width="12" height="12" class="bar-s1" rx="1"/>
  <text x="96" y="64" class="legend-label">nDCG@5</text>
  <rect x="180" y="54" width="12" height="12" class="bar-s2" rx="1"/>
  <text x="196" y="64" class="legend-label">P@3</text>
  <rect x="280" y="54" width="12" height="12" class="bar-s3" rx="1"/>
  <text x="296" y="64" class="legend-label">MRR</text>
</svg>
<p>The LTR model scored 0.788, below the RRF baseline it was supposed to beat.</p>
<h2 id="what-went-wrong"><a href="#what-went-wrong">What Went Wrong</a></h2>
<p>The reranking evaluation and the pipeline evaluation measure different things.</p>
<p>Reranking (nDCG@5 = 0.848): &quot;Here are 460 documents already retrieved from both search methods. Sort them.&quot; The model sees all candidates, including relevant ones, and only needs to order them correctly.</p>
<p>End-to-end pipeline (nDCG@5 = 0.788): &quot;Run semantic search, run keyword search, fuse the two result lists, return the top results.&quot; The fusion strategy also controls which documents survive the cutoff.</p>
<p>The negative <code>keyword_rank_norm</code> weight (-0.162) was the culprit. In reranking, it correctly identifies keyword-only false positives. But in the pipeline, it actively pushes down every document that only appears in keyword results, including the ones that happen to be relevant. Those documents score below the retrieval cutoff and vanish from the final results entirely.</p>
<p>The model learned the right thing for the wrong task.</p>
<p>This is an instance of a general principle in retrieval systems: a reranker can only reorder documents that retrieval surfaces. It cannot recover what was never retrieved. Recall is the ceiling, and ranking quality can only work within it. The reranking evaluation hid this because it presented all candidates at once, removing the retrieval bottleneck entirely.</p>
<h2 id="the-fix"><a href="#the-fix">The Fix</a></h2>
<p>The intervention was simple: constrain the rank-based feature weights to be non-negative during training. The model can still ignore keyword rank (weight -&gt; 0), but it cannot penalize it.</p>
<pre class="athl code-block" style="color: #e6edf3; background-color: #30363d;"><code class="language-rust" translate="no" tabindex="0"><div class="line" data-line="1"><span style="color: #ff7b72;">let</span> <span style="color: #e6edf3;">non_negative</span><span style="color: #e6edf3;">:</span> <span style="color: #e6edf3;">[</span><span style="color: #ff7b72;">bool</span><span style="color: #e6edf3;">;</span> <span style="color: #79c0ff;">NUM_FEATURES</span><span style="color: #e6edf3;">]</span> <span style="color: #79c0ff;">=</span> <span style="color: #e6edf3;">[</span>
</div><div class="line" data-line="2">    <span style="color: #79c0ff;">false</span><span style="color: #e6edf3;">,</span> <span style="color: #8b949e;">// bias</span>
</div><div class="line" data-line="3">    <span style="color: #79c0ff;">false</span><span style="color: #e6edf3;">,</span> <span style="color: #8b949e;">// semantic_score</span>
</div><div class="line" data-line="4">    <span style="color: #79c0ff;">true</span><span style="color: #e6edf3;">,</span>  <span style="color: #8b949e;">// semantic_rank_norm</span>
</div><div class="line" data-line="5">    <span style="color: #79c0ff;">true</span><span style="color: #e6edf3;">,</span>  <span style="color: #8b949e;">// keyword_rank_norm</span>
</div><div class="line" data-line="6">    <span style="color: #79c0ff;">true</span><span style="color: #e6edf3;">,</span>  <span style="color: #8b949e;">// in_both</span>
</div><div class="line" data-line="7">    <span style="color: #79c0ff;">true</span><span style="color: #e6edf3;">,</span>  <span style="color: #8b949e;">// rrf_score</span>
</div><div class="line" data-line="8">    <span style="color: #79c0ff;">false</span><span style="color: #e6edf3;">,</span> <span style="color: #8b949e;">// path_depth_norm</span>
</div><div class="line" data-line="9">    <span style="color: #79c0ff;">false</span><span style="color: #e6edf3;">,</span> <span style="color: #8b949e;">// content_length_norm</span>
</div><div class="line" data-line="10"><span style="color: #e6edf3;">]</span><span style="color: #e6edf3;">;</span>
</div></code></pre>
<p>After retraining with the constraint:</p>
<table>
<thead>
<tr>
<th>Strategy</th>
<th>nDCG@5 (e2e)</th>
<th>LOO-CV (reranking)</th>
</tr>
</thead>
<tbody>
<tr>
<td>RRF baseline</td>
<td>0.794</td>
<td>n/a</td>
</tr>
<tr>
<td>LTR v1 (unconstrained)</td>
<td>0.788</td>
<td>0.848</td>
</tr>
<tr>
<td>LTR v2 (non-negative)</td>
<td>0.794</td>
<td>0.844</td>
</tr>
</tbody>
</table>
<p>The constrained model recovered the full pipeline performance. The <code>keyword_rank_norm</code> weight went from -0.162 to +0.007, effectively zero. The model learned to ignore keyword rank rather than penalize it.</p>
<p>But it did not beat RRF. It matched it exactly.</p>
<h2 id="why-the-model-converges-to-rrf"><a href="#why-the-model-converges-to-rrf">Why the Model Converges to RRF</a></h2>
<p>Looking at the final weights:</p>
<table>
<thead>
<tr>
<th>Feature</th>
<th>Unconstrained</th>
<th>Constrained</th>
</tr>
</thead>
<tbody>
<tr>
<td>semantic_rank_norm</td>
<td>+0.532</td>
<td>+0.534</td>
</tr>
<tr>
<td>rrf_score</td>
<td>+0.390</td>
<td>+0.387</td>
</tr>
<tr>
<td>in_both</td>
<td>+0.313</td>
<td>+0.185</td>
</tr>
<tr>
<td>semantic_score</td>
<td>+0.250</td>
<td>+0.218</td>
</tr>
<tr>
<td>keyword_rank_norm</td>
<td>-0.162</td>
<td>+0.007</td>
</tr>
</tbody>
</table>
<svg id="weights-chart" viewBox="0 0 720 440" xmlns="http://www.w3.org/2000/svg" style="width:100%;max-width:720px;">
  <style>
    #weights-chart .axis { stroke: var(--c-brd); stroke-width: 1; }
    #weights-chart .grid { stroke: var(--c-brd); stroke-width: 0.5; opacity: 0.3; }
    #weights-chart .tick-label { font: 10px "JetBrains Mono", monospace; fill: var(--c-muted); }
    #weights-chart .axis-label { font: 12px "Source Serif 4", serif; fill: var(--c-sec); }
    #weights-chart .chart-title { font: 500 14px "Playfair Display", serif; fill: var(--c-pri); }
    #weights-chart .feature-label { font: 11px "JetBrains Mono", monospace; fill: var(--c-pri); }
    #weights-chart .feature-highlight { font: 11px "JetBrains Mono", monospace; fill: var(--c-acc); font-weight: 600; }
    #weights-chart .legend-label { font: 10px "JetBrains Mono", monospace; fill: var(--c-muted); }
    #weights-chart .bar-unc { fill: var(--c-acc); opacity: 0.85; }
    #weights-chart .bar-con { fill: var(--c-acc2); opacity: 0.85; }
    #weights-chart .bar-label { font: 9px "JetBrains Mono", monospace; fill: var(--c-pri); }
  </style>
  <text x="360" y="24" text-anchor="middle" class="chart-title">Feature Weights: Unconstrained vs. Constrained</text>
  <rect x="170" y="40" width="12" height="12" class="bar-unc" rx="1"/>
  <text x="186" y="50" class="legend-label">Unconstrained</text>
  <rect x="290" y="40" width="12" height="12" class="bar-con" rx="1"/>
  <text x="306" y="50" class="legend-label">Constrained</text>
  <line x1="198.2" y1="72" x2="198.2" y2="390" class="grid"/>
  <line x1="254.7" y1="72" x2="254.7" y2="390" class="grid"/>
  <line x1="311.2" y1="72" x2="311.2" y2="390" class="grid"/>
  <line x1="367.6" y1="72" x2="367.6" y2="390" class="grid"/>
  <line x1="424.1" y1="72" x2="424.1" y2="390" class="grid"/>
  <line x1="480.6" y1="72" x2="480.6" y2="390" class="grid"/>
  <line x1="537.1" y1="72" x2="537.1" y2="390" class="grid"/>
  <line x1="593.5" y1="72" x2="593.5" y2="390" class="grid"/>
  <line x1="650.0" y1="72" x2="650.0" y2="390" class="grid"/>
  <line x1="311.2" y1="72" x2="311.2" y2="390" stroke="var(--c-brd)" stroke-width="1.5" opacity="0.7"/>
  <line x1="170" y1="390" x2="650" y2="390" class="axis"/>
  <text x="198.2" y="408" text-anchor="middle" class="tick-label">-0.2</text>
  <text x="254.7" y="408" text-anchor="middle" class="tick-label">-0.1</text>
  <text x="311.2" y="408" text-anchor="middle" class="tick-label">0.0</text>
  <text x="367.6" y="408" text-anchor="middle" class="tick-label">0.1</text>
  <text x="424.1" y="408" text-anchor="middle" class="tick-label">0.2</text>
  <text x="480.6" y="408" text-anchor="middle" class="tick-label">0.3</text>
  <text x="537.1" y="408" text-anchor="middle" class="tick-label">0.4</text>
  <text x="593.5" y="408" text-anchor="middle" class="tick-label">0.5</text>
  <text x="650.0" y="408" text-anchor="middle" class="tick-label">0.6</text>
  <text x="162" y="98.7" text-anchor="end" class="feature-label">semantic_rank_norm</text>
  <rect x="311.2" y="83.2" width="300.4" height="10" class="bar-unc" rx="1"/>
  <text x="616.6" y="92.2" text-anchor="start" class="bar-label">+0.532</text>
  <rect x="311.2" y="96.2" width="301.6" height="10" class="bar-con" rx="1"/>
  <text x="617.7" y="105.2" text-anchor="start" class="bar-label">+0.534</text>
  <text x="162" y="144.1" text-anchor="end" class="feature-label">rrf_score</text>
  <rect x="311.2" y="128.6" width="220.2" height="10" class="bar-unc" rx="1"/>
  <text x="536.4" y="137.6" text-anchor="start" class="bar-label">+0.390</text>
  <rect x="311.2" y="141.6" width="218.5" height="10" class="bar-con" rx="1"/>
  <text x="534.7" y="150.6" text-anchor="start" class="bar-label">+0.387</text>
  <text x="162" y="189.6" text-anchor="end" class="feature-label">in_both</text>
  <rect x="311.2" y="174.1" width="176.8" height="10" class="bar-unc" rx="1"/>
  <text x="492.9" y="183.1" text-anchor="start" class="bar-label">+0.313</text>
  <rect x="311.2" y="187.1" width="104.5" height="10" class="bar-con" rx="1"/>
  <text x="420.6" y="196.1" text-anchor="start" class="bar-label">+0.185</text>
  <text x="162" y="235.0" text-anchor="end" class="feature-label">semantic_score</text>
  <rect x="311.2" y="219.5" width="141.2" height="10" class="bar-unc" rx="1"/>
  <text x="457.4" y="228.5" text-anchor="start" class="bar-label">+0.250</text>
  <rect x="311.2" y="232.5" width="123.1" height="10" class="bar-con" rx="1"/>
  <text x="439.3" y="241.5" text-anchor="start" class="bar-label">+0.218</text>
  <rect x="170" y="253.7" width="480" height="45.4" fill="var(--c-acc)" opacity="0.06" rx="2"/>
  <text x="162" y="280.4" text-anchor="end" class="feature-highlight">keyword_rank_norm</text>
  <rect x="219.7" y="264.9" width="91.5" height="10" class="bar-unc" rx="1"/>
  <text x="214.7" y="273.9" text-anchor="end" class="bar-label">-0.162</text>
  <rect x="311.2" y="277.9" width="4.0" height="10" class="bar-con" rx="1"/>
  <text x="320.1" y="286.9" text-anchor="start" class="bar-label">+0.007</text>
  <text x="162" y="325.9" text-anchor="end" class="feature-label">path_depth_norm</text>
  <rect x="311.2" y="310.4" width="75.1" height="10" class="bar-unc" rx="1"/>
  <text x="391.3" y="319.4" text-anchor="start" class="bar-label">+0.133</text>
  <text x="162" y="371.3" text-anchor="end" class="feature-label">content_length_norm</text>
</svg>
<p>The dominant features, <code>semantic_rank_norm</code> and <code>rrf_score</code>, are highly correlated with RRF's own scoring. The <code>semantic_rank_norm</code> tracks the semantic component of RRF, and <code>rrf_score</code> is the RRF score. This creates strong multicollinearity: several feature combinations can produce nearly the same ordering.</p>
<p>In that regime, individual coefficients are not very identifiable. A different random seed or slightly different sample can shift the learned weights while preserving almost identical rankings. So &quot;the model just reweighted RRF&quot; is less a model failure than a consequence of correlated features and limited independent signal.</p>
<p>With only 42 queries and 460 candidate documents, there isn't enough signal to reliably learn behavior beyond the baseline. The features that could differentiate (<code>path_depth_norm</code>, <code>content_length_norm</code>) have near-zero weights. On this small, well-curated corpus, RRF is already near-optimal for this feature set.</p>
<h2 id="lessons"><a href="#lessons">Lessons</a></h2>
<p>Reranking metrics can overstate pipeline gains. This is well documented in learning-to-rank research, but the effect is easy to underestimate. The +0.054 reranking improvement disappeared in end-to-end evaluation. If you evaluate a reranker, always measure end-to-end.</p>
<p>Constraints can encode domain knowledge. The non-negative constraint on rank features captures a practical pipeline rule: the ranker should not discard candidates by penalizing one channel.</p>
<p>Simple baselines are hard to beat with simple models. RRF is robust: it does not penalize documents for appearing in only one list, and its <code>1/(k + rank)</code> formula is a useful nonlinear rank transform. A 7-feature linear model can approximate RRF, but not reliably exceed it without interaction features.</p>
<p>Small training sets favor conservative behavior. Forty-two queries were enough to control overfitting (LOO-CV confirmed this), but not enough to learn stable corpus-specific patterns beyond the baseline.</p>
<p>A slightly higher training metric than validation metric does not, by itself, prove problematic overfitting. It can also reflect mild train/validation distribution differences in a small evaluation set.</p>
<p>Feature design is at least as important as data volume. Manual relevance judgments are accurate but expensive; production systems usually rely on weaker but abundant implicit feedback (clicks, dwell time, reformulations). In practice, model quality is bounded by both data quality/quantity and whether the features capture the real retrieval process.</p>
<h2 id="what-could-beat-rrf"><a href="#what-could-beat-rrf">What Could Beat RRF</a></h2>
<p>Based on this experiment, improving beyond RRF at pipeline level likely requires:</p>
<ul>
<li>Better query-aware features. Per-query signals (query length, term rarity, semantic-keyword score divergence) could adapt fusion behavior beyond fixed global weights.</li>
<li>Interaction features, even in a linear model. Terms like <code>keyword_rank_norm × rare_term_ratio</code> or <code>semantic_score × query_length</code> let a linear model represent conditional behavior.</li>
<li>Query-dependent weighting. A single global weight vector may match one corpus, but robust gains often require query-level adaptation.</li>
<li>More judged data. About 150+ judged queries would give the model room to learn beyond baseline behavior, especially once richer features are available.</li>
<li>Potentially a non-linear model. If linear features saturate, a non-linear model can capture higher-order interactions directly.</li>
<li>Listwise loss. Optimizing nDCG directly (for example <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/lambdarank.pdf">LambdaRank</a>) would align training with the final metric better than pairwise hinge loss.</li>
<li>Neural reranking. Instead of scoring documents independently with a linear model, a neural reranker jointly considers the query and document. Cross-encoders like BERT or monoT5 concatenate query and document into a single input and run a full transformer forward pass per candidate, capturing deep query-document interactions but at high latency - practical only as a second-stage reranker over a small candidate set. Late-interaction models like ColBERT take a different approach: they encode query and document independently into per-token embeddings, then compute fine-grained token-level similarity via MaxSim. This makes ColBERT usable both as a retriever (via ANN search over precomputed token embeddings) and as a reranker, offering a middle ground between bi-encoder speed and cross-encoder quality.</li>
</ul>
<p>The infrastructure is in place: the <code>LtrFusion</code> strategy, the training CLI, the feature extraction pipeline. The linear model just needs richer signal to work with.</p>
<h2 id="embedding-model"><a href="#embedding-model">Embedding Model</a></h2>
<p>The goal is on-device search with no cloud API dependency. That rules out hosted embedding services and means inference has to run on whatever hardware is available - typically a laptop CPU, no GPU. The model choice follows from this constraint: we need something small enough to run fast on CPU, accurate enough to produce useful semantic search results, and available in a local runtime format (typically ONNX).</p>
<p>qrst primarily uses ONNX Runtime for inference, loaded with the default execution provider (CPU). Most presets run without GPU acceleration, ensuring compatibility across hardware. However, some models like <code>nomic-embed-text-v1.5</code> are implemented using the <a href="https://github.com/huggingface/candle">Candle</a> framework, which provides Metal acceleration on macOS. For ONNX-based models, <code>Session::builder()?.commit_from_file(&amp;model_path)</code> handles a forward pass per batch of 8 chunks. ONNX Runtime's CPU backend is well-optimized for small models: quantized attention, SIMD vectorization, and minimal memory overhead. On an M-series Mac, embedding a 6,600-chunk corpus takes about a minute with EmbeddingGemma 300M.</p>
<p>The system supports five model presets:</p>
<table>
<thead>
<tr>
<th>Model</th>
<th>Dimensions</th>
<th>Max Tokens</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>all-MiniLM-L6-v2</td>
<td>384</td>
<td>128</td>
<td>Fastest, good baseline</td>
</tr>
<tr>
<td>nomic-embed-text-v1.5</td>
<td>768</td>
<td>8192</td>
<td>Matryoshka embeddings, long context</td>
</tr>
<tr>
<td>EmbeddingGemma 300M</td>
<td>768</td>
<td>2048</td>
<td>Best accuracy/speed tradeoff</td>
</tr>
<tr>
<td>e5-base-v2</td>
<td>768</td>
<td>512</td>
<td>Balanced, instruction-tuned</td>
</tr>
<tr>
<td>bge-base-en</td>
<td>768</td>
<td>512</td>
<td>Balanced, English-focused</td>
</tr>
</tbody>
</table>
<p>EmbeddingGemma 300M is the default for benchmarks and the model behind all results in this article. At 300M parameters it is small enough for real-time CPU inference but large enough to capture semantic nuance that the 33M-parameter MiniLM misses. The SciFact results (nDCG@10 = 0.762 semantic-only) confirm it performs well on domain-specific scientific text without fine-tuning.</p>
<p>For ONNX-based presets, model dimensions are auto-detected from ONNX metadata at load time. Each model preset also defines whether to normalize embeddings and what query/document prefixes to prepend (e.g., EmbeddingGemma uses <code>&quot;task: search result | query: &quot;</code> for queries).</p>
<p>The vector index uses <a href="https://github.com/unum-cloud/usearch">USearch</a>, an HNSW implementation created by <a href="https://ashvardanian.com/">Ash Vardanian</a>. USearch is a single-file, dependency-free library for approximate nearest neighbor search that compiles to native code on every major platform. It supports multiple scalar types (F32, F16, I8) for the stored vectors, so you can trade precision for memory: F16 halves memory usage with negligible recall loss, I8 quarters it at some accuracy cost. qrst uses F32 by default but the quantization is configurable. USearch also handles index persistence - the HNSW graph is loaded from disk; while USearch supports memory-mapping via <code>view()</code>, qrst currently uses the <code>load()</code> path which reads the index into memory. For a local search engine that needs to start fast and stay light, these properties matter more than marginal recall differences between ANN libraries.</p>
<p><strong>Why not GPU acceleration on Apple Silicon?</strong> ONNX Runtime has no Metal Performance Shaders execution provider. The available path is the CoreML EP, which can target the Apple GPU and Neural Engine (ANE), but for transformer models it is currently impractical. Standard transformer operations - <code>Erf</code> for GELU, <code>ReduceMean</code> for LayerNorm, <code>LayerNormalization</code> - are supported in current ONNX Runtime versions, and dynamic shapes are permitted. However, performance can still be slower due to graph partitioning: the model graph gets split into dozens of fragments, each boundary incurring CPU↔CoreML data transfer overhead. In practice this makes CoreML inference <a href="https://github.com/microsoft/onnxruntime/issues/19887">slower than pure CPU for models with partial operator coverage</a>. The Rust <code>ort</code> crate does expose a CoreML EP, but its prebuilt binaries do not include it - you would need to compile ONNX Runtime from source.</p>
<p>Apple's own research on <a href="https://machinelearning.apple.com/research/neural-engine-transformers">deploying transformers on the Neural Engine</a> shows that ANE acceleration requires restructuring the model: replacing <code>nn.Linear</code> with <code>nn.Conv2d</code>, switching to channels-first layout, and chunking multi-head attention into single-head operations. With these changes, Apple demonstrated a 10x speedup on DistilBERT - but this is manual model surgery, not something you get by flipping an execution provider flag. For embedding models available as standard ONNX exports, the M-series CPU is the fastest path. Its high memory bandwidth and AMX/NEON units already deliver sub-second inference for a 300M-parameter model.</p>
<p>The deliberate choice here is pragmatic: we do not need a 7B-parameter model or the absolute best score on MTEB. We need a model that runs in under a second per query on CPU, fits in memory alongside the rest of the application, and produces embeddings good enough that the ranking pipeline - BM25, fusion, and chunking - can do its job. A 300M-parameter model on ONNX/CPU meets all three requirements.</p>
<h2 id="chunking"><a href="#chunking">Chunking</a></h2>
<p>Everything described above operates on chunks, not documents. A 500-line markdown file or a Rust module with twenty functions does not get indexed as a single unit. It gets split into pieces that each fit within the embedding model's effective context, and each chunk becomes its own entry in both the keyword and vector indexes. The chunking strategy directly affects retrieval quality: chunks that are too large dilute the semantic signal, chunks that are too small lose context.</p>
<p>qrst uses a pluggable chunking system with three strategies, selected by file extension.</p>
<p><strong>Markdown chunker.</strong> Splits on heading boundaries. When a <code>#</code> line appears, the accumulated content is flushed as a chunk. If a section exceeds the budget (80% of model context), it is split again when the next line would exceed the limit. Each chunk carries its heading as a title, which becomes searchable metadata. The minimum chunk size is 10 tokens, filtering out headings-only fragments.</p>
<p><strong>Code chunker.</strong> The code chunker implements <a href="https://arxiv.org/abs/2506.15655">cAST-style</a> recursive AST merging. Most RAG pipelines inherit line-based chunking from natural-language retrieval, which breaks semantic structure: a function split at line 50 produces two chunks that are each incomplete. The cAST approach instead uses the parse tree to align chunk boundaries with syntactic units.</p>
<p>The algorithm works in three phases. First, tree-sitter parses the source file into an AST. Second, the chunker walks the AST top-down, maintaining a buffer of pending nodes and a token budget (80% of model context by default). At each child node, it applies three rules in order:</p>
<ol>
<li><strong>Boundary check.</strong> If the child is a boundary node (function, struct, impl, class, interface, trait, enum, module), flush any pending buffer as a chunk. The boundary node then starts a new accumulation.</li>
<li><strong>Size check.</strong> If the child's token count exceeds the budget, recurse into its children with a fresh buffer. If the child is a leaf that is still too large, fall back to line-based splitting.</li>
<li><strong>Budget check.</strong> If adding the child to the buffer would exceed the budget, flush the buffer first, then add the child.</li>
</ol>
<p>Otherwise, the child is appended to the buffer. When all children are processed, the remaining buffer is flushed.</p>
<svg id="ast-chunk" viewBox="0 0 720 520" xmlns="http://www.w3.org/2000/svg" style="width:100%;max-width:720px;">
  <style>
    #ast-chunk .node-box { stroke-width: 1.5; rx: 4; }
    #ast-chunk .chunk-a { fill: color-mix(in srgb, var(--c-acc) 12%, transparent); stroke: var(--c-acc); }
    #ast-chunk .chunk-b { fill: color-mix(in srgb, var(--c-acc2) 12%, transparent); stroke: var(--c-acc2); }
    #ast-chunk .chunk-c { fill: color-mix(in srgb, var(--c-muted) 10%, transparent); stroke: var(--c-muted); }
    #ast-chunk .node-label { font: 600 10px "JetBrains Mono", monospace; fill: var(--c-pri); }
    #ast-chunk .node-size { font: 9px "JetBrains Mono", monospace; fill: var(--c-muted); }
    #ast-chunk .edge { stroke: var(--c-brd); stroke-width: 1; fill: none; }
    #ast-chunk .chart-title { font: 500 14px "Playfair Display", serif; fill: var(--c-pri); }
    #ast-chunk .phase-label { font: 600 11px "Source Serif 4", serif; fill: var(--c-sec); }
    #ast-chunk .annotation { font: italic 10px "Source Serif 4", serif; fill: var(--c-muted); }
    #ast-chunk .chunk-label { font: 600 10px "JetBrains Mono", monospace; }
    #ast-chunk .chunk-label-a { fill: var(--c-acc); }
    #ast-chunk .chunk-label-b { fill: var(--c-acc2); }
    #ast-chunk .chunk-label-c { fill: var(--c-muted); }
    #ast-chunk .boundary-marker { font: 9px "JetBrains Mono", monospace; fill: var(--c-acc); }
    #ast-chunk .budget-text { font: 9px "JetBrains Mono", monospace; fill: var(--c-sec); }
    #ast-chunk .arrow { fill: var(--c-brd); }
    #ast-chunk .brace { stroke: var(--c-brd); stroke-width: 1; fill: none; }
    #ast-chunk .dim { opacity: 0.4; }
  </style>
  <text x="360" y="24" text-anchor="middle" class="chart-title">cAST: Recursive AST Merging with Token Budget</text>
  <text x="30" y="52" class="phase-label">1. Parse AST (tree-sitter)</text>
  <rect x="40" y="68" width="130" height="26" class="node-box chunk-c"/>
  <text x="46" y="84" class="node-label">source_file</text>
  <path d="M48,94 L48,113 L64,113" class="edge"/>
  <rect x="64" y="100" width="150" height="26" class="node-box chunk-a"/>
  <text x="70" y="116" class="node-label">use_declaration</text>
  <text x="210" y="116" text-anchor="end" class="node-size">40t</text>
  <path d="M48,94 L48,145 L64,145" class="edge"/>
  <rect x="64" y="132" width="150" height="26" class="node-box chunk-a"/>
  <text x="70" y="148" class="node-label">use_declaration</text>
  <text x="210" y="148" text-anchor="end" class="node-size">30t</text>
  <path d="M48,94 L48,177 L64,177" class="edge"/>
  <rect x="64" y="164" width="150" height="26" class="node-box chunk-b"/>
  <text x="70" y="180" class="node-label">struct_item</text>
  <text x="210" y="180" text-anchor="end" class="node-size">120t</text>
  <text x="218" y="180" class="boundary-marker">●</text>
  <path d="M48,94 L48,209 L64,209" class="edge"/>
  <rect x="64" y="196" width="150" height="26" class="node-box chunk-c dim"/>
  <text x="70" y="212" class="node-label">impl_item</text>
  <text x="210" y="212" text-anchor="end" class="node-size">800t</text>
  <text x="218" y="212" class="boundary-marker">●</text>
  <text x="230" y="212" class="annotation">↓ recurse (>budget)</text>
  <path d="M72,222 L72,241 L88,241" class="edge"/>
  <rect x="88" y="228" width="130" height="26" class="node-box chunk-b"/>
  <text x="94" y="244" class="node-label">fn new</text>
  <text x="214" y="244" text-anchor="end" class="node-size">140t</text>
  <text x="222" y="244" class="boundary-marker">●</text>
  <path d="M72,222 L72,273 L88,273" class="edge"/>
  <rect x="88" y="260" width="130" height="26" class="node-box chunk-b"/>
  <text x="94" y="276" class="node-label">fn add</text>
  <text x="214" y="276" text-anchor="end" class="node-size">200t</text>
  <text x="222" y="276" class="boundary-marker">●</text>
  <path d="M72,222 L72,305 L88,305" class="edge"/>
  <rect x="88" y="292" width="130" height="26" class="node-box chunk-c"/>
  <text x="94" y="308" class="node-label">fn search</text>
  <text x="214" y="308" text-anchor="end" class="node-size">400t</text>
  <text x="222" y="308" class="boundary-marker">●</text>
  <path d="M48,94 L48,337 L64,337" class="edge"/>
  <rect x="64" y="324" width="150" height="26" class="node-box chunk-c"/>
  <text x="70" y="340" class="node-label">mod tests</text>
  <text x="210" y="340" text-anchor="end" class="node-size">300t</text>
  <text x="218" y="340" class="boundary-marker">●</text>
  <text x="390" y="52" class="phase-label">2. Walk top-down, merge with budget</text>
  <text x="396" y="70" class="node-size">for child in node.children:</text>
  <text x="396" y="86" class="node-size">  if child is boundary → flush buf</text>
  <text x="396" y="102" class="node-size">  if child > budget  → recurse</text>
  <text x="396" y="118" class="node-size">  if buf + child > budget → flush</text>
  <text x="396" y="134" class="node-size">  buf.push(child)</text>
  <text x="396" y="150" class="node-size">flush remaining buf</text>
  <text x="570" y="182" class="budget-text">budget = 80% of model context</text>
  <text x="390" y="208" class="phase-label">3. Output chunks</text>
  <rect x="390" y="224" width="310" height="36" class="node-box chunk-a" rx="4"/>
  <text x="398" y="236" class="chunk-label chunk-label-a">Chunk 1</text>
  <text x="694" y="236" text-anchor="end" class="node-size">70t</text>
  <text x="398" y="250" class="node-label">source_file + use_declaration ×2</text>
  <rect x="390" y="266" width="310" height="36" class="node-box chunk-b" rx="4"/>
  <text x="398" y="278" class="chunk-label chunk-label-b">Chunk 2</text>
  <text x="694" y="278" text-anchor="end" class="node-size">120t</text>
  <text x="398" y="292" class="node-label">struct Store</text>
  <rect x="390" y="308" width="310" height="36" class="node-box chunk-b" rx="4"/>
  <text x="398" y="320" class="chunk-label chunk-label-b">Chunk 3</text>
  <text x="694" y="320" text-anchor="end" class="node-size">140t</text>
  <text x="398" y="334" class="node-label">fn new</text>
  <rect x="390" y="350" width="310" height="36" class="node-box chunk-b" rx="4"/>
  <text x="398" y="362" class="chunk-label chunk-label-b">Chunk 4</text>
  <text x="694" y="362" text-anchor="end" class="node-size">200t</text>
  <text x="398" y="376" class="node-label">fn add</text>
  <rect x="390" y="392" width="310" height="36" class="node-box chunk-c" rx="4"/>
  <text x="398" y="404" class="chunk-label chunk-label-c">Chunk 5</text>
  <text x="694" y="404" text-anchor="end" class="node-size">400t</text>
  <text x="398" y="418" class="node-label">fn search</text>
  <rect x="390" y="434" width="310" height="36" class="node-box chunk-c" rx="4"/>
  <text x="398" y="446" class="chunk-label chunk-label-c">Chunk 6</text>
  <text x="694" y="446" text-anchor="end" class="node-size">300t</text>
  <text x="398" y="460" class="node-label">mod tests</text>
  <path d="M214,129 Q370,129 390,242" class="edge" stroke-dasharray="4 2" marker-end="url(#ast-chunk-arrowhead)"/>
  <path d="M218,177 Q370,177 390,284" class="edge" stroke-dasharray="4 2" marker-end="url(#ast-chunk-arrowhead)"/>
  <path d="M222,244 Q370,244 390,326" class="edge" stroke-dasharray="4 2" marker-end="url(#ast-chunk-arrowhead)"/>
  <path d="M222,276 Q370,276 390,368" class="edge" stroke-dasharray="4 2" marker-end="url(#ast-chunk-arrowhead)"/>
  <path d="M222,308 Q370,308 390,410" class="edge" stroke-dasharray="4 2" marker-end="url(#ast-chunk-arrowhead)"/>
  <path d="M218,340 Q370,340 390,452" class="edge" stroke-dasharray="4 2" marker-end="url(#ast-chunk-arrowhead)"/>
  <defs>
    <marker id="ast-chunk-arrowhead" markerWidth="8" markerHeight="8" refX="7" refY="3" orient="auto" markerUnits="strokeWidth">
      <path d="M0,0 L0,6 L8,3 z" class="arrow"/>
    </marker>
  </defs>
  <text x="30" y="496" class="boundary-marker">●</text>
  <text x="42" y="496" class="annotation">= boundary node (starts new chunk)</text>
  <text x="30" y="510" class="annotation">Oversized nodes (>budget) are recursed into; boundary nodes flush the buffer to start a fresh context.</text>
</svg>
<p>The diagram illustrates how the algorithm walks the top-level children. The two <code>use_declaration</code> nodes (40t + 30t) are not boundary nodes, so they accumulate in the buffer and flush together as Chunk 1 when the boundary node <code>struct_item</code> is encountered. The <code>struct_item</code> (120t) starts a new accumulation, is itself a boundary node, and then the oversized <code>impl_item</code> triggers a flush-before-recurse. Inside the recursive walk, each <code>fn</code> is a boundary node, so each new boundary flushes the previous buffered node (<code>fn new</code> -&gt; Chunk 3, <code>fn add</code> -&gt; Chunk 4, etc.). This ensures that major syntactic units like functions and structs remain isolated unless they are small enough to be merged without crossing boundaries. Non-boundary nodes are merged greedily until a boundary or budget overflow forces a flush.</p>
<p>A key implementation detail: merged chunks preserve inter-node whitespace by slicing <code>source[first.start_byte..last.end_byte]</code> rather than concatenating extracted node texts. This means a chunk reads exactly like the original source, including blank lines between functions, which matters for both readability and keyword search.</p>
<p><strong>Composite chunker.</strong> Handles multi-zone files like Astro, Vue, and Svelte components. The file is first split into zones by text boundaries: frontmatter (<code>---</code> fences), <code>&lt;script&gt;</code>, <code>&lt;style&gt;</code>, and template regions. Each zone is then delegated to the appropriate sub-chunker: TypeScript for script zones, CSS for style zones, HTML for template zones. Zone labels are prefixed to chunk titles (<code>[script] const handler</code>, <code>[style] .container</code>) so search results indicate which part of the component matched.</p>
<p>The registry dispatches by extension and defaults to covering <code>.md</code>, <code>.rs</code>, <code>.js</code>, <code>.jsx</code>, <code>.ts</code>, <code>.tsx</code>, <code>.html</code>, <code>.css</code>, <code>.astro</code>, <code>.vue</code>, and <code>.svelte</code>. Files without a matching chunker are skipped.</p>
<p>The indexer walks the directory tree (respecting <code>.gitignore</code>), dispatches each file to its chunker, and feeds the resulting chunks through the embedding model in batches of 8. Each chunk is stored with its content, file path, title, source line range, and embedding vector. Incremental updates use blake3 content hashing to detect changed files: unchanged files are skipped, changed files have their old chunks deleted and new chunks inserted.</p>
<p>The token bounds (defaulting to 80% of model context for maximum and 10 for minimum) are configurable in <code>config.toml</code> but the defaults work well in practice. For prose-heavy content, 80% of context maps to roughly 2000–2500 characters; for code and mixed syntax, the character count is lower because punctuation, operators, and camelCase identifiers each consume separate tokens. Either way, the result fits perfectly within the embedding model's context window (typically 512 or 2048 tokens) while providing enough context for meaningful semantic similarity. The 10-token minimum filters out empty sections and standalone headings without discarding short but relevant code snippets.</p>
<h2 id="implementation"><a href="#implementation">Implementation</a></h2>
<p>The full implementation is in <a href="https://github.com/l1x/qrst">qrst</a>:</p>
<ul>
<li><strong>Core:</strong> <code>LtrFusion</code> in <a href="https://github.com/l1x/qrst/blob/main/crates/qrst/src/storage/fusion.rs"><code>crates/qrst/src/storage/fusion.rs</code></a>, 7-feature extraction, dot product scoring, <code>LtrWeights</code> with TOML serialization</li>
<li><strong>Training:</strong> <code>train-ltr</code> command in <a href="https://github.com/l1x/qrst/blob/main/crates/qrst-bench/src/commands/train_ltr.rs"><code>crates/qrst-bench/src/commands/train_ltr.rs</code></a>, pairwise SGD, LOO-CV, non-negative constraints</li>
<li><strong>Config:</strong> <code>fusion_strategy = &quot;ltr&quot;</code> in config.toml, weights auto-loaded from <code>{data_dir}/ltr_weights.toml</code></li>
<li><strong>Results:</strong> <a href="https://github.com/l1x/qrst/blob/main/bench/results/ltr-fusion-results.md"><code>bench/results/ltr-fusion-results.md</code></a></li>
</ul>
<p>Inference cost for the linear model is negligible - just 8 multiply-adds per candidate document, completing in well under a microsecond per item on modern hardware. The weights are 8 floats in a human-readable TOML file. No new dependencies beyond what qrst already uses (serde + toml).</p>
<h2 id="external-validation-on-beirscifact"><a href="#external-validation-on-beirscifact">External Validation on BEIR/SciFact</a></h2>
<p>The panzerotti corpus (my private documentation dataset) is a set where both BM25 and semantic search contribute meaningfully (BM25 nDCG@5 = 0.431). To test whether the findings generalize, I ran the same fusion strategies on <a href="https://github.com/allenai/scifact">SciFact</a>, a public benchmark of 300 scientific claim–evidence pairs, using EmbeddingGemma 300M.</p>
<table>
<thead>
<tr>
<th>Strategy</th>
<th>nDCG@10</th>
<th>nDCG@5</th>
<th>P@3</th>
<th>MRR</th>
<th>95% CI nDCG@10</th>
</tr>
</thead>
<tbody>
<tr>
<td>BM25 only</td>
<td>0.047</td>
<td>0.047</td>
<td>0.017</td>
<td>0.050</td>
<td>n/a</td>
</tr>
<tr>
<td>Semantic only</td>
<td>0.762</td>
<td>0.744</td>
<td>0.282</td>
<td>0.723</td>
<td>n/a</td>
</tr>
<tr>
<td>RRF (k=60)</td>
<td>0.761</td>
<td>0.745</td>
<td>0.282</td>
<td>0.726</td>
<td>[0.724, 0.797]</td>
</tr>
<tr>
<td>Convex (α=0.95)</td>
<td>0.762</td>
<td>0.745</td>
<td>0.282</td>
<td>0.726</td>
<td>[0.724, 0.799]</td>
</tr>
<tr>
<td>LTR (panzerotti weights)</td>
<td>0.765</td>
<td>0.748</td>
<td>0.283</td>
<td>0.730</td>
<td>[0.726, 0.801]</td>
</tr>
</tbody>
</table>
<p>BM25 is near-zero on SciFact. Beyond the specialized vocabulary, there is a structural confound: qrst indexes at the document level (the full abstract), whereas BEIR baselines often use passage-level indexing. On SciFact's short, dense abstracts, this mismatch significantly penalizes keyword-based retrieval.</p>
<p>When one retriever is broken, fusion approximately collapses to the working retriever. RRF and convex both produce near-identical results to semantic-only because BM25 contributes mostly noise. In expectation, RRF with one random-quality list adds uncorrelated perturbations to the signal-carrying list; with enough documents, the semantic ranking dominates, though individual queries can still see minor rank swaps. The per-system bootstrap confidence intervals overlap heavily - [0.724, 0.797] vs [0.724, 0.799] vs [0.726, 0.801] - which is suggestive but not a formal significance test. A paired bootstrap or permutation test would be needed to make a rigorous claim; the point estimates and CI overlap are consistent with no meaningful difference.</p>
<p>The LTR model, trained on panzerotti weights, transfers to SciFact without degradation but also without improvement (0.765 nDCG@10, within every other strategy's CI). The panzerotti-trained linear model has converged to RRF-equivalent behavior, and that equivalence holds even on an out-of-domain corpus.</p>
<p>This is the complementary case to the panzerotti experiment. On panzerotti, both retrievers contribute and LTR converges to RRF. On SciFact, only one retriever contributes and all fusion strategies converge to semantic-only. Neither case gives LTR room to differentiate. The bottleneck is retriever quality, not the combination method.</p>
<h2 id="conclusion"><a href="#conclusion">Conclusion</a></h2>
<p>This experiment set out to beat RRF with a learned model and ended up explaining why RRF works so well. A 7-feature linear model trained on 42 queries converged to a reweighted version of the very baseline it was supposed to surpass. Importantly, this is not because linear models are inherently weak; with strongly correlated features, the model had little room to learn distinct behavior.</p>
<p>The negative keyword rank weight that looked like a genuine insight in reranking evaluation turned out to be a pipeline-breaking artifact: optimizing for reranking quality is not the same as optimizing for end-to-end retrieval.</p>
<p>The key takeaway is not that learning-to-rank fails for hybrid search. It is that a linear model on a small corpus with limited training signal cannot find structure beyond what a well-chosen heuristic already captures. RRF's <code>1/(k + rank)</code> formula encodes a reasonable prior: every retrieval channel contributes, no document is penalized for appearing in only one list, and rank is compressed through a monotonic transform. Replicating those properties is easy. Exceeding them requires richer features, more training data, or a model class that can represent interactions.</p>
<p>The more useful outcome is methodological. Reranking metrics and pipeline metrics measure different things. A reranker operates on a fixed candidate set; a fusion strategy also determines which candidates survive. Any learned ranker that can suppress documents below a retrieval cutoff will show a gap between these two evaluations. Measuring end-to-end from the start would have caught the negative-weight problem before it looked like an improvement.</p>
<p>For practitioners building hybrid search: start with RRF. It is fast, parameter-light, and robust. If you have enough labeled data and the right features to justify a learned model, evaluate it end-to-end against RRF before shipping. The bar is higher than reranking metrics suggest.</p>
<h2 id="appendix"><a href="#appendix">Appendix</a></h2>
<h3 id="acronyms"><a href="#acronyms">Acronyms</a></h3>
<table>
<thead>
<tr>
<th>Acronym</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>AMX</td>
<td>Apple Matrix Co-processor (hardware accelerator in M-series chips)</td>
</tr>
<tr>
<td>ANE</td>
<td>Apple Neural Engine (on-chip neural network accelerator)</td>
</tr>
<tr>
<td>ANN</td>
<td>Approximate Nearest Neighbor</td>
</tr>
<tr>
<td>AST</td>
<td>Abstract Syntax Tree</td>
</tr>
<tr>
<td>BEIR</td>
<td>Benchmarking IR (standardized information retrieval benchmark suite)</td>
</tr>
<tr>
<td>BM25</td>
<td>Best Match 25 (probabilistic term-scoring function)</td>
</tr>
<tr>
<td>cAST</td>
<td>Code AST (recursive AST-based chunking strategy)</td>
</tr>
<tr>
<td>CI</td>
<td>Confidence Interval</td>
</tr>
<tr>
<td>CoreML</td>
<td>Apple's machine learning framework for on-device inference</td>
</tr>
<tr>
<td>EP</td>
<td>Execution Provider (ONNX Runtime backend for hardware acceleration)</td>
</tr>
<tr>
<td>HNSW</td>
<td>Hierarchical Navigable Small World (graph-based ANN algorithm)</td>
</tr>
<tr>
<td>IDF</td>
<td>Inverse Document Frequency</td>
</tr>
<tr>
<td>LOO-CV</td>
<td>Leave-One-Out Cross-Validation</td>
</tr>
<tr>
<td>LTR</td>
<td>Learning to Rank</td>
</tr>
<tr>
<td>MPS</td>
<td>Metal Performance Shaders (Apple GPU compute framework)</td>
</tr>
<tr>
<td>MRR</td>
<td>Mean Reciprocal Rank</td>
</tr>
<tr>
<td>MTEB</td>
<td>Massive Text Embedding Benchmark</td>
</tr>
<tr>
<td>nDCG</td>
<td>Normalized Discounted Cumulative Gain</td>
</tr>
<tr>
<td>ONNX</td>
<td>Open Neural Network Exchange (portable model format)</td>
</tr>
<tr>
<td>P@k</td>
<td>Precision at rank k</td>
</tr>
<tr>
<td>RAG</td>
<td>Retrieval-Augmented Generation</td>
</tr>
<tr>
<td>RRF</td>
<td>Reciprocal Rank Fusion</td>
</tr>
<tr>
<td>SGD</td>
<td>Stochastic Gradient Descent</td>
</tr>
<tr>
<td>SIMD</td>
<td>Single Instruction, Multiple Data</td>
</tr>
<tr>
<td>TF</td>
<td>Term Frequency</td>
</tr>
<tr>
<td>TOML</td>
<td>Tom's Obvious Minimal Language (configuration format)</td>
</tr>
</tbody>
</table>
<h3 id="references"><a href="#references">References</a></h3>
<table>
<thead>
<tr>
<th>Title</th>
<th>Summary</th>
<th>Link</th>
</tr>
</thead>
<tbody>
<tr>
<td>A General Theory of Relevance (BM25 Review)</td>
<td>Foundational review of the BM25 scoring function covering TF saturation and document length normalization</td>
<td><a href="https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf">PDF</a></td>
</tr>
<tr>
<td>Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods</td>
<td>Introduces RRF as a simple, parameter-light method for combining multiple ranked lists that outperforms more complex approaches</td>
<td><a href="https://dl.acm.org/doi/10.1145/1571941.1572114">ACM</a></td>
</tr>
<tr>
<td>Pairwise Hinge Loss (MSR-TR-2010-82)</td>
<td>Describes the pairwise hinge loss objective for training ranking models by pushing relevant documents above irrelevant ones by a margin</td>
<td><a href="https://www.microsoft.com/en-us/research/uploads/prod/2016/02/MSR-TR-2010-82.pdf">PDF</a></td>
</tr>
<tr>
<td>Cumulative Gain-Based Evaluation of IR Techniques (nDCG)</td>
<td>Introduces nDCG as a graded relevance metric with position-based discounting for evaluating ranked retrieval</td>
<td><a href="https://dl.acm.org/doi/10.1145/582415.582418">ACM</a></td>
</tr>
<tr>
<td>LambdaRank: Learning to Rank with Nonsmooth Cost Functions</td>
<td>Proposes listwise ranking optimization that directly optimizes nDCG through lambda gradients</td>
<td><a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/lambdarank.pdf">PDF</a></td>
</tr>
<tr>
<td>cAST: Code AST-Based Chunking for Retrieval</td>
<td>Presents recursive AST merging for code chunking, aligning chunk boundaries with syntactic units instead of line counts</td>
<td><a href="https://arxiv.org/abs/2506.15655">arXiv</a></td>
</tr>
<tr>
<td>Deploying Transformers on the Apple Neural Engine</td>
<td>Apple's guide to restructuring transformer architectures for ANE acceleration, achieving 10x speedup on DistilBERT</td>
<td><a href="https://machinelearning.apple.com/research/neural-engine-transformers">Apple ML</a></td>
</tr>
</tbody>
</table>
]]></content:encoded>
      <author>Istvan</author>
      <pubDate>Thu, 05 Mar 2026 12:00:00 +0100</pubDate>
    </item>
    <item>
      <title>harrow: Macro-Free HTTP Framework</title>
      <link>https://www.vectorian.be/projects/harrow/</link>
      <guid isPermaLink="true">https://www.vectorian.be/projects/harrow/</guid>
      <description>&lt;p&gt;harrow is a thin HTTP framework built directly on Hyper 1.0. The goal is explicit, macro-free routing with built-in observability: tracing spans, Prometheus metrics, and structured request IDs from the start, not bolted on after the fact.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h2 id="context"><a href="#context">Context</a></h2>
<p>harrow is a thin HTTP framework built directly on Hyper 1.0. The goal is explicit, macro-free routing with built-in observability: tracing spans, Prometheus metrics, and structured request IDs from the start, not bolted on after the fact.</p>
<h2 id="design"><a href="#design">Design</a></h2>
<p>Every route is plain Rust. No proc macros, no magic. The route table is a concrete data structure, inspectable at runtime. This makes it straightforward to enumerate endpoints for health checks, generate OpenAPI-like documentation, or debug routing issues in production.</p>
<p>The framework is split into focused crates:</p>
<ul>
<li><strong>harrow-core</strong>: Routing, request/response types, middleware chain</li>
<li><strong>harrow-o11y</strong>: Tracing integration, Prometheus metrics export, request ID propagation</li>
<li><strong>harrow-server</strong>: Hyper binding, connection handling, graceful shutdown, optional TLS via rustls</li>
<li><strong>harrow-bench</strong>: Micro-benchmarks, TCP-level latency tests, and Axum comparison suite</li>
</ul>
<h2 id="performance"><a href="#performance">Performance</a></h2>
<p>Target: less than 1 microsecond of added latency over raw Hyper. Current benchmarks on a c7g.xlarge (Graviton3):</p>
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Latency</th>
</tr>
</thead>
<tbody>
<tr>
<td>Path matching (1 param)</td>
<td>79.6 ns</td>
</tr>
<tr>
<td>Route table lookup (100 routes)</td>
<td>1.19 us</td>
</tr>
<tr>
<td>TCP round-trip (echo)</td>
<td>24.4 us</td>
</tr>
<tr>
<td>Full stack (3 middleware + JSON + params)</td>
<td>31.5 us</td>
</tr>
<tr>
<td>Axum equivalent</td>
<td>24.8 us</td>
</tr>
</tbody>
</table>
<p>Continuous flamegraph profiling catches performance regressions before they ship.</p>
<h2 id="observability"><a href="#observability">Observability</a></h2>
<p>Every request gets a tracing span with structured fields: method, path, status, duration. RED metrics (rate, errors, duration) are emitted automatically. Prometheus scrape endpoint is built in. For AWS deployments, CloudFront request IDs are extracted and propagated as deterministic trace IDs via BLAKE3.</p>
<h2 id="links"><a href="#links">Links</a></h2>
<ul>
<li><strong>Source:</strong> <a href="https://github.com/l1x/harrow">github.com/l1x/harrow</a></li>
<li><strong>License:</strong> MIT / Apache-2.0</li>
</ul>
]]></content:encoded>
      <author>istvan</author>
      <pubDate>Fri, 20 Feb 2026 12:00:00 +0100</pubDate>
    </item>
    <item>
      <title>Supervising Coding Agents: Notes from a Live Debugging Session</title>
      <link>https://www.vectorian.be/articles/2026-01-30/supervising-agents-live-session/</link>
      <guid isPermaLink="true">https://www.vectorian.be/articles/2026-01-30/supervising-agents-live-session/</guid>
      <description>&lt;p&gt;A short bugfix session became a practical example of agent supervision. The agent produced valid code quickly, but it focused on symptoms first. Progress came from brief human questions that redirected work toward root cause, accurate impact framing, and a cleaner final production change set for users.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h2 id="context"><a href="#context">Context</a></h2>
<p>A short bugfix session became a practical example of agent supervision. The agent produced valid code quickly, but it focused on symptoms first. Progress came from brief human questions that redirected work toward root cause, accurate impact framing, and a cleaner final production change set for users.</p>
<h2 id="key-takeaways"><a href="#key-takeaways">Key Takeaways</a></h2>
<ul>
<li>Coding agents optimize for task completion, not business-aware correctness. In this session, five short human interventions redirected the agent from suppressing a log warning to fixing the root cause in user registration</li>
<li>The agent did not independently ask &quot;why&quot; before solving &quot;how.&quot; It also overstated business impact until a human requested a one-sentence summary with accurate scope</li>
<li>Supervising a coding agent is not overhead. The time previously spent writing code is reallocated to asking clarifying questions, redirecting toward root causes, and verifying that changes match business intent</li>
</ul>
<h2 id="the-promise"><a href="#the-promise">The promise</a></h2>
<p>The pitch is straightforward: let an agent write most of the code and move faster. In practice, without clear boundaries and active review, quality drops quickly. This article presents an unedited live session that shows where human intervention still matters.</p>
<p>The session starts with a small bug: a React component logs a warning about unexpected metadata shape. The transcript shows how the first solution addressed the symptom, while the root cause appeared only after follow-up questions.</p>
<h2 id="the-live-session"><a href="#the-live-session">The Live Session</a></h2>
<p>The transcript below is from a real session.</p>
<h3 id="step-1-symptom-fix"><a href="#step-1-symptom-fix">Step 1: Symptom Fix</a></h3>
<p>The agent finds the warning in the logs:</p>
<pre class="athl code-block" style="color: #e6edf3; background-color: #30363d;"><code class="language-plaintext" translate="no" tabindex="0"><div class="line" data-line="1">reviews metadata parsed but unexpected shape - using empty &lbrace;
</div><div class="line" data-line="2">  raw: &quot;0&quot;,
</div><div class="line" data-line="3">  parsed: 0,
</div><div class="line" data-line="4">  type: &quot;number&quot;,
</div><div class="line" data-line="5">&rbrace;
</div></code></pre>
<p>The first change is:</p>
<pre class="athl code-block" style="color: #e6edf3; background-color: #30363d;"><code class="language-plaintext" translate="no" tabindex="0"><div class="line" data-line="1">-     &rbrace; else &lbrace;
</div><div class="line" data-line="2">-       console.warn(&quot;reviews metadata parsed but unexpected shape — using empty&quot;, &lbrace;
</div><div class="line" data-line="3">-         raw: reviewsValueRaw,
</div><div class="line" data-line="4">-         parsed,
</div><div class="line" data-line="5">-         type: typeof parsed,
</div><div class="line" data-line="6">-       &rbrace;);
</div><div class="line" data-line="7">+     // else: primitive value (number, string, null) - silently use empty
</div><div class="line" data-line="8">+   &rbrace; catch &lbrace;
</div><div class="line" data-line="9">+     // invalid JSON - silently use empty
</div><div class="line" data-line="10">    &rbrace;
</div></code></pre>
<p>This removes logging and suppresses the warning, but it does not explain why the data is malformed.</p>
<h3 id="step-2-pushback"><a href="#step-2-pushback">Step 2: Pushback</a></h3>
<p>A single line from the human:</p>
<blockquote>
<p><code>// invalid JSON - silently use empty</code>
?</p>
</blockquote>
<p>The agent reverses course:</p>
<blockquote>
<p>You're right - silent failures are bad. Let me add proper logging.</p>
</blockquote>
<pre class="athl code-block" style="color: #e6edf3; background-color: #30363d;"><code class="language-plaintext" translate="no" tabindex="0"><div class="line" data-line="1">-   if (reviewsValueRaw) &lbrace;
</div><div class="line" data-line="2">+   if (reviewsValueRaw &amp;&amp; reviewsValueRaw !== &quot;0&quot;) &lbrace;
</div></code></pre>
<p>Now it hardcodes a bypass for a known bad value. The warning is gone, and the agent marks the task as done.</p>
<h3 id="step-3-root-cause-question"><a href="#step-3-root-cause-question">Step 3: Root Cause Question</a></h3>
<p>The human asks:</p>
<blockquote>
<p>Why was this happening in the first place? Aren't we fixing the symptoms and not the root cause?</p>
</blockquote>
<p>The agent searches and finds:</p>
<pre class="athl code-block" style="color: #e6edf3; background-color: #30363d;"><code class="language-typescript" translate="no" tabindex="0"><div class="line" data-line="1"><span style="color: #8b949e;">// RegisterAction.ts line 94</span>
</div><div class="line" data-line="2"><span style="color: #e6edf3;">&lbrace;</span> <span style="color: #79c0ff;">key</span><span style="color: #e6edf3;">:</span> <span style="color: #a5d6ff;">&quot;reviews&quot;</span><span style="color: #e6edf3;">,</span> <span style="color: #79c0ff;">value</span><span style="color: #e6edf3;">:</span> <span style="color: #a5d6ff;">&quot;0&quot;</span><span style="color: #e6edf3;">&rbrace;</span><span style="color: #e6edf3;">,</span>
</div></code></pre>
<p>New users were initialized with invalid metadata during registration. Every new user got <code>reviews: &quot;0&quot;</code> instead of valid JSON. The symptom fix would have hidden this source bug rather than resolving it.</p>
<h3 id="step-4-executive-summary"><a href="#step-4-executive-summary">Step 4: Executive Summary</a></h3>
<p>The agent provides a long explanation. The human asks:</p>
<blockquote>
<p>Could you shorten it to one sentence?</p>
</blockquote>
<p>The agent responds:</p>
<blockquote>
<p>New users no longer get invalid <code>reviews: &quot;0&quot;</code> on registration; the reviews metadata is now created only when they actually submit their first review.</p>
</blockquote>
<h3 id="step-5-business-impact"><a href="#step-5-business-impact">Step 5: Business Impact</a></h3>
<p>The human asks:</p>
<blockquote>
<p>What is the business impact?</p>
</blockquote>
<p>The agent's analysis:</p>
<blockquote>
<p>Minimal. Users always saw &quot;0 reviews&quot; correctly - the fallback worked. The fix eliminates noisy server logs and prevents potential edge-case crashes.</p>
</blockquote>
<p>The human's response:</p>
<blockquote>
<p>No impact.</p>
</blockquote>
<p>The agent:</p>
<blockquote>
<p>Correct. Log noise only.</p>
</blockquote>
<h2 id="what-this-session-reveals"><a href="#what-this-session-reveals">What This Session Reveals</a></h2>
<table>
<thead>
<tr>
<th>Moment</th>
<th>Agent Behavior</th>
<th>Required Human Input</th>
</tr>
</thead>
<tbody>
<tr>
<td>Initial fix</td>
<td>Silences warning</td>
<td>None (let it run)</td>
</tr>
<tr>
<td>Silent failure</td>
<td>Removes logging entirely</td>
<td>Single &quot;?&quot; to question</td>
</tr>
<tr>
<td>Hardcoded bypass</td>
<td>Adds exception for known bad value</td>
<td>None (acceptable)</td>
</tr>
<tr>
<td>Symptom vs cause</td>
<td>Declares done after symptom fix</td>
<td>&quot;Why was this happening?&quot;</td>
</tr>
<tr>
<td>Technical summary</td>
<td>Multi-paragraph explanation</td>
<td>&quot;Shorten it&quot;</td>
</tr>
<tr>
<td>Business framing</td>
<td>Overclaims impact</td>
<td>Reality check</td>
</tr>
</tbody>
</table>
<p>The agent was technically competent at each step. It found the warning, understood the code, made valid changes, and committed cleanly. Without human intervention:</p>
<ul>
<li>It would have silently swallowed errors</li>
<li>It would not have found the root cause</li>
<li>It would have left invalid initialization logic in production</li>
<li>Its summary would have overstated the impact</li>
</ul>
<h2 id="the-cost-of-supervision"><a href="#the-cost-of-supervision">The Cost of Supervision</a></h2>
<p>Each intervention in this session took only seconds. Across many sessions, those checks add up. The time is not removed; it is reallocated.</p>
<p>The old work: write code, debug, test.
The new work: supervise, question, redirect, verify.</p>
<p>The agent does not reliably ask &quot;why&quot; before optimizing &quot;how.&quot; It does not naturally separate symptom handling from root cause analysis. It also does not calibrate summaries for different audiences without prompting. Those parts still depend on human judgment.</p>
<h2 id="implications-for-leaders"><a href="#implications-for-leaders">Implications for Leaders</a></h2>
<p>If you are evaluating AI coding tools for your organization:</p>
<ol>
<li>Don't measure speed in isolation. Faster commits with frequent rework can still reduce overall throughput.</li>
<li>Staff for supervision, not replacement. The human in the loop needs enough experience to catch silent failures and framing errors.</li>
<li>Build feedback loops. Repeated misses should feed into better prompts, workflows, and guardrails.</li>
<li>Watch for automation bias. Higher output volume makes thorough review harder.</li>
</ol>
<h2 id="conclusion"><a href="#conclusion">Conclusion</a></h2>
<p>AI agents are useful engineering tools. They write correct syntax and they move through large codebases quickly.</p>
<p>They still optimize for task completion, not business-aware correctness. In this session, five short human interventions changed the result from symptom suppression to root-cause resolution with accurate impact framing.</p>
<p>That is the current operating model for agent-assisted development: faster execution, with supervision still required for quality.</p>
]]></content:encoded>
      <author>Istvan</author>
      <pubDate>Fri, 30 Jan 2026 12:00:00 +0100</pubDate>
    </item>
    <item>
      <title>ro11y: Lightweight Observability</title>
      <link>https://www.vectorian.be/projects/ro11y/</link>
      <guid isPermaLink="true">https://www.vectorian.be/projects/ro11y/</guid>
      <description>&lt;p&gt;ro11y is a Rust observability library that implements OTLP protobuf export over HTTP without pulling in the full OpenTelemetry SDK. Seven direct dependencies instead of a hundred and twenty. It builds on the tracing crate for structured logging and distributed tracing.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h2 id="context"><a href="#context">Context</a></h2>
<p>ro11y is a Rust observability library that implements OTLP protobuf export over HTTP without pulling in the full OpenTelemetry SDK. Seven direct dependencies instead of a hundred and twenty. It builds on the tracing crate for structured logging and distributed tracing.</p>
<h2 id="why-not-the-opentelemetry-sdk"><a href="#why-not-the-opentelemetry-sdk">Why Not the OpenTelemetry SDK</a></h2>
<p>The official Rust SDK requires version lock-step across multiple <code>opentelemetry-*</code> crates, pulls in ~120 transitive dependencies (3+ minute compile times), and has a shutdown footgun where <code>drop()</code> doesn't flush pending spans. The gRPC transport via tonic/prost adds further bloat for services that only need HTTP export.</p>
<p>ro11y hand-rolls the protobuf encoding in ~200 lines of Rust, leveraging the fact that the protobuf wire format hasn't changed since 2008. The result is a library that compiles fast, exports standard OTLP, and never blocks your service.</p>
<h2 id="architecture"><a href="#architecture">Architecture</a></h2>
<p>The core is a <code>tracing::Layer</code> that captures spans and events, encodes them as OTLP protobuf (<code>ExportTraceServiceRequest</code>, <code>ExportLogsServiceRequest</code>), and ships them over HTTP POST to any OTLP-compatible collector, including Vector, Grafana Alloy, and the OpenTelemetry Collector.</p>
<p>Dual output mode supports OTLP HTTP for production and JSON stderr for local development or CloudWatch fallback. A background exporter with 3-retry exponential backoff ensures telemetry never blocks service operations.</p>
<h2 id="signals"><a href="#signals">Signals</a></h2>
<ul>
<li><strong>Traces</strong>: Standard OTLP trace export with W3C traceparent propagation</li>
<li><strong>Logs</strong>: OTLP log export from tracing events</li>
<li><strong>Metrics</strong>: RED metrics (rate, errors, duration) emitted as structured log events, designed for Vector's <code>log_to_metric</code> transform</li>
<li><strong>Process metrics</strong>: CPU and memory via <code>/proc</code> polling on Linux</li>
</ul>
<p>Optional Tower middleware integrates with Axum for automatic request instrumentation, CloudFront request ID extraction, and BLAKE3-based deterministic trace IDs.</p>
<h2 id="links"><a href="#links">Links</a></h2>
<ul>
<li><strong>Source:</strong> <a href="https://github.com/l1x/ro11y">github.com/l1x/ro11y</a></li>
<li><strong>License:</strong> MIT / Apache-2.0</li>
</ul>
]]></content:encoded>
      <author>istvan</author>
      <pubDate>Thu, 15 Jan 2026 12:00:00 +0100</pubDate>
    </item>
    <item>
      <title>Diagnosing Network Latency on MikroTik: Seven Layers, Seven Fixes</title>
      <link>https://www.vectorian.be/articles/2026-01-08/debugging-network-latency/</link>
      <guid isPermaLink="true">https://www.vectorian.be/articles/2026-01-08/debugging-network-latency/</guid>
      <description>&lt;p&gt;&lt;code&gt;aws s3 sync&lt;/code&gt; was running at 60 KB/s on a link that should reach around 80 Mbps. The issue was not one bug but several: DFS channel pauses, Apple roaming behavior, LTE bufferbloat, and smaller configuration problems. This article walks through the complete fixes on RouterOS v7.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h2 id="context"><a href="#context">Context</a></h2>
<p><code>aws s3 sync</code> was running at 60 KB/s on a link that should reach around 80 Mbps. The issue was not one bug but several: DFS channel pauses, Apple roaming behavior, LTE bufferbloat, and smaller configuration problems. This article walks through the complete fixes on RouterOS v7.</p>
<h2 id="key-takeaways"><a href="#key-takeaways">Key Takeaways</a></h2>
<ul>
<li>Seven distinct issues across Wi-Fi radio, protocol, traffic shaping, and macOS layers combined to degrade an 80 Mbps link to 60 KB/s during AWS S3 sync operations</li>
<li>The three highest-impact fixes on MikroTik RouterOS v7 were switching from a DFS channel to channel 36, disabling 802.11r FT-over-DS to stop Apple device roaming stalls, and adding CAKE queue management to eliminate LTE bufferbloat</li>
<li>Speed tests can report full bandwidth while interactive traffic feels broken. Debug network latency layer by layer starting from the physical radio, because bufferbloat and protocol-level issues are invisible to throughput-only measurements</li>
</ul>
<h2 id="symptoms"><a href="#symptoms">Symptoms</a></h2>
<p>Three things were happening at once:</p>
<ol>
<li><code>s3 sync</code> would stall for seconds, then resume, then stall again.</li>
<li>Speed tests showed 80 Mbps down, but interactive use felt sluggish.</li>
<li>The Mac would occasionally drop off Wi-Fi and reconnect seconds later.</li>
</ol>
<p>The tools: <code>networkQuality</code> on macOS for latency and throughput measurement, and the MikroTik CLI for radio and queue diagnostics.</p>
<h2 id="layer-1-dfs-radar-on-the-5-ghz-radio"><a href="#layer-1-dfs-radar-on-the-5-ghz-radio">Layer 1: DFS radar on the 5 GHz radio</a></h2>
<p>5 GHz Wi-Fi in Europe overlaps with weather radar spectrum. Channels in the DFS (Dynamic Frequency Selection) range require the router to listen for radar pulses. When it detects one, it must go silent for 60 seconds and switch channels.</p>
<p>My router was on channel 100 (5500 MHz), right in the DFS zone:</p>
<pre class="athl code-block" style="color: #e6edf3; background-color: #30363d;"><code class="language-bash" translate="no" tabindex="0"><div class="line" data-line="1"><span style="color: #d2a8ff;">/interface</span> <span style="color: #e6edf3;">wifi</span> <span style="color: #e6edf3;">monitor</span> <span style="color: #e6edf3;">[</span><span style="color: #e6edf3;">find</span> <span style="color: #e6edf3;">name=mrt6</span><span style="color: #e6edf3;">]</span>
</div><div class="line" data-line="2"><span style="color: #8b949e;"># channel: 5500/ax/Ceee/D (DFS)</span>
</div></code></pre>
<p>The logs showed frequent &quot;radar detected&quot; events. Each one meant a full minute of radio silence.</p>
<p>The fix: move to channel 36 (5180 MHz), which is outside the DFS range. No radar checks, no forced channel switches.</p>
<pre class="athl code-block" style="color: #e6edf3; background-color: #30363d;"><code class="language-bash" translate="no" tabindex="0"><div class="line" data-line="1"><span style="color: #d2a8ff;">/interface</span> <span style="color: #e6edf3;">wifi</span> <span style="color: #e6edf3;">set</span> <span style="color: #e6edf3;">[</span><span style="color: #e6edf3;">find</span> <span style="color: #e6edf3;">name=mrt6</span><span style="color: #e6edf3;">]</span> <span style="color: #e6edf3;">channel.frequency=5180</span>
</div></code></pre>
<h2 id="layer-2-80211r-roaming-and-apple-devices"><a href="#layer-2-80211r-roaming-and-apple-devices">Layer 2: 802.11r roaming and Apple devices</a></h2>
<p>With the radio stable, the Mac still had micro-stutters. The Wi-Fi logs showed a pattern: disconnect at -92 dBm, reconnect at -52 dBm. The Mac was roaming aggressively, even with only one access point in range.</p>
<p>The issue was 802.11r Fast Transition, specifically the &quot;over-the-DS&quot; (Distribution System) variant. This negotiates roaming through the wired backbone. Apple devices prefer &quot;over-the-air&quot; roaming and stall during the DS handshake.</p>
<p>Disabling FT-over-DS and enforcing Protected Management Frames (required for WPA3) resolved the disconnects:</p>
<pre class="athl code-block" style="color: #e6edf3; background-color: #30363d;"><code class="language-bash" translate="no" tabindex="0"><div class="line" data-line="1"><span style="color: #d2a8ff;">/interface</span> <span style="color: #e6edf3;">wifi</span> <span style="color: #e6edf3;">set</span> <span style="color: #e6edf3;">[</span><span style="color: #e6edf3;">find</span> <span style="color: #e6edf3;">name=mrt6</span><span style="color: #e6edf3;">]</span> <span style="color: #e6edf3;">security.ft-over-ds=no</span> <span style="color: #e6edf3;">security.pmf=required</span>
</div></code></pre>
<h2 id="layer-3-lte-bufferbloat"><a href="#layer-3-lte-bufferbloat">Layer 3: LTE bufferbloat</a></h2>
<p>The Wi-Fi was now solid. No drops, no radar pauses. But <code>networkQuality</code> showed:</p>
<pre class="athl code-block" style="color: #e6edf3; background-color: #30363d;"><code class="language-plaintext" translate="no" tabindex="0"><div class="line" data-line="1">Responsiveness: Low (727 milliseconds)
</div><div class="line" data-line="2">Idle Latency: 440 milliseconds
</div></code></pre>
<p>700 ms of latency on a connection with 28 ms idle round-trip time. This is bufferbloat.</p>
<p>The ISP (LTE) has large buffers on the tower side. During an upload like <code>s3 sync</code>, the router sends packets faster than the tower can drain them. The tower queues them. Eventually every new packet, including DNS lookups and TCP ACKs, waits behind hundreds of queued data packets.</p>
<p>The fix is Smart Queue Management using the CAKE algorithm. CAKE shapes traffic on the router side, keeping the outbound rate just below the link capacity so the ISP's buffer never fills.</p>
<pre class="athl code-block" style="color: #e6edf3; background-color: #30363d;"><code class="language-bash" translate="no" tabindex="0"><div class="line" data-line="1"><span style="color: #d2a8ff;">/queue</span> <span style="color: #e6edf3;">simple</span> <span style="color: #e6edf3;">add</span> <span style="color: #e6edf3;">name=</span><span style="color: #a5d6ff;">&quot;queue-lte&quot;</span> <span style="color: #e6edf3;">target=</span><span style="color: #a5d6ff;">&quot;bridge&quot;</span> \
</div><div class="line" data-line="2">    <span style="color: #e6edf3;">max-limit=45M/80M</span> <span style="color: #e6edf3;">queue=cake-default/cake-default</span>
</div></code></pre>
<p>The limits (45 Mbps up, 80 Mbps down) are set to 90-95% of measured capacity. Setting them at 100% defeats the purpose since the ISP's buffer would still fill.</p>
<h2 id="layer-4-transmit-power"><a href="#layer-4-transmit-power">Layer 4: Transmit power</a></h2>
<p>Wi-Fi is bidirectional. The router might transmit at 23 dBm, but a laptop radio typically maxes out around 15-17 dBm. If the router's power is too high, it can hear the client but the client struggles to send frames back. This asymmetry causes retransmissions.</p>
<p>Reducing transmit power to 17-20 dBm brought it closer to the client's capability:</p>
<pre class="athl code-block" style="color: #e6edf3; background-color: #30363d;"><code class="language-bash" translate="no" tabindex="0"><div class="line" data-line="1"><span style="color: #d2a8ff;">/interface</span> <span style="color: #e6edf3;">wifi</span> <span style="color: #e6edf3;">set</span> <span style="color: #e6edf3;">[</span><span style="color: #e6edf3;">find</span> <span style="color: #e6edf3;">name=mrt6</span><span style="color: #e6edf3;">]</span> <span style="color: #e6edf3;">channel.tx-power=17</span>
</div></code></pre>
<h2 id="layer-5-channel-width"><a href="#layer-5-channel-width">Layer 5: Channel width</a></h2>
<p>Wider channels (80 MHz, 160 MHz) offer higher throughput but collect more noise. In a residential environment with neighboring access points, a 40 MHz channel is often more reliable than an 80 MHz one. The throughput ceiling drops, but the error rate drops further.</p>
<pre class="athl code-block" style="color: #e6edf3; background-color: #30363d;"><code class="language-bash" translate="no" tabindex="0"><div class="line" data-line="1"><span style="color: #d2a8ff;">/interface</span> <span style="color: #e6edf3;">wifi</span> <span style="color: #e6edf3;">set</span> <span style="color: #e6edf3;">[</span><span style="color: #e6edf3;">find</span> <span style="color: #e6edf3;">name=mrt6</span><span style="color: #e6edf3;">]</span> <span style="color: #e6edf3;">channel.width=20/40mhz</span>
</div></code></pre>
<h2 id="layer-6-macos-awdl"><a href="#layer-6-macos-awdl">Layer 6: macOS AWDL</a></h2>
<p>Apple Wireless Direct Link (awdl0) is the interface macOS uses for AirDrop and AirPlay discovery. It periodically forces the Wi-Fi radio to hop to a social channel for peer discovery, causing latency spikes of 50-200 ms even when AirDrop is not in active use.</p>
<p>Disabling it removes the spikes:</p>
<pre class="athl code-block" style="color: #e6edf3; background-color: #30363d;"><code class="language-bash" translate="no" tabindex="0"><div class="line" data-line="1"><span style="color: #d2a8ff;">sudo</span> <span style="color: #e6edf3;">ifconfig</span> <span style="color: #e6edf3;">awdl0</span> <span style="color: #e6edf3;">down</span>
</div></code></pre>
<p>This disables AirDrop. It re-enables on reboot.</p>
<h2 id="layer-7-lte-mtu-and-mss-clamping"><a href="#layer-7-lte-mtu-and-mss-clamping">Layer 7: LTE MTU and MSS clamping</a></h2>
<p>LTE links often have a lower MTU than the standard 1500 bytes due to GTP tunneling overhead. When a TCP segment exceeds the path MTU, it gets fragmented or dropped. If the Don't Fragment bit is set (common with modern TCP), the packet is dropped and the sender must retransmit at a smaller size after receiving an ICMP &quot;Fragmentation Needed&quot; message. Some networks filter ICMP, breaking this discovery process entirely.</p>
<p>MSS clamping rewrites the TCP Maximum Segment Size during the handshake so segments are sized correctly from the start:</p>
<pre class="athl code-block" style="color: #e6edf3; background-color: #30363d;"><code class="language-bash" translate="no" tabindex="0"><div class="line" data-line="1"><span style="color: #d2a8ff;">/ip</span> <span style="color: #e6edf3;">firewall</span> <span style="color: #e6edf3;">mangle</span> <span style="color: #e6edf3;">add</span> <span style="color: #e6edf3;">chain=forward</span> <span style="color: #e6edf3;">action=change-mss</span> \
</div><div class="line" data-line="2">    <span style="color: #e6edf3;">new-mss=clamp-to-pmtu</span> <span style="color: #e6edf3;">passthrough=yes</span> <span style="color: #e6edf3;">protocol=tcp</span> <span style="color: #e6edf3;">tcp-flags=syn</span>
</div></code></pre>
<h2 id="summary"><a href="#summary">Summary</a></h2>
<p>After all seven layers:</p>
<pre class="athl code-block" style="color: #e6edf3; background-color: #30363d;"><code class="language-plaintext" translate="no" tabindex="0"><div class="line" data-line="1">Responsiveness: High (420 RPM)
</div><div class="line" data-line="2">Idle Latency: 28 ms
</div></code></pre>
<table>
<thead>
<tr>
<th>Layer</th>
<th>Problem</th>
<th>Fix</th>
</tr>
</thead>
<tbody>
<tr>
<td>1. Physical</td>
<td>DFS radar pauses</td>
<td>Channel 36 (non-DFS)</td>
</tr>
<tr>
<td>2. Protocol</td>
<td>Apple FT-over-DS stalls</td>
<td>Disable FT-over-DS, enforce PMF</td>
</tr>
<tr>
<td>3. Traffic</td>
<td>LTE bufferbloat (700 ms)</td>
<td>CAKE queue at 90-95% capacity</td>
</tr>
<tr>
<td>4. Power</td>
<td>Tx/Rx asymmetry</td>
<td>Reduce to 17-20 dBm</td>
</tr>
<tr>
<td>5. Width</td>
<td>Noise on wide channels</td>
<td>40 MHz instead of 80/160</td>
</tr>
<tr>
<td>6. macOS</td>
<td>AWDL channel hopping</td>
<td>Disable awdl0</td>
</tr>
<tr>
<td>7. WAN</td>
<td>MTU/fragmentation</td>
<td>MSS clamping</td>
</tr>
</tbody>
</table>
<p>The first three layers had the largest impact. Layers 4-7 were incremental improvements that reduced tail latency and retransmissions. The total debugging took an afternoon; most of it was spent on layer 3, understanding why a fast connection felt slow.</p>
]]></content:encoded>
      <author>Istvan</author>
      <pubDate>Thu, 08 Jan 2026 23:00:00 +0100</pubDate>
    </item>
    <item>
      <title>marie-ssg: Static Site Generator</title>
      <link>https://www.vectorian.be/projects/marie-ssg/</link>
      <guid isPermaLink="true">https://www.vectorian.be/projects/marie-ssg/</guid>
      <description>&lt;p&gt;marie-ssg is a static site generator built to do one thing well: turn markdown files paired with TOML metadata into HTML pages. It powers this site. The design prioritizes build speed, minimal configuration, and staying out of the way.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h2 id="context"><a href="#context">Context</a></h2>
<p>marie-ssg is a static site generator built to do one thing well: turn markdown files paired with TOML metadata into HTML pages. It powers this site. The design prioritizes build speed, minimal configuration, and staying out of the way.</p>
<h2 id="how-it-works"><a href="#how-it-works">How It Works</a></h2>
<p>Each content file is a pair: <code>article.md</code> for the body and <code>article.meta.toml</code> for metadata (title, date, tags, cover image). Templates use Jinja-style syntax via Minijinja. Content loading runs in parallel using Rayon, and syntax highlighting covers 10 languages through the Autumnus library.</p>
<p>The build pipeline is a single pass: load content, render templates, copy static assets, generate sitemap and RSS feed. Watch mode on macOS uses FSEvents for automatic rebuilds during development.</p>
<h2 id="features"><a href="#features">Features</a></h2>
<ul>
<li><strong>Flexible URL patterns</strong>: <code>{date}/{stem}</code>, <code>{year}/{month}/{stem}</code>, or just <code>{stem}</code></li>
<li><strong>Content-based asset hashing</strong>: Cache busting via BLAKE3 hashes appended to CSS/JS filenames</li>
<li><strong>Draft support</strong>: Mark content as <code>draft = true</code> to exclude from production builds</li>
<li><strong>Clean URLs</strong>: <code>/articles/my-post/</code> instead of <code>/articles/my-post.html</code></li>
<li><strong>Syntax highlighting</strong>: 10 languages with configurable themes</li>
<li><strong>RSS and sitemap</strong>: Generated automatically from content metadata</li>
<li><strong>Redirects</strong>: Declarative URL redirects for content migrations</li>
<li><strong>Flamechart profiling</strong>: Built-in performance profiling for build pipeline optimization</li>
</ul>
<h2 id="performance"><a href="#performance">Performance</a></h2>
<p>The binary is size-optimized: LTO, symbol stripping, and a migration from chrono to the time crate reduced the release build from 80 MB to 9 MB. Build times for a typical site are under 200 ms.</p>
<h2 id="links"><a href="#links">Links</a></h2>
<ul>
<li><strong>Source:</strong> <a href="https://github.com/l1x/marie-ssg">github.com/l1x/marie-ssg</a></li>
<li><strong>License:</strong> MIT / Apache-2.0</li>
</ul>
]]></content:encoded>
      <author>istvan</author>
      <pubDate>Tue, 30 Dec 2025 12:00:00 +0100</pubDate>
    </item>
    <item>
      <title>Agentic Project Management: Why Vibe Coding Fails and How to Fix It</title>
      <link>https://www.vectorian.be/articles/2025-12-29/agentic-project-management/</link>
      <guid isPermaLink="true">https://www.vectorian.be/articles/2025-12-29/agentic-project-management/</guid>
      <description>&lt;p&gt;Intuitive prompt engineering - often called vibe coding - promises a flow state for software engineers, but the reality is often a repetitive loop of review and correction. We’ve traded writing code for supervising agents that write code. The question is how to make that trade worthwhile.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h2 id="context"><a href="#context">Context</a></h2>
<p>Intuitive prompt engineering - often called vibe coding - promises a flow state for software engineers, but the reality is often a repetitive loop of review and correction. We’ve traded writing code for supervising agents that write code. The question is how to make that trade worthwhile.</p>
<h2 id="key-takeaways"><a href="#key-takeaways">Key Takeaways</a></h2>
<ul>
<li>A three-tier workflow of specification, execution, and verification reduces coding agent regressions by catching intent misalignment before any code is written</li>
<li>Coding agents perform best when treated as fast interns with no short-term memory: restrict their context to relevant files, store project state in persistent docs, and break tasks into atomic units</li>
<li>Detailed task specifications produce correct agent output on roughly 90% of first attempts. When output is wrong, updating the specification and re-running is more effective than manually editing the generated code</li>
</ul>
<h2 id="the-illusion-of-effortless-ai-coding"><a href="#the-illusion-of-effortless-ai-coding">The Illusion of Effortless AI Coding</a></h2>
<p>You've probably seen the demos. An agent scaffolds an entire feature in minutes, commits cleanly, and moves on to the next task. Then you try it on your own codebase.</p>
<p>The first task goes well. The second introduces a subtle regression because the agent &quot;forgot&quot; the component hierarchy it just modified. By the third, you're back to reading diffs line-by-line, except now there are three times as many lines and half of them are hallucinated imports.</p>
<p>AI companies are racing towards AGI, and models are being released at an unprecedented rate, claiming higher and higher scores on benchmarks like <a href="https://www.swebench.com/">SWE-bench</a>. The goal is to score in the top bracket of these tests, and this is orthogonal to the code quality that we have to produce at work, repeatedly, reliably, and in a controlled manner.</p>
<p>I have been vibe coding for the last couple of months. While the pace of development surprised me, so did the failure scenarios. We’ve all heard the horror stories: deleted databases, hallucinated file paths, and uncommitted work vanishing into the ether. But there are also subtle failures, like an agent centering a div vertically but forgetting the horizontal alignment because it drifted from the design system.</p>
<p>Based on my research and experience, I’ve categorized the primary challenges most engineers face with coding agents:</p>
<table>
<thead>
<tr>
<th>Failure Mode</th>
<th>Cause</th>
<th>Symptom</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Attention Drift</strong></td>
<td>Large context windows</td>
<td>Agent &quot;forgets&quot; earlier instructions, hallucinates file paths</td>
</tr>
<tr>
<td><strong>Scope Creep</strong></td>
<td>No explicit boundaries</td>
<td>Simple bug fix becomes architectural refactor</td>
</tr>
<tr>
<td><strong>Context Exhaustion</strong></td>
<td>Dumping entire codebase</td>
<td>Token limits hit mid-task</td>
</tr>
<tr>
<td><strong>Coordination Issues</strong></td>
<td>No persistent state</td>
<td>Multi-session work loses continuity</td>
</tr>
<tr>
<td><strong>Reviewer Fatigue</strong></td>
<td>Too lengthy reviews</td>
<td>Humans stop checking the AI's work thoroughly</td>
</tr>
</tbody>
</table>
<p>Each of these failures originates from a singular root cause: the mismatch between the agent's almost infinite stamina and its finite understanding of intent. This is exactly why we need to direct its attention the right way so that we don't waste our tokens on subpar implementations.</p>
<p>My goal was to stop micro-managing and start directing. I needed a system to maximize token efficiency and enforce architectural boundaries. I call this framework Agentic Project Management.</p>
<h2 id="the-intern-model"><a href="#the-intern-model">The Intern Model</a></h2>
<p>The core philosophy of agentic project management is to treat the agent not as a senior engineer who knows everything and figures out when to ignore details or when to expand on requirements, but as a brilliant, fast intern with zero short-term memory and who gets easily distracted.</p>
<p>To make this work, we need three pillars of guardrails:</p>
<ul>
<li>
<p>Context Curation: Manually restricting the &quot;surface area&quot; of the code exposed to the agent to reduce hallucination and attention drift.</p>
</li>
<li>
<p>Externalized Memory: Moving project rules, architectural decisions, and current status (including project management tickets) into persistent files that live in the repo.</p>
</li>
<li>
<p>Atomic Scoping: Breaking complex features down into tasks so small that &quot;Context Exhaustion&quot; becomes very rare.</p>
</li>
</ul>
<p>With these in mind, we can have a workflow that implements these pillars.</p>
<h2 id="definition-before-execution"><a href="#definition-before-execution">Definition Before Execution</a></h2>
<p>If we treat the agent as an intern with no short-term memory, we cannot expect it to infer requirements as it works. The instructions need to be complete before execution begins. This means separating the workflow into distinct phases: first define what needs to be done, then execute, then verify.</p>
<p>This structure forces clarity. A vague idea must become a concrete specification before any code is written. The agent receives bounded context rather than an open-ended chat, which reduces both hallucination and scope creep.</p>
<p>The workflow operates in three tiers.</p>
<h3 id="tier-1-definition"><a href="#tier-1-definition">Tier 1: Definition</a></h3>
<ul>
<li>
<p><a href="https://github.com/l1x/agent-prompts/blob/main/do/create-prd.md">Create PRD</a>: The PRD captures high-level intent and constraints. I use a prompt like: &quot;Given @create-prd.md, create a PRD for a simple todo list app.&quot; For larger projects, each major feature gets its own PRD.</p>
</li>
<li>
<p><a href="https://github.com/l1x/agent-prompts/blob/main/do/generate-tasks.md">Task Breakdown</a>: This prompt takes the PRD and decomposes it into epics and individual tasks. Each task should be small enough that an agent can complete it without exhausting its context window.</p>
</li>
<li>
<p><a href="https://github.com/l1x/agent-prompts/blob/main/do/prepare-task.md">Context Preparation</a>: Before execution, each task gets a structured context block (typically XML) that lists only the files, types, and dependencies the agent needs. This prevents the agent from reading irrelevant code and losing focus.</p>
</li>
</ul>
<h3 id="tier-2-execution"><a href="#tier-2-execution">Tier 2: Execution</a></h3>
<ul>
<li>
<p><a href="https://github.com/l1x/agent-prompts/blob/main/do/execute-task.md">Execute Task</a>: The agent receives a task ID, reads the prepared context, and implements the change. It commits to a branch and opens a pull request. The prompt constrains the agent to the defined scope.</p>
</li>
<li>
<p><a href="https://github.com/l1x/agent-prompts/blob/main/do/execute-epic.md">Execute Epic</a>: For larger units of work, an orchestrator agent can coordinate multiple task executions. This is still experimental; the goal is to spin up isolated containers, each handling a single task in parallel.</p>
</li>
</ul>
<h3 id="tier-3-verification"><a href="#tier-3-verification">Tier 3: Verification</a></h3>
<ul>
<li><a href="https://github.com/l1x/agent-prompts/blob/main/do/review-pr.md">Code Review</a>: A separate agent reviews the changes against the original PRD and task specification. This agent starts with fresh context, free from the execution agent's accumulated assumptions. It checks for scope creep, regressions, hallucinated imports, and architectural violations before human review.</li>
</ul>
<p>The following diagram shows how these tiers connect.</p>
<svg id="diagram" viewBox="0 0 850 500" xmlns="http://www.w3.org/2000/svg" style="width: 110%; max-width: 850px; margin-left: -5%;">
  <defs>
    <marker id="arrow" markerWidth="10" markerHeight="10" refX="9" refY="3" orient="auto" markerUnits="strokeWidth">
      <path d="M0,0 L0,6 L9,3 z" fill="var(--c-muted)" opacity="0.6" />
    </marker>
    <filter id="shadow" x="-20%" y="-20%" width="140%" height="140%">
      <feDropShadow dx="1" dy="1" stdDeviation="2" flood-color="var(--c-pri)" flood-opacity="0.15" />
    </filter>
  </defs>
  <style>
    #diagram { --human: var(--c-acc); --exec: var(--c-acc2); --decision: var(--c-muted); --repo: var(--c-sec); }
    #diagram .text-title { font-family: system-ui, -apple-system, sans-serif; font-size: 13px; font-weight: 600; fill: var(--c-pri); }
    #diagram .text-desc { font-family: system-ui, -apple-system, sans-serif; font-size: 10px; fill: var(--c-muted); }
    #diagram .text-tier { font-family: system-ui, -apple-system, sans-serif; font-size: 11px; font-weight: 600; letter-spacing: 0.05em; fill: var(--c-sec); }
    #diagram .path-line { fill: none; stroke: var(--c-brd); stroke-width: 1.5; opacity: 0.8; marker-end: url(#arrow); }
    #diagram .box { filter: url(#shadow); stroke-width: 1.5px; }
    #diagram .human-box { stroke: var(--human); fill: color-mix(in srgb, var(--human) 15%, transparent); }
    #diagram .exec-box { stroke: var(--exec); fill: color-mix(in srgb, var(--exec) 12%, transparent); }
    #diagram .exec-wip { stroke: var(--exec); stroke-dasharray: 4 2; fill: color-mix(in srgb, var(--exec) 10%, transparent); }
    #diagram .repo-box { stroke: var(--repo); fill: color-mix(in srgb, var(--repo) 12%, transparent); }
    #diagram .decision { fill: color-mix(in srgb, var(--decision) 8%, transparent); stroke: var(--decision); stroke-width: 1.5; }
    #diagram .theme-label { font-family: var(--f-mono, monospace); font-size: 9px; letter-spacing: 0.1em; text-transform: uppercase; fill: var(--c-muted); opacity: 0.6; }
    #diagram .theme-light { display: block; }
    #diagram .theme-dark { display: none; }
    [data-theme="dark"] #diagram .theme-light { display: none; }
    [data-theme="dark"] #diagram .theme-dark { display: block; }
  </style>
  <text x="115" y="490" text-anchor="start" class="theme-label">Fig: Workflow</text>
  <text x="740" y="490" text-anchor="end" class="theme-label theme-light">LIGHT</text>
  <text x="740" y="490" text-anchor="end" class="theme-label theme-dark">DARK</text>
  <g transform="translate(90, 0)">
  <text x="20" y="30" class="text-tier">TIER 1: DEFINITION</text>
  <rect x="20" y="50" width="120" height="55" rx="6" class="box human-box" />
  <text x="80" y="82" text-anchor="middle" class="text-title">Human Intent</text>
  <path d="M140 77 L180 77" class="path-line" />
  <rect x="180" y="50" width="130" height="55" rx="6" class="box exec-box" />
  <text x="245" y="72" text-anchor="middle" class="text-title">Generate PRD</text>
  <text x="245" y="88" text-anchor="middle" class="text-desc">create-prd.md</text>
  <path d="M310 77 L350 77" class="path-line" />
  <rect x="350" y="50" width="130" height="55" rx="6" class="box exec-box" />
  <text x="415" y="72" text-anchor="middle" class="text-title">Task Breakdown</text>
  <text x="415" y="88" text-anchor="middle" class="text-desc">generate-tasks.md</text>
  <path d="M480 77 L520 77" class="path-line" />
  <rect x="520" y="50" width="130" height="55" rx="6" class="box exec-box" />
  <text x="585" y="72" text-anchor="middle" class="text-title">Context (XML)</text>
  <text x="585" y="88" text-anchor="middle" class="text-desc">prepare-task.md</text>
  <path d="M585 105 L585 145 L335 145 L335 175" class="path-line" />
  <text x="20" y="185" class="text-tier">TIER 2: EXECUTION</text>
  <polygon points="335,175 375,200 335,225 295,200" class="decision" filter="url(#shadow)" />
  <text x="335" y="205" text-anchor="middle" class="text-title" font-size="10">Mode</text>
  <path d="M295 200 L220 200 L220 240" class="path-line" />
  <rect x="150" y="240" width="140" height="55" rx="6" class="box exec-box" />
  <text x="220" y="262" text-anchor="middle" class="text-title">Coding Agent</text>
  <text x="220" y="278" text-anchor="middle" class="text-desc">execute-task.md</text>
  <path d="M375 200 L450 200 L450 240" class="path-line" />
  <rect x="380" y="240" width="140" height="55" rx="6" class="box exec-wip" />
  <text x="450" y="262" text-anchor="middle" class="text-title">Orchestrator</text>
  <text x="450" y="278" text-anchor="middle" class="text-desc">execute-epic.md</text>
  <path d="M220 295 L220 325 L300 325" class="path-line" />
  <path d="M450 295 L450 325 L370 325" class="path-line" />
  <rect x="300" y="310" width="70" height="40" rx="6" class="box repo-box" />
  <text x="335" y="335" text-anchor="middle" class="text-title">Repo</text>
  <text x="20" y="375" class="text-tier">TIER 3: VERIFICATION</text>
  <path d="M335 350 L335 390 L335 405" class="path-line" />
  <rect x="265" y="405" width="140" height="55" rx="6" class="box exec-box" />
  <text x="335" y="427" text-anchor="middle" class="text-title">Code Review</text>
  <text x="335" y="443" text-anchor="middle" class="text-desc">review-pr.md</text>
  </g>
</svg>
<h2 id="shifting-quality-left"><a href="#shifting-quality-left">Shifting Quality Left</a></h2>
<p>With this workflow, I was able to reduce the amount of rollback I had to do and increase the efficiency of token use. This is only anecdotal since I did not think about measuring it from the get-go. It would be a fun project to prove this with more data points. This approach is similar to the &quot;shift-left&quot; in security engineering.</p>
<p>By moving quality checks earlier in the pipeline, defects are caught earlier in the development loop. The same principle applies to coding agents. Default vibe coding puts all quality control on the right side: generate code, review, find problems, correct, repeat.</p>
<p>Shifting left means investing in task definition before execution begins. The PRD catches intent misalignment before any code is written. The task breakdown enforces scope boundaries. The XML context preparation prevents attention drift before the agent opens files it should not touch or loses focus.</p>
<p>This mirrors how Rust's toolchain works: the borrow checker and clippy catch issues at compile time rather than runtime. You pay upfront, fighting the compiler and satisfying the type system, but you rarely pay the much higher cost of debugging a mysterious null pointer in production.</p>
<p>Crafting detailed task specifications feels like overhead when you could just let the agent start coding. The payoff only becomes obvious after you have experienced enough late-stage failures, enough agent sessions that spiraled into incoherence, enough diffs so large you stopped reading carefully. Once you have experienced that pain, putting in the work stops feeling like overhead and starts feeling like the way forward.</p>
<h2 id="parallelism-without-chaos"><a href="#parallelism-without-chaos">Parallelism Without Chaos</a></h2>
<p>There are several ways to create a multi-agent environment. Locally, git <a href="https://git-scm.com/docs/git-worktree">worktrees</a> allow multiple working directories within the same repo, each checked out to a different branch.</p>
<p>From the documentation:</p>
<blockquote>
<p>Manage multiple working trees attached to the same repository.
A git repository can support multiple working trees, allowing you to check out more than one branch at a time. With git worktree add a new working tree is associated with the repository, along with additional metadata that differentiates that working tree from others in the same repository. The working tree, along with this metadata, is called a &quot;worktree&quot;.</p>
</blockquote>
<p>It works very well with multiple agents running on the same node.</p>
<p>For maximum separation and to reduce the chance of unintended consequences further, I have also implemented a simple Docker-based workflow.</p>
<h2 id="supporting-toolchain"><a href="#supporting-toolchain">Supporting Toolchain</a></h2>
<p>I use the following toolset for getting the most out of agentic work:</p>
<ul>
<li>
<p><a href="https://mise.jdx.dev/">Mise</a> is a polyglot tool version manager with environment variable support and a task runner. There are multiple locations and therefore, multiple strategies to capture tool (programming languages, executables, etc.) requirements. Mise supports multiple <a href="https://mise.jdx.dev/dev-tools/backends/">backends</a> for installing software, even custom implementations.</p>
</li>
<li>
<p><a href="https://github.com/steveyegge/beads">Beads</a> is a Git persistent, dependency-aware graph of tasks. By structuring tasks logically, it prevents context drift, allowing coding agents to execute multi-step engineering workflows reliably. Instead of giving the agent every single file in the repo, we can list the files we need in the ticket description together with all the other details. It has an agent aware CLI that makes task management for agents easy.</p>
</li>
<li>
<p><a href="https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/use-xml-tags">XML</a>: Finally, after 27 years of waiting, we figured out what XML is good for. It turns out agents love it.</p>
</li>
</ul>
<h2 id="what-changed-in-practice"><a href="#what-changed-in-practice">What Changed in Practice</a></h2>
<p>Before this framework, I spent a significant portion of my time reviewing code and fixing subtle regressions, trying to get the agent back on track. With the three-tier workflow, that dynamic no longer exists.</p>
<p>Now, roughly 90% of agent output is correct on the first pass. When something is wrong, I do not fix the code directly. I adjust the task description and run the agent again. The feedback loop has shifted from debugging implementation to refining specification. This is a more sustainable way to work: specifications are reusable, and improving them benefits future tasks.</p>
<p>Babysitting sessions are gone. I observe outcomes, verify against the PRD, and move on.</p>
<h2 id="where-this-goes-next"><a href="#where-this-goes-next">Where This Goes Next</a></h2>
<p>The immediate next step is true parallelism. I am experimenting with using the Orchestrator to spin up disposable containers, each assigned a specific task ID and context. This would allow an &quot;epic&quot; to be implemented by five &quot;interns&quot; simultaneously.</p>
<p>In the end, agentic productivity gains do not come from letting go of control. They come from deciding where control actually matters.</p>
]]></content:encoded>
      <author>Istvan</author>
      <pubDate>Mon, 29 Dec 2025 19:48:00 +0100</pubDate>
    </item>
    <item>
      <title>qrst: On-Device Hybrid Search</title>
      <link>https://www.vectorian.be/projects/qrst/</link>
      <guid isPermaLink="true">https://www.vectorian.be/projects/qrst/</guid>
      <description>&lt;p&gt;qrst is a hybrid search engine that runs entirely on-device. No API calls, no cloud, just a Rust binary and an embedding model. It combines BM25 keyword search (SQLite FTS5) with vector semantic search (ONNX embeddings + HNSW) and fuses results using configurable strategies including Reciprocal Rank Fusion and learned-to-rank models.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h2 id="context"><a href="#context">Context</a></h2>
<p>qrst is a hybrid search engine that runs entirely on-device. No API calls, no cloud, just a Rust binary and an embedding model. It combines BM25 keyword search (SQLite FTS5) with vector semantic search (ONNX embeddings + HNSW) and fuses results using configurable strategies including Reciprocal Rank Fusion and learned-to-rank models.</p>
<h2 id="architecture"><a href="#architecture">Architecture</a></h2>
<p>The engine indexes local files into two parallel stores: a full-text index backed by SQLite FTS5 for keyword search, and an HNSW vector index (usearch) for semantic nearest-neighbor lookup. At query time, both stores return ranked results that are combined through a pluggable fusion layer.</p>
<p>File parsing uses tree-sitter grammars for language-aware chunking across multiple file types, including Rust, JavaScript, TypeScript, HTML, CSS, and markdown. Each chunk is embedded using an ONNX model (currently EmbeddingGemma 300M, 768 dimensions) via the ort runtime.</p>
<p>The trait-based architecture keeps components swappable: <code>Embed</code> for the embedding backend, <code>VectorStore</code> for the ANN index, and <code>FusionStrategy</code> for result fusion. Adding a new embedding model or fusion method means implementing a single trait.</p>
<h2 id="search-modes"><a href="#search-modes">Search Modes</a></h2>
<ul>
<li><strong>Keyword</strong>: BM25 ranking via SQLite FTS5, fast and precise for exact term matches</li>
<li><strong>Semantic</strong>: Cosine similarity over dense embeddings, handles synonyms and paraphrases</li>
<li><strong>Hybrid</strong>: Fused ranking combining both signals, configurable via ConvexFusion (weighted blend), RRF, or a learned linear model</li>
</ul>
<h2 id="evaluation"><a href="#evaluation">Evaluation</a></h2>
<p>On a 42-query evaluation corpus with graded relevance judgments:</p>
<table>
<thead>
<tr>
<th>Strategy</th>
<th>nDCG@5</th>
<th>P@3</th>
<th>MRR</th>
</tr>
</thead>
<tbody>
<tr>
<td>BM25 only</td>
<td>0.431</td>
<td>0.246</td>
<td>0.534</td>
</tr>
<tr>
<td>RRF (k=60)</td>
<td>0.794</td>
<td>0.476</td>
<td>0.903</td>
</tr>
<tr>
<td>Semantic only</td>
<td>0.827</td>
<td>0.500</td>
<td>0.880</td>
</tr>
</tbody>
</table>
<p>The benchmark suite (<code>qrst-bench</code>) supports bootstrap confidence intervals, inter-annotator agreement metrics, and automated SVG report generation.</p>
<h2 id="links"><a href="#links">Links</a></h2>
<ul>
<li><strong>Source:</strong> <a href="https://github.com/l1x/qrst">github.com/l1x/qrst</a></li>
<li><strong>License:</strong> MIT</li>
</ul>
]]></content:encoded>
      <author>istvan</author>
      <pubDate>Sat, 01 Nov 2025 12:00:00 +0100</pubDate>
    </item>
  </channel>
</rss>
