Bambicim

# Building a Calm Copilot: Streaming with SSE + Hybrid Retrieval

I wanted a Copilot that feels **instant**, **honest**, and **small enough to maintain**. This post documents the first stable version that now powers Bambicim’s chat.

**Guiding principles**
- **Calm UX:** no spinner purgatory—stream tokens immediately.
- **Transparency:** show sources and keep answers grounded.
- **Tiny ops:** one database, one web app, predictable caching.

---

## Experience goals

- **p50 first token < 800ms**, **p95 < 1.8s** (from request to first streamed byte).
- Always show **citations**; answers without grounding are discouraged.
- **Graceful failure:** time out generation, stream partials, never freeze the UI.

---

## High-level architecture

User ──> /copilot/chat?q=... ──> Retrieve (BM25 + Vector)
└─> Build prompt with cites
└─> LLM stream (SSE) ──> UI appends tokens

- **Hybrid retrieval** gives resilient recall:
- **BM25** wins on proper nouns and exact phrases.
- **Vectors** win on fuzzy/semantic matches.
- A lightweight **re-rank** nudges the final set.

---

## Retrieval and indexing

**Chunking**
- ~800 tokens per chunk with 150 overlap; each chunk stores `url`, `title`, `section`.

**Storage**
- Postgres with `pgvector` (or FAISS if you prefer), plus normal text search for BM25-like scoring.

**Hybrid score**

score = 0.6 * cosine(emb_query, emb_chunk)
+ 0.4 * bm25(query, chunk)

I bump the BM25 weight if the query contains quotes or many capitals (likely exact phrases).

**Indexer**
- Daily scheduled job (and on-write reindex).
- Markdown → HTML → plain text normalization so we don’t embed noise like code fences or nav.

---

## Streaming with Server-Sent Events

I chose **SSE** over WebSockets because the traffic is **one-way** (server → client), proxies are friendlier, and the code is tiny.

**Django view (trimmed):**
```python
# views.py
from django.http import StreamingHttpResponse
from .retrieval import retrieve
from .llm import stream_tokens

def copilot_chat(request):
q = request.GET.get("q", "").strip()
top = retrieve(q) # [{"content": "...", "url": "...", "title": "..."}]

def events():
yield f"event: context\ndata: {top[:3]}\n\n"
for token in stream_tokens(q, top):
yield f"data: {token}\n\n"
yield "event: done\ndata: {}\n\n"

resp = StreamingHttpResponse(events(), content_type="text/event-stream")
resp["Cache-Control"] = "no-cache"
return resp

Client:

const es = new EventSource(`/copilot/chat?q=${encodeURIComponent(input.value)}`);
es.addEventListener("context", (e) => showCitations(JSON.parse(e.data)));
es.onmessage = (e) => appendToken(e.data);
es.addEventListener("done", () => es.close());

SSE keeps the UI predictable: the user sees context first, then tokens; if something fails, they still have the citations and partial text.

Prompt building (honest by default)

The system prompt is short: describe the assistant, how to cite, and tone.

I pass the top chunks as source[n] blocks with url and title.

The model is asked to quote minimally and include inline cite tags like [1], [2], which I map back to URLs in the UI.

Guardrails

Truncate combined sources to a token budget.

If retrieval is empty, respond with a friendly “I don’t have a grounded answer yet” and suggest site areas to search.

Caching, timeouts, and cost

Retrieval results cached by (normalized_query) for 2–10 minutes.

Generation cut at 60–90s; the stream ends with a done event either way.

I log tokens_out and tokens/sec to keep an eye on costs and perceived speed.

Observability (just enough)

Metrics: p50/p95 first-byte, total latency, error rate (by cause), cache hit rate.

Logs: one record per chat with query, top sources, model, duration, bytes.

Feature toggles let me adjust the hybrid weights or disable re-ranking without redeploy.

What went wrong (and got fixed)

Media vs. static confusion broke blog images used as citations in early tests. Root cause: missing /media/ route and incorrect MEDIA_URL. Fixed by adding a media handler in urls.py and double-checking storage backends.

Manifest 500s: {% static 'images/png/og-default.png' %} wasn’t in the source dir, so CompressedManifestStaticFilesStorage 500’d the homepage. I added the file under core/static/... and recollected. Now I run a static assets smoke test after deploy.

Roadmap

Why-this-source tooltips that highlight the matching sentence.

Follow-up question generator that helps users refine queries when recall looks weak.

Eval harness with a tiny set of prompts to track quality across model versions.

Team mode later: private docs per project, same pipeline.

Try it & tell me what hurts

You can open the Copilot from the site header. If it confuses you, answer slowly, or cites the wrong thing, please send me the query. This project gets better with real feedback—and I’m building it out in the open for that reason.

Thanks for reading. Onward to a calmer, faster helper.