How RAG, agentic orchestration, and bilingual alignment work together under the hood
Before diving into the architecture, here are the building blocks. Every piece of ProustGPT relies on these ideas.
A way to turn text into a list of numbers (a vector) that captures its meaning. "The madeleine triggered a flood of memory" and "A small cake unlocked forgotten recollections" produce similar vectors, even though the words are different. We use Cohere embed-v4.0 — a multilingual model, so French and English text about the same thing land near each other in vector space.
A database optimized for storing and searching vectors. Instead of SQL queries (WHERE title = ...), you give it a vector and ask "find the 20 closest ones." We use Pinecone — it holds all ~12,900 Proust passages as 1536-dimensional vectors and returns the most semantically similar ones in milliseconds.
A Python framework that provides standard interfaces for LLMs, embeddings, and vector stores. Instead of writing raw API calls to Cohere, Groq, and Pinecone, we use LangChain's abstractions — CohereEmbeddings, ChatGroq, CohereRerank. This makes it easy to swap providers without rewriting logic.
Built on top of LangChain, LangGraph lets you build stateful agents — LLMs that can call tools in a loop, observe results, and decide what to do next. Unlike a simple prompt-in/response-out flow, a LangGraph agent can make multiple tool calls before generating its final answer. We use it for the ReAct agent.
Every passage from Proust's 7 volumes is embedded and stored in a Pinecone index. Here's how a passage goes from text to searchable vector:
Here's the actual code that queries Pinecone. The namespace parameter selects which language's text to return:
def _pinecone_query(
query: str, top_k: int, lang: str = "en",
metadata_filter: dict | None = None,
) -> list[Document]:
embeddings = get_embeddings()
query_vector = embeddings.embed_query(query) # text → 1536-dim vector
index = get_pinecone_index()
results = index.query(
vector=query_vector,
top_k=top_k, # fetch 20 candidates
include_metadata=True,
namespace=lang, # "en" or "fr" namespace
)
# Build LangChain Document objects from Pinecone results
docs = []
for match in results.matches:
metadata = match.metadata or {}
text_en = metadata.pop("text", "")
text_fr = metadata.pop("text_fr", "")
# Use French text when requested
page_content = text_fr if (lang == "fr" and text_fr) else text_en
docs.append(Document(page_content=page_content, metadata=metadata))
return docs
RAG (Retrieval-Augmented Generation) means: before the LLM answers, we first find relevant passages from Proust's text, then give those passages to the LLM as context. This grounds the response in the actual text instead of relying on the LLM's training data.
top_k.
The full pipeline in code — note how compact it is thanks to LangChain abstractions:
def retrieve_passages(query: str, lang: str = "en") -> list[Document]:
# Stage 1: Semantic search (20 candidates)
candidates = _pinecone_query(query, top_k=config.RETRIEVAL_CANDIDATES, lang=lang)
# Stage 2: Rerank to top 5
reranker = get_reranker()
reranked = list(reranker.compress_documents(candidates, query))
# Stage 3: Stitch adjacent passages for context continuity
return _stitch_context(reranked, lang=lang)
def stream_rag_response(query: str, lang: str = "en") -> Generator:
docs = retrieve_passages(query, lang=lang)
# Build context string with citation markers
context = "\n\n---\n\n".join(
f"[{i+1}] {doc.page_content}" for i, doc in enumerate(docs)
)
prompt = rag_template.format(context=context, question=query)
# Stream LLM response token by token
llm = get_llm()
for chunk in llm.stream(prompt):
yield {"type": "token", "token": chunk.content}
yield {"type": "sources", "passages": _format_passages(docs)}
yield {"type": "done", "done": True}
Simple RAG works for direct questions. But what about: "How does Swann's jealousy compare to the narrator's jealousy of Albertine?" That requires searching twice (once for each character), reading context around the results, and synthesizing across volumes. A single retrieval step can't do this.
This is where the ReAct agent comes in. ReAct stands for Reason + Act — the LLM thinks about what it needs, calls a tool, observes the result, thinks again, and repeats until it has enough information to answer.
Creating the agent is remarkably concise — LangGraph's create_react_agent handles the entire loop:
from langgraph.prebuilt import create_react_agent
# Define the 6 tools the agent can use
_TOOLS = [
search_passages, # Semantic search (API call)
search_by_volume, # Volume-filtered search (API call)
get_adjacent_passages, # Read nearby passages (FREE)
get_chapter_overview, # Chapter start (FREE)
find_character_mentions, # Text search by name (FREE)
get_toc, # Table of contents (FREE)
]
def create_proust_agent(lang: str = "en"):
llm = get_llm() # Groq / Kimi K2
agent = create_react_agent(llm, _TOOLS, prompt=AGENT_SYSTEM_PROMPT)
return agent # Ready to .stream() or .invoke()
Each tool is a Python function decorated with @tool. The LLM sees the docstrings and decides which to call. Four of the six tools are free — they search the in-memory corpus without making any API calls:
| Tool | Cost | What It Does | When the Agent Uses It |
|---|---|---|---|
search_passages |
API | Semantic vector search across full corpus | General questions about themes, scenes, quotes |
search_by_volume |
API | Search filtered to one volume | Comparing across specific volumes |
get_adjacent_passages |
FREE | Read passages before/after a known index | "What happens next?" or expanding context |
get_chapter_overview |
FREE | First 5 passages of a chapter | Understanding chapter scope before searching |
find_character_mentions |
FREE | In-memory text search for character names | Character analysis, tracking appearances |
get_toc |
FREE | Full table of contents for all 7 volumes | Understanding novel structure, finding chapters |
Here's what a tool definition looks like — the docstring is what the LLM reads to decide whether to call it:
@tool
def get_adjacent_passages(
passage_index: int, before: int = 2, after: int = 2, lang: str = "en"
) -> str:
"""Get passages immediately before and/or after a known passage.
Use this to read what happens next or before a scene you've already found.
This is a FREE operation (no API calls) — prefer it over new searches when
you already know the passage location.
"""
results = []
for offset in range(-before, after + 1):
idx = passage_index + offset
p = get_passage_text(idx, lang=lang)
if p:
results.append(f"[passage {idx}] ({p['book']}, {p['chapter']})\n{p['text'][:500]}")
return "\n\n---\n\n".join(results)
Not every query needs a multi-step agent. The complexity router examines the query and conversation history to decide whether to use fast single-step RAG or the full agent. This keeps simple questions fast and cheap.
The routing logic is a simple function — no ML model needed, just regex patterns and conversation-length heuristics:
# Patterns that indicate a complex query
_COMPLEX_PATTERNS = re.compile(
r"\b("
r"compare|comparison|vs\.?|versus|differ|difference|contrast"
r"|evolve|evolution|develop|change over time|across volumes"
r"|what happens (after|next|before|then)"
r"|how does .+ change"
r"|trace|track|arc|journey"
r"|relationship between"
r")\b", re.IGNORECASE
)
def needs_agent(query: str, history: list | None = None) -> bool:
if not config.AGENT_ENABLED:
return False
# 2+ prior messages → likely needs conversational context
if history:
user_msgs = [m for m in history if m["role"] == "user"]
if len(user_msgs) >= 2:
return True
# Query matches complexity patterns
if _COMPLEX_PATTERNS.search(query):
return True
# Short follow-up (e.g. "tell me more", "and Albertine?")
if history and len(query.split()) <= 6:
return True
return False
And in the server, the routing is a single if statement:
@app.post("/api/explore_lost_time/stream")
async def explore_lost_time_stream(body: QueryRequest):
query = body.query or body.message or ""
lang = body.lang or "en"
history = [m.model_dump() for m in body.history] if body.history else None
if needs_agent(query, history):
# Complex → multi-step agent with tool calling
return StreamingResponse(
async_sse_generator(stream_agent_response, query, history, lang),
media_type="text/event-stream",
)
# Simple → fast single-shot RAG
return StreamingResponse(
async_sse_generator(stream_rag_response, query, lang),
media_type="text/event-stream",
)
Both the fast RAG and agent paths stream their output as Server-Sent Events (SSE). This means the user sees tokens appear in real-time and gets status updates as the agent calls tools — instead of waiting 10 seconds for a complete response.
The frontend React hook processes each event type:
// SSE event types the frontend handles
interface StreamEvent {
type: 'token' | 'sources' | 'status' | 'done' | 'error';
token?: string;
passages?: Passage[];
status?: string;
}
// Inside the streaming loop:
switch (event.type) {
case 'status':
setStatus(event.status); // "Searching for passages about..."
break;
case 'token':
setStatus(null); // Clear status once tokens arrive
setResponse(prev => prev + event.token);
break;
case 'sources':
setPassages(event.passages); // Populate the ReaderPanel
break;
}
Proust's English and French texts don't have matching paragraph structures. Translators split, merge, and restructure paragraphs differently. To show parallel bilingual text, we need to align them semantically.
Instead of aligning paragraphs (which have different granularity in each language), we split into individual sentences, embed them with Cohere, and use dynamic programming to find the optimal monotonic alignment: