ProustGPT Architecture

How RAG, agentic orchestration, and bilingual alignment work together under the hood

Foundational Concepts

Before diving into the architecture, here are the building blocks. Every piece of ProustGPT relies on these ideas.

Core Idea

Embeddings

A way to turn text into a list of numbers (a vector) that captures its meaning. "The madeleine triggered a flood of memory" and "A small cake unlocked forgotten recollections" produce similar vectors, even though the words are different. We use Cohere embed-v4.0 — a multilingual model, so French and English text about the same thing land near each other in vector space.

Core Idea

Vector Database

A database optimized for storing and searching vectors. Instead of SQL queries (WHERE title = ...), you give it a vector and ask "find the 20 closest ones." We use Pinecone — it holds all ~12,900 Proust passages as 1536-dimensional vectors and returns the most semantically similar ones in milliseconds.

Framework

LangChain

A Python framework that provides standard interfaces for LLMs, embeddings, and vector stores. Instead of writing raw API calls to Cohere, Groq, and Pinecone, we use LangChain's abstractions — CohereEmbeddings, ChatGroq, CohereRerank. This makes it easy to swap providers without rewriting logic.

Framework

LangGraph

Built on top of LangChain, LangGraph lets you build stateful agents — LLMs that can call tools in a loop, observe results, and decide what to do next. Unlike a simple prompt-in/response-out flow, a LangGraph agent can make multiple tool calls before generating its final answer. We use it for the ReAct agent.

Why not just use LangChain for everything? LangChain is great for linear pipelines (embed → retrieve → generate). But when you need an LLM to decide which tools to call, how many times, and in what order — that's where LangGraph comes in. It adds the control flow that makes an agent possible.

Vector Database: How Passages are Stored

Every passage from Proust's 7 volumes is embedded and stored in a Pinecone index. Here's how a passage goes from text to searchable vector:

Proust Passage "For a long time I used to go to bed early..." vol: 1, ch: Overture embed Cohere embed-v4.0 1536 dimensions Vector [0.021, -0.183, 0.447, ... ×1536] store Pinecone 12,900 vectors + metadata: text, text_fr, volume, chapter
Key insight: Multilingual search for free. Because Cohere embed-v4.0 is multilingual, a French query like "la jalousie de Swann" produces a vector that's close to English passages about Swann's jealousy. We only embed the English text — French queries just work without needing a separate French index.

Here's the actual code that queries Pinecone. The namespace parameter selects which language's text to return:

backend/retrieval.py def _pinecone_query( query: str, top_k: int, lang: str = "en", metadata_filter: dict | None = None, ) -> list[Document]: embeddings = get_embeddings() query_vector = embeddings.embed_query(query) # text → 1536-dim vector index = get_pinecone_index() results = index.query( vector=query_vector, top_k=top_k, # fetch 20 candidates include_metadata=True, namespace=lang, # "en" or "fr" namespace ) # Build LangChain Document objects from Pinecone results docs = [] for match in results.matches: metadata = match.metadata or {} text_en = metadata.pop("text", "") text_fr = metadata.pop("text_fr", "") # Use French text when requested page_content = text_fr if (lang == "fr" and text_fr) else text_en docs.append(Document(page_content=page_content, metadata=metadata)) return docs

RAG Pipeline: From Question to Answer

RAG (Retrieval-Augmented Generation) means: before the LLM answers, we first find relevant passages from Proust's text, then give those passages to the LLM as context. This grounds the response in the actual text instead of relying on the LLM's training data.

"What is involuntary memory?" 1. embed Cohere embed-v4.0 2. search (k=20) Pinecone 3. rerank → top 5 Cohere rerank-v3.5 4. generate Groq / Kimi K2 (131K ctx) 20 candidates 5 best passages Context Stitching Fetches adjacent passages if text starts/ends mid-sentence
Why rerank? Embedding search is fast but approximate — it finds 20 "close enough" passages. The Cohere reranker then reads the actual text of all 20 candidates against the query and picks the 5 that are genuinely most relevant. This two-stage approach gives better results than just increasing top_k.

The full pipeline in code — note how compact it is thanks to LangChain abstractions:

backend/retrieval.py def retrieve_passages(query: str, lang: str = "en") -> list[Document]: # Stage 1: Semantic search (20 candidates) candidates = _pinecone_query(query, top_k=config.RETRIEVAL_CANDIDATES, lang=lang) # Stage 2: Rerank to top 5 reranker = get_reranker() reranked = list(reranker.compress_documents(candidates, query)) # Stage 3: Stitch adjacent passages for context continuity return _stitch_context(reranked, lang=lang) def stream_rag_response(query: str, lang: str = "en") -> Generator: docs = retrieve_passages(query, lang=lang) # Build context string with citation markers context = "\n\n---\n\n".join( f"[{i+1}] {doc.page_content}" for i, doc in enumerate(docs) ) prompt = rag_template.format(context=context, question=query) # Stream LLM response token by token llm = get_llm() for chunk in llm.stream(prompt): yield {"type": "token", "token": chunk.content} yield {"type": "sources", "passages": _format_passages(docs)} yield {"type": "done", "done": True}

Agentic System: When RAG Isn't Enough

Simple RAG works for direct questions. But what about: "How does Swann's jealousy compare to the narrator's jealousy of Albertine?" That requires searching twice (once for each character), reading context around the results, and synthesizing across volumes. A single retrieval step can't do this.

This is where the ReAct agent comes in. ReAct stands for Reason + Act — the LLM thinks about what it needs, calls a tool, observes the result, thinks again, and repeats until it has enough information to answer.

ReAct Agent Loop THINK "I need to find passages about Swann's jealousy" ACT search_passages("Swann jealousy") OBSERVE 5 passages about Swann's torment returned loop up to 4× enough info RESPOND Synthesize findings into flowing prose with [1], [2] citations Example multi-step: 1. search_passages("Swann") 2. search_passages("Albertine") 3. get_adjacent(#4521) 4. → final synthesis

Creating the agent is remarkably concise — LangGraph's create_react_agent handles the entire loop:

backend/agent.py from langgraph.prebuilt import create_react_agent # Define the 6 tools the agent can use _TOOLS = [ search_passages, # Semantic search (API call) search_by_volume, # Volume-filtered search (API call) get_adjacent_passages, # Read nearby passages (FREE) get_chapter_overview, # Chapter start (FREE) find_character_mentions, # Text search by name (FREE) get_toc, # Table of contents (FREE) ] def create_proust_agent(lang: str = "en"): llm = get_llm() # Groq / Kimi K2 agent = create_react_agent(llm, _TOOLS, prompt=AGENT_SYSTEM_PROMPT) return agent # Ready to .stream() or .invoke()

Agent Tools

Each tool is a Python function decorated with @tool. The LLM sees the docstrings and decides which to call. Four of the six tools are free — they search the in-memory corpus without making any API calls:

ToolCostWhat It DoesWhen the Agent Uses It
search_passages API Semantic vector search across full corpus General questions about themes, scenes, quotes
search_by_volume API Search filtered to one volume Comparing across specific volumes
get_adjacent_passages FREE Read passages before/after a known index "What happens next?" or expanding context
get_chapter_overview FREE First 5 passages of a chapter Understanding chapter scope before searching
find_character_mentions FREE In-memory text search for character names Character analysis, tracking appearances
get_toc FREE Full table of contents for all 7 volumes Understanding novel structure, finding chapters

Here's what a tool definition looks like — the docstring is what the LLM reads to decide whether to call it:

backend/agent.py @tool def get_adjacent_passages( passage_index: int, before: int = 2, after: int = 2, lang: str = "en" ) -> str: """Get passages immediately before and/or after a known passage. Use this to read what happens next or before a scene you've already found. This is a FREE operation (no API calls) — prefer it over new searches when you already know the passage location. """ results = [] for offset in range(-before, after + 1): idx = passage_index + offset p = get_passage_text(idx, lang=lang) if p: results.append(f"[passage {idx}] ({p['book']}, {p['chapter']})\n{p['text'][:500]}") return "\n\n---\n\n".join(results)

Two Agent Modes

Explore Mode Literary analysis — all 6 tools search_passages search_by_volume get_adjacent_passages get_chapter_overview find_character_mentions get_toc API API FREE FREE Reflect Mode Introspective — 3 tools, used sparingly search_passages get_adjacent_passages find_character_mentions Default is pure conversation — only searches when user's experience

Complexity Routing: Choosing the Right Path

Not every query needs a multi-step agent. The complexity router examines the query and conversation history to decide whether to use fast single-step RAG or the full agent. This keeps simple questions fast and cheap.

Incoming Query + History needs_agent() complexity check False Fast RAG ~2-3 seconds True ReAct Agent ~5-15 seconds Triggers Fast RAG "What is the madeleine scene about?" "Tell me about Combray" "Quote about time and memory" Simple, self-contained questions Triggers Agent "Compare Swann's jealousy to the narrator's" "How does the theme of memory evolve?" "What happens after the Guermantes party?" Comparative, sequential, or follow-up queries

The routing logic is a simple function — no ML model needed, just regex patterns and conversation-length heuristics:

backend/agent.py # Patterns that indicate a complex query _COMPLEX_PATTERNS = re.compile( r"\b(" r"compare|comparison|vs\.?|versus|differ|difference|contrast" r"|evolve|evolution|develop|change over time|across volumes" r"|what happens (after|next|before|then)" r"|how does .+ change" r"|trace|track|arc|journey" r"|relationship between" r")\b", re.IGNORECASE ) def needs_agent(query: str, history: list | None = None) -> bool: if not config.AGENT_ENABLED: return False # 2+ prior messages → likely needs conversational context if history: user_msgs = [m for m in history if m["role"] == "user"] if len(user_msgs) >= 2: return True # Query matches complexity patterns if _COMPLEX_PATTERNS.search(query): return True # Short follow-up (e.g. "tell me more", "and Albertine?") if history and len(query.split()) <= 6: return True return False

And in the server, the routing is a single if statement:

backend/server.py @app.post("/api/explore_lost_time/stream") async def explore_lost_time_stream(body: QueryRequest): query = body.query or body.message or "" lang = body.lang or "en" history = [m.model_dump() for m in body.history] if body.history else None if needs_agent(query, history): # Complex → multi-step agent with tool calling return StreamingResponse( async_sse_generator(stream_agent_response, query, history, lang), media_type="text/event-stream", ) # Simple → fast single-shot RAG return StreamingResponse( async_sse_generator(stream_rag_response, query, lang), media_type="text/event-stream", )

Streaming: Backend to Frontend

Both the fast RAG and agent paths stream their output as Server-Sent Events (SSE). This means the user sees tokens appear in real-time and gets status updates as the agent calls tools — instead of waiting 10 seconds for a complete response.

Backend (Python) {"type":"status", "status":"Searching..."} {"type":"token", "token":"Swann's"} {"type":"token", "token":" jealousy"} {"type":"sources", "passages":[...]} {"type":"done", "done":true} SSE (text/event-stream) Frontend (React) Searching for passages... ← status event Swann's jealousy manifests as a consuming... ← tokens appended [1] Swann in Love [2] The Guermantes Way [3] The Captive ← sources event

The frontend React hook processes each event type:

src/hooks/useStreamingQuery.ts // SSE event types the frontend handles interface StreamEvent { type: 'token' | 'sources' | 'status' | 'done' | 'error'; token?: string; passages?: Passage[]; status?: string; } // Inside the streaming loop: switch (event.type) { case 'status': setStatus(event.status); // "Searching for passages about..." break; case 'token': setStatus(null); // Clear status once tokens arrive setResponse(prev => prev + event.token); break; case 'sources': setPassages(event.passages); // Populate the ReaderPanel break; }

Bilingual Alignment: Matching EN and FR Text

Proust's English and French texts don't have matching paragraph structures. Translators split, merge, and restructure paragraphs differently. To show parallel bilingual text, we need to align them semantically.

The Problem

Why Position-Based Alignment Fails English (872 paragraphs) EN ¶1: For a long time I used to... EN ¶2: Sometimes, when I had put... EN ¶3: the candle was still lit... EN ¶4: I had not ceased while... ... ¶872 French (417 paragraphs) FR ¶1: Longtemps, je me suis... FR ¶2: Parfois, comme Eve...(combines EN ¶2 + ¶3 + ¶4) FR ¶3: Je me rendormais... ... ¶417 correct 3 EN ¶ → 1 FR ¶

The Solution: Sentence-Level DP Alignment

Instead of aligning paragraphs (which have different granularity in each language), we split into individual sentences, embed them with Cohere, and use dynamic programming to find the optimal monotonic alignment:

Sentence-Level Alignment with DP EN Sentences For a long time I used to go... Sometimes, when I had put out... my eyes would close so quickly... the candle was still burning... I had not ceased while asleep... FR Sentences Longtemps, je me suis couché... Parfois, à peine ma bougie... mes yeux se fermaient si vite... la bougie brûlait encore... je n'avais pas cessé en dormant... cos=0.92 cos=0.87 cos=0.91 cos=0.89 cos=0.85 Dynamic Programming Constraint EN sentence i maps to FR index >= EN sentence (i-1)'s mapping → preserves reading order while maximizing total similarity
Result: ~12,900 bilingual passage pairs with 99.98% French coverage. The sentence-level approach fixed a ~18% misalignment rate that the previous paragraph-level method produced.