ProustGPT — Architecture Deep Dive

Foundational Concepts

Before diving into the architecture, here are the building blocks. Every piece of ProustGPT relies on these ideas.

Core Idea

Embeddings

A way to turn text into a list of numbers (a vector) that captures its meaning. "The madeleine triggered a flood of memory" and "A small cake unlocked forgotten recollections" produce similar vectors, even though the words are different. We use Cohere embed-v4.0 — a multilingual model, so French and English text about the same thing land near each other in vector space.

Core Idea

Vector Database

A database optimized for storing and searching vectors. Instead of SQL queries (WHERE title = ...), you give it a vector and ask "find the 20 closest ones." We use Pinecone — it holds all ~12,900 Proust passages as 1536-dimensional vectors and returns the most semantically similar ones in milliseconds.

Framework

LangChain

A Python framework that provides standard interfaces for LLMs, embeddings, and vector stores. Instead of writing raw API calls to Cohere, Groq, and Pinecone, we use LangChain's abstractions — CohereEmbeddings, ChatGroq, CohereRerank. This makes it easy to swap providers without rewriting logic.

Framework

LangGraph

Built on top of LangChain, LangGraph lets you build stateful agents — LLMs that can call tools in a loop, observe results, and decide what to do next. Unlike a simple prompt-in/response-out flow, a LangGraph agent can make multiple tool calls before generating its final answer. We use it for the ReAct agent.

Why not just use LangChain for everything? LangChain is great for linear pipelines (embed → retrieve → generate). But when you need an LLM to decide which tools to call, how many times, and in what order — that's where LangGraph comes in. It adds the control flow that makes an agent possible.

Vector Database: How Passages are Stored

Every passage from Proust's 7 volumes is embedded and stored in a Pinecone index. Here's how a passage goes from text to searchable vector:

Key insight: Multilingual search for free. Because Cohere embed-v4.0 is multilingual, a French query like "la jalousie de Swann" produces a vector that's close to English passages about Swann's jealousy. We only embed the English text — French queries just work without needing a separate French index.

Here's the actual code that queries Pinecone. The namespace parameter selects which language's text to return:

    backend/retrieval.py
    def _pinecone_query(
    query: str, top_k: int, lang: str = "en",
    metadata_filter: dict | None = None,
) -> list[Document]:
    embeddings = get_embeddings()
    query_vector = embeddings.embed_query(query)  # text → 1536-dim vector

    index = get_pinecone_index()
    results = index.query(
        vector=query_vector,
        top_k=top_k,            # fetch 20 candidates
        include_metadata=True,
        namespace=lang,         # "en" or "fr" namespace
    )

    # Build LangChain Document objects from Pinecone results
    docs = []
    for match in results.matches:
        metadata = match.metadata or {}
        text_en = metadata.pop("text", "")
        text_fr = metadata.pop("text_fr", "")
        # Use French text when requested
        page_content = text_fr if (lang == "fr" and text_fr) else text_en
        docs.append(Document(page_content=page_content, metadata=metadata))
    return docs
  

RAG Pipeline: From Question to Answer

RAG (Retrieval-Augmented Generation) means: before the LLM answers, we first find relevant passages from Proust's text, then give those passages to the LLM as context. This grounds the response in the actual text instead of relying on the LLM's training data.

Why rerank? Embedding search is fast but approximate — it finds 20 "close enough" passages. The Cohere reranker then reads the actual text of all 20 candidates against the query and picks the 5 that are genuinely most relevant. This two-stage approach gives better results than just increasing top_k.

The full pipeline in code — note how compact it is thanks to LangChain abstractions:

    backend/retrieval.py
    def retrieve_passages(query: str, lang: str = "en") -> list[Document]:
    # Stage 1: Semantic search (20 candidates)
    candidates = _pinecone_query(query, top_k=config.RETRIEVAL_CANDIDATES, lang=lang)

    # Stage 2: Rerank to top 5
    reranker = get_reranker()
    reranked = list(reranker.compress_documents(candidates, query))

    # Stage 3: Stitch adjacent passages for context continuity
    return _stitch_context(reranked, lang=lang)


def stream_rag_response(query: str, lang: str = "en") -> Generator:
    docs = retrieve_passages(query, lang=lang)

    # Build context string with citation markers
    context = "\n\n---\n\n".join(
        f"[{i+1}] {doc.page_content}" for i, doc in enumerate(docs)
    )
    prompt = rag_template.format(context=context, question=query)

    # Stream LLM response token by token
    llm = get_llm()
    for chunk in llm.stream(prompt):
        yield {"type": "token", "token": chunk.content}

    yield {"type": "sources", "passages": _format_passages(docs)}
    yield {"type": "done", "done": True}
  

Agentic System: When RAG Isn't Enough

Simple RAG works for direct questions. But what about: "How does Swann's jealousy compare to the narrator's jealousy of Albertine?" That requires searching twice (once for each character), reading context around the results, and synthesizing across volumes. A single retrieval step can't do this.

This is where the ReAct agent comes in. ReAct stands for Reason + Act — the LLM thinks about what it needs, calls a tool, observes the result, thinks again, and repeats until it has enough information to answer.

Creating the agent is remarkably concise — LangGraph's create_react_agent handles the entire loop:

    backend/agent.py
    from langgraph.prebuilt import create_react_agent

# Define the 6 tools the agent can use
_TOOLS = [
    search_passages,           # Semantic search (API call)
    search_by_volume,          # Volume-filtered search (API call)
    get_adjacent_passages,     # Read nearby passages (FREE)
    get_chapter_overview,      # Chapter start (FREE)
    find_character_mentions,   # Text search by name (FREE)
    get_toc,                   # Table of contents (FREE)
]

def create_proust_agent(lang: str = "en"):
    llm = get_llm()  # Groq / Kimi K2
    agent = create_react_agent(llm, _TOOLS, prompt=AGENT_SYSTEM_PROMPT)
    return agent  # Ready to .stream() or .invoke()
  

Agent Tools

Each tool is a Python function decorated with @tool. The LLM sees the docstrings and decides which to call. Four of the six tools are free — they search the in-memory corpus without making any API calls:

Tool	Cost	What It Does	When the Agent Uses It
`search_passages`	API	Semantic vector search across full corpus	General questions about themes, scenes, quotes
`search_by_volume`	API	Search filtered to one volume	Comparing across specific volumes
`get_adjacent_passages`	FREE	Read passages before/after a known index	"What happens next?" or expanding context
`get_chapter_overview`	FREE	First 5 passages of a chapter	Understanding chapter scope before searching
`find_character_mentions`	FREE	In-memory text search for character names	Character analysis, tracking appearances
`get_toc`	FREE	Full table of contents for all 7 volumes	Understanding novel structure, finding chapters

Here's what a tool definition looks like — the docstring is what the LLM reads to decide whether to call it:

    backend/agent.py
    @tool
def get_adjacent_passages(
    passage_index: int, before: int = 2, after: int = 2, lang: str = "en"
) -> str:
    """Get passages immediately before and/or after a known passage.

    Use this to read what happens next or before a scene you've already found.
    This is a FREE operation (no API calls) — prefer it over new searches when
    you already know the passage location.
    """
    results = []
    for offset in range(-before, after + 1):
        idx = passage_index + offset
        p = get_passage_text(idx, lang=lang)
        if p:
            results.append(f"[passage {idx}] ({p['book']}, {p['chapter']})\n{p['text'][:500]}")
    return "\n\n---\n\n".join(results)
  

Two Agent Modes

Complexity Routing: Choosing the Right Path

Not every query needs a multi-step agent. The complexity router examines the query and conversation history to decide whether to use fast single-step RAG or the full agent. This keeps simple questions fast and cheap.

The routing logic is a simple function — no ML model needed, just regex patterns and conversation-length heuristics:

    backend/agent.py
    # Patterns that indicate a complex query
_COMPLEX_PATTERNS = re.compile(
    r"\b("
    r"compare|comparison|vs\.?|versus|differ|difference|contrast"
    r"|evolve|evolution|develop|change over time|across volumes"
    r"|what happens (after|next|before|then)"
    r"|how does .+ change"
    r"|trace|track|arc|journey"
    r"|relationship between"
    r")\b", re.IGNORECASE
)

def needs_agent(query: str, history: list | None = None) -> bool:
    if not config.AGENT_ENABLED:
        return False

    # 2+ prior messages → likely needs conversational context
    if history:
        user_msgs = [m for m in history if m["role"] == "user"]
        if len(user_msgs) >= 2:
            return True

    # Query matches complexity patterns
    if _COMPLEX_PATTERNS.search(query):
        return True

    # Short follow-up (e.g. "tell me more", "and Albertine?")
    if history and len(query.split()) <= 6:
        return True

    return False
  

And in the server, the routing is a single if statement:

    backend/server.py
    @app.post("/api/explore_lost_time/stream")
async def explore_lost_time_stream(body: QueryRequest):
    query = body.query or body.message or ""
    lang = body.lang or "en"
    history = [m.model_dump() for m in body.history] if body.history else None

    if needs_agent(query, history):
        # Complex → multi-step agent with tool calling
        return StreamingResponse(
            async_sse_generator(stream_agent_response, query, history, lang),
            media_type="text/event-stream",
        )

    # Simple → fast single-shot RAG
    return StreamingResponse(
        async_sse_generator(stream_rag_response, query, lang),
        media_type="text/event-stream",
    )
  

Streaming: Backend to Frontend

Both the fast RAG and agent paths stream their output as Server-Sent Events (SSE). This means the user sees tokens appear in real-time and gets status updates as the agent calls tools — instead of waiting 10 seconds for a complete response.

The frontend React hook processes each event type:

    src/hooks/useStreamingQuery.ts
    // SSE event types the frontend handles
interface StreamEvent {
  type: 'token' | 'sources' | 'status' | 'done' | 'error';
  token?: string;
  passages?: Passage[];
  status?: string;
}

// Inside the streaming loop:
switch (event.type) {
  case 'status':
    setStatus(event.status);      // "Searching for passages about..."
    break;
  case 'token':
    setStatus(null);               // Clear status once tokens arrive
    setResponse(prev => prev + event.token);
    break;
  case 'sources':
    setPassages(event.passages);  // Populate the ReaderPanel
    break;
}
  

Bilingual Alignment: Matching EN and FR Text

Proust's English and French texts don't have matching paragraph structures. Translators split, merge, and restructure paragraphs differently. To show parallel bilingual text, we need to align them semantically.

The Problem

The Solution: Sentence-Level DP Alignment

Instead of aligning paragraphs (which have different granularity in each language), we split into individual sentences, embed them with Cohere, and use dynamic programming to find the optimal monotonic alignment:

Result: ~12,900 bilingual passage pairs with 99.98% French coverage. The sentence-level approach fixed a ~18% misalignment rate that the previous paragraph-level method produced.