LLM Wiki by Andrej Karpathy - A Smarter Way to Build a Knowledge Base (With a Mini Project)

If you've ever uploaded a PDF to an AI tool, had a great conversation, then came back the next day and uploaded the same file again to pick up where you left off, then Karpathy’s idea is about that.

Most knowledge tools are stateless. Every session starts from zero. The model retrieves, answers, and forgets. If you ask the same question tomorrow and it rebuilds the answer from scratch, with no memory of the synthesis it already did for you yesterday. It's expensive, repetitive, and doesn't compound.

Now you might ask me, “How is it different to adding memory to a RAG”?

Most people reach for "add memory" as the fix. Memory helps; it means the system recalls your past queries and what it told you. But the underlying documents are still raw, still unchanged, still re-chunked on every query.

The synthesis evaporates. The connections between documents are never made explicit. The contradiction between your 2023 methodology doc and your 2024 update never gets flagged. Memory remembers your conversations. It doesn't compile your knowledge.

In April 2026, Andrej Karpathy published a GitHub Gist called LLM Wiki that reframes this entirely. Not a library, not a framework. A pattern.

Let me walk you through what it is, how it works, and then we'll build a working version around something I actually use: Databricks Lakehouse Wiki

The Compilation:

  • Karpathy uses the idea of “Compilation”. You don’t execute the source code every time you want to run a program. You compile it once into an optimised binary and then run that efficiently on demand. The compilation step is expensive, but the subsequent runs are faster and cheap.

    Instead of pointing your LLM at a pile of raw PDFs every single time you ask a question, you have the LLM build a persistent Wiki from these sources. One compilation pass, then every future query you hit is compiled artefact- structured, cross-referenced, already synthesised and not the raw documents.

    The Architecture:

    This has three layers:

    • Raw Sources: These are all your PDFs, articles, documentation, blog posts, etc.. These live in the sources/ directory and are treated as immutable.

    • The Wiki itself: A directory of plain markdown files that the LLM owns completely. One file per concept: entity pages, topic pages, comparison pages, and an index. You read it. The LLM writes and maintains it.

    • Schema: This is the single Claude.md or agents.md file that acts as a guide for the agent. Meaning, it tells what the wiki is about, what conversations to follow, how pages should be structured, and how conflicts should be handled.

What happens at the Ingest:

When you drop a new source into the folder and trigger an ingest, the agent doesn’t just index it. It reads the new source and the existing wiki, then updates or creates entity pages that the new resource touches, notes where the new information contradicts existing ones, adds cross-references where concepts connect and updates the index.

The knowledge compounds with every source you add. Over time, the wiki becomes a densely interlinked knowledge graph written in natural language, not a pile of document chunks. You can view that from an app called Obsidian

The Mini Project: A Databricks Lakehouse Wiki:

I have used Tableau’s superstore dataset, the one every Tableau user has accessed and reverse engineered into a Databricks Lakehouse. My goal was to talk to Wiki

Here is what I dropped into sources/

  • Superstore.twb - which contains every calculated field in XML

  • Three databricks notebooks - bronze DLL, Silver and Gold DDLs that mirror Tableau’s dashboard views exactly

I then wrote Claude.md schema telling the agent what the wiki was about, what page types to create, how to handle the Tableau -to-Databricks translation, and what to do when i found the conflicts.

Then I ran one compilation prompt in Claude Code

A few minutes later, I could see a lot of .md files in my wiki/ folder. Each .md file had its own specific page with the summary, what Tableau formula was used, etc.

Data dictionary generated from LLM Wiki

The Wiki compiled it by reading both sources and reconciling them.

I created the Wiki yesterday, and today I can still refer to it to ask any question or build a dashboard on top of it

Where the Wiki Pattern Works and Where it doesn’t?

To my understanding, it fits the knowledge bases well. Any personal research, data dictionaries, or architecture decisions. It wouldn’t be able to handle petabyte-scale knowledge graphs with compliance requirements. The research around it is still ongoing.

You can also view this in the knowledge graph using Obsidian. Look below

Wrapping Up:

The wiki pattern is not about better memory. It is about not having to think the same thought twice. You compile once, and every question after that lands on knowledge that is already connected, already reconciled, already yours.

This pattern quietly solves, not with infrastructure, not with a new tool, just with a folder of markdown that gets smarter every time you add something to it. Thank you so much for taking the time to read this! See you in the next one, Until then #HappyLearning

Next
Next

Gen AI Made Simple: Building an AI Agent That Converts Alteryx Workflows to Databricks SQL