A business lead asks a one-line question: "how many ChatGPT Pro users in Italy?" For most companies, the data scientist answering it would click through a few dashboards and write a quick query. At OpenAI, ad-hoc questions like this arrive dozens of times a day, and the underlying data lives in more than 600 petabytes of internal storage. The answer, according to a recent engineering talk by OpenAI software engineer Bonnie Xu, was an internal AI data analyst the company calls Kepler.
Kepler is not a product. It is a tool the OpenAI Data Productivity team built for its own use, and Xu's 44-minute InfoQ presentation is the most detailed public look at how it works. The talk is best read not as a feature tour but as a pattern audit: a list of engineering moves other teams building agentic data systems could borrow, and a set of tradeoffs the speaker does not address.
Four techniques carry the load. The first is the Model Context Protocol, a standard for letting a model call external tools and pull in structured context without filling its context window with raw data. The second is automated code crawling paired with retrieval-augmented generation: instead of indexing table schemas, the system indexes the actual analysis code OpenAI engineers have already written, so the agent reuses a working query rather than inventing a new one. The third is scoped semantic memory, which gives the agent a per-user or per-task memory of past questions and answers, letting it learn from prior interactions without contaminating other users' sessions. The fourth is AST-based LLM grading, where the evaluation pipeline parses the agent's generated code into an abstract syntax tree and checks structural correctness before scoring the answer, a way to catch regressions that pure output-comparison evals would miss.
Each of these patterns is genuinely portable. MCP is a protocol that any team can adopt, and the talk treats it as the entry point. Code-as-retrieval is a quietly powerful idea: most internal data tools index metadata and leave humans to write the queries, while Kepler indexes the queries themselves and treats them as a knowledge base. Scoped semantic memory is a sensible default for any multi-user agent. AST-based grading is a useful addition to the standard eval toolkit, especially for code-generating systems.
The tradeoffs are quieter but real. Xu's 600-petabyte scale figure is OpenAI's own, not externally audited, so any team assuming similar bottlenecks should benchmark before assuming. Scoped semantic memory can drift: the agent's notion of "what the user wanted last time" is only as good as the memory hygiene, and Xu does not detail how stale or contradictory memories are reconciled. AST grading catches structural defects but says nothing about whether the answer is actually correct, so the eval pipeline still needs outcome checks downstream. And code-as-retrieval depends on a critical mass of high-quality prior queries; a team that has not yet written thousands of analyses does not have a knowledge base to crawl.
There is also the question of what Kepler does not attempt. It does not generate dashboards. It does not explain causality. It is built to answer narrow, well-formed questions about internal data, and the presentation makes clear that the harder problems, open-ended exploration and ambiguous business questions, are still routed to human analysts.
For engineering teams building their own agentic data systems, the practical takeaway is to read the talk as a checklist. The four patterns are worth piloting, in roughly the order Xu presents them. The constraints are worth flagging in any internal pitch, because the parts that look like best practice are really artifacts of OpenAI's specific scale and data maturity. A team with ten analysts and ten terabytes is not OpenAI, and adopting Kepler's full architecture would be over-engineering. The question worth asking is which of the four moves survives the simplification.