The Git Commit as Scientific Discovery: How Autoresearch Turns Version Control into a Research Lab
In traditional software development, a git commit means “this code works.” In Karpathy’s autoresearch, a git commit means something different: “this change made the model measurably better.”
Every commit is a small scientific discovery. Every git reset is a hypothesis that didn’t pan out. The git log becomes a research journal, automatically written by an AI agent.
This is version control reimagined as a research tool.
The Binary Decision
Autoresearch’s use of git is elegantly simple:
- Agent modifies
train.py - Training runs for 5 minutes
- Validation loss is measured
- If improved:
git commit--- the change is a keeper - If not improved:
git reset--- the change never happened
No pull requests. No code review. No merge conflicts. Just a binary decision: did this change make things better or not?
This creates a clean, linear history of improvements. Each commit in the log represents a validated step forward. There’s no noise --- no “WIP” commits, no “fix typo” commits, no “revert revert” chains. Just a sequence of changes that each made the model measurably better.
The Git Log as Research Journal
After an overnight autoresearch session, the git log reads like a research notebook:
Each commit message (written by the AI agent) describes what was changed and what effect it had. The diff shows exactly what code was modified. The improvement in validation loss is recorded.
This is radically more auditable than traditional ML research. Instead of a researcher’s notes saying “tried adjusting learning rate, seemed to help,” you have an exact diff, an exact measurement, and a reproducible result.
Memory Across Sessions
Git gives autoresearch something AI agents desperately need: persistent memory.
When you start a new autoresearch session, the agent can read the git history to understand what’s been tried before. It can see which directions produced improvements and which didn’t. This prevents the agent from re-trying failed experiments and helps it build on what worked.
This is Markdown plus git working together: the program.md file provides strategic direction (what to try), and the git history provides tactical context (what’s been tried).
The Compounding Effect
Because each successful commit becomes the new baseline, improvements compound. The agent doesn’t start from scratch each night --- it starts from the best result achieved so far.
In Karpathy’s two-day run, around 20 improvements accumulated. Each one was small, but together they reduced the GPT-2 training time by 11%. The agent found optimizations in attention scaling, regularization, and hyperparameters that built on each other.
This is the power of the git-based approach: it naturally creates a ratchet. Progress is locked in as commits. Failures are discarded. The codebase only moves forward.
What Gets Reverted
The failed experiments --- the git reset operations --- are just as interesting as the successes. In a typical overnight run, about 70-80% of experiments are reverted.
These reverted experiments aren’t wasted. They’re negative results that inform the agent’s future decisions. With cross-agent memory and shared git history, a distributed autoresearch system can learn from failures across the entire swarm.
Git as the Experiment Database
Traditional ML research uses experiment tracking tools --- MLflow, Weights & Biases, Neptune --- to log hyperparameters, metrics, and artifacts.
Autoresearch replaces all of this with git. The commit history IS the experiment log. The diffs ARE the hyperparameter changes. The commit messages ARE the experiment descriptions.
This simplification is powerful. There’s no separate experiment database to maintain. No dashboard to configure. No schema to define. Just git, which every developer already knows.
The Broader Pattern
The git-as-research-journal pattern works beyond ML training:
- Code optimization: Each commit represents a change that made the code faster
- Test coverage: Each commit represents a change that improved test coverage
- Bug fixing: Each commit represents a fix that resolved a failing test
- Content optimization: Each commit represents a change that improved a measurable metric
Any domain where you can automatically measure “better” and “worse” can use git as an experiment tracker.
The Human’s Role: Reading the Log
In agentic engineering, the human’s morning routine after an overnight autoresearch session is reading the git log.
This is a different skill than writing code. You’re evaluating a series of AI-generated changes, understanding why each one worked, and deciding whether the overall direction is correct. Based on this review, you update your program.md to steer the next session.
The git log is the communication channel between human and agent. The agent communicates through commits. The human communicates through the program.md updates. Markdown flows in both directions.
Building Git-Friendly Knowledge
Writing effective program.md files --- the kind that produce clean, meaningful git histories --- requires understanding both the domain and the tools. The best agent instructions come from people who’ve studied the problem space deeply.
Saving reference material as clean Markdown creates a knowledge base you can draw from when writing agent instructions. Documentation, research papers, and best practices, all in the format that flows naturally into a program.md and ultimately into a git history of discoveries.
Save converts any webpage to clean Markdown --- building the knowledge library that powers effective AI agent instructions and autonomous research. Try Save free.