Shopify's 19% Improvement: How Companies Are Already Using the Autoresearch Pattern
When Karpathy released autoresearch on March 7, 2026, it took exactly days --- not weeks, not months --- for companies to start running it on their own problems.
The most notable early adopter: Shopify CEO Tobi Lutke, who adapted the autoresearch framework for an internal project. The result? A 0.8 billion parameter model trained overnight outperformed a previous 1.6 billion parameter model by 19% after just 37 experiments in 8 hours.
Smaller model. Better results. Zero human intervention overnight.
The Autoresearch Pattern in Business
What Shopify demonstrated isn’t just a cute ML experiment. It’s a proof of concept for a new way companies do R&D.
The traditional approach: hire ML engineers, have them run experiments manually, review results in meetings, decide next steps, repeat slowly. A good team might run 30 focused experiments per month.
The autoresearch approach: write a program.md defining your goals, let an AI agent run experiments overnight, review the results in the morning. One engineer, one GPU, 100+ experiments per night.
The math is overwhelming. Manual research produces ~1 experiment per day per researcher. Autoresearch produces ~12 per hour. That’s a 100x increase in experimental throughput.
Beyond ML: The 36,500-Experiment Year
The pattern extends beyond model training. Marketing teams typically run about 30 experiments per year --- A/B tests, copy variations, audience targeting changes. It’s slow because each experiment requires human setup, monitoring, and analysis.
Early adopters are already imagining a world where autonomous agents run 100 marketing experiments per day, measuring conversion rates, adjusting copy, and iterating on targeting --- all guided by a program.md that defines the brand’s goals and constraints.
That’s 36,500+ experiments per year versus 30. The companies that adopt this pattern first will have a compounding advantage that’s nearly impossible to catch up to.
What Made Shopify’s Results Possible
Shopify’s 19% improvement wasn’t luck. Several factors made it work:
Clear metrics. They had a well-defined evaluation metric that the agent could measure automatically after each experiment. Without automated measurement, the loop breaks.
Constrained scope. Like Karpathy’s 630-line train.py, Shopify kept the modifiable codebase small enough for the LLM to understand completely. You don’t throw a million-line codebase at an agent and hope for the best.
Good initial instructions. The program.md that directed the agent was informed by the team’s domain knowledge. The agent wasn’t searching randomly --- it was exploring directions the team identified as promising.
Trust in the process. They let it run overnight without intervening. The temptation to check and adjust every hour defeats the purpose of autonomous experimentation.
The Overnight Run Pattern
The typical autoresearch adoption follows a pattern:
Day 1: Set up the environment, write your first program.md, run a few experiments manually to verify the loop works.
Night 1: Start the agent before leaving. Set it to run indefinitely, committing improvements and reverting failures.
Day 2 morning: Review the git log. See what the agent tried, what worked, and what didn’t. Update your program.md based on what you learned.
Night 2: Run again with improved instructions. The agent starts from where Night 1’s best result left off.
Within a week: You have a refined program.md and dozens of validated improvements that would have taken a human team months to discover.
Industries Ripe for This Pattern
Any field that involves systematic experimentation can adopt the autoresearch loop:
Machine learning --- the original use case. Hyperparameter tuning, architecture search, regularization experiments.
Software optimization --- performance tuning, bundle size reduction, query optimization. Anywhere you have a measurable metric and modifiable code.
Drug discovery --- molecular simulations with measurable binding affinity. The experiment is computational, the metric is numerical, the loop is automatable.
Financial modeling --- backtesting trading strategies against historical data. Clear metrics, fast feedback, huge search space.
Content optimization --- A/B testing headlines, layouts, and copy with conversion rate as the metric.
The Markdown Advantage
In every case, the human’s contribution is the same: a Markdown file that defines what to optimize, what constraints to respect, and what strategies to try.
This is why Markdown literacy is becoming a competitive advantage. The companies writing the best program.md files are the ones getting the best results from autonomous agents. And writing good program.md files requires deep domain knowledge organized in a format AI can consume.
Companies building reference libraries --- saving documentation, competitive analysis, research papers, and best practices as clean Markdown --- have a head start. When it’s time to write the program.md that directs an overnight experiment, they can pull from a curated knowledge base instead of starting from scratch.
Save converts any webpage to clean Markdown --- building the knowledge library that companies need to write effective AI agent instructions. Try Save free.