Papers & Reports
All publications
 
  
      
      
        report
      
      
      
      
         · 
        
         25 min read
        
      
    
    
      Could decentralized training solve AI’s power problem?
    
    
    
      We illustrate a decentralized 10 GW training run across a dozen sites spanning thousands of kilometers. Developers are likely to scale datacenters to multi-gigawatt levels before adopting decentralized training.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         23 min read
        
      
    
    
      Evaluating Gemini 2.5 Deep Think's math capabilities
    
    
    
      Improved use of knowledge and precision, helpful for research, more conceptual in geometry, but limited creativity and citation issues.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         8 min read
        
      
    
    
      What will AI look like in 2030?
    
    
    
      If scaling persists to 2030, AI investments will reach hundreds of billions of dollars and require gigawatts of power. Benchmarks suggest AI could improve productivity in valuable areas such as scientific R&D.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         4 min read
        
      
    
    
      How much power will frontier AI training demand in 2030?
    
    
    
      The power required to train the largest frontier models is growing by more than 2x per year, and is on trend to reaching multiple gigawatts by 2030.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         31 min read
        
      
    
    
      Evaluating Grok 4’s math capabilities
    
    
    
      It's good at involved computations, improving at proofs, and useful for literature search. It still favors low-level grinds and leans on background knowledge.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         4 min read
        
      
    
    
      Inference economics of language models
    
    
    
      We investigate how speed trades off against cost in language model inference. We find that inference latency scales with the square root of model size and the cube root of memory bandwidth, and other results.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         11 min read
        
      
    
    
      What skills does SWE-bench Verified evaluate?
    
    
    
      We take a deep dive into SWE-bench Verified, a prominent agentic coding benchmark. While one of the best public tests of AI coding agents, it is limited by its focus on simple bug fixes in familiar open-source repositories.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         35 min read
        
      
    
    
      How many AI models will exceed compute thresholds?
    
    
    
      We project how many notable AI models will exceed training compute thresholds. Model counts rapidly grow from 10 above 1e26 FLOP by 2026, to over 200 by 2030.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         4 min read
        
      
    
    
      Trends in AI supercomputers
    
    
    
      AI supercomputers double in performance every 9 months, cost billions of dollars, and require as much power as mid-sized cities. Companies now own 80% of all AI supercomputers, while governments’ share has declined.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         5 min read
        
      
    
    
      GATE: Modeling the trajectory of AI and automation
    
    
    
      We introduce a compute-centric model of AI automation and its economic effects, illustrating key dynamics of AI development. The model suggests large AI investments and subsequent economic growth.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         9 min read
        
      
    
    
      Train once, deploy many: AI and increasing returns
    
    
    
      AI's “train-once-deploy-many” advantage yields increasing returns: doubling compute more than doubles output by increasing models' inference efficiency and enabling more deployed inference instances.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         7 min read
        
      
    
    
      What is the future of AI in mathematics? Interviews with leading mathematicians
    
    
    
      How will AI transform mathematics? Fields Medalists and other leading mathematicians discuss whether they expect AI to automate advanced math research.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         15 min read
        
      
    
    
      Hardware failures won’t limit AI scaling
    
    
    
      Hardware failures won't limit AI training scale - GPU memory checkpointing enables training with millions of GPUs despite failures.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         37 min read
        
      
    
    
      How far behind are open models?
    
    
    
      Analysis of open vs. closed AI models reveals the best open model today matches closed models in performance and training compute, but with a one-year lag.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         14 min read
        
      
    
    
      Data movement bottlenecks to large-scale model training: Scaling past 1e28 FLOP
    
    
    
      Data movement bottlenecks limit LLM scaling beyond 2e28 FLOP, with a "latency wall" at 2e31 FLOP. We may hit these in ~3 years. Aggressive batch size scaling could potentially overcome these limits.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         10 min read
        
      
    
    
      Interviewing AI researchers on automation of AI R&D
    
    
    
      AI could speed up AI R&D, especially in coding and debugging. We explore predictions on automation and researchers' suggestions for AI R&D evaluations.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         83 min read
        
      
    
    
      Can AI scaling continue through 2030?
    
    
    
      We investigate four constraints to scaling AI training: power, chip manufacturing, data, and latency. We predict 2e29 FLOP runs will be feasible by 2030.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         6 min read
        
      
    
    
      Will we run out of data? Limits of LLM scaling based on human-generated data
    
    
    
      If trends continue, language models will fully utilize the stock of human-generated public text between 2026 and 2032.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         4 min read
        
      
    
    
      How much does it cost to train frontier AI models?
    
    
    
      The cost of training top AI models has grown 2-3x annually for the past eight years. By 2027, the largest models could cost over a billion dollars.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         20 min read
        
      
    
    
      Training compute of frontier AI models grows by 4-5x per year
    
    
    
      Our expanded AI model database shows that training compute grew 4-5x/year from 2010 to 2024, with similar trends in frontier and large language models.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         10 min read
        
      
    
    
      Do the returns to software R&D point towards a singularity?
    
    
    
      Returns to R&D are key in growth dynamics and AI development. Our paper introduces new empirical techniques to estimate this vital parameter.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         4 min read
        
      
    
    
      Chinchilla scaling: A replication attempt
    
    
    
      We replicate Hoffmann et al.’s parametric scaling law estimates, finding issues and providing better-fitting estimates that align with their other methods.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         16 min read
        
      
    
    
      Tracking large-scale AI models
    
    
    
      We present a dataset of 81 large-scale models, from AlphaGo to Gemini, developed across 18 countries, at the leading edge of scale and capabilities.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         9 min read
        
      
    
    
      Optimally allocating compute between inference and training
    
    
    
      AI labs should spend comparable resources on training and inference, assuming they can flexibly balance compute between the two to maintain performance.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         3 min read
        
      
    
    
      Algorithmic progress in language models
    
    
    
      Progress in pretrained language model performance outpaces expectations, occurring at a pace equivalent to doubling computational power every 5 to 14 months.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         23 min read
        
      
    
    
      Biological sequence models in the context of the AI directives
    
    
    
      Our expanded database now includes biological sequence models, highlighting potential regulatory gaps and the growth of training compute in these models.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         3 min read
        
      
    
    
      How predictable is language model benchmark performance?
    
    
    
      We investigate large language model performance, finding that compute-focused extrapolations are a promising way to forecast AI capabilities.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         4 min read
        
      
    
    
      Limits to the energy efficiency of CMOS microprocessors
    
    
    
      How far can the energy efficiency of CMOS microprocessors be pushed before hitting physical limits? We find room for a further 50 to 1000x improvement.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         2 min read
        
      
    
    
      AI capabilities can be significantly improved without expensive retraining
    
    
    
      While scaling compute is key to improving LLMs, post-training enhancements can offer gains equivalent to 5-20x more compute at less than 1% of the cost.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         3 min read
        
      
    
    
      Who is leading in AI? An analysis of industry AI research
    
    
    
      Industry has emerged as a driving force in AI. We compare top companies on research impact, training runs, and contributions to algorithmic innovations.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         31 min read
        
      
    
    
      Challenges in predicting AI automation
    
    
    
      Economists propose various approaches to predicting AI's automation of valuable tasks, but disagreements persist, with no consensus on the best method.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         27 min read
        
      
    
    
      Trends in machine learning hardware
    
    
    
      FLOP/s performance in 47 ML hardware accelerators doubled every 2.3 years. Switching from FP32 to tensor-FP16 led to a further 10x performance increase.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         11 min read
        
      
    
    
      Explosive growth from AI: A review of the arguments
    
    
    
      Our new article explores whether deployment of advanced AI systems could lead to growth rates ten times higher than those of today’s frontier economies.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         27 min read
        
      
    
    
      Trading off compute in training and inference
    
    
    
      We characterize techniques that induce a tradeoff between spending resources on training and inference, outlining their implications for AI governance.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         10 min read
        
      
    
    
      The limited benefit of recycling foundation models
    
    
    
      Reusing pretrained models can save on training costs, but it's unlikely to significantly boost AI capabilities beyond modest improvements.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         14 min read
        
      
    
    
      Direct Approach interactive model
    
    
    
      When could transformative AI be achieved? We present a simple, user-adjustable model of key inputs that forecasts the date TAI could be deployed.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         10 min read
        
      
    
    
      The Direct Approach
    
    
    
      We propose a method using neural scaling laws to estimate the compute needed to train AI models to reach human-level performance on various tasks.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         2 min read
        
      
    
    
      Power laws in speedrunning and machine learning
    
    
    
      Our model suggests ML benchmarks aren’t near saturation. While large improvements are rare, we find 1OOM gains happen roughly once in every 50 instances.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         66 min read
        
      
    
    
      Trends in the dollar training cost of machine learning systems
    
    
    
      How much does it cost to train AI models? Looking at 124 ML systems from between 2009 and 2022, we find the cost has grown by approximately 0.5OOM/year.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         6 min read
        
      
    
    
      Scaling laws literature review
    
    
    
      I have collected a database of scaling laws for different tasks and architectures, and reviewed dozens of papers in the scaling law literature.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         16 min read
        
      
    
    
      Literature review of transformative artificial intelligence timelines
    
    
    
      We summarize and compare several models and forecasts predicting when transformative AI will be developed.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         2 min read
        
      
    
    
      Revisiting algorithmic progress
    
    
    
      Examining over 100 computer vision models, we find that every 9 months, better algorithms contribute the equivalent of a doubling of compute budgets.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         3 min read
        
      
    
    
      Will we run out of ML data? Evidence from projecting dataset size trends
    
    
    
      We project dataset growth in language and vision domains, estimating future limits to training by evaluating the availability of unlabeled data over time.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         12 min read
        
      
    
    
      The longest training run
    
    
    
      Training runs of large ML systems will likely last less than 14-15 months, as shorter runs starting later use better hardware and algorithms.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         22 min read
        
      
    
    
      A time-invariant version of Laplace’s rule
    
    
    
      We discuss estimating event probabilities with past data, addressing issues with Laplace’s rule and proposing a modification to improve accuracy.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         2 min read
        
      
    
    
      Machine learning model sizes and the parameter gap
    
    
    
      Since 2018, the size of ML models has been growing 10 times faster than before. Around 2020, model sizes saw a significant jump, increasing by 1OOM.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         14 min read
        
      
    
    
      Trends in GPU price-performance
    
    
    
      Improvements in hardware are central to AI progress. Using data on 470 GPUs from 2006 to 2021, we find that FLOP/s per dollar doubles every ~2.5 years.
    
    
    
   
  
      
      
        paper
      
      
      
      
         · 
        
         7 min read
        
      
    
    
      Compute trends across three eras of machine learning
    
    
    
      We’ve compiled a comprehensive dataset of the training compute of AI models, providing key insights into AI development.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         24 min read
        
      
    
    
      Estimating training compute of deep learning models
    
    
    
      We describe two approaches for estimating the training compute of Deep Learning systems, by counting operations and looking at GPU time.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         8 min read
        
      
    
    
      What’s the backward-forward FLOP ratio for neural networks?
    
    
    
      Determining the backward-forward FLOP ratio for neural networks, to help calculate their total training compute.
    
    
    
   
  
      
      
        report
      
      
      
      
         · 
        
         9 min read
        
      
    
    
      How to measure FLOP for neural networks empirically?
    
    
    
      Computing the utilization rate for multiple Neural Network architectures.