Pairwise comparison of automatability measures across individual O*NET occupations. Higher scores indicate higher automatability. Automatability measures are standardized to employment-weighted z-scores at occupation level. Trend lines are shown for linear regression between measures, and include bootstrapped 95% confidence intervals and explained variance R2.

Introduction

The automation of tasks by AI systems has the potential to generate tremendous economic value (Erdil and Besiroglu, 2023). The prospect of capturing this value can incentivize greater investments into developing AI capabilities. Accurately predicting when tasks are likely to be automated by AI could therefore help forecast the trajectory of AI investment and AI development. Understanding the impact of automation on the economy and labor force is also important for policymakers; governments may need to implement policies to help workers transition and ensure the benefits of automation are broadly shared.

This review examines the literature on predicting AI automation, focusing on the economics literature on AI-driven automation of occupational tasks. We also review the nascent literature on empirical validation of these predictions, examining whether we should put more trust in some predictions than others. We hope this review will help researchers engage with this important problem. We also hope that clarifying the challenges faced by existing predictions will surface promising directions for future work.

In the literature, there are three broad ways people have predicted automation:

  • Task feature analysis to measure how susceptible tasks are to automation, typically informed by AI researchers’ opinions on AI capabilities, which are then linked to features of work tasks.
  • Task-patent mapping, matching task descriptions to similar AI patents.
  • Automation forecasting surveys directly asking AI experts when tasks or occupations will become automatable.

There are several challenges to existing methodologies for predicting automation. Most approaches focus on “in-principle-automatability” - scoring occupational tasks on how soon AI will have the technical capacity to automate them, while neglecting questions about costs and incentives, workflow reorganization, societal decisions, and other important factors. Interpretability is another important limitation. Many approaches do not distinguish between full and partial automation, making their results difficult to translate into well-defined predictions. Moreover, these methodologies, with the exception of forecasting surveys, do not provide concrete timelines - they instead provide ratings of relative automatability. Finally, some approaches rely on assessments of AI capabilities that have become outdated by subsequent progress, as discussed by Eloundou et al. (2023) and Felten et al. (2023). These limitations can be severe, and we discuss them further in Comparison of prediction methodologies.

Nevertheless, the field has made progress on the above challenges. There is even empirical evidence exploring how predictions compare to real world data. Several prediction methodologies for software and robotics automation have been partially validated by economy-wide occupational changes in hiring and wages. Focusing more narrowly on AI, there is early evidence that predictions correlate with firm-level changes in hiring. There are also AI-specific case studies showing productivity improvements from AI systems in domains such as customer support, translation, software engineering and management consulting. Taken together, these growing bodies of evidence may provide clues about how to predict automation from AI - and how to treat existing predictions.

Before reviewing prediction approaches in detail, we discuss the task-based framework used to model automation, which underpins them (How does automation happen?). Subsequently, we examine the purpose of automatability predictions, and the properties that we desire from prediction methodologies (What do we want from automation predictions?). We proceed to review the literature on automatability predictions, with a focus on their methodologies (An overview of automatability prediction methodologies). We then review the literature on empirical validation of these approaches, and examine the differences between them in a side-by-side comparison (Empirical evidence and comparison). Finally, we discuss open questions for future research, relating these to recent progress in both AI and the automation literature (Discussion).

How does automation happen?

In labor economics, it is common to study automation at the level of tasks associated with workers in given occupations or sectors, for example as described by Autor (2013). Task-level analyses can provide a more accurate and realistic model of automation than occupation-level analyses. For example, the work tasks of a secretary or software engineer have changed dramatically due to technology, but these occupations have not been automated in their entirety. The task-based approach builds upon the older “canonical model” that divided workers into low- and high-skilled groups without examining their individual tasks.

The concept of a task is defined very generally, for example in Autor (2013): “a unit of work activity that produces output”. Concrete examples offered are “moving an object, performing a calculation, communicating a piece of information, or resolving a discrepancy”. Much research on automatability has used the Occupational Information Network (O*NET) database of occupational information as a detailed source of occupations, tasks, and skills. The O*NET database describes work tasks at a range of granularities: a task is the most fine-grained description of work, often specific to a single occupation. Work activities are higher-level descriptions of more general work, often associated with many occupations.

There are broadly three ways of thinking about automation in a task-based framework:

  • Full automation of tasks within jobs, i.e. technology being able to perform the task.
  • Partial automation of tasks within jobs, i.e. improving workers’ productivity.
  • Deskilling of tasks within jobs, i.e. reducing the skill requirement to perform a task.

Much of the literature on measuring automatability does not carefully distinguish between these three. This is in contrast to the literature on modeling automation’s effects, such as Acemoglu and Autor (2011), which considers these differences to be important. All of these can also see further technological progress, resulting in deepening automation: improvement of productivity on already-automated tasks as discussed in Acemoglu and Restrepo (2019). Relatedly, the practical cost of automation may vary with time: there is an important difference between a task that can be automated with large upfront and ongoing capital investment, and a task that can be automated with near-zero investment (Arntz et al., 2016).

An important feature in task-based analyses is that tasks can be adapted, destroyed or created due to technological and social change (Arntz et al., 2016; Acemoglu and Restrepo, 2019). This may happen naturally as a consequence of automation: when tasks can be automated cost-effectively, there are incentives to automate them. This often results in rearrangement of existing workflows around these tasks. Labor is reallocated to non-automated tasks, but with demands changed correspondingly: if automation increases productivity in a task, complementary non-automatable tasks will assume greater importance. This can be a significant challenge for recording data, modeling automation, and predicting automation trends. Nevertheless, this challenge is fundamental to a realistic view of automation.

What do we want from automation predictions?

Ideally, we would want to predict when particular occupational tasks will see a certain level of automation. We want to answer questions such as “When will AI improve productivity by X% on task T?” In practice, no existing methodology can yet answer such questions. Existing approaches instead tend to provide a general rating of tasks’ susceptibility to automation.

There are several properties we would want from automation predictions:

  1. Interpretable predictions.

    An automatability rating should be clearly related to the type and extent of automation, the timeframe or required AI capabilities, and the extent of task adaptation or workflow rearrangement needed to enable AI automation. This is important because these factors can substantially change the economic and labor market effects of automation. Existing approaches vary in which of these factors they model, and in what detail (if any) they are predicted.

  2. Engagement with deployment practicalities: bottlenecks, incentives, and regulatory issues.

    These can be significant drivers of real world automation progress. Measuring “in-principle-automatability” while ignoring these considerations, as some approaches do, can produce predictions that diverge vastly from real world automation. Predictions that neglect deployment can also be ill-defined and hard to falsify. The strongest signal a task can be automated is when it is successfully automated. Ideally, a prediction methodology should incorporate these details, or at least provide a clear way to model them separately.

  3. Inputs have a strong theoretical and/or empirical justification for predicting automation.

    Structured prediction methodologies use inputs such as ratings of task suitability for AI, task descriptions, skill ratings, patents, and AI capabilities benchmarks. Ideally, these should have the strongest possible justification for why they predict automation. For example, Eloundou et al. (2023) discusses how rapid changes in AI capabilities may challenge older ratings of task suitability for AI.

  4. Predictions align with evidence on automation so far.

    We are only now beginning to see empirical evidence of AI-driven automation, as opposed to automation from older technologies such as software or industrial robotics. In the labor economics literature, Acemoglu et al. (2020) offers evidence of firm-level hiring changes in response to AI exposure. There are also compelling case studies showing how AI automation can interact with occupational tasks in customer support, translation, software engineering and consulting. Predictions should be consistent with this evidence - or provide a clear reason why not.

An overview of automatability prediction methodologies

Figure 1: The academic literature on predicting task automatability falls into three categories: task feature analysis, task-patent mapping, and automation forecasting surveys. O*NET features, discussed under Task feature analysis, are from the O*NET database of occupational information, characterizing tasks in terms of required skills, abilities, and other details.

There are broadly three approaches for predicting task automatability. First, task feature analysis to measure tasks’ susceptibility to automation, typically informed by AI researchers’ opinions on the capabilities of present and near future AI. Second, as an alternative, task-patent mapping based on correlating keywords between task descriptions and recent innovations such as patents. Finally, there are automation forecasting surveys for when tasks or activities can be (or will be) automated. Only surveys include explicit forecasts of when automation will happen, while the other approaches provide relative measurements of automatability.

In this section, we review the literature on automatability prediction, focusing on methodologies - how the different methods operate. We briefly discuss broad differences in Overview of predictions, and provide example outputs in Appendix: example outputs from different methods, but they are not the focus of this review.

Methodology Overview Inputs Reproducibility
Task feature analysis Autor et al. (2003)
Pre-AI automation, adapted to O*NET variables by Acemoglu and Autor (2011).
Classify tasks as routine / non-routine, cognitive / physical. Transformed variables from Dictionary of Occupational Titles (predecessor of O*NET). Categorisation published in paper.
Frey and Osborne (2013) Label occupation automatability with current AI. Extrapolate to all occupations by regression on 9 O*NET “bottlenecks” for AI. AI researchers surveyed about occupation full automatability. High-level categorisation published in paper.
Arntz et al. (2016) Regress individual task automatability from inter-job variation in Frey and Osborne (2013). Worker survey on tasks, Frey and Osborne occupation-level automatability scores. Occupation-level ratings reported in paper.
Manyika et al. (2017), followed up later by Chui et al. (2023) Rate tasks / activities / skills’ automatability. Proprietary automatability ratings. Proprietary data, not reported at occupation level.
Duckworth et al. (2019) Rate task automatability. Regression on 120 O*NET features (skills, knowledge, abilities) to extrapolate. Online survey of 156 “experts in machine learning, robotics and
intelligent systems”.
Dataset and code available.
Felten et al. (2021), building on Felten et al. (2018) Relate AI benchmarks to O*NET abilities. EFF AI Progress Measurement benchmarks, crowdsourced ratings of benchmark-ability linkage. Dataset available including benchmark-ability linkage, results at level of occupation, industries and US counties.
Brynjolfsson et al. (2018) Rate activities on a 23-item rubric of suitability for ML (SML). Crowdsourced survey. Dataset and code available.
Eloundou et al. (2023) Rate tasks according to whether they can be sped up >2x by ChatGPT or similar systems. Ratings from GPT-experienced annotation workers, and ratings from GPT-4. Rubric included in appendix, data not released.
Task-patent mapping Webb (2019) Identify task descriptions’ overlap with patents. Aggregate to occupations. Patents (separate sets for AI, software, industrial robotics). Dataset and code available.
Automation forecasting surveys Gruetzemacher et al. (2020) Forecast fractions of tasks automatable. 203 attendees at three ML conferences in 2018. No definitions for automatable or percentages. Summary data available in paper.
Stein-Perlman et al. (2022) Forecast automatable-in-principle year for all tasks of a job. 738 authors from NeurIPS or ICML papers in 2022. Focus on in-principle-automatability. Dataset available.

Table 1: Overview of the main prediction methodologies, including a brief description of how they work, the inputs they use for rating task automatability, and reproducibility.

Task feature analysis

Background

Attempts to measure task automatability in the economics literature arguably began with Autor et al. (2003). They broke down occupational tasks into routine and non-routine labor, routine meaning “a limited and well-defined set of cognitive and manual activities, those that can be accomplished by following explicit rules [as opposed to] problem-solving and complex communication activities”. They scored task routineness using pre-existing ratings in areas such as “Direction, Control and Planning”. Earlier work had examined the diffusion of computers and industrial robotics and their effects, and work from other disciplines had identified routineness as a key predictor of automation. Autor et al. (2003) applied this categorisation to tasks in a production function, developing a model in which automation replaced routine labor and complemented non-routine labor. This model’s predictions matched empirical evidence on labor demand.

Developments in AI later began to challenge the Autor et al. (2003) model of automation. Increasingly, tasks that had been rated as non-routine became more amenable to automation (Susskind, 2019). The first authors to create general metrics for AI automatability across the economy were Frey and Osborne (2013). They asked AI researchers whether 70 occupations could be fully automated in the near future, focusing on the technical capability to automate them rather than whether they would be automated. They then used these survey results to fit a regression to estimate automatability from nine manually identified O*NET skill variables related to AI bottlenecks. These were broadly in the areas of Perception and Manipulation, Creative Intelligence, and Social Intelligence. Their work contained a much-quoted finding that ~50% of total US employment was at high risk of automation (>70% on their metric). Several authors subsequently performed similar analyses for different countries’ labor markets, but without significant changes in methodology.

Task-focused analyses

Several researchers responded to Frey and Osborne (2013) by implementing task-focused analyses - arguing that automation usually affects tasks within occupations, rather than entire occupations (Arntz et al., 2016; Manyika et al., 2017; Brandes and Wattenhofer, 2016; Nedelkoska et al., 2018; Duckworth et al., 2019). Theseworks typically found that considering individual tasks significantly reduced the “high risk” share of employment, for example to 9% across OECD countries in Arntz et al. (2016). These methods continued to focus on researchers’ guesses of what could, in theory, be fully automated at the time of their surveys.

Of particular note, due to its transparency and replicability, Duckworth et al. (2019) expanded on Frey and Osborne (2013) by producing their own task-based formulation and open source dataset. They focused on tasks that were technically automatable at the time of investigation, and a more fine-grained level of detail, i.e. O*NET tasks. Their approach aggregated tasks into work activities, weighted by task importance and occupation requirements (skills, knowledge, and abilities). They then regressed automatability from occupation requirements for all O*NET work activities. This gave results more similar to Arntz et al. (2016) and Manyika et al. (2017) than Frey and Osborne (2013), i.e. few occupations entirely exposed to automation.

Brynjolfsson and Mitchell (2017) created a structured rubric to measure task automatability in terms of Suitability for Machine Learning (SML). This was potentially an improvement on preceding work by making assessments that are more legible and future-facing than regressing expert predictions via O*NET variables. Brynjolfsson et al. (2018) assessed O*NET work activities in this rubric using crowdsourcing, finding broadly similar proportions of exposed work to Arntz et al. (2016) and Duckworth et al. (2019).

Felten et al. (2018) developed an alternative metric for automatability based on AI benchmarks. They attempted to link AI benchmarks to abilities required by occupations, as categorized by O*NET. They focused on analysis at the occupation level, although O*NET also couples abilities to work activities, so the same dataset could be disaggregated. At first this was done in a backwards-looking way, with abilities linked to benchmarks by the authors with advice from computer science PhD students. Subsequently, this was expanded to include a crowdsourced survey for a forward-looking linkage in Felten et al. (2021), and inspired several similar approaches in subsequent work: Lassebie and Quintini (2022); Josten and Lordan (2020); Tolan et al. (2021).

Response to generative AI and LLMs

A handful of publications have been inspired by recent advances in generative AI, revisiting previous analyses and accounting for rapid progress in this area. Felten et al. (2023) revisited their automatability analyses, focusing on the tasks and occupations most exposed to language modeling and image generation. Chui et al. (2023) released a report focused on generative AI, broadly following the methodology of Manyika et al. (2017) - with estimates of roll-out updated to be significantly faster, based on recent advances.

Eloundou et al. (2023) devised a prediction methodology built specifically around large language models (LLMs). They categorized tasks for automation potential from GPT-like systems. This had three strengths: (i) assessing automation potential based on breakthrough AI capabilities demonstrated by LLMs such as GPT-4; (ii) defining a rubric focused on 2x task speed-up rather than full automation; (iii) rating O*NET tasks at the most fine-grained level, providing the most thorough automatability measures to date. Strikingly, the study used a combination of GPT-experienced human raters and GPT-4 itself to rate tasks - showing that there was high agreement between them.

Task-patent mapping

An alternative, data-driven approach to measuring automatability is to relate AI patents to occupations whose activities they may automate. This approach was pioneered by Webb (2019), who extracted verb-noun pairs from patents and related them to fine-grained O*NET tasks and occupation descriptions. This allowed him to measure technological exposure to AI – or to older technologies such as industrial robotics and software – depending on the selection of patents. Meindl et al. (2021) did similar work, and expanded the patent selection to cover a broad selection of “Fourth Industrial Revolution” topics. Similarly, Zarifhonarvar (2023) has recently explored a similar text mining approach for measuring occupational exposure to generative language models. Webb (2019)’s data release allowed for a broad comparison against metrics such as Felten et al. (2018) or Brynjolfsson et al. (2018) - discussed further in Empirical evidence and comparison.

Automation forecasting surveys

The main literature source giving explicit predictions measured in years is Stein-Perlman et al. (2022). This survey of AI researchers asked when all tasks in four example occupations (truck driver, surgeon, retail salesperson, AI researcher) would be technically automatable by AI. They also asked for examples of occupations that would be among the most difficult to automate. Although inter-respondent variation was large, median forecasts were that automatability may be achieved in 10 years for all tasks performed by truck drivers or retail salespersons. However, median forecasts also suggested that the final occupations to become automatable will not be automatable until much later (80 years). There were many different suggestions for the last occupation to become automatable. These tended to be occupations that (i) benefit from interpersonal physical interaction (nurse, therapist); (ii) involve high social status and charisma (politician, CEO); or (iii) require a lot of originality, creative thought and analytic rigor (philosopher, AI researcher).

Another source of forecasting surveys is Gruetzemacher et al. (2020), which elicited predictions on the overall fraction of automatable tasks versus year, surveying attendees at three machine learning conferences in 2018. While not focusing on individual tasks, this provides some insight into researchers’ overall beliefs. The surveys showed that researchers believed a large fraction of tasks are already automatable (~20%) and that this would increase dramatically in the next decade (~60%). Predictions had high variance, however, and predictions of the tasks automatable in 10 years ranged from 10% to 100%.

Overview of predictions

Comparison of existing methodologies’ predictions is fairly sparse in the literature, with a few exceptions such as Acemoglu et al. (2020). In this section we compare predictions, at the level of broad occupational categories and individual occupations. We show there is remarkably little agreement between the different predictions, although a common feature of all AI-focused methodologies is that they rate Managers and Professionals as more exposed to automation than traditional automatability measures.

Figure 2 compares automatability measures for twelve broad occupation categories, which have often been used for studying automation. Although aggregating to broad occupation categories elides over task details, these results provide an accessible snapshot of each methodology. Figure 2 only covers task-level methodologies with data releases. We provide comparison of example results from other methodologies in Appendix: example outputs from different methods. The routineness measure of Acemoglu and Autor (2011) is also included in Figure 2 to allow comparison with pre-AI automation.

Figure 2: Average automatability measures for twelve broad occupation categories used in Acemoglu et al. (2020) and other sources. Higher scores indicate higher automatability. Occupations are ordered from highest to lowest median wage - broadly following traditional ratings of skill level. Measures have been standardized to employment-weighted z-scores at occupation level before aggregation to broad occupation categories. Data is taken from respective publications or the comparison in Acemoglu et al. (2020). In this figure the Acemoglu and Autor (2011) measure uses the sum of cognitive and physical net routineness scores, for simplicity.

Notably, most AI methodologies predict that higher-wage occupations such as Managers & Professionals have middling-to-high automatability. Duckworth et al. (2019) does not follow this pattern, and is generally closer to older pre-AI predictions - consistent with its broader scope and survey focused on currently-automatable tasks. Unfortunately, Eloundou et al. (2023) does not have a data release, preventing its inclusion in Figure 2. However, one should expect higher ratings for occupations involving more cognitive tasks, fewer manual tasks, and higher educational requirements - i.e. following the same pattern.

In the AI-focused methodologies, there is relatively little agreement between measures. Brynjolfsson et al. (2018) and Felten et al. (2018) are somewhat similar, showing a broad pattern where the occupations traditionally seen as high-skill are more automatable. However, the individual occupational categories do not agree closely. Webb (2019), conversely, predicts middling automatability across most occupations, with notable exceptions in Sales and Office/Admin (less automatable), and Farm & Mining (more automatable). This intuitively matches the method’s focus on patents before generative AI, typically involving “systematic relationships between inputs and decisions, such as sorting agricultural products” (Webb, 2019).

The low agreement between different automatability measures is explored in more detail in Figure 3. Figure 3 shows pairwise comparison of automatability measures, plotting their respective ratings for individual O*NET occupations. Duckworth et al. (2019) shows fairly close agreement with Acemoglu and Autor (2011), further supporting the interpretation that it is dominated by traditional pre-AI ideas of automatability. However, AI-focused measures have little correlation with each other: the strongest agreement is between Felten et al. (2018) and Brynjolfsson et al. (2018), where one measure can explain only 8% of the other’s variance. Other AI measures are wholly uncorrelated. This emphasizes their significantly different predictions, although without validation data it is unclear whether this implies some predictions are more accurate, or that different methods “capture different components of AI exposure” (Acemoglu et al., 2020).