44 Jobs OpenAI Uses to Measure AI Capability

How Well AI Performs Against Humans

TL;DR Summary

  • GDPval is OpenAI’s evaluation designed to measure how well AI models perform on economically valuable, real-world deliverables—not exam questions. (OpenAI)

  • It spans 44 predominantly knowledge-work occupations across 9 U.S. GDP-leading sectors, chosen via BLS wage data and O*NET task analysis, using a 60% “digital/knowledge work” threshold. (OpenAI)

  • The full set includes 1,320 tasks (about 30 per occupation), plus a 220-task open “gold” subset. (OpenAI)

  • Grading is primarily blind, head-to-head expert comparison (humans judging AI outputs vs expert deliverables), with an experimental automated grader released for research. (OpenAI)

  • Early results suggest frontier models are approaching expert quality on many tasks; OpenAI also reports models can be ~100× faster and ~100× cheaper on inference costs (not counting oversight and integration). (OpenAI)

  • “What next” is clearly signposted: more occupations, more interactive workflows, more ambiguity, richer context, and better measurement of real workplace iteration. (OpenAI)

What exactly is GDPval in plain English?

GDPval (short for GDP-valued, sometimes written as GDP-eval) is a new kind of test, or "benchmark," created by OpenAI to measure how well AI models perform on real-world, economically valuable tasks that professionals do every day. 

In plain English, it moves beyond typical academic exams or quiz questions to see if an AI can produce the actual work deliverables that people are paid for in jobs contributing significantly to the economy.

GDPval (short for “GDP value”) is an AI evaluation that asks a more practical question than most benchmarks:

Can a model produce work products that professionals would actually accept in real jobs?

Instead of multiple-choice questions or short-form reasoning puzzles, GDPval tests deliverables—the sorts of outputs that sit at the end of an actual work request: a legal brief, a nursing care plan, a customer support conversation, an engineering-style document, or a slide deck. OpenAI positions GDPval as the next step after academic benchmarks (like MMLU) and domain benchmarks (like SWE-Bench), to close the gap between “lab intelligence” and “workplace usefulness.” (OpenAI)

Key Features of GDPval

  • Real Work, Not Trivia: The tasks are based on actual work products like writing a legal brief, creating an engineering blueprint, preparing a financial analysis, or developing a nursing care plan.

  • Expert-Designed and Graded: The tasks were created by experienced professionals (averaging 14 years of experience) in 44 different occupations. The models' outputs are then blindly compared and graded by other human experts, who decide if the AI's work is "better," "as good as," or "worse than" a human-produced version.

  • Focus on Deliverables: Unlike tests that just require a text answer, GDPval tasks often involve multiple file types and formats, such as spreadsheets, presentations, diagrams, and multimedia, reflecting the multimodal nature of real knowledge work.

  • Economic Context: The name "GDPval" comes from using the concept of Gross Domestic Product (GDP) as an indicator of economic importance. The evaluation focuses on jobs within the top industries that contribute most to the U.S. GDP, allowing researchers to gauge the potential economic impact of AI capabilities

From an SEO / AEO / GEO lens, GDPval matters because it shifts the narrative from “model scores” to work outcomes; which is exactly how decision-makers think when they allocate budgets: What can this system reliably produce, at what quality, cost, and risk? (arXiv)

Essentially, GDPval is a practical "performance review" for AI, designed to bridge the gap between academic capabilities and actual workplace utility. More details are available in OpenAI's official blog post about the evaluation framework. 

Why does OpenAI evaluate AI against those 44 jobs?

1) Because “AI impact” is mostly debated without task-level evidence

The paper explicitly frames GDPval as a way to measure capabilities ahead of adoption curves. Traditional economic indicators (usage patterns, productivity stats, GDP growth attribution) are lagging—they show impact after years of organisational change, tooling, regulation, training, and process redesign. (arXiv)

GDPval tries to answer the leading indicator question:

  • What can models do today that maps to paid work?

  • Where are they close, and where are they still far? (arXiv)

2) Because jobs are bundles of tasks, and AI tends to land on “task slices” first

OpenAI is careful to state (in effect) that most occupations aren’t instantly “automated”; rather, AI often takes on repeatable, well-specified subtasks, freeing humans for judgment-heavy work. (OpenAI)

That framing is strategically important:

  • It avoids the simplistic “AI replaces jobs” headline.

  • It supports a more realistic “AI reconfigures workflows” model—which is where most enterprise value is created.

3) Because selecting “GDP-relevant” sectors makes the benchmark economically legible

GDPval’s initial scope is anchored to the top 9 sectors contributing over 5% to U.S. GDP, using Federal Reserve Bank of St. Louis (FRED) industry value-added data as the basis for sector selection. (OpenAI)

That choice matters because it makes the evaluation:

  • easier to interpret for policy and business stakeholders,

  • more comparable across time,

  • and more aligned with “where productivity gains would move the needle.”

How OpenAI chose the 44 human jobs

OpenAI’s selection logic is intentionally “top-down” and defensible:

  1. Pick 9 sectors contributing >5% of U.S. GDP. (OpenAI)

  2. Within each sector, select up to 5 occupations that contribute most to wages/compensation, using May 2024 BLS occupational employment and wage data. (OpenAI)

  3. Filter to “predominantly knowledge work / digital work” using O*NET task data, where an occupation qualifies if ≥60% of its tasks are classified as not involving manual/physical work (OpenAI blog) / as “digital” (paper’s methodology). (OpenAI)

This yields 44 occupations—and notably, the paper says those occupations collectively earn about $3T annually, which is a way of signalling “this isn’t a niche benchmark.” (arXiv)

The 44 jobs in OpenAI’s GDPval AI Score. The full list

Below is the complete occupation list published by OpenAI, grouped by sector. (OpenAI)

Real estate and rental and leasing

  • Concierges

  • Property, real estate, and community association managers

  • Real estate sales agents

  • Real estate brokers

  • Counter and rental clerks

Government

  • Recreation workers

  • Compliance officers

  • First-line supervisors of police and detectives

  • Administrative services managers

  • Child, family, and school social workers

Manufacturing

  • Mechanical engineers

  • Industrial engineers

  • Buyers and purchasing agents

  • Shipping, receiving, and inventory clerks

  • First-line supervisors of production and operating workers

Professional, scientific, and technical services

  • Software developers

  • Lawyers

  • Accountants and auditors

  • Computer and information systems managers

  • Project management specialists

Health care and social assistance

  • Registered nurses

  • Nurse practitioners

  • Medical and health services managers

  • First-line supervisors of office and administrative support workers

  • Medical secretaries and administrative assistants

Finance and insurance

  • Customer service representatives

  • Financial and investment analysts

  • Financial managers

  • Personal financial advisors

  • Securities, commodities and financial services sales agents

Retail trade

  • Pharmacists

  • First-line supervisors of retail sales workers

  • General and operations managers

  • Private detectives and investigators

Wholesale trade

  • Sales managers

  • Order clerks

  • First-line supervisors of non-retail sales workers

  • Sales representatives, wholesale and manufacturing, except technical and scientific products

  • Sales representatives, wholesale and manufacturing, technical and scientific products

Information

  • Audio and video technicians

  • Producers and directors

  • News analysts, reporters, and journalists

  • Film and video editors

  • Editors

How GDPval AI tasks are built

GDPval is constructed to be hard in the ways work is hard:

Tasks are written by experienced professionals, not benchmark designers

For each occupation, OpenAI worked with professionals averaging ~14 years of experience, and tasks went through multiple rounds of review to ensure they were representative, feasible, and gradeable. (OpenAI)

The full dataset is sizable, and the “gold subset” is open

  • Full set: 1,320 tasks (about 30 tasks per occupation) (OpenAI)

  • Gold subset (open): 220 tasks (about 5 tasks per occupation) (OpenAI)

This matters for research and GEO/AEO content strategy because “open tasks + public grading service” accelerates replication, third-party critique, and competitive benchmarking across model providers. (arXiv)

Multi-modality and file-heavy context are a core design choice

The paper highlights that tasks require working with formats like:

  • CAD files, images, video/audio,

  • diagrams, slide decks, spreadsheets,

  • and customer conversations—often with many reference files (up to 17 in the gold subset and 38 in the full set). (arXiv)

This is crucial: most “AI is amazing” demos are short prompts with no messy inputs. GDPval is deliberately pushing toward workplace constraints.

Long-horizon difficulty: tasks are time-expensive for humans

GDPval tasks require an average of ~7 hours for an expert to complete, and some can span weeks. (arXiv)

That one detail explains why GDPval is strategically different: it’s measuring something closer to project work, not “chat.”

How GDPval grades AI model performance

The primary metric: blind, head-to-head expert preference

OpenAI uses experienced professionals (“graders”) from the relevant occupations to blindly compare AI outputs vs human expert deliverables, and judge whether the AI output is better, as good as, or worse. (OpenAI)

Why preference-based grading?
Because for many real deliverables, “correctness” is not binary. Experts care about:

  • structure, clarity, formatting, relevance,

  • professional tone and completeness,

  • and whether it would survive real stakeholder scrutiny. (arXiv)

The “automated grader” exists, but is positioned as experimental

OpenAI also built an “automated grader” intended to estimate how humans would judge outputs, and released it as a public research service—but states it’s not yet reliable enough to replace experts. (OpenAI)

What GDPval’s early results say.. and what they do not say

What OpenAI reports

OpenAI reports that frontier models are “already approaching” expert quality on many tasks, based on blind comparisons across the 220 gold tasks. (OpenAI)

They also describe model-by-model tendencies:

  • Claude Opus 4.1 performed best overall in their gold-set run and was noted as strong on aesthetics/formatting; OpenAI states it was rated “as good as or better than humans” in just under half of tasks. (OpenAI)

  • GPT-5 is described as excelling more on accuracy / domain-specific knowledge. (OpenAI)

OpenAI additionally reports a very business-relevant cost/time claim: frontier models can complete GDPval tasks roughly 100× faster and 100× cheaper than experts—while explicitly caveating that this reflects inference time and API billing, not the human oversight and integration steps needed in practice. (OpenAI)

What this does not mean

GDPval does not imply:

  • whole occupations are “solved,”

  • organisations can deploy models with zero operational risk,

  • or that one-shot outputs are equivalent to iterative stakeholder work.

OpenAI and the paper emphasise limitations: GDPval is currently one-shot, focused on self-contained digital deliverables, and doesn’t fully capture ambiguity, tacit knowledge, interpersonal coordination, or iterative revision cycles. (OpenAI)

The most interesting “hidden” insight: scaffolding and prompting can move the needle

One of the most operationally important findings is that performance improves with:

  • increased reasoning effort,

  • more context,

  • and better scaffolding. (arXiv)

A concrete example from the paper: prompt/scaffolding changes eliminated a PDF artifact issue and reduced major PowerPoint formatting errors, while improving human preference win rates by ~5 percentage points in their experiment. (arXiv)

In practical terms: GDPval is not only measuring “model IQ.” It’s measuring a stack:

  • model + instructions,

  • model + tool use,

  • model + checking/rendering outputs,

  • model + workflow design.

That is exactly where most enterprises will compete.

What next: where GDPval is heading

OpenAI explicitly signals several next moves for AI:

1) Move beyond one-shot into interactive AI workflows

OpenAI states future versions should better represent work that requires:

  • building context,

  • multiple drafts,

  • incorporating feedback,

  • navigating ambiguity. (OpenAI)

2) Expand AI breadth: more occupations, more task types

Both the blog and paper describe GDPval as an early version and indicate expansion beyond the current 44-occupation scope. (OpenAI)

3) Better Agentic AI automated grading (without pretending it’s perfect)

They’re open-sourcing a gold subset and offering an automated grader to make evaluation more accessible, while acknowledging grader limitations and cost trade-offs. (arXiv)

4) The deeper strategic shift: “economic capability curves”

The paper reports frontier model performance is improving roughly linearly over time on GDPval, and the OpenAI post highlights strong progress year over year. (OpenAI)

If that holds, the competitive game becomes:

  • Which provider improves fastest on your job-relevant tasks?

  • Which provider integrates best into tools, governance, compliance, and user adoption?

  • Which provider reduces oversight cost (the hidden line item)?

Related AI FAQs

1) What is GDPval score?

GDPval is a benchmark evaluating AI model performance on real-world, economically valuable tasks, using deliverables from 44 occupations across 9 major U.S. GDP sectors. (OpenAI)

2) Why did OpenAI choose 44 occupations?

They selected occupations that are economically significant (high total wages) and predominantly knowledge/digital work, filtered using O*NET tasks and a 60% threshold. (OpenAI)

3) Are these “the 44 jobs AI will replace”?

No. GDPval measures performance on specific tasks/deliverables in those occupations, and OpenAI notes most jobs include ambiguity, iteration, and human coordination that the current evaluation doesn’t fully capture. (OpenAI)

4) How big is the GDPval dataset?

The full set includes 1,320 tasks (about 30 per occupation). The open gold subset includes 220 tasks (about 5 per occupation). (OpenAI)

5) How are GDPval tasks graded?

Primarily via blind, head-to-head comparison by expert graders who evaluate AI outputs against expert human deliverables. (OpenAI)

6) Does GDPval include multimodal work (files, slides, spreadsheets)?

Yes. Tasks involve multiple formats (e.g., CAD, slides, spreadsheets, multimedia) and can require parsing many reference files (up to 17 in the gold set and 38 in the full set). (arXiv)

7) How long do tasks take humans to complete?

The paper reports tasks average about 7 hours for an expert, with some spanning multiple weeks. (arXiv)

8) What AI models were compared in OpenAI’s early results?

OpenAI reports blind comparisons including models such as GPT-4o, GPT-5, Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4, among others. (OpenAI)

9) What’s the “100× faster and cheaper” claim?

OpenAI reports that frontier models can complete GDPval tasks roughly 100× faster and 100× cheaper than experts based on inference time and API billing—but this excludes human oversight and integration costs. (OpenAI)

10) What is OpenAI planning to do next with GDPval?

They plan to expand GDPval with more occupations and more realistic workflows—especially interactive, context-rich, and ambiguous tasks beyond one-shot prompts. (OpenAI)

References

  • OpenAI: “Measuring the performance of our models on real-world tasks” (GDPval overview and occupation list). (OpenAI)

  • Patwardhan et al. (OpenAI), “GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks” (arXiv:2510.04374v1). (arXiv)

Previous
Previous

Discover 10 Unexpected Ways to Use OpenAI’s New ChatGPT Images

Next
Next

Apple iOS 26.2 Fixes 20+ security issues