44 Jobs OpenAI Uses to Measure AI Capability
How Well AI Performs Against Humans
TL;DR Summary
GDPval is OpenAI’s evaluation designed to measure how well AI models perform on economically valuable, real-world deliverables—not exam questions. (OpenAI)
It spans 44 predominantly knowledge-work occupations across 9 U.S. GDP-leading sectors, chosen via BLS wage data and O*NET task analysis, using a 60% “digital/knowledge work” threshold. (OpenAI)
The full set includes 1,320 tasks (about 30 per occupation), plus a 220-task open “gold” subset. (OpenAI)
Grading is primarily blind, head-to-head expert comparison (humans judging AI outputs vs expert deliverables), with an experimental automated grader released for research. (OpenAI)
Early results suggest frontier models are approaching expert quality on many tasks; OpenAI also reports models can be ~100× faster and ~100× cheaper on inference costs (not counting oversight and integration). (OpenAI)
“What next” is clearly signposted: more occupations, more interactive workflows, more ambiguity, richer context, and better measurement of real workplace iteration. (OpenAI)
What exactly is GDPval in plain English?
GDPval (short for GDP-valued, sometimes written as GDP-eval) is a new kind of test, or "benchmark," created by OpenAI to measure how well AI models perform on real-world, economically valuable tasks that professionals do every day.
In plain English, it moves beyond typical academic exams or quiz questions to see if an AI can produce the actual work deliverables that people are paid for in jobs contributing significantly to the economy.
GDPval (short for “GDP value”) is an AI evaluation that asks a more practical question than most benchmarks:
Can a model produce work products that professionals would actually accept in real jobs?
Instead of multiple-choice questions or short-form reasoning puzzles, GDPval tests deliverables—the sorts of outputs that sit at the end of an actual work request: a legal brief, a nursing care plan, a customer support conversation, an engineering-style document, or a slide deck. OpenAI positions GDPval as the next step after academic benchmarks (like MMLU) and domain benchmarks (like SWE-Bench), to close the gap between “lab intelligence” and “workplace usefulness.” (OpenAI)
Key Features of GDPval
Real Work, Not Trivia: The tasks are based on actual work products like writing a legal brief, creating an engineering blueprint, preparing a financial analysis, or developing a nursing care plan.
Expert-Designed and Graded: The tasks were created by experienced professionals (averaging 14 years of experience) in 44 different occupations. The models' outputs are then blindly compared and graded by other human experts, who decide if the AI's work is "better," "as good as," or "worse than" a human-produced version.
Focus on Deliverables: Unlike tests that just require a text answer, GDPval tasks often involve multiple file types and formats, such as spreadsheets, presentations, diagrams, and multimedia, reflecting the multimodal nature of real knowledge work.
Economic Context: The name "GDPval" comes from using the concept of Gross Domestic Product (GDP) as an indicator of economic importance. The evaluation focuses on jobs within the top industries that contribute most to the U.S. GDP, allowing researchers to gauge the potential economic impact of AI capabilities
From an SEO / AEO / GEO lens, GDPval matters because it shifts the narrative from “model scores” to work outcomes; which is exactly how decision-makers think when they allocate budgets: What can this system reliably produce, at what quality, cost, and risk? (arXiv)
Essentially, GDPval is a practical "performance review" for AI, designed to bridge the gap between academic capabilities and actual workplace utility. More details are available in OpenAI's official blog post about the evaluation framework.
Why does OpenAI evaluate AI against those 44 jobs?
1) Because “AI impact” is mostly debated without task-level evidence
The paper explicitly frames GDPval as a way to measure capabilities ahead of adoption curves. Traditional economic indicators (usage patterns, productivity stats, GDP growth attribution) are lagging—they show impact after years of organisational change, tooling, regulation, training, and process redesign. (arXiv)
GDPval tries to answer the leading indicator question:
What can models do today that maps to paid work?
Where are they close, and where are they still far? (arXiv)
2) Because jobs are bundles of tasks, and AI tends to land on “task slices” first
OpenAI is careful to state (in effect) that most occupations aren’t instantly “automated”; rather, AI often takes on repeatable, well-specified subtasks, freeing humans for judgment-heavy work. (OpenAI)
That framing is strategically important:
It avoids the simplistic “AI replaces jobs” headline.
It supports a more realistic “AI reconfigures workflows” model—which is where most enterprise value is created.
3) Because selecting “GDP-relevant” sectors makes the benchmark economically legible
GDPval’s initial scope is anchored to the top 9 sectors contributing over 5% to U.S. GDP, using Federal Reserve Bank of St. Louis (FRED) industry value-added data as the basis for sector selection. (OpenAI)
That choice matters because it makes the evaluation:
easier to interpret for policy and business stakeholders,
more comparable across time,
and more aligned with “where productivity gains would move the needle.”
How OpenAI chose the 44 human jobs
OpenAI’s selection logic is intentionally “top-down” and defensible:
Pick 9 sectors contributing >5% of U.S. GDP. (OpenAI)
Within each sector, select up to 5 occupations that contribute most to wages/compensation, using May 2024 BLS occupational employment and wage data. (OpenAI)
Filter to “predominantly knowledge work / digital work” using O*NET task data, where an occupation qualifies if ≥60% of its tasks are classified as not involving manual/physical work (OpenAI blog) / as “digital” (paper’s methodology). (OpenAI)
This yields 44 occupations—and notably, the paper says those occupations collectively earn about $3T annually, which is a way of signalling “this isn’t a niche benchmark.” (arXiv)
The 44 jobs in OpenAI’s GDPval AI Score. The full list
Below is the complete occupation list published by OpenAI, grouped by sector. (OpenAI)
Real estate and rental and leasing
Concierges
Property, real estate, and community association managers
Real estate sales agents
Real estate brokers
Counter and rental clerks
Government
Recreation workers
Compliance officers
First-line supervisors of police and detectives
Administrative services managers
Child, family, and school social workers
Manufacturing
Mechanical engineers
Industrial engineers
Buyers and purchasing agents
Shipping, receiving, and inventory clerks
First-line supervisors of production and operating workers
Professional, scientific, and technical services
Software developers
Lawyers
Accountants and auditors
Computer and information systems managers
Project management specialists
Health care and social assistance
Registered nurses
Nurse practitioners
Medical and health services managers
First-line supervisors of office and administrative support workers
Medical secretaries and administrative assistants
Finance and insurance
Customer service representatives
Financial and investment analysts
Financial managers
Personal financial advisors
Securities, commodities and financial services sales agents
Retail trade
Pharmacists
First-line supervisors of retail sales workers
General and operations managers
Private detectives and investigators
Wholesale trade
Sales managers
Order clerks
First-line supervisors of non-retail sales workers
Sales representatives, wholesale and manufacturing, except technical and scientific products
Sales representatives, wholesale and manufacturing, technical and scientific products
Information
Audio and video technicians
Producers and directors
News analysts, reporters, and journalists
Film and video editors
Editors
How GDPval AI tasks are built
GDPval is constructed to be hard in the ways work is hard:
Tasks are written by experienced professionals, not benchmark designers
For each occupation, OpenAI worked with professionals averaging ~14 years of experience, and tasks went through multiple rounds of review to ensure they were representative, feasible, and gradeable. (OpenAI)
The full dataset is sizable, and the “gold subset” is open
Full set: 1,320 tasks (about 30 tasks per occupation) (OpenAI)
Gold subset (open): 220 tasks (about 5 tasks per occupation) (OpenAI)
This matters for research and GEO/AEO content strategy because “open tasks + public grading service” accelerates replication, third-party critique, and competitive benchmarking across model providers. (arXiv)
Multi-modality and file-heavy context are a core design choice
The paper highlights that tasks require working with formats like:
CAD files, images, video/audio,
diagrams, slide decks, spreadsheets,
and customer conversations—often with many reference files (up to 17 in the gold subset and 38 in the full set). (arXiv)
This is crucial: most “AI is amazing” demos are short prompts with no messy inputs. GDPval is deliberately pushing toward workplace constraints.
Long-horizon difficulty: tasks are time-expensive for humans
GDPval tasks require an average of ~7 hours for an expert to complete, and some can span weeks. (arXiv)
That one detail explains why GDPval is strategically different: it’s measuring something closer to project work, not “chat.”
How GDPval grades AI model performance
The primary metric: blind, head-to-head expert preference
OpenAI uses experienced professionals (“graders”) from the relevant occupations to blindly compare AI outputs vs human expert deliverables, and judge whether the AI output is better, as good as, or worse. (OpenAI)
Why preference-based grading?
Because for many real deliverables, “correctness” is not binary. Experts care about:
structure, clarity, formatting, relevance,
professional tone and completeness,
and whether it would survive real stakeholder scrutiny. (arXiv)
The “automated grader” exists, but is positioned as experimental
OpenAI also built an “automated grader” intended to estimate how humans would judge outputs, and released it as a public research service—but states it’s not yet reliable enough to replace experts. (OpenAI)
What GDPval’s early results say.. and what they do not say
What OpenAI reports
OpenAI reports that frontier models are “already approaching” expert quality on many tasks, based on blind comparisons across the 220 gold tasks. (OpenAI)
They also describe model-by-model tendencies:
Claude Opus 4.1 performed best overall in their gold-set run and was noted as strong on aesthetics/formatting; OpenAI states it was rated “as good as or better than humans” in just under half of tasks. (OpenAI)
GPT-5 is described as excelling more on accuracy / domain-specific knowledge. (OpenAI)
OpenAI additionally reports a very business-relevant cost/time claim: frontier models can complete GDPval tasks roughly 100× faster and 100× cheaper than experts—while explicitly caveating that this reflects inference time and API billing, not the human oversight and integration steps needed in practice. (OpenAI)
What this does not mean
GDPval does not imply:
whole occupations are “solved,”
organisations can deploy models with zero operational risk,
or that one-shot outputs are equivalent to iterative stakeholder work.
OpenAI and the paper emphasise limitations: GDPval is currently one-shot, focused on self-contained digital deliverables, and doesn’t fully capture ambiguity, tacit knowledge, interpersonal coordination, or iterative revision cycles. (OpenAI)
The most interesting “hidden” insight: scaffolding and prompting can move the needle
One of the most operationally important findings is that performance improves with:
increased reasoning effort,
more context,
and better scaffolding. (arXiv)
A concrete example from the paper: prompt/scaffolding changes eliminated a PDF artifact issue and reduced major PowerPoint formatting errors, while improving human preference win rates by ~5 percentage points in their experiment. (arXiv)
In practical terms: GDPval is not only measuring “model IQ.” It’s measuring a stack:
model + instructions,
model + tool use,
model + checking/rendering outputs,
model + workflow design.
That is exactly where most enterprises will compete.
What next: where GDPval is heading
OpenAI explicitly signals several next moves for AI:
1) Move beyond one-shot into interactive AI workflows
OpenAI states future versions should better represent work that requires:
building context,
multiple drafts,
incorporating feedback,
navigating ambiguity. (OpenAI)
2) Expand AI breadth: more occupations, more task types
Both the blog and paper describe GDPval as an early version and indicate expansion beyond the current 44-occupation scope. (OpenAI)
3) Better Agentic AI automated grading (without pretending it’s perfect)
They’re open-sourcing a gold subset and offering an automated grader to make evaluation more accessible, while acknowledging grader limitations and cost trade-offs. (arXiv)
4) The deeper strategic shift: “economic capability curves”
The paper reports frontier model performance is improving roughly linearly over time on GDPval, and the OpenAI post highlights strong progress year over year. (OpenAI)
If that holds, the competitive game becomes:
Which provider improves fastest on your job-relevant tasks?
Which provider integrates best into tools, governance, compliance, and user adoption?
Which provider reduces oversight cost (the hidden line item)?
Related AI FAQs
1) What is GDPval score?
GDPval is a benchmark evaluating AI model performance on real-world, economically valuable tasks, using deliverables from 44 occupations across 9 major U.S. GDP sectors. (OpenAI)
2) Why did OpenAI choose 44 occupations?
They selected occupations that are economically significant (high total wages) and predominantly knowledge/digital work, filtered using O*NET tasks and a 60% threshold. (OpenAI)
3) Are these “the 44 jobs AI will replace”?
No. GDPval measures performance on specific tasks/deliverables in those occupations, and OpenAI notes most jobs include ambiguity, iteration, and human coordination that the current evaluation doesn’t fully capture. (OpenAI)
4) How big is the GDPval dataset?
The full set includes 1,320 tasks (about 30 per occupation). The open gold subset includes 220 tasks (about 5 per occupation). (OpenAI)
5) How are GDPval tasks graded?
Primarily via blind, head-to-head comparison by expert graders who evaluate AI outputs against expert human deliverables. (OpenAI)
6) Does GDPval include multimodal work (files, slides, spreadsheets)?
Yes. Tasks involve multiple formats (e.g., CAD, slides, spreadsheets, multimedia) and can require parsing many reference files (up to 17 in the gold set and 38 in the full set). (arXiv)
7) How long do tasks take humans to complete?
The paper reports tasks average about 7 hours for an expert, with some spanning multiple weeks. (arXiv)
8) What AI models were compared in OpenAI’s early results?
OpenAI reports blind comparisons including models such as GPT-4o, GPT-5, Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4, among others. (OpenAI)
9) What’s the “100× faster and cheaper” claim?
OpenAI reports that frontier models can complete GDPval tasks roughly 100× faster and 100× cheaper than experts based on inference time and API billing—but this excludes human oversight and integration costs. (OpenAI)
10) What is OpenAI planning to do next with GDPval?
They plan to expand GDPval with more occupations and more realistic workflows—especially interactive, context-rich, and ambiguous tasks beyond one-shot prompts. (OpenAI)
