What services do you offer?

We are an AI-first performance marketing agency in London specialising in SEO, AEO, GEO, PPC and AI marketing for B2B, SaaS, fintech and ecommerce brands. We help you grow organic traffic and AI visibility, run profitable Google Ads and Performance Max campaigns, join up SEO and PPC, and build AI marketing strategy and GenAI content ops.

How do I get started?

Book a 30-minute AI Growth & ROAS Review or request a 30-day AI Visibility + Revenue Sprint. Share your website, markets and goals, we review your SEO, AEO/GEO and PPC performance (including visibility in Google, AI Overviews, ChatGPT, Gemini and Perplexity), then we walk you through a practical 90-day plan with next steps.

What makes you different?

We focus on turning SEO, PPC and AI answer visibility into qualified leads and revenue. The approach is AI-first but human-led, integrated across SEO, AEO, GEO, PPC and content, delivered by senior specialists, and reported in plain English against pipeline, ROAS, CPA and LTV.

How can I contact you?

Contact us via the site contact form, book a call, or email info@integrated.social. Tell us who you are, your website and your main growth challenge, and we will reply within one working day to suggest next steps.

44 Jobs OpenAI Uses to Measure AI Capability

OpenAiAGITechVoiceNewsLLMAI PPCFuture of Work & JobsAI Jobs

16 Dec

Written By Modi

How Well AI Performs Against Humans

TL;DR AI Impact on Employment

GDPval is OpenAI's evaluation designed to measure how well artificial intelligence models perform on economically valuable, real-world deliverables—not exam questions. (OpenAI)
It spans 44 predominantly knowledge-work occupations across 9 U.S. GDP-leading sectors, chosen via BLS wage data and O*NET task analysis, using a 60% “digital/knowledge work” threshold. (OpenAI)
The full set includes 1,320 tasks (about 30 per occupation), plus a 220-task open “gold” subset. (OpenAI)
Grading is primarily blind, head-to-head expert comparison (humans judging AI outputs vs expert deliverables), with an experimental automated grader released for research. (OpenAI)
Early results suggest frontier models are approaching expert quality on many tasks; OpenAI also reports models can be ~100× faster and ~100× cheaper on inference costs (not counting oversight and integration). (OpenAI)
“What next” is clearly signposted: more occupations, more interactive workflows, more ambiguity, richer context, and better measurement of real workplace iteration. (OpenAI)
The AI power shift between ChatGPT 5.2 and Gemini 3. What the ChatGPT 5.2 vs Gemini 3 shift means for teams

What exactly is GDPval in plain English?

GDPval (short for GDP-valued, sometimes written as GDP-eval) is a new kind of test, or "benchmark," created by OpenAI to measure how well AI models perform on real-world, economically valuable tasks that professionals do every day, utilizing a new evaluation framework.

In plain English, it moves beyond typical academic exams or quiz questions to see if an AI can produce the actual work deliverables that people are paid for in jobs contributing significantly to the labor market and the economy.

GDPval (short for “GDP value”) is an AI evaluation that asks a more practical question than most benchmarks:

Can a model produce work products that professionals would actually accept in real jobs?

Instead of multiple-choice questions or short-form reasoning puzzles, GDPval tests deliverables—the sorts of outputs that sit at the end of an actual work request: a legal brief, a nursing care plan, a customer support conversation, an engineering-style document, or a slide deck. OpenAI positions GDPval as the next step after academic benchmarks (like MMLU) and domain benchmarks (like SWE-Bench), to close the gap between “lab intelligence” and “workplace usefulness.” (OpenAI)

Key Features of GDPval

Real Work, Not Trivia: The tasks are based on actual work products like writing a legal brief, creating an engineering blueprint, preparing a financial analysis, or developing a nursing care plan.
Expert-Designed and Graded: The tasks were created by experienced professionals (averaging 14 years of experience) in 44 different occupations. The models' outputs are then blindly compared and graded by other human experts, who decide if the AI's work is "better," "as good as," or "worse than" a human-produced version.
Focus on Deliverables: Unlike tests that just require a text answer, GDPval tasks often involve multiple file types and formats, such as spreadsheets, presentations, diagrams, and multimedia, reflecting the multimodal nature of real knowledge work.
Economic Context: The name "GDPval" comes from using the concept of Gross Domestic Product (GDP) as an indicator of economic importance. The evaluation focuses on jobs within the top industries that contribute most to the U.S. GDP, allowing researchers to gauge the potential economic impact of AI capabilities

From an SEO / AEO / GEO lens, GDPval matters because it shifts the narrative from “model scores” to work outcomes; which is exactly how decision-makers think when they allocate budgets: What can this system reliably produce, at what quality, cost, and risk? (arXiv)

Essentially, GDPval is a practical "performance review" for AI, designed to bridge the gap between academic capabilities and actual workplace utility. More details are available in OpenAI's official blog post about the evaluation framework.

Why does OpenAI evaluate AI against those 44 jobs?

1) Because “AI impact” is mostly debated without task-level evidence

The paper explicitly frames GDPval as a way to measure capabilities ahead of adoption curves. Traditional economic indicators (usage patterns, productivity stats, GDP growth attribution) are lagging—they show impact after years of organisational change, tooling, regulation, training, and process redesign. (arXiv)

GDPval tries to answer the leading indicator question:

What can models do today that maps to paid work?
Where are they close, and where are they still far? (arXiv)

2) Because jobs are bundles of tasks, and AI tends to land on “task slices” first

OpenAI is careful to state (in effect) that most occupations aren't instantly “automated”; rather, AI often takes on repetitive tasks and repeatable, well-specified subtasks, enabling automation and freeing humans for judgment-heavy work. (OpenAI)

That framing is strategically important:

It avoids the simplistic “AI replaces jobs” headline.
It supports a more realistic “AI reconfigures workflows” model—which is where most enterprise value is created.

3) Because selecting “GDP-relevant” sectors makes the benchmark economically legible

GDPval's initial scope is anchored to the top 9 sectors contributing over 5% to U.S. GDP, using Federal Reserve Bank of St. Louis (FRED) industry value-added data as the basis for sector selection. (OpenAI)

That choice matters because it makes the evaluation:

easier to interpret for policy and business stakeholders,
more comparable across time,
and more aligned with “where productivity gains would move the needle.”

How OpenAI chose the 44 human jobs

OpenAI's selection logic is intentionally “top-down” and defensible:

Pick 9 sectors contributing >5% of U.S. GDP. (OpenAI)
Within each sector, select up to 5 occupations that contribute most to wages/compensation, using May 2024 BLS occupational employment and wage data. (OpenAI)
Filter to “predominantly knowledge work / digital work” using O*NET task data, where an occupation qualifies if ≥60% of its tasks are classified as not involving manual/physical work (OpenAI blog) / as “digital” (paper's methodology). (OpenAI)

This yields 44 occupations—and notably, the paper says those occupations collectively earn about $3T annually, which is a way of signalling “this isn't a niche benchmark.” (arXiv)

The 44 jobs in OpenAI’s GDPval AI Score. The full list

Below is the complete occupation list published by OpenAI, grouped by sector. (OpenAI)

Real estate and rental and leasing

Concierges
Property, real estate, and community association managers
Real estate sales agents
Real estate brokers
Counter and rental clerks

Government

Recreation workers
Compliance officers
First-line supervisors of police and detectives
Administrative services managers
Child, family, and school social workers

Manufacturing

Mechanical engineers
Industrial engineers
Buyers and purchasing agents
Shipping, receiving, and inventory clerks
First-line supervisors of production and operating workers

Professional, scientific, and technical services

Software developers
Lawyers
Accountants and auditors
Computer and information systems managers
Project management specialists

Health care and social assistance

Registered nurses
Nurse practitioners
Medical and health services managers
First-line supervisors of office and administrative support workers
Medical secretaries and administrative assistants

Finance and insurance

Customer service representatives
Financial and investment analysts
Financial managers
Personal financial advisors
Securities, commodities and financial services sales agents

Retail trade

Pharmacists
First-line supervisors of retail sales workers
General and operations managers
Private detectives and investigators

Wholesale trade

Sales managers
Order clerks
First-line supervisors of non-retail sales workers
Sales representatives, wholesale and manufacturing, except technical and scientific products
Sales representatives, wholesale and manufacturing, technical and scientific products

Information

Audio and video technicians
Producers and directors
News analysts, reporters, and journalists
Film and video editors
Editors

ChatGPT 5.2 vs Gemini 3: why the landscape is shifting

ChatGPT 5.2 and Gemini 3 mark a pivotal shift in the AI landscape as both push boundaries in reasoning, multimodality and real‑time integration, but with different commercial orientations: ChatGPT 5.2 tightens flow and context retention for long-form dialogue, fine‑tunes safety guardrails and scales personalized assistants across workflows, while Gemini 3 emphasizes multimodal grounding, on‑device inference and search-native capabilities that blend retrieval with situational awareness. The competitive edge now depends less on raw parameter counts and more on ecosystem hooks; access to proprietary data, search and marketplace integrations, latency for interactive use, and monetization layers such as plugins and commerce APIs. As a result, brands and platforms must evaluate models by deployment fit (privacy, latency, cost), verticalization potential (specialist skills, enterprise adapters) and downstream metrics (task completion, conversion lift, and reduced support load), not headline benchmarks — a shift from model race to integration race, especially as interest surges around innovations expected by August. Read more about How model competition is changing AI adoption decisions

How GDPval AI tasks are built

GDPval AI tasks are designed to mirror the complexity, ambiguity, and cumulative demands of real-world work: tasks require integration of AI and diverse information sources, balancing short-term outputs with long-term objectives, and adapting to shifting constraints and incomplete data. The benchmark stacks interdependent subtasks so that progress depends on prior, often imperfect steps, reflecting how real projects compound small errors and require iterative correction. It embeds noisy, partially specified goals and trade-offs; competing priorities, resource limits, and evolving stakeholder preferences—forcing models to reason under uncertainty and make pragmatic decisions rather than rely on perfectly clean signals. Evaluation emphasizes not just isolated accuracy but robustness, consistency, and the cost of failure across sequences of actions, so models must manage effort, prioritize, and recover from mistakes as humans do in workplace settings.

Tasks are written by experienced professionals, not benchmark designers

For each occupation, OpenAI worked with professionals averaging ~14 years of experience, and tasks went through multiple rounds of review to ensure they were representative, feasible, and gradeable. (OpenAI)

The full dataset is sizable, and the “gold subset” is open

Full set: 1,320 tasks (about 30 tasks per occupation) (OpenAI)
Gold subset (open): 220 tasks (about 5 tasks per occupation) (OpenAI)

This matters for research and GEO/AEO content strategy because “open tasks + public grading service” accelerates replication, third-party critique, and competitive benchmarking across model providers. (arXiv)

Multi-modality and file-heavy context are a core design choice

The paper highlights that tasks require working with formats like:

CAD files, images, video/audio,
diagrams, slide decks, spreadsheets,
and customer conversations—often with many reference files (up to 17 in the gold subset and 38 in the full set). (arXiv)

This is crucial: most “AI is amazing” demos are short prompts with no messy inputs. GDPval is deliberately pushing toward workplace constraints.

Long-horizon difficulty: tasks are time-expensive for humans

GDPval tasks require an average of ~7 hours for an expert to complete, and some can span weeks. (arXiv)

That one detail explains why GDPval is strategically different: it's measuring something closer to project work, not “chat.”

How GDPval Benchmark grades AI model performance

The primary metric: blind, head-to-head expert preference

OpenAI uses experienced professionals (“graders”) from the relevant occupations to blindly compare AI outputs vs human expert deliverables, and judge whether the AI output is better, as good as, or worse. In this process, machine learning techniques play a crucial role. (OpenAI)

Why preference-based grading?

Because for many real deliverables, “correctness” is not binary. Experts care about:

structure, clarity, formatting, relevance,
professional tone and completeness,
and whether it would survive real stakeholder scrutiny. (arXiv)

The “automated grader” exists, but is positioned as experimental

OpenAI also built an “automated grader” intended to estimate how humans would judge outputs, and released it as a public research service—but states it's not yet reliable enough to replace experts. (OpenAI)

What GDPval’s early results say.. and what they do not say

AI Capability Matrix

The AI Capability Matrix serves as a pivotal framework for evaluating AI model performance against the backdrop of real-world applications. By emphasizing multi-modality and the handling of file-heavy contexts, it effectively encapsulates the complexities faced in practical scenarios — from CAD files and multimedia to dynamic customer interactions.

Central to this evaluation is the acknowledgment that traditional "AI is amazing" showcases often overlook nuanced tasks involving messy inputs. The GDPval benchmark deliberately challenges models within authentic workplace constraints, reflecting the multifaceted nature of human projects rather than simplistic conversational exchanges.

What OpenAI reports

OpenAI reports that frontier models are “already approaching” expert quality on many tasks, including challenging academic tests, based on blind comparisons across the 220 gold tasks, specifically in academic tests. (OpenAI)

They also describe model-by-model tendencies:

Claude Opus 4.1 performed best overall in their gold-set run and was noted as strong on aesthetics/formatting; OpenAI states it was rated “as good as or better than humans” in just under half of tasks. (OpenAI)
GPT-5 is described as excelling more on accuracy / domain-specific knowledge. (OpenAI)

OpenAI additionally reports a very business-relevant cost/time claim: frontier models can complete GDPval tasks roughly 100× faster and 100× cheaper than experts—while explicitly caveating that this reflects inference time and API billing, not the human oversight and integration steps needed in practice. (OpenAI)

What this does not mean

GDPval does not imply:

whole occupations are “solved,”
organisations can deploy models with zero operational risk,
or that one-shot outputs are equivalent to iterative stakeholder work.

OpenAI and the paper emphasise limitations: GDPval is currently one-shot, focused on self-contained digital deliverables, and doesn't fully capture ambiguity, tacit knowledge, interpersonal coordination, or iterative revision cycles. (OpenAI)

The most interesting “hidden” insight: scaffolding and prompting can move the needle

One of the most operationally important findings is that performance improves with: increased reasoning effort, more context, In practical terms: GDPval is not only measuring “model IQ.” It's measuring a stack. Currently, there are no widely available public leaderboards specifically showing AI model performance on GDPval. As the metric gains traction, it is possible that such leaderboards will surface, but at present, results are typically shared in research papers or private benchmarks rather than a central public dashboard.

increased reasoning effort,
more context,
and better scaffolding. (arXiv)

A concrete example from the paper: prompt/scaffolding changes eliminated a PDF artifact issue and reduced major PowerPoint formatting errors, while improving human preference win rates by ~5 percentage points in their experiment. (arXiv)

In practical terms: GDPval is not only measuring “model IQ.” It's measuring a stack:

model + instructions,
model + tool use,
model + checking/rendering outputs,
model + workflow design.

That is exactly where most enterprises will compete.

AI may augment and improve existing work

As artificial intelligence (AI), including generative AI, continues to evolve, its potential to enhance and augment existing roles becomes increasingly evident. Rather than merely replacing jobs, AI can take on repetitive and mundane tasks, allowing human workers to focus on more complex and creative aspects of their jobs. For instance, in customer service, AI-driven chatbots handle routine inquiries, freeing up human agents to tackle more nuanced customer needs. Similarly, in sectors like manufacturing, AI, alongside generative AI, can monitor equipment and predict maintenance needs, reducing downtime and enabling workers to concentrate on production rather than reactive tasks. This shift not only improves efficiency but also boosts job satisfaction, as employees are able to direct their energy towards tasks that require human insight and creativity.

Furthermore, the integration of AI has the potential to democratize skills across various sectors. As the integration of AI tools become more accessible, even entry-level workers can leverage them to enhance their productivity. This is particularly impactful for less experienced individuals, who may find that using AI tools levels the playing field, allowing them to compete more effectively with seasoned professionals. By embracing AI as a collaborator rather than a competitor, organizations can create a more empowered workforce that harnesses the strengths of both human creativity and machine efficiency.

What next: where GDPval is heading

OpenAI explicitly signals several next moves for AI:

1) Move beyond one-shot into interactive AI workflows

OpenAI states future versions should better represent work that requires:

building context,
multiple drafts,
incorporating feedback,
navigating ambiguity. (OpenAI)

2) Expand AI breadth: more occupations, more task types

Both the blog and paper describe GDPval as an early version and indicate expansion beyond the current 44-occupation scope. (OpenAI)

3) Better Agentic AI automated grading (without pretending it’s perfect)

They're open-sourcing a gold subset and offering an automated grader to make evaluation more accessible, while acknowledging grader limitations and cost trade-offs. (arXiv)

4) The deeper strategic shift: “economic capability curves”

The paper reports frontier model performance is improving roughly linearly over time on GDPval, and the OpenAI post highlights strong progress year over year with implications for economic growth. (OpenAI)

If that holds, the competitive game becomes:

Which provider improves fastest on your job-relevant tasks?
Which provider integrates best into tools, governance, compliance, and user adoption?
Which provider reduces oversight cost (the hidden line item)?

AI’s limitations and the need for further research

Despite the promising advantages AI presents, particularly in the realm of machine learning, it is essential to recognize its limitations, particularly when it comes to academic tests within fields like natural language processing (NLP). Current AI models often struggle with tasks requiring deep understanding and contextual awareness, which are vital in many professional environments. For instance, while AI can generate impressive text outputs, it may not fully grasp the nuances of human communication or the ethical implications of its suggestions. Such limitations necessitate further research to ensure that AI systems can be reliably integrated into workplaces without compromising quality or ethical standards.

Moreover, in the United States, critical evaluations of AI's impact on labor market dynamics in the United States are still underway. As AI continues to permeate various sectors, ongoing studies are crucial to understanding how these technologies will affect job roles, economic structures, and workforce skills. Researchers must explore not only the capabilities of AI but also the ethical, economic, and social ramifications that may arise as these technologies become increasingly embedded in our everyday work lives. By comprehensively addressing these issues, we can better prepare for an AI-driven future while ensuring that the benefits of such advancements are equitably distributed.

Related AI FAQs

1) What is GDPval score?

GDPval is a benchmark evaluating AI model performance on real-world, economically valuable tasks, using deliverables from 44 occupations across 9 major U.S. GDP sectors. (OpenAI)

2) Why did OpenAI choose 44 occupations?

They selected occupations that are economically significant (high total wages) and predominantly knowledge/digital work, filtered using O*NET tasks and a 60% threshold. (OpenAI)

3) Are these “the 44 jobs AI will replace”?

No. GDPval measures performance on specific tasks/deliverables in those occupations, and OpenAI notes most jobs include ambiguity, iteration, and human coordination that the current evaluation doesn't fully capture. (OpenAI)

4) How big is the GDPval dataset?

The full set includes 1,320 tasks (about 30 per occupation). The open gold subset includes 220 tasks (about 5 per occupation). (OpenAI)

5) How are GDPval tasks graded?

Primarily via blind, head-to-head comparison by expert graders who evaluate AI outputs against expert human deliverables. (OpenAI)

6) Does GDPval include multimodal work (files, slides, spreadsheets)?

Yes. Tasks involve multiple formats (e.g., CAD, slides, spreadsheets, multimedia) and can require parsing many reference files (up to 17 in the gold set and 38 in the full set). (arXiv)

7) How long do tasks take humans to complete?

The paper reports tasks average about 7 hours for an expert, with some spanning multiple weeks. (arXiv)

8) What AI models were compared in OpenAI’s early results?

OpenAI reports blind comparisons including models such as GPT-4o, GPT-5, Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4, among others. (OpenAI)

9) What’s the “100× faster and cheaper” claim?

OpenAI reports that frontier models can complete GDPval tasks roughly 100× faster and 100× cheaper than experts based on inference time and API billing—but this excludes human oversight and integration costs, which is a point often noted by industry leaders such as a CEO. With the introduction of a new evaluation framework, OpenAI aims to address these critical aspects. (OpenAI)

10) What is OpenAI planning to do next with GDPval?

They plan to expand GDPval with more occupations and more realistic workflows—especially interactive, context-rich, and ambiguous tasks beyond one-shot prompts. (OpenAI)

How to adapt and thrive in an AI-driven job market

Embracing lifelong learning is crucial to thriving in an AI-driven job market. Prioritizing upskilling in machine learning and other relevant technologies enables individuals to remain competitive. Understanding the integration of AI in various sectors—like customer support and finance—can foster adaptability. Engaging in collaborative workflows that harness generative AI enhances productivity while reducing the burden of repetitive tasks. This proactive approach not only prepares workers for the current labor market but also positions them favorably for the upcoming AI-driven productivity boom and future economic growth and disruption.

Embracing lifelong learning and upskilling for AI era

A succinct overview of AI capability reveals the multifaceted nature of assessing artificial intelligence within various job sectors. Evaluative frameworks like GDPval harness performance-based metrics derived from human benchmarks to gauge efficiency. By focusing on task-specific outcomes, the framework emphasizes the importance of precision in measuring how generative AI can enhance productivity while navigating labor market dynamics. This process facilitates informed discussions on economic growth and the future of work, especially as automation continues reshaping traditional roles across industries, much like the innovative research coming out of Stanford.

How to quickly change career in an AI world

In an AI-driven world, adaptability is key to career success. Individuals seeking to pivot into AI-focused roles should prioritize acquiring relevant skills that align with industry demands. This includes pursuing online courses or certifications in machine learning, data analysis, and AI application development. Moreover, gaining hands-on experience through internships or collaborative projects can significantly enhance one's understanding of AI technologies and their practical applications, especially in relation to avoiding repetitive tasks.

Networking also plays a crucial role in transitioning careers. Engaging with professionals in the AI sector can provide valuable insights into emerging trends and job opportunities. Attending industry conferences, webinars, and local meetups can foster connections with potential mentors and employers. By actively participating in the AI community, individuals can position themselves as informed candidates ready to contribute to the evolving landscape of work shaped by artificial intelligence.

Jobs least likely to be automated by AI

Certain roles demonstrate resilience against the onslaught of automation and potential disruption. Professions requiring deep emotional intelligence, such as therapists and social workers, thrive on human connection, making them difficult for AI to replicate. Similarly, skilled trades, like electricians and plumbers, rely on intricate problem-solving and physical dexterity, resisting complete automation. Creative fields, including art and writing, flourish in human creativity and nuanced expression, presenting a challenge for generative AI to fully take over. These jobs underscore the enduring importance of human skill and intuition.

About Modi Elnadi

I'm Modi Elnadi, a London-based AI-first Growth & Performance Marketing leader and the founder of Integrated.Social. I help brands win in the new search landscape by combining PPC (Google Ads / Performance Max) with AI Search, SEO + AEO/GEO so you don't just rank on Google you show up in answer engines like ChatGPT, Gemini, and Perplexity when buyers ask high-intent questions.

If this article helped you, I'd genuinely love to connect. Reach out on LinkedIn (Modi Elnadi) with what you're building (or what's broken), and I'll share a practical angle on where you're likely leaving performance on the table, whether that's prompt-led content engineering, AI visibility, or paid media efficiency. If you'd like to stay connected, please drop your email address, and if you reshare, please tag me and I'll jump into the comments.

If you want this implemented for your brand

If you want to translate GDPval-style “real work” capability into practical go-to-market execution, stronger AI search visibility, and conversion‑focused demand capture, start here: schedule an audit of your current AI Search (AEO/GEO), SEO and paid channels so we can map high-intent touchpoints, identify content and creative gaps, and align attribution to revenue and ROAS; we'll prioritise quick-win technical fixes, schema and prompt-ready content that feeds AI answer engines and marketplaces, and test performance creative and bidding strategies designed to convert qualified demand—delivering measurable uplift in visibility, conversion rates and lifetime value while building a roadmap to scale.

If you want to translate GDPval-style “real work” capability into practical go-to-market execution, stronger AI search visibility, and conversion-focused demand capture, start here:

References

OpenAI: “Measuring the performance of our models on real-world tasks” (GDPval overview and occupation list). (OpenAI)
Patwardhan et al. (OpenAI), “GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks” (arXiv:2510.04374v1). (arXiv)

chatgptopenaitech trendsAI BenchmarksAGI / AI StrategyExplainerEntity SEOMarket Statsai jobs

Modi https://integrated.social