GPT-4o System Card

August 8, 2024

GPT‑4o System Card

GPT-4o Scorecard

Key Areas of Risk Evaluation & Mitigation

Unauthorized voice generation
Speaker identification
Ungrounded inference & sensitive trait attribution
Generating disallowed audio content
Generating erotic & violent speech

Preparedness Framework Scorecard

Cybersecurity Low

Scorecard ratings

Low
Medium
High
Critical

Only models with a post-mitigation score of "medium" or below can be deployed. Only models with a post-mitigation score of "high" or below can be developed further.

We thoroughly evaluate new models for potential risks and build in appropriate safeguards before deploying them in ChatGPT or the API. We’re publishing the model System Card together with the Preparedness Framework ⁠ scorecard to provide an end-to-end safety assessment of GPT‑4o ⁠ , including what we’ve done to track and address today’s safety challenges as well as frontier risks.

Building on the safety evaluations and mitigations we developed for GPT‑4 ⁠ , and GPT‑4V ⁠ , we’ve focused additional efforts on GPT‑4o's audio capabilities which present novel risks, while also evaluating its text and vision capabilities.

Some of the risks we evaluated include speaker identification, unauthorized voice generation, the potential generation of copyrighted content, ungrounded inference, and disallowed content. Based on these evaluations, we’ve implemented safeguards at both the model- and system-levels to mitigate these risks.

Our findings indicate that GPT‑4o’s voice modality doesn’t meaningfully increase Preparedness risks. Three of the four Preparedness Framework categories scored low, with persuasion, scoring borderline medium. The Safety Advisory Group ⁠ (opens in a new window) reviewed our Preparedness evaluations and mitigations as part of our safe deployment process. We invite you to read the details of this work in the report below.

Introduction

GPT‑4o 1 is an autoregressive omni model, which accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It’s trained end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.

GPT‑4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time ⁠ (opens in a new window) 2 in a conversation. It matches GPT‑4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT‑4o is especially better at vision and audio understanding compared to existing models.

In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House 3 , we are sharing the GPT‑4o System Card, which includes our Preparedness Framework ⁠ (opens in a new window) 5 evaluations. In this System Card, we provide a detailed look at GPT‑4o’s capabilities, limitations, and safety evaluations across multiple categories, with a focus on speech-to-speech (voice) A while also evaluating text and image capabilities, and the measures we’ve taken to enhance safety and alignment. We also include third party assessments on general autonomous capabilities, as well as discussion of potential societal impacts of GPT‑4o text and vision capabilities.

Model data & training

GPT‑4o's capabilities were pre-trained using data up to October 2023, sourced from a wide variety of materials including:

Select publicly available data, mostly collected from industry-standard machine learning datasets and web crawls.
Proprietary data from data partnerships . We form partnerships to access non-publicly available data, such as pay-walled content, archives, and metadata. For example, we partnered with Shutterstock ⁠ (opens in a new window) 5 on building and delivering AI-generated images.

The key dataset components that contribute to GPT‑4o’s capabilities are:

Web Data – Data from public web pages provides a rich and diverse range of information, ensuring the model learns from a wide variety of perspectives and topics.
Code and math – Including code and math data in training helps the model develop robust reasoning skills by exposing it to structured logic and problem-solving processes.
Multimodal data – Our dataset includes images, audio, and video to teach the LLMs how to interpret and generate non-textual input and output. From this data, the model learns how to interpret visual images, actions and sequences in real-world contexts, language patterns, and speech nuances.

Prior to deployment, OpenAI assesses and mitigates potential risks that may stem from generative models, such as information harms, bias and discrimination, or other content that violates our safety policies. We use a combination of methods, spanning all stages of development across pre-training, post-training, product development, and policy. For example, during post-training, we align the model to human preferences; we red team the resulting models and add product-level mitigations such as monitoring and enforcement; and we provide moderation tools and transparency reports to our users.

We find that the majority of effective testing and mitigations are done after the pre-training stage because filtering pre-trained data alone cannot address nuanced and context-specific harms. At the same time, certain pre-training filtering mitigations can provide an additional layer of defense that, along with other safety mitigations, help exclude unwanted and harmful information from our datasets:

We use our Moderation API and safety classifiers to filter out data that could contribute to harmful content or information hazards, including CSAM, hateful content, violence, and CBRN.
As with our previous image generation systems, we filter our image generation datasets for explicit content such as graphic sexual material and CSAM.
We use advanced data filtering processes to reduce personal information from training data.
Upon releasing DALL·E 3, we piloted a new approach to give users the power to opt images out of training ⁠ . To respect those opt-outs, we fingerprinted the images and used the fingerprints to remove all instances of the images from the training dataset for the GPT‑4o series of models.

Risk identification, assessment and mitigation

Deployment preparation was carried out via exploratory discovery of additional novel risks through expert red teaming, starting with early checkpoints of the model while in development, turning the identified risks into structured measurements and building mitigations for them. We also evaluated GPT‑4o in accordance with our Preparedness Framework 4 .

External Red Teaming

OpenAI worked with more than 100 external red teamers B , speaking a total of 45 different languages, and representing geographic backgrounds of 29 different countries. Red teamers had access to various snapshots of the model at different stages of training and safety mitigation maturity starting in early March and continuing through late June 2024.

External red teaming was carried out in four phases. The first three phases tested the model via an internal tool and the final phase used the full iOS experience for testing the model. At the time of writing, external red teaming of the GPT‑4o API is ongoing.

Phase 1

10 red teamers working on early model checkpoints still in development

This checkpoint took in audio and text as input and produced audio and text as outputs.

Single-turn conversations

Phase 2

30 red teamers working on model checkpoints with early safety mitigations

This checkpoint took in audio, image & text as inputs and produced audio and text as outputs.

Single & multi-turn conversations

Phase 3

65 red teamers working on model checkpoints & candidates

This checkpoint took in audio, image, and text as inputs and produced audio, image, and text as outputs.

Improved safety mitigations tested to inform further improvements

Multi-turn conversations

Phase 4

65 red teamers working on final model candidates & assessing comparative performance

Model access via advanced voice mode within iOS app for real user experience; reviewed and tagged via internal tool.

This checkpoint took in audio and video prompts, and produced audio generations.

Multi-turn conversations in real time

Red teamers were asked to carry out exploratory capability discovery, assess novel potential risks posed by the model, and stress test mitigations as they were developed & improved - specifically those introduced by audio input and generation (speech to speech capabilities). This red teaming effort builds upon prior work, including as described in the GPT‑4 System Card ⁠ (opens in a new window) 6 and GPT‑4(V) System Card ⁠ 7 .

Red teamers covered categories that spanned violative & disallowed content (illegal erotic content, violence, self harm, etc), mis/disinformation, bias, ungrounded inferences, sensitive trait attribution, private information, geolocation, person identification, emotional perception and anthropomorphism risks, fraudulent behavior and impersonation, copyright, natural science capabilities, and multilingual observations.

The data generated by red teamers motivated the creation of several quantitative evaluations that are described in the Observed Safety Challenges, Evaluations and Mitigations ⁠ section. In some cases, insights from red teaming were used to do targeted synthetic data generation. Models were evaluated using both autograders and manual labeling in accordance with some criteria (e.g, violation of policy or not, refused or not). In addition, we sometimes re-purposed C the red teaming data to run targeted assessments on a variety of voices and examples to test the robustness of various mitigations.

Evaluation methodology

In addition to the data from red teaming, a range of existing evaluation datasets were converted to evaluations for speech-to-speech models using text-to-speech (TTS) systems such as Voice Engine ⁠ . We converted text-based evaluation tasks to audio-based evaluation tasks by converting the text inputs to audio. This allowed us to reuse existing datasets and tooling around measuring model capability, safety behavior, and monitoring of model outputs, greatly expanding our set of usable evaluations.

We used Voice Engine to convert text inputs to audio, feed it to GPT‑4o, and score the outputs by the model. We always score only the textual content of the model output, except in cases where the audio needs to be evaluated directly (See Voice Generation ⁠ ).

GPT-4o System Card > Media > Tasks Visual > Media Item > Light

Limitations of the evaluation methodology

First, the validity of this evaluation format depends on the capability and reliability of the TTS model. Certain text inputs are unsuitable or awkward to be converted to audio; for instance: mathematical equations code. Additionally, we expect TTS to be lossy for certain text inputs, such as text that makes heavy use of white-space or symbols for visual formatting. Since we expect that such inputs are also unlikely to be provided by the user over Advanced Voice Mode, we either avoid evaluating the speech-to-speech model on such tasks, or alternatively pre-process examples with such inputs. Nevertheless, we highlight that any mistakes identified in our evaluations may arise either due to model capability, or the failure of the TTS model to accurately translate text inputs to audio.

A second concern may be whether the TTS inputs are representative of the distribution of audio inputs that users are likely to provide in actual usage. We evaluate the robustness of GPT‑4o on audio inputs across a range of regional accents in Disparate Performance on Voice Inputs. However, there remain many other dimensions that may not be captured in a TTS-based evaluation, such as different voice intonations and valence, background noise, or cross-talk, that could lead to different model behavior in practical usage.

Lastly, there may be artifacts or properties in the model’s generated audio that are not captured in text; for example, background noises and sound effects, or responding with an out-of-distribution voice. In the Voice Generation ⁠ , we illustrate using auxiliary classifiers to identify undesirable audio generation that can be used in conjunction with scoring transcripts.

Observed safety challenges, evaluations & mitigations

Potential risks with the model were mitigated using a combination of methods. We trained the model to adhere to behavior that would reduce risk via post-training methods and also integrated classifiers for blocking specific generations as a part of the deployed system.

For observed safety challenges outlined below, we provide a description of the risk, the mitigations applied, and results of relevant evaluations where applicable. The risks outlined below are illustrative, and non-exhaustive, and are focused on the experience in the ChatGPT interface. In this section, we focus on the risks that are introduced by speech to speech capabilities and how they may interact with pre-existing modalities (text, image) D .

Risk

Mitigations

Unauthorized voice generation

In all of our post-training audio data, we supervise ideal completions using the voice sample in the system message as the base voice.

We only allow the model to use certain pre-selected voices and use an output classifier to detect if the model deviates from that.

Speaker identification

We post-trained GPT-4o to refuse to comply with requests to identify someone based on a voice in an audio input, while still complying with requests to identify people associated with famous quotes.

Generating copyrighted content

We trained GPT-4o to refuse requests for copyrighted content, including audio, consistent with our broader practices.

To account for GPT-4o’s audio modality, we also updated certain text-based filters to work on audio conversations, built filters to detect and block outputs containing music, and for our limited alpha of ChatGPT’s Advanced Voice Mode, instructed the model to not sing at all.

Ungrounded inference / Sensitive trait attribution

We post-trained GPT-4o to refuse requests for ungrounded inference, such as “how intelligent is this speaker?”.

We post-trained GPT-4o to safely comply with requests for sensitive trait attribution by hedging answers, such as “what is this speaker’s accent” → “Based on the audio, they sound like they have a British accent.”

Disallowed content in audio output

We run our existing moderation classifier over text transcriptions of audio prompts and generations, and block the output for certain high-severity categories.

Erotic and violent speech output

We run our existing moderation classifier over text transcriptions of audio prompts, and block the output if the prompt contains erotic or violent language.

Unauthorized voice generation

Risk Description: Voice generation is the capability to create audio with a human-sounding synthetic voice, and includes generating voices based on a short input clip.

In adversarial situations, this capability could facilitate harms such as an increase in fraud due to impersonation and may be harnessed to spread false information 9 , 10 (for example, if we allowed users to upload an audio clip of a given speaker and ask GPT‑4o to produce a speech in that speaker’s voice). These are very similar to the risks we identified with Voice Engine ⁠ 8 .

Voice generation can also occur in non-adversarial situations, such as our use of that ability to generate voices for ChatGPT’s advanced voice mode. During testing, we also observed rare instances where the model would unintentionally generate an output emulating the user’s voice E .

Risk Mitigation: We addressed voice generation related-risks by allowing only the preset voices we created in collaboration with voice actors ⁠ 11 to be used. We did this by including the selected voices as ideal completions while post-training the audio model. Additionally, we built a standalone output classifier to detect if the GPT‑4o output is using a voice that’s different from our approved list. We run this in a streaming fashion during audio generation and block the output if the speaker doesn’t match the chosen preset voice.

Evaluation: We find that the residual risk of unauthorized voice generation is minimal. Our system currently catches 100% of meaningful deviations from the system voice F based on our internal evaluations, which includes samples generated by other system voices, clips during which the model used a voice from the prompt as part of its completion, and an assortment of human samples.

While unintentional voice generation still exists as a weakness of the model, we use the secondary classifiers to ensure the conversation is discontinued if this occurs making the risk of unintentional voice generation minimal. Finally, our moderation behavior may result in over-refusals when the conversation is not in English, which is an active area of improvement G .

Our voice output classifier performance over a conversation by language H :

Precision

Recall

English

0.96

1.0

Non-English

0.95

1.0

Speaker identification

Risk Description: Speaker identification is the ability to identify a speaker based on input audio. This presents a potential privacy risk, particularly for private individuals as well as for obscure audio of public individuals, along with potential surveillance risks.

Risk Mitigation: We post-trained GPT‑4o to refuse to comply with requests to identify someone based on a voice in an audio input. GPT‑4o still complies with requests to identify famous quotes. For example, a request to identify a random person saying “four score and seven years ago” should identify the speaker as Abraham Lincoln, while a request to identify a celebrity saying a random sentence should be refused.

Evaluations: Compared to our initial model, we saw a 14 point improvement in when the model should refuse to identify a voice in an audio input, and a 12 point improvement when it should comply with that request.

The former means the model will almost always correctly refuse to identify a speaker based on their voice, mitigating the potential privacy issue. The latter means there may be situations in which the model incorrectly refuses to identify the speaker of a famous quote.

GPT-4o-early

GPT-4o-deployed

should_refuse

0.83

0.98

should_comply

0.70

0.83

Disparate performance on voice inputs

Risk Description : Models may perform differently with users speaking with different accents. Disparate performance can lead to a difference in quality of service for different users of the model.

Risk Mitigation : We post-trained GPT‑4o with a diverse set of input voices to have model performance and behavior be invariant across different user voices.

Evaluations : We run evaluations on GPT‑4o Advanced Voice Mode using a fixed assistant voice (“shimmer”) and Voice Engine to generate user inputs across a range of voice samples I . We use two sets of voice samples for TTS:

Official system voices (3 different voices)
A diverse set of voices collected from two data campaigns. This comprises 27 different English voice samples from speakers from a wide range of countries, and a mix of genders.

We evaluate on two sets of tasks: Capabilities and Safety Behavior

Capabilities : We evaluate on four J tasks: TriviaQA, a subset of MMLU K , HellaSwag and LAMBADA. TriviaQA and MMLU are knowledge-centric tasks, while HellaSwag and LAMBADA are common sense-centric or text-continuation tasks. Overall, we find that performance on the diverse set of human voices performs marginally but not significantly worse than on system voices across all four tasks.

A grouped bar chart comparing capability evaluation scores across system and human voices, indicating similar performance levels regardless of voice type.

Safety Behavior: We evaluate on an internal dataset of conversations and evaluate the consistency of the model’s adherence and refusal behavior across different user voices. Overall, we do not find that the model behavior varies across different voices.

A grouped bar chart comparing safety evaluation scores across system and human voices, showing consistent ratings across different voice types.

Ungrounded Inference / Sensitive Trait Attribution

Risk Description: Audio input can lead to the model making potentially biased inferences about speakers. We define two categories:

Ungrounded inference (UGI): making inferences about a speaker that couldn’t be determined solely from audio content. This includes inferences about things such as a speaker’s race, socio-economic status/occupation, religious beliefs, personality traits, political attributes, intelligence, appearance (e.g., eye color, attractiveness), gender identity, sexual preference, or criminal history. This can lead to both allocative and representational harms 13 , 15 depending on how such behavior manifests.
Sensitive trait attribution (STA): making inferences about a speaker that could plausibly be determined solely from audio content. This includes inferences about things such as a speaker’s accent or nationality. Potential harms from STA include an increase in risks from surveillance 16 and a difference in quality of service for speakers with different voice attributes 12 , 13 , 14 .

Risk Mitigation: We post-trained GPT‑4o to refuse to comply with UGI requests, while hedging answers to STA questions. For example, a question to identify a speaker’s level of intelligence will be refused, while a question to identify a speaker’s accent will be met with an answer such as “Based on the audio, they sound like they have a British accent.”

Evaluation: Compared to our initial model, we saw a 24 point improvement in the model correctly responding to requests to identify sensitive traits (i.e., refusing UGI and safely complying with STA).

GPT-4o-early

GPT-4o-deployed

Accuracy

0.60

0.84

Violative & disallowed content

Risk Description: GPT‑4o may be prompted to output harmful content through audio that would be disallowed through text, such as audio speech output that gives instructions on how to carry out an illegal activity.

Risk Mitigation: We found high text to audio transference of refusals for previously disallowed content. This means that the post-training we’ve done to reduce the potential for harm in GPT‑4o’s text output successfully carried over to audio output.

Additionally, we run our existing moderation model over a text transcription of both audio input and audio output to detect if either contains potentially harmful language, and will block a generation if so L .

Evaluation: We used TTS to convert existing text safety evaluations to audio. We then evaluate the text transcript of the audio output with the standard text rule-based classifier. Our evaluations show strong text-audio transfer for refusals on pre-existing content policy areas. Further evaluations can be found in Appendix A ⁠

Text

Audio

Not unsafe

0.99

1.0

Not over-refuse

GPT-4o System Card