OpenAI, the developer of ChatGPT, has silently unveiled Sora, a text-to-video model. Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions.
OpenAI stated that Sora is still undergoing red-teaming to ensure it doesn’t generate inappropriate or harmful content. Additionally, the company has granted access to select “visual artists, designers, and filmmakers” to gain feedback on advancing the model to be most helpful for creative professionals.
Can generate complex scenes
OpenAI’s text-to-video model can generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background. The model understands what the user has asked for in the prompt and how those things exist in the physical world.
Deep understanding of language
The model has a deep understanding of language, enabling it to accurately interpret prompts and generate compelling characters that express vibrant emotions. Sora can also create multiple shots within a single generated video that accurately portrays characters and visual style.
It’s not 100% perfect though
OpenAI reports that Sora may struggle to accurately simulate complex physics and understand specific cause-and-effect scenarios.
For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark.
Moreover, the model might get confused with the spatial details of a prompt, such as mixing up left and right, and it may find it difficult to accurately describe events that occur over time, such as tracing a specific camera’s path.
OpenAI partnering with Red Teamers
OpenAI is taking several safety measures making Sora available in its products. The company has partnered with red teamers – domain experts in misinformation, hateful content, and bias – who will be adversarially testing the model. This testing is to ensure that Sora is safe and reliable for use in OpenAI’s products.
The company is also building tools to detect misleading content, including a detection classifier that can recognise videos generated by Sora. If the model is implemented in an OpenAI product in the future, they plan to include C2PA metadata.
The team is using existing safety methods built for DALL·E 3 products to prepare for Sora’s deployment. Their text and image classifiers reject content that violates usage policies, such as extreme violence, sexual content, hateful imagery, and celebrity likeness.
The company plans to engage with policymakers, educators, and artists around the world to understand their concerns and identify positive use cases for their new technology.
Despite extensive research and testing, the company acknowledges that it is impossible to predict how people will use and misuse the technology. Therefore, they believe that learning from real-world usage is a critical component in creating and releasing increasingly safe AI systems over time.
Research techniques
Sora is a diffusion model that generates a video by “starting with one that looks like a static noise and gradually transforms it by removing the noise over many steps.”
OpenAI says Sora can generate entire videos, or extend generated videos to make them longer.
“By giving the model foresight of many frames at a time, we’ve solved a challenging problem of making sure a subject stays the same even when it goes out of view temporarily,” says the company
Like GPT models, Sora utilises a transformer architecture, enabling superior scalability.
Videos and images are represented as collections of smaller data units — patches, similar to tokens in GPT.
This unified data representation facilitates the training of diffusion transformers on visual data, including varying durations, resolutions, and aspect ratios.
Sora is an AI model that builds on past research in DALL·E and GPT models. It incorporates the recaptioning technique from DALL·E 3, which generates highly descriptive captions for visual training data. This technique allows Sora to create videos that more accurately follow the user’s text instructions.
In addition to being able to generate a video solely from text instructions, the model can also create a video from a static image by animating its contents with great precision and attention to detail.
The model can also take an existing video and extend it or fill in missing frames.
Sora can serve as a foundation for AGI by understanding and simulating the real world.