OpenAI’s Sora is Incredible but Comes with Risks

Hey Everyone,

Sora as an announcement of a text-to-video model has surprised a lot of people. Seeing is believing in this case:

Sora is a text-to-video solution for generating Video with Generative AI. It appears to perform well in advance of Runway or Pika Labs.

Creating Video from text

OpenAI says that Sora is an AI model that can create realistic and imaginative scenes from text instructions.

best friends

“a red panda and a toucan are best friends taking a stroll through santorini during the blue hour”

Result

While I need more time to evaluate Sora and the thought leadership on Sora for full coverage.

Our european writer has some interesting thoughts on the future of Sora and text-to-video, in this post below.

Sora

One Step Away from the Matrix? By

View more Videos from one of its Creators Twitter/X profile.

Sora Unlocks the Dystopian Allure of OpenAI’s Mission of AGI

I.

OpenAI is a very unusual company that is building towards the quasi-religious ideal of Artificial General Intelligence (AGI) with great commercial success and legal and ethical challenges on its path. If anything, OpenAI understands to capture the imaginations of its tens of millions of users with products and narratives that are equally impressive and terrifying. Several rumors and news have circulated in February.

On February 7, The Information reported that OpenAI was working on a form of agent software to automate complex tasks by effectively taking over the customer’s device (perhaps a bit similar to Rabbit’s R1).

On February 8, Wall Street Journal reported that OpenAI chief, Sam Altman, was seeking up to $7 trillion from investors, including from the United Arab Emirates government, for a new project in the semiconductor industry that would boost the capacity for AI-chip making.

On February 13, OpenAI announced that it would roll out a new memory feature for ChatGPT to a small group of test users, that enables ChatGPT to remember conversations across chats, and for the user to explicitly tell ChatGPT to remember, forget or recall something. The new memory feature naturally raises new privacy and safety considerations as well.

On February 14, The Information reported that OpenAI is building a new product to challenge Google’s dominance in search and compete with other AI-driven search engines such as Perplexity.

Then, on February 15, OpenAI announced the release of a new text-to-video model, Sora, that has the ability to create realistic-looking synthetic videos with a duration of up to 60 seconds. Besides text, Sora can also be prompted with pre-existing images or videos. At first, Sora is released in closed beta to red teamers for safety testing and to a selected group of visual artists, designers, and filmmakers for feedback on its creative abilities. There is currently no waitlist or information about when and if Sora will be publicly released.

However, that didn’t stop Sora from starting a hype train on social media, completely stealing the show from Google’s new Gemini 1.5 which was released a few hours prior. Gemini 1.5 is otherwise a breakthrough model with GPT-4-equivalent capabilities and a token context window of up to 1 million tokens.

The Sora hype was carried by chokingly realistic video examples of Sora’s state-of-the-art capabilities in video generation as seen in the official teaser on YouTube below.

Sam Altman performed a good deal of marketing wizardry by inviting users on X to suggest prompts from which Sora would generate videos. For example, one user requested “A instructional cooking session for homemade gnocchi hosted by a grandmother social media influencer set in a rustic Tuscan country kitchen with cinematic lighting”. Here’s what Sora came up with.

In line with previous marketing efforts, OpenAI discretely claims in the last paragraph of Sora’s announcement post that the new model is an important milestone towards AGI.

These three lines might be easy to overlook but they are in my view significant. Particularly, the word “understand” is worth discussing.

II.

If someone asks me if AI will be more intelligent than humans soon, or if AI will take over the majority of human jobs, my go-to-answer is “no, because AI’s have no real understanding of the world.” GPT-4 or Gemini Pro do not understand the world, not any more than a water tap understands water or a PlayStation understands FIFA. Understanding the world assumes an underlying consciousness that is actually capable of understanding. But I could be wrong.

Senior Research Scientist at NVIDIA, Jim Fan argues that Sora is not just a creative tool like OpenAI’s image model DALL-E but a “data-driven physics engine”. From Fan’s follow-up comment on LinkedIn:

Sora is an end-to-end, diffusion transformer model. It inputs text/image and outputs video pixels directly. Sora learns a physics engine implicitly in the neural parameters by gradient descent through massive amounts of videos. Sora is a learnable simulator, or “world model”.

OpenAI would concur. As highlighted here from the sub-title of Sora’s technical report:

OpenAI claims that Sora demonstrates some emergent capabilities which means capabilities that are not built into the model but arise spontaneously:

3D consistency: “Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space.”

Long-range coherence and object permanence: “(..) our model can persist people, animals and objects even when they are occluded or leave the frame. Likewise, it can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video.”

Interacting with the world: “Sora can sometimes simulate actions that affect the state of the world in simple ways. For example, a painter can leave new strokes along a canvas that persist over time, or a man can eat a burger and leave bite marks.”

Simulating digital worlds: “Sora can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity. These capabilities can be elicited zero-shot by prompting Sora with captions mentioning “Minecraft.”

By looking at a gigantic corpus of text, LLMs learn to see connections and patterns in text and thus learn to understand language. A diffusion transformer model such as Sora learns to understand physics and how objects move in the physical world by looking at a huge corpus of video footage. Although OpenAI, in keeping up with tradition, has not disclosed anything about Sora’s training data, Sora’s proposed status as a “data-driven physics engine” or a “general purpose world simulator” is primarily derived from the massive amount of video footage it was trained on.

Again, we are back to the meaning of the word “understanding”. Sora’s ability to “understand” spatial dimensions is flawed as we can see from the clip below that was generated with the prompt: “A beautiful homemade video showing the people of Lagos, Nigeria in the year 2056. Shot with a mobile phone camera.”

Gary Marcus has pointed out that Sora exhibits surreal physics, and does not seem to understand for example the structure of a chess board, or the fact that ants have six legs.

On the other hand, OpenAI’s AGI rhetoric makes a lot of sense. Because, and I am sorry to ask, but if Sora is not a step closer to some greater artificial intelligence, then what is the point even? To empower artists? Do we need more synthetic content flooding the web? Do we really need to speed up AI’s generative capabilities further?

As of now, it’s hard enough to deal with the many distasteful impersonation scams, misinformation campaigns, and just plain spam that is accumulating and poisoning the web. Additionally, after the NY Times lawsuit, more people are starting to question the industry practice of claiming ownership of private data and copyrighted content to train foundation models. The $100 billion valuation that OpenAI is eyeing is probably hard to maintain without the religious AGI undertones to its mission.

Editor’s Note

“a scuba diver discovers a hidden futuristic shipwreck, with cybernetic marine life and advanced alien technology”

Video generated by Sora.

OpenAI is going head-to-head for hype with Google DeepMind and Winning

It appears the announcement of Sora by OpenAI was timed to obscure hype for Google Gemini’s open-source SLM called Gemma. The biggest talking points in the last three days were Sora, Google Gemini’s ahistorical non-white depictions of American history, a meme on X, and to a much lesser extent the attributes of Google Gemma that now along with Mistral and Meta’s Llama SLMs really open up the doors for more innovation.

Field Notes

Sora can generate 1080p movie-like scenes with multiple characters, different types of motion and background details, OpenAI claims.

“Sora has a deep understanding of language, enabling it to accurately interpret prompts and generate compelling characters that express vibrant emotions,” OpenAI writes in a blog post.

The model can also generate a video based on a still image, as well as fill in missing frames on an existing video or extend it.

Sora is currently only available to “red teamers” who are assessing the model for potential harms and risks.

Sora can be a bit hazy on cause and effect; “a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark”.

Like its other AI products, OpenAI will have to contend with the consequences of fake, AI photorealistic videos being mistaken for the real thing. It has been confirmed that OpenAI’s Sora will be released for Microsoft’s Copilot suite as well.

It’s not clear how OpenAI will address the deepfake risks of Sora at scale that could pollute channels like TikTok, YouTube and YouTube Shorts rather easily.

OpenAI has also now invited a select number of visual artists, designers, and movie makers to test out the video generation capabilities and provide feedback.

Thanks for reading!

If you want to support our work and get access to our best deep dives.

Subscribe now

OpenAI’s Sora is Incredible but Comes with Risks

Creating Video from text

best friends

Result

Sora

More Articles from the Author

Sora Unlocks the Dystopian Allure of OpenAI’s Mission of AGI

I.

II.

Editor’s Note

“a scuba diver discovers a hidden futuristic shipwreck, with cybernetic marine life and advanced alien technology”

Video generated by Sora.

OpenAI is going head-to-head for hype with Google DeepMind and Winning

Field Notes

About The Author

Leave a reply Cancel reply

Recent Posts

Recent Comments

OpenAI’s Sora is Incredible but Comes with Risks

Creating Video from text

best friends

Result

Sora

More Articles from the Author

Futuristic Lawyer

Sora Unlocks the Dystopian Allure of OpenAI’s Mission of AGI

I.

II.

Editor’s Note

“a scuba diver discovers a hidden futuristic shipwreck, with cybernetic marine life and advanced alien technology”

Video generated by Sora.

OpenAI is going head-to-head for hype with Google DeepMind and Winning

Field Notes

About The Author

Related Posts

What’s all the noise in the AI basement?

Sam Altman Pitches Utopian impact of AI while Accepting UAE Oil Money Funding

China’s pursuit of AGI and General Purpose Robots should not be Underestimated

Microsoft is Eating the World

Leave a reply Cancel reply

Recent Posts

Recent Comments