An expert in AI video generation discusses the technology's rapid advances-and its current limitations.
This presidential cycle has already seen several high-profile examples of people using deepfakes to try to influence voters. Deepfakes are images, audio recordings, or videos generated or modified using artificial intelligence (AI) models to depict real or fictional people. Recent deepfake examples include manipulated audio of Joe Biden urging voters to stay home during primaries and fabricated images of Taylor Swift endorsing Donald Trump.
It appears generative artificial intelligence is an increasingly prominent tool in the misinformation toolbox. Should voters be concerned about being bombarded with phony videos of politicians created with generative AI? An expert in computer vision and deep learning at the University of Rochester says that while the technology is rapidly advancing, deepfake video generation remains harder for bad actors to leverage due to its complex nature.
While OpenAI's products, including ChatGPT for text generation and DALL-E 3 for image generation, are taking off in popularity, the company has yet to release an equivalent for video generation. According to Chenliang Xu, an associate professor of computer science at Rochester, the company has released previews of its Sora video generation software but has yet to release the product, which is still undergoing testing and refinement.
"Generating video using AI is still an ongoing research topic and a hard problem because it's what we call multimodal content," says Xu. "Generating moving videos along with corresponding audio are difficult problems on their own-and aligning them is even harder."
Xu says that his research group was among the first to use artificial neural networks to generate multimodal video in 2017. They started with tasks like providing an image of a violin player and audio of a violin to generate a moving video of a violin player. From there, they moved on to problems like generating lip movements, and then to creating full talking faces complete with head gestures from a single image.
"Now, we can generate real-time, fully drivable heads and even turn the heads into various styles specified by language descriptions," says Xu.
Challenges with deepfake detection technology
Xu's team has also developed technology for deepfake detection. He calls it an area that needs extensive further research, noting that it's easier to build technology to generate deepfakes than to detect them because of the training data needed to build the generalized deepfake detection models.
"If you want to build a technology that's able to detect deepfakes, you need to create a database that identifies what are fake images and what are real images," says Xu. "That labeling requires an additional layer of human involvement that generation does not."
Another concern, he adds, is making a detector that is generalizable to different types of deepfake generators. "You can make a model that performs well against the techniques you know about, but if someone uses a different model, your detection algorithm will have a hard time capturing that," he says.
The easiest targets for video deepfakes
Having access to good training data is crucial for creating effective generative AI models. As a result, Xu says politicians and celebrities will be the earliest and easiest targets when video generators become widely available.
"Politicians and celebrities are easier to generate than normal people because there is simply more data about them," says Xu. "Because so much video of them already exists, these models can use it to learn the expressions they show in different situations, along with their voices, their hair, movements, and emotions."
But he expects that, at least initially, the training data the "celeb deepfakes" in particular are built on may make them more easily noticeable.
"If you used only high-quality photographs to train a model, it will produce similar results," says Xu. "It may result in an overly smooth style that you can pick out as a cue to tell it's a deepfake."
Other cues can include how natural a person's reaction seems, whether they can move their heads, and even the number of teeth shown. But image generators have overcome similar early tells-such as creating hands with six fingers-and Xu says enough training data can mitigate these limitations.
He calls on the research community to invest more effort into developing deepfake detection strategies and grappling with the ethical concerns surrounding the development of these technologies.
"Generative models are a tool that in the hands of good people can do good things, but in the hands of bad people can do bad things," says Xu. "The technology itself isn't good or bad, but we need to discuss how to prevent these powerful tools from ending up in the wrong hands and used maliciously."