Generating videos from whole cloth isn't anything new for neural networks, or layers of mathematical functions modeled after biological neurons — researchers last week described a machine learning system capable of hallucinating clips from start and end frames alone. But because of the inherent randomness, complexity, and information denseness of videos, modeling realistic ones at scale remains something of an AI grand challenge.
A team of scientists at Google Research say they've made progress, though, with novel networks that are able to produce "diverse" and "surprisingly realistic" frames from open-source video data sets at scale. They describe their method in a newly published paper on the preprint server Arxiv.org ("Scaling Autoregressive Video Models"), and on a webpage containing selected samples of the model's outputs.
"[We] find that our [AI] models are capable of producing diverse and surprisingly realistic continuations on a subset of videos from Kinetics, a large scale action recognition data set of … videos exhibiting phenomena such as camera movement, complex object interactions and diverse human movement," wrote the coauthors. "To our knowledge, this is the first promising application of video-generation models to videos of this complexity."

Above: Videos generated by models trained on the Kinetics data set.
Image Credit: Google
The researcher's systems are autoaggressive, meaning that they generate videos pixel by pixel, and they're built upon a generalization of Transformers, a type of neural architecture introduced in a 2017 paper ("Attention Is All You Need") coauthored by scientists at Google Brain, Google's AI research division. As with all deep neural networks, Transformers contain neurons (functions) that transmit "signals" from input data and slowly adjust the synaptic strength — weights — of each connection. (That's how the models extracts features and learns to make predictions.)
Uniquely, Transformers have attention, such that every output element is connected to every input element, and the weightings between them are calculated dynamically. And it's this property that enables the video-generating systems to efficiently model clips as 3D volumes as opposed to sequences of still frames, and that drives direct interactions between representations of the videos' pixels across dimensions.
To maintain a manageable memory footprint and to create an architecture well-suited for tensor processing units, (TPUs), Google's custom-designed AI workload accelerator chipsets, the researchers combined the Transformer-derived architecture with approaches that generate images as sequences of smaller, sub-scaled image slices. Their models produce "slices" (sub-sampled lower-resolution videos) by processing partially masked video input data with an encoder, the output of which is used as conditioning for decoding the current video slice. After generating a slice, the respective padding in the video is replaced with the generated output and the process is repeated for the next slice.

Above: Videos generated by models trained on the BAIR data set.
Image Credit: Google
In experiments, the team modeled slices of four frames by first feeding their AI systems video from the BAIR Robot Pushing robot data set, which consists of roughly 40,000 training and 256 test videos showing a robotic arm pushing and grasping objects in a box. Next, they applied the models to down-sampled videos from the Kinetics-600 data set, a large-scale action-recognition corpus containing about 400,000 YouTube videos across 600 action classes.
Smaller models were trained for 300,000 steps, while larger ones were trained for one million steps.
The qualitative results were quite good — the team reports seeing "highly encouraging" generated videos for limited subsets such as cooking videos, which they note feature camera movement, complex object interactions like steam and fire, and cover diverse subjects. "This marks a departure from the often very narrow domains discussed in the video generation literature to date, such as artificially generated videos of moving digits or shapes," wrote the researchers, "or videos depicting natural, yet highly constrained environments, such as robot arms interacting with a small number of different objects with a fixed camera angle and background."
They concede that the models struggle with nuanced elements like human motion of fingers and faces and point out the many failure modes in the output data, which range from freezing movement or object distortions to continuations that break after a few frames. But those shortcomings aside, they assert that they achieve state-of-the-art results in video generation and demonstrate an aptitude for modeling clips of an "unprecedented" complexity.
Không có nhận xét nào:
Đăng nhận xét