Technology

The Movie Gen Meta model provides realistic video with sound, so we can finally have infinite Moo Deng

Published

on

No one really knows yet what generative video models are good for, but that does not stop firms like Runway, OpenAI, and Meta from investing thousands and thousands of their development. The latest version of Meta is titled Movie Genand true to its name, it turns text prompts into relatively realistic video with sound… but luckily no voice yet. And it’s sensible that they do not post it publicly.

Movie Gen is definitely a set (or “cast” as they call it) of basic models, the most important of which is the text-to-video bit. Meta claims it outperforms the likes of Runway Gen3, the newest LumaLabs release, and Kling 1.5, although as all the time, this type of thing shows more that they are playing the identical game than Movie Gen winning. The specs can be present in Meta’s release article describing all components.

Sound is generated to match the content of the video, adding, for instance, engine sounds corresponding to the movements of the automotive, the sound of a waterfall within the background, or thunder mid-video when required. He’ll even add music if he thinks it is vital.

It was trained on “a combination of licensed and publicly available datasets” that they called “proprietary/commercially sensitive” and provided no further details about it. We can only guess, which means there are various videos on Instagram and Facebook, in addition to some partner materials and rather more, that usually are not properly shielded from scrapers – i.e. “publicly available”.

However, Meta is clearly aiming here not only to say the “state-of-the-art” crown for a month or two, but for a practical, soup-to-nuts approach during which a quite simple material can be changed into a solid end product, a natural language prompt. Things like “imagine me as a baker baking a shiny hippopotamus-shaped cake during a storm.”

For example, one in every of the sticking points with these video generators is how difficult they have an inclination to be to edit. If you request a video of an individual crossing the road and also you realize that you just want the person to walk from right to left, not left to right. There’s a very good probability the entire shot will look different when you repeat the prompt with additional instruction. The meta adds an easy, text-based editing method where you can just say “change the background to a busy intersection” or “change her clothes to a red dress” and she is going to attempt to make that change, nevertheless it’s a change.

Image credits:Meta

Camera movements are also generally understood, and things like “tracking shot” and “panning left” are taken into consideration when generating video. It’s still quite clunky in comparison with actual camera controls, nevertheless it’s significantly better than nothing.

The model limitations are a bit strange. It generates video with a width of 768 pixels, a dimension familiar to most from the famous but outdated 1024×768 resolution, but which can be thrice the width of 256, so it plays well with other HD formats. The Movie Gen system upscales this resolution to 1080p, which is the source of the claim that it produces this resolution. Not entirely true, but we’ll leave them alone because upscaling is surprisingly effective.

Oddly enough, it generates as much as 16 seconds of video… at 16 frames per second, a frame rate that nobody in history has ever wanted or asked for. However, you can also record 10 seconds of video at 24 frames per second. Lead with it!

As for why it doesn’t play the voice… well, there are probably two reasons. First of all, it is extremely difficult. Generating speech is now easy, but matching it to lip movements and lip movements to faces is a rather more complicated proposition. I do not blame them for leaving it until later because it could have been a one-minute fail. Someone might say, “generate a clown delivering the Gettysburg address by riding around on a little bicycle” – nightmare fuel primed for popularity.

The second reason might be political: releasing a deepfake generator a month before the major elections is… not one of the best for optics. A practical preventive step is to barely limit its capabilities so that if malicious actors try to make use of it, it requires real work on their part. You can actually mix this generative model with a speech generator and an open mouth sync generator, but you can’t just generate a candidate making crazy claims.

“Movie Gen is currently purely an AI research concept, and even at this early stage, security is a top priority, as it is with all of our generative AI technologies,” a Meta representative said in response to TechCrunch’s questions.

Unlike, say, Llama’s large language models, Movie Gen is not going to be publicly available. You can replicate these techniques to some extent by following the research paper, however the code is not going to be published apart from the “baseline evaluation prompt dataset,” a record of the prompts used to generate the test videos.

This article was originally published on : techcrunch.com

Leave a Reply

Your email address will not be published. Required fields are marked *

Trending

Exit mobile version