Google AI Introduces MediaPipe Diffusion Plugins That Enable Controllable Text-To-Image Generation On-Device

Diffusion models have been widely used with remarkable success in text-to-image generation in recent years, leading to significant improvements in image quality, inference performance, and the scope of our creative possibilities. However, effective generation management remains a challenge, especially under conditions that are hard to define in words.

MediaPipe dispersion plugins, developed by Google researchers, make it possible to execute on-device text-to-image generation under user control. In this study, we extend our previous work on GPU inference for large generative models on the device itself, and we present low-cost solutions for programmable text-to-image creation that can be integrated into preexisting diffusion models and their Low-Rank Adaptation (LoRA) variations.

Iterative denoising is modeled for image production in diffusion models.

Each iteration of the diffusion model begins with an image contaminated by noise and ends with an image of the target notion. Language understanding through text prompts has significantly enhanced the image-generating process.

The text embedding is linked to the model for text-to-image production through cross-attention layers.

However, the position and pose of an object are two examples of details that could be more challenging to convey using text prompts. Researchers introduce control information from a condition image into diffusion utilizing extra models.

Image: Or Manor exploring MediaPipe abilities at the Google I/O Extand in Amsterdam.

Various methods such as Plug-and-Play, ControlNet, and T2I Adapters are commonly used to produce controlled text-to-image output.

Plug-and-Play uses a copy of the diffusion model (860M parameters for Stable Diffusion 1.5) and a denoising diffusion implicit model (DDIM) inversion approach to encoding the state from an input image.

The spatial features with self-attention are extracted from the copied diffusion and then merged into the text-to-image diffusion using Plug-and-Play.

ControlNet creates a trainable duplicate of the encoder of a diffusion model, which is then linked to the decoder layers via a convolution layer with zero-initialized parameters to encode the conditioning information.

Unfortunately, this approach has resulted in a significant increase in size—about 450M parameters for Stable Diffusion 1.5—half as much as the diffusion model itself. T2I Adapter, on the other hand, delivers comparable results in controlled generation despite being a smaller network (77M parameters).

The condition picture is the only input to T2I Adapter and its output is used by all subsequent diffusion cycles. However, this type of adapter is not suitable for mobile gadgets.

Iterative denoising is modeled for image production in diffusion models. Each iteration of the diffusion model begins with an image contaminated by noise and ends with an image of the target notion. Language understanding through text prompts has significantly enhanced the image-generating process.

The MediaPipe diffusion plugin is a standalone network Google developed to make conditioned generation effective, flexible, and scalable.

Connects simply to a trained baseline model; pluggable.
Zero-based training means no weights from the original model were used.
It is portable because it can be run independently of the base model on mobile devices at almost no additional expense.
The plugin is its network, the results of which can be integrated into an existing model for converting text to images. The diffusion model’s (blue) corresponding downsampling layer receives the retrieved features from the plugin.

A portable on-device paradigm for text-to-image creation, the MediaPipe dispersion plugin is available as a free download. It takes a conditioned image and uses multiscale feature extraction to add features at the appropriate scales to the encoder of a diffusion model.

The plugin model adds a conditioning signal to the image production when coupled with a text-to-image diffusion model. We intend for the plugin network to have only 6M parameters, making it a relatively simple model. To achieve rapid inference on mobile devices, MobileNetv2 employs depth-wise convolutions and inverted bottlenecks.

The text embedding is linked to the model for text-to-image production through cross-attention layers.

Easy-to-understand abstractions for self-service machine learning. To modify, test, prototype, and release an application, use a low-code API or a no-code studio.
Innovative machine learning (ML) approaches to common problems, developed using Google’s ML know-how.
Complete optimization, including hardware acceleration, while remaining small and efficient enough to run smoothly on smartphones running on battery power.

Check Out the Project Page and Google Blog.