GenAI

Computer Vision: Image Generator Networks

ChatGPT Generated Understanding of this Article

A Look at the Core Concepts and How T2I has Evolved

I am a Machine Learning Engineer by trade and a Computer Vision scientist by education, so any use case where computers are trying to learn the wonderful world outside (machine learning), primarily using visual information (computer vision) gives me a professional warm fuzzy.

At New Math Data we do various work in machine learning and computer vision to enable our customers to get the most from their data.

Some Core Concepts

Neural networks are computational models inspired by the human brain. In a nutshell, neural networks are graphs of interconnected layers. There is always one (or more) input layer and one or more output layer. Layers contain individual neurons. Neurons model nerve cells inside the brain. Neurons implement simple activation functions like ReLU, Sigmoid, Tanh to get the model to recognize complex patterns.

Convolutional Neural Networks are neural networks of a special kind very popular in Computer Vision. Their base units use convolution operations to process visual information. They are inspired by retina cell layers at the bottom of the eye.

Convolutions are basically weighted sums over a region of an image — or any other vector input for that matter. Convolution weights are also known as kernels. Convolutions may be computed over multiple channels (RGB or feature planes) ^[5]

Loss functions. Mathematically, neural networks implement multivariate vector functions. They are also quite configurable: each neuron unit has a set of weights associated with it. How do we configure those weights in the most optimal way possible? That’s what loss functions are for — they measure error between neural network output and desired outcome. Then, we apply training (see below) to configure weights in such a way that the loss is kept to a minimum.

Training We want to make sure neural network parameters (also called weights) are configured in the best way possible. Basically, we want to find a set of parameters that minimizes the loss function. The process is called training and involves forward propagation and backpropagation.

Forward propagation Remember we mentioned that neutral networks are, basically, functions ? Forward propagation is simply a function evaluation step. It involves passing input through the layers starting from input layers and going all the way to output layer(s). It usually involves evaluating the loss function as well to see if we are making progress minimizing the loss.

Backpropagation If there is anything worth remembering from Calculus 1, it’s that in order to optimize a function we need to compute its gradient. And that’s what the backpropagation step is all about: we carefully compute the gradient of the loss function starting from the output layers and going backwards applying chain rule, another thing worth remembering from Calculus 1. Next, we compute a new set of weights by applying gradient descent.

Gradient descent involves repeatedly taking very small steps in the direction opposite of a gradient. We just have to remember that we are taking those steps in the parameter (weight) space of the neural network, and each step brings network weights a bit closer to the optimal set of weights.

LLMs (Large Language Models) are Machine Learning models used in Natural Language Processing tasks (think Chatbots, Automated Tech Support Agents, Automatic Translators, AI Trainers, and a multitude of other applications. This article covers yet another LLM application: we “teach” Convolutional Neural Networks (used in Computer Vision) and Large Language Models (used in Natural Language Processing) to speak the same language in order to be able to generate images from text and text from images.

Tokens. LLMs break text into tokens in order to make it easier to learn the language in a meaningful way. Please note that tokens are not necessarily full words, they can be individual syllables, punctuation marks or even special markers.

Embeddings are lower-dimensional representations of sparse data. They let models learn complex patterns and are a core foundational concept for AI as they’re often a conversion of images and text into numerical representations. Large Language Models, for example, use transformers to transform tokens into embeddings.

Attention mechanism (also known as QKV-scheme). Attention mechanism allows neural networks to automatically focus on parts of the input that are truly important given the context. Consider, for example, two sentences: “This is a novel idea” and “Gone by the Wind is a fascinating novel to read”. In order to understand and properly interpret the word “novel”, the attention mechanism will assign more weight to the word “idea” in the first sentence and will assign more weight to “read” in the second sentence. The whole process is only a few steps: Query Q represents the current state of the model, Keys K and Values V are obtained from the input, similarity scores are computed based on a similarity function (think dot product) between Q and K. Then similarity scores are applied to appropriate elements of V and the results are added together to obtain the embedding of interest:

Autoencoders are neural networks of a special kind. They are trained to create low dimensional embeddings of the input and then reconstruct the input from those embeddings. Autoencoders have the encoder part which creates embeddings, and the decoder part which reconstructs the original data back from those embedding. The beauty of autoencoders is that they are self-supervised: they just need input data for training, no manual labelling of the input is needed.

Variational Autoencoders

In this article we are going to discuss variational autoencoders (VAE) ^[11] VAEs are different from other autoencoder varieties because they don’t just create a single “snapshot” encoding of the input, but rather encode random variable distribution. We can later use this distribution in the decoder to generate new samples. Please note that in the context of this article VAEs are convolutional neural networks: they use convolution as their base unit. We are going to use VAEs to encode and accurately reconstruct images and text data.

Let’s say we want to generate an image based on a few inherently vague sentences. “Impossible!” I would have exclaimed emphatically just a few years ago. “We are not quite there yet!”. But, with the recent advances in Generative AI, image generating networks have come well within our grasp. The term “Generative AI” applies not only to LLMs but also to image generators. I see LLMs as statistical AI parrots imitating human speech and memorizing a lot of potentially relevant information along the way. The trick with image generation is to blend the two, making sure the word embeddings are properly interpreted by image generators. The entire discipline is called T2I (text-to-image). Historically there were several approaches falling into three categories:

Generative Adversarial Networks (GANs)

GAN-based solutions were the first ones to produce reasonable output. They also generate the most crisp, vivid and high fidelity images.

GANs include two parts: generators and discriminators. Please note that both generators and discriminators are convolutional neural networks (CNN) as they both use convolution as their base unit. Generators create realistic looking images, and discriminators figure out how to distinguish real images from fake ones (see ^[8] for more details). Conditional GANs ^[10] go one step further and generate output conditioned on some prior information. This information may include an image description in a sentence or two, encoded with a transformer-type LLM, and that’s how GANs generate relevant images. The paper on StackGANs ^[7] explains the process in detail.

Conditional GANs, as great as they are, have a few inherent problems when it comes to image generation. For example, they experience mode collapse. In statistics, mode is the most frequent item in the set. So, in the context of GANs, mode collapse happens when the generator repeatedly generates a very small set of “modes”, which are the most frequent images in the data set no matter what the input is. So, the next generation of T2I tools involved “hourglass” architectures like VAE and Stable Diffusion models.

Variational Autoencoders (VAE)

VAE-based solutions include, for example, the initial version of Dall-E (pronounced “Dali” in honor of Salvador Dali). They are based on hourglass-style neural network architectures (see figure 6 above), U-Net being the prime example. The layers in the center of the hourglass represent embeddings (both image and text). The whole idea of the autoencoder is to encode image/text pairs into embeddings and then learn how to restore those embeddings back into images. Here are the secret ingredients worth mentioning:

Each image is compressed into a 32×32 grid of image tokens
Each image token is “snapped” to an underlying LLM transformer embedding to make sure we are using the same “codebook” (a set of token embeddings) as the language model. So this step ensures that the LLM and the VAE use the same embedding vectors (“speak” the same language). For example, vector e53 in the embedding space matches label 53 at the bottom of the grid. It also matches vector e53 of the decoder (figure 7 below). That’s not a coincidence — they all “speak” the same vector language!
Replacing embeddings with indexes kind of breaks the backpropagation step, so the corresponding gradients are just copy and pasted from the decoder to the encoder
We concatenate image and text embeddings together
And then, finally, we train a VAE autoencoder to accurately reconstruct image/text pairs.

So, let’s say you are done training the model. How does the inference work? In other words: how do you generate images given a text prompt? The conceptual answer is relatively straightforward: if you supply text and provide a prompt for the image token, the decoder will generate the rest of the image. You just need to keep generating new tokens step by step until you obtain the entire 32×32 matrix. The implementation details are more complex and I highly encourage curious readers to have a look at ^[1] for more information.

Stable Diffusion Models

The Dall-E model was a huge leap for mankind. People all over the world started playing with it as soon as it hit the open-source community (and then social media soon afterward). They soon discovered limitations: images would occasionally come out distorted, pixelated and/or blocky; they would often contain bizarre artifacts: six-fingered, three-legged creatures and so on such as below in figure 8.

Future versions of Dall-E, Midjourney and other vision models use the Stable Diffusion family of algorithms. And yes, just like that, Stable Diffusion-based models became all the rage for T2I. I’ll briefly explain their inner workings. Imagine a painting slowly appearing from a blurry void. Or, as Michelangelo put it, every block of stone has a statue inside it, and it is the task of the sculptor to discover it, and the same applies to images. This process of something slowly emerging and filling the void is called diffusion in physics; there are math equations describing it which comes in handy when we need to put together a loss function. Stable diffusion models seize on this. In a nutshell, we are doing the following:

First-generation diffusion models work with actual RGB images and figure out how to restore them from the blur.
Instead of working with full RGB image, we use autoencoders to create a low dimensional embedding of the image (see figure 9)
Interestingly, this autoencoder has an adversarial GAN component to make the encoded image look more realistic. So, the whole GAN-based research direction was very useful after all.
Stable diffusion process is applied to the bottleneck (low dimensional embedding) layer of the autoencoder.
Here’s another way to look at it: we use autoencoders to perform image compression. So, the purpose of the autoencoder is similar to the 32×32 grid of DallE v.1 but it adopts a much more laissez-faire approach.
These improvements allow us to generate way better quality high-resolution images while avoiding blockiness and artifacts.
The U-Net model at the bottom of figure 9 models the diffusion process where an image embedding is turned into blur and then the process is reversed.
Note the QKV- blocks in the decoder portion. Do they look familiar? If so it’s because they implement an attention mechanism, and yes, it is similar to the now famous attention mechanism of LLMs. Their main purpose is to make the whole system use the codebook of the underlying LLM model in its visual representation. Remember how DALL-E v.1 would “snap” the embedding to the closest codeword? These blocks serve the same purpose albeit in a much more subtle and flexible way.

For the math aficionados, ^[2] describes the process in all its glory.

But let’s say you want to do just the opposite: you have an image and your task is to generate a description, which is a very common use case. Turns out the models above can be used for generating image captions as well. All you need to do is to supply a text prompt instead of an image prompt and keep sampling from the joint distribution.

And finally, just wanted to mention that stable diffusion-based drawing tool is available online: https://stablediffusionweb.com/app/image-generator We asked it to draw an illustration forJ.R.R Tolkien’s poetry from Lord of the Rings:

Roads go ever ever on,
Under cloud and under star.
Yet feet that wandering have gone
Turn at last to home afar.
Eyes that fire and sword have seen,
And horror in the halls of stone
Look at last on meadows green,
And trees and hills they long have known.

Here’s what it came up with, and I think it’s on the right track (no pun intended):