I was pretty interested to see what the model could produce given the same prompt, particularly across a large number of images. I used Huggingface diffusers library to set up the generation and just let it run. My original plan of 1 million hot-dogs in a day quickly unraveled as I soon realized the required 695 hot-dogs/minute would be unattainable so I had to settle for 13 hot-dogs/minute and ~93,000 (only 7,000 off) images.
Here is a sample of 100 images randomly selected. There is a pretty wide variety of dogs that are generated and some very interesting interpretations of what constitutes a hot dog.
To dig a bit deeper into hot dog-100k dataset, we can index the data using Marqo. This allows us to easily search the images and do some additional labeling and classifying of the images in the dataset. After downloading the dataset we start up Marqo:
Once Marqo is up and running we can get the files ready for indexing:
Now we can start indexing:
Check we have our images in the index:
One noticeable thing is the presence of some black images. These are caused by the built-in filtering that suppresses images which may be deemed NSFW. I couldn’t be bothered to remove the filter so we have some of these images in the dataset. We can easily remove these though using Marqo. Since I am lazy I will just search for a blank image using a natural language description — “a black image”.
Now we have a black image, we can search using that and find all the other duplicate black images and remove them from the dataset.
Now our dataset should be free from images that do not contain hot dogs.
Given the variety of hot dogs generated from a single prompt, I was keen to understand more. A quick search of the following query — “two hot dogs” yielded something of interest but more was to follow, “a hamburger” and “a face”.
Armed with this survey, I created a new index with the following four documents — “one hot dog”, “two hot dogs”, “a hamburger” and “a face”. We can use our dataset images as queries against these labels and get back scores for each image:label pair. This is effectively doing zero-shot learning to provide a score for each category which could be thresholded to provide classification labels to each image for each category.
Now for each image we have the computed scores against each category which were just “documents” in our small index.
We have now calculated scores for the different categories described previously. The next thing to do is update our indexed data to have the scores. If we re-index a document it will update it with any new information. We can remove our image field from the documents as it has already been indexed with a model previously.
After all this work we can now animate the hot-dog-100k. We will do this by effectively “sorting” the vectors and creating a movie from these vectors accompanying images. Before we animate, we can make it more interesting as well by restricting the images we can search over by pre-filtering on some of the scores we calculated. This effectively restricts the space we can use for the animation based on the filtering criteria.
To animate the images based on their sorted vectors we can take an image as a start point (based on the query “a photo of a smiling face”) and find the next closest one (as seen above). Repeat the process until no more images are left and you have walked across the latent space. We are effectively solving a variant of the traveling salesmen problem — albeit with an approximate algorithm.
After we have done this walk we have the “sorted” list of images. These can then be animated in the order they appear in the list.
Its amazing to watch the progress in image generation and the multi-modality of it all. The fidelity of the images is incredible and the ability to manipulate images via prompt engineering is very interesting. Of particular interest is how this will impact search given their shared models. If you are interested in this check out Marqo for yourself and sign up for the cloud beta. Feel free to request features or add any comments as well.