Combining stable diffusion with semantic search: generating and categorising 100k hot dogs

February 26, 2024
mins read
There has been an incredible advancement in image generation recently. One of the latest examples of this is Stable Diffusion. You provide a natural language prompt — “A photo of a hot dog” — and out comes a high fidelity image matching the caption . Given this great technological development, I decided to generate 100,000 images of hot dogs so you don’t have to.

The hot-dog 100k dataset

I was pretty interested to see what the model could produce given the same prompt, particularly across a large number of images. I used Huggingface diffusers library to set up the generation and just let it run. My original plan of 1 million hot-dogs in a day quickly unraveled as I soon realized the required 695 hot-dogs/minute would be unattainable so I had to settle for 13 hot-dogs/minute and ~93,000 (only 7,000 off) images.

Here is a sample of 100 images randomly selected. There is a pretty wide variety of dogs that are generated and some very interesting interpretations of what constitutes a hot dog.

Indexing the hot-dog 100k dataset

To dig a bit deeper into hot dog-100k dataset, we can index the data using Marqo. This allows us to easily search the images and do some additional labeling and classifying of the images in the dataset. After downloading the dataset we start up Marqo:

Once Marqo is up and running we can get the files ready for indexing:

Now we can start indexing:

Check we have our images in the index:

Cleaning the hot-dog 100k dataset

One noticeable thing is the presence of some black images. These are caused by the built-in filtering that suppresses images which may be deemed NSFW. I couldn’t be bothered to remove the filter so we have some of these images in the dataset. We can easily remove these though using Marqo. Since I am lazy I will just search for a blank image using a natural language description — “a black image”.

Top 3 results for “a black image”

Now we have a black image, we can search using that and find all the other duplicate black images and remove them from the dataset.

Now our dataset should be free from images that do not contain hot dogs.

Labeling the hot-dog 100k dataset

Given the variety of hot dogs generated from a single prompt, I was keen to understand more. A quick search of the following query — “two hot dogs” yielded something of interest but more was to follow, “a hamburger” and “a face”.

The top two images (up/down) for the queries “two hot dogs”, “a hamburger” and “a face” (left/right).

Armed with this survey, I created a new index with the following four documents — “one hot dog”, “two hot dogs”, “a hamburger” and “a face”. We can use our dataset images as queries against these labels and get back scores for each image:label pair. This is effectively doing zero-shot learning to provide a score for each category which could be thresholded to provide classification labels to each image for each category.

Now for each image we have the computed scores against each category which were just “documents” in our small index.

Updating the hot-dog 100k dataset

We have now calculated scores for the different categories described previously. The next thing to do is update our indexed data to have the scores. If we re-index a document it will update it with any new information. We can remove our image field from the documents as it has already been indexed with a model previously.

Animating the hot-dog 100k dataset

After all this work we can now animate the hot-dog-100k. We will do this by effectively “sorting” the vectors and creating a movie from these vectors accompanying images. Before we animate, we can make it more interesting as well by restricting the images we can search over by pre-filtering on some of the scores we calculated. This effectively restricts the space we can use for the animation based on the filtering criteria.

To animate the images based on their sorted vectors we can take an image as a start point (based on the query “a photo of a smiling face”) and find the next closest one (as seen above). Repeat the process until no more images are left and you have walked across the latent space. We are effectively solving a variant of the traveling salesmen problem — albeit with an approximate algorithm.

After we have done this walk we have the “sorted” list of images. These can then be animated in the order they appear in the list.

Closing thoughts

Its amazing to watch the progress in image generation and the multi-modality of it all. The fidelity of the images is incredible and the ability to manipulate images via prompt engineering is very interesting. Of particular interest is how this will impact search given their shared models. If you are interested in this check out Marqo for yourself and sign up for the cloud beta. Feel free to request features or add any comments as well.

Jesse Clark