Image similarity search based on embeddings and sentence_transformers

A vector is a mathematical object that has both a direction and a magnitude, typically represented as an array of numbers. At the very basic level, a vector is just a set of numbers (its coordinates).

In the context of data, vectors can be used to represent various types of information in a numerical format. When we talk about embeddings, we’re referring to the process of transforming complex data, like words or images, into these numerical vectors.

Image embedding, specifically, involves converting an image into a vector that captures important features and patterns within the image. This allows machines to understand and analyze images more effectively, making it easier to perform tasks like image recognition and classification in machine learning.

Similar images = similar image embeddings

Image embedding vectors gravitate towards similar features of images. If we visualize that in 3D (though embeddings have hundreds of dimensions usually), we find that similar images are situated close to each other:

Image embeddings are produced so that similar images are close to each other

As illustrated, Eva Mendes’s images are close to each other but far from Megan Fox’s. Eva’s image embeddings (vectors) are green, while Megan’s embeddings are blue. Similar images will have vectors that are close to each other (more on vector similarity search). This is the property we’re going to use to find similar images - build embeddings and find the nearest ones.

1. Converting images to vectors (embeddings)

We’re going to use sentence_trasformers lib, which makes it easy to use embedding models. First, let’s install this module:

pip install sentence_transformers

To generate an embedding for an image we only need to pick a model and load an image. We’re going to use clip-ViT-B-32 model which has shared text and image space (which means we can directly compare text with images, but that’s a separate article). Let’s code it now:

from sentence_transformers import SentenceTransformer
from PIL import Image

model = SentenceTransformer("clip-ViT-B-32")
image = Image.open('image.jpg')
vector = model.encode(image)

print(vector)
[-2.79104441e-01  2.57993788e-01 -7.66323954e-02  2.35199034e-01
...
  6.64310992e-01 -7.79335350e-02  4.23714638e-01  2.20056295e-01]

Our resulting vector has 512 dimensions (consisting of 512 floating point numbers). Note, that generating an embedding on a CPU might be slower since this is done by an NN model, so consider running on GPU instead.

2. Storing embedding

Now we have to generate vectors for all of our images. We need a storage for that. In our example, we use the ClickHouse database, which has built-in capabilities to work with vectors:

CREATE TABLE embeddings
(
    `image` String,
    `vector` Array(Float32)
)
ENGINE = MergeTree ORDER BY image
Query id: 8c76b807-02a2-4157-8d43-f5fbc901ad74

Ok.

0 rows in set. Elapsed: 0.018 sec. 

After embeddings are generated, we just insert them into ClickHouse:

INSERT INTO embeddings VALUES('image.jpg', [-2.79104441e-01, ..., 2.20056295e-01])

Note, how we pass vector column value, which ClickHouse perfectly understands.

3. Querying similar images based on embeddings

Let’s suppose we have the following image (top one) to find similar to in our database (2 bottom images):

Query similar images from database

We’ll use L2Distance function to calculate distances between database embeddings and query embedding:

SELECT
    image,
    L2Distance(vector, [...]) AS distance
FROM embeddings
ORDER BY distance
LIMIT 2
┌──image──┬──distance─┐
│   1.jpg │  3.895906 │
│   2.jpg │  12.46083 │

The first image has the distance metric which is 3 times less than the second one. This means the first image is much closer to the query image than the second one.

Let’s illustrate the result we’ve got:

Similar images were found

So Godzilla is similar to Godzilla but different from a beautiful girl on the street. Looks good. That’s how we search for similar images.

We could also add some threshold to filter out completely irrelevant images:

SELECT
    image,
    L2Distance(vector, [...]) AS distance
FROM embeddings
WHERE distance < 5
ORDER BY distance
LIMIT 2

Play with the threshold (5 in the example above) in your situation to pick the right figure and not lose valuable results.

Further reading

Published half a year ago in #machinelearning about #embeddings, #sentence_transformers, #clip, #vector search, #python and #clickhouse by Denys Golotiuk

Edit this article on Github
Denys Golotiuk · golotyuk@gmail.com · my github