Clustering similar images with pHash

Published on Fri Oct 30 2020Kruttika Nadig

Image hashing is a technique for generating distinct "fingerprints" of images which can be used to identify and group together similar images. "phash" is one of the most popular and effective hashing algorithms. We tried it on 10k images from our archive and had promising results.

This blog is a walkthrough of how we constructed the phashes with the Imagehash library, created easily navigable clusters (groups) of images whose fingerprints (hashes) are identical, and found images that are similar to a query image. An elegant feature of phashes is that similar images will have similar hashes. To know how the hashing algorithm works, check out this other blog

The code implementation of this can be found in a jupyter notebook here. The executed version of this notebook has been archived by the wayback back machine, which can be found here.

Text and illustrations on the website is licensed under Creative Commons 4.0 License. The code is licensed under GPL. For data, please look at respective licenses.