Clustering similar images with pHash

Published on Fri Oct 30 2020Kruttika Nadig

Tags:

Image hashing is a technique for generating distinct "fingerprints" of images which can be used to identify and group together similar images. "phash" is one of the most popular and effective hashing algorithms. We tried it on 10k images from our archive and had promising results.

This blog is a walkthrough of how we constructed the phashes with the Imagehash library, created easily navigable clusters (groups) of images whose fingerprints (hashes) are identical, and found images that are similar to a query image. An elegant feature of phashes is that similar images will have similar hashes. To know how the hashing algorithm works, check out this other blog

The code implementation of this can be found in a jupyter notebook here. The executed version of this notebook has been archived by the wayback back machine, which can be found here.

Related Posts

Contributing to Shell Server
Shell Server is the single point of contact for all the different services that tattle builds. This blog post describes the system architecture and lists resources helpful to anyone looking to contribute to the Shell Server.
Finding Similar Videos Efficiently
Data Science blog on finding similar videos in Tattle's archive by Feature Selection of anchor frames
Topic Modelling on Fact-Checked Stories
This notebook builds LDA topic models on the headlines of 13,000+ fact-checking stories in the Tattle archive.
Analysing the Katna library for video key frame extraction
Today we see a rampant proliferation of video content via various social media channels all over the world. We explore the Katna library to extract key frames from a video to feed into our duplicate image search engine
Text and illustrations on the website is licensed under Creative Commons 4.0 License. The code is licensed under GPL. For data, please look at respective licenses.