FYI: The hamming distance calc I use is on the file's phash, which is basically a highly compressed DCT of the image that represents image shape. Hamming is useful on this because similar phashes mean images with similar shape. Check GeneratePerceptualHash in ClientImageHandling.py for more details.
The actual hamming distance calculator is this:
def GetHammingDistance( phash1, phash2 ):
distance = 0
phash1 = bytearray( phash1 )
phash2 = bytearray( phash2 )
for i in range( len( phash1 ) ):
xor = phash1[i] ^ phash2[i]
while xor > 0:
distance += 1
xor &= xor - 1
return distance
It is ugly–I expect there is a better way of doing it, likely not in python! It compares each byte in turn, XORing them and then literally counting the 1s (the 'byte &= byte - 1' part to remove the rightmost 1 is neat–I'm terrible at byte arithmetic, so I must have found it on Stack or something).
EDIT: I just searched again and found numpy can do it bretty quick with numpy.count_nonzero( a != b ). I'll test this myself and probably switch over.
At the moment, I do a linear SCAN against every file in the db, saying 'find all files that have hamming distance less than x with this phash', which takes a huge amount of time as the hamming distance has to be recalculated for every file every time. After I'm done with suggested tags control, I will revisit this to speed it up. I hope to create a VP tree inside the db or something, but I'll have to apply some brainpower.
Let me know if you would like any more explanation!