Hey, hydrus admin.Could you explain your max hamming distance implementation? I'm trying to check whether an image's hash is close enough to a list of hashes (unrelated to hydrus).

Name
Email
Subject
Comment *
File	Select/drop/paste files here
* = required field	[▶ Show post options & limits] Confused? See the FAQ.

Embed	(replaces files and can be used instead)
Options	Do not bump (you can also write sage in the email field) Spoiler images (this replaces the thumbnails of your images with question marks)
Password	(For file and post deletion.)
Allowed file types:jpg, jpeg, gif, png, webm, mp4, swf, pdf Max filesize is 12 MB. Max image dimensions are 10000 x 10000. You may upload 5 per post.

hydrus_dev ## Board Owner 10/04/16 (Tue) 19:38:20 21d41e No.4155

File: 314dbc6d5af08bc⋯.jpg (81.19 KB, 589x632, 589:632, 314dbc6d5af08bcf6cc78f52d7….jpg)

FYI: The hamming distance calc I use is on the file's phash, which is basically a highly compressed DCT of the image that represents image shape. Hamming is useful on this because similar phashes mean images with similar shape. Check GeneratePerceptualHash in ClientImageHandling.py for more details.

The actual hamming distance calculator is this:

def GetHammingDistance( phash1, phash2 ):
    
    distance = 0
    
    phash1 = bytearray( phash1 )
    phash2 = bytearray( phash2 )
    
    for i in range( len( phash1 ) ):
        
        xor = phash1[i] ^ phash2[i]
        
        while xor > 0:
            
            distance += 1
            xor &= xor - 1
            
        
    
    return distance

It is ugly–I expect there is a better way of doing it, likely not in python! It compares each byte in turn, XORing them and then literally counting the 1s (the 'byte &= byte - 1' part to remove the rightmost 1 is neat–I'm terrible at byte arithmetic, so I must have found it on Stack or something).

EDIT: I just searched again and found numpy can do it bretty quick with numpy.count_nonzero( a != b ). I'll test this myself and probably switch over.

At the moment, I do a linear SCAN against every file in the db, saying 'find all files that have hamming distance less than x with this phash', which takes a huge amount of time as the hamming distance has to be recalculated for every file every time. After I'm done with suggested tags control, I will revisit this to speed it up. I hope to create a VP tree inside the db or something, but I'll have to apply some brainpower.

Let me know if you would like any more explanation!