[ / / / / / / / / / ] [ dir / cute / egy / fur / kind / kpop / miku / waifuist / wooo ]

/hydrus/ - Hydrus Network

Bug reports, feature requests, and other discussion for the hydrus network.

Catalog

Name
Email
Subject
Comment *
File
* = required field[▶ Show post options & limits]
Confused? See the FAQ.
Embed
(replaces files and can be used instead)
Options
Password (For file and post deletion.)

Allowed file types:jpg, jpeg, gif, png, webm, mp4, swf, pdf
Max filesize is 12 MB.
Max image dimensions are 10000 x 10000.
You may upload 5 per post.


New user? Start here ---> http://hydrusnetwork.github.io/hydrus/

Current to-do list has: 714 items

Current big job: finishing off duplicate search/filtering workflow


File: 335d2b5e76e728e⋯.png (993.99 KB, 700x990, 70:99, 335d2b5e76e728e72ae53f6d87….png)

3feeee No.4149

Hey, hydrus admin.

Could you explain your max hamming distance implementation? I'm trying to check whether an image's hash is close enough to a list of hashes (unrelated to hydrus).

21d41e No.4155

File: 314dbc6d5af08bc⋯.jpg (81.19 KB, 589x632, 589:632, 314dbc6d5af08bcf6cc78f52d7….jpg)

FYI: The hamming distance calc I use is on the file's phash, which is basically a highly compressed DCT of the image that represents image shape. Hamming is useful on this because similar phashes mean images with similar shape. Check GeneratePerceptualHash in ClientImageHandling.py for more details.

The actual hamming distance calculator is this:

def GetHammingDistance( phash1, phash2 ):

distance = 0

phash1 = bytearray( phash1 )
phash2 = bytearray( phash2 )

for i in range( len( phash1 ) ):

xor = phash1[i] ^ phash2[i]

while xor > 0:

distance += 1
xor &= xor - 1



return distance

It is ugly–I expect there is a better way of doing it, likely not in python! It compares each byte in turn, XORing them and then literally counting the 1s (the 'byte &= byte - 1' part to remove the rightmost 1 is neat–I'm terrible at byte arithmetic, so I must have found it on Stack or something).

EDIT: I just searched again and found numpy can do it bretty quick with numpy.count_nonzero( a != b ). I'll test this myself and probably switch over.

At the moment, I do a linear SCAN against every file in the db, saying 'find all files that have hamming distance less than x with this phash', which takes a huge amount of time as the hamming distance has to be recalculated for every file every time. After I'm done with suggested tags control, I will revisit this to speed it up. I hope to create a VP tree inside the db or something, but I'll have to apply some brainpower.

Let me know if you would like any more explanation!




[Return][Go to top][Catalog][Post a Reply]
Delete Post [ ]
[]
[ / / / / / / / / / ] [ dir / cute / egy / fur / kind / kpop / miku / waifuist / wooo ]