windows
zip: https://github.com/hydrusnetwork/hydrus/releases/download/v241/Hydrus.Network.241.-.Windows.-.Extract.only.zip
exe: https://github.com/hydrusnetwork/hydrus/releases/download/v241/Hydrus.Network.241.-.Windows.-.Installer.exe
os x
app: https://github.com/hydrusnetwork/hydrus/releases/download/v241/Hydrus.Network.241.-.OS.X.-.App.dmg
tar.gz: https://github.com/hydrusnetwork/hydrus/releases/download/v241/Hydrus.Network.241.-.OS.X.-.Extract.only.tar.gz
linux
tar.gz: https://github.com/hydrusnetwork/hydrus/releases/download/v241/Hydrus.Network.241.-.Linux.-.Executable.tar.gz
source
tar.gz: https://github.com/hydrusnetwork/hydrus/archive/v241.tar.gz
I had a good week. I fixed things and moved the duplicate search stuff way forward.
fixes and a note on cv
I've fixed the stupid 'add' subscription bug I accidentally introduced last week. I apologise again–I have added a specific weekly test to make sure it doesn't happen again.
With the help of some users, I've also updated the clientside pixiv login for their new login system. It seems to work ok for now, but if they alter their system any more I'll have to go back to it. Ideally, I'd like to write a whole login engine for the client to allow login for any site and make pixiv and anything else work with less duct tape and more easily maintainable.
For Windows users, I've updated the client's main image library (OpenCV) this week, and this new version looks to be more stable (it loads some files that crashed the old version). If you are on Windows and have 'load images with PIL' checked under options->media, I recommend you now turn it off–if you have a decent graphics card, your images will load about twice as fast.
duplicate files are now findable
Dupe file display or filtering is not yet here. If you are interested in this stuff, then please check it out and let me know how you get on, but if you are waiting for something more fun than some numbers slowly getting bigger, please hang in there for a little longer!
I have written code to auto-find duplicate pairs and activated the buttons on the new duplicates page (which is still at pages->new search page->duplicates for now).
The idea of this page is to:
1) Prepare the database to search for duplicate pairs.
2) Search for duplicate pairs at different confidence levels (and cache the results).
3) Show those pairs one at a time and judge what kind of dupe they are.
Parts 1 and 2 now work. I would appreciate, if you are interested, in you putting some time into them and giving me some numbers so I can design part 3 well.
Since originally introducing duplicate search, I have updated the 'phash' algorithm (which represents how an image 'looks' for easy comparison) several times. I improved it significantly more this week and am now pleased with it, so I do not expect to do any more on it. As all existing phashes are low quality, I have scheduled every single eligible file (jpgs and pngs) for phash regeneration. This is a big job–for me, this is about 250k files that need to be completely read again and have some CPU thrown at them. I'm getting about 1-2 thousand per minute, so I'm expecting to be at it for something like three hours. This only has to be done once, and only for your old files–new files will be introduced to the new system with correct phashes as they are imported.
To save redundant tree rebalancing, I recommend you set the time aside and regenerate them all in one go. The db will be locked while it runs. The maintenance code here is still ugly and may hang your gui. If it does hang, just leave it running–it'll get there in the end.
Then, once the 'preparation' panel is happy, run some searches at different distances–you don't have to search everything, but maybe do a few thousand and write down the rough number of files searched and duplicate pairs discovered.
I am very interested to know:
- How inconvenient was it doing the regen in real time? Approximately how fast did it run?
- At 'exact match' search distance, roughly how many potential duplicate pairs per thousand files does it find? What about 'very similar' and (if it isn't too slow) 'similar'?
- How much of this heavy CPU/HDD work would you like to run in the background on the normal idle routines?
- Did anything go wrong?
I'm still regenerating files as I write this, but I will update with my own numbers once I can. Thanks!
full list
- fixed the 'setnondupename' problem that was affecting 'add' actions on manage subscriptions, scripts, and import/export folders
- added some more tests to catch this problem automatically in future
- cleaned up some similar files phash regeneration logic
- cleaned up similar files maintenance code to deal with the new duplicates page
- wrote a similar files duplicate pair search maintenance routine
- activated file phash regen button on the new duplicates page
- activated branch rebalancing button on the new duplicates page
- activated duplicate search button on the new duplicates page
- search distance on the new duplicates page is now remembered between sessions
- improved the phash algorithm to use median instead of mean–it now gives fewer apparent false positives and negatives, but I think it may also be stricter in general
- the duplicate system now discards phashes for blank, flat colour images (this will be more useful when I reintroduce dupe checking for animations, which often start with a black frame)
- misc phash code cleanup
- all local jpegs and pngs will be scheduled for phash regeneration on update as their current phashes are legacies of several older versions of the algorithm
- debuted a cog menu button on the new duplicates page to refresh the page and reset found potential duplicate pairs–this cog should be making appearances elsewhere to add settings and reduce excess buttons
- improved some search logic that was refreshing too much info on an 'include current/pending tags' button press
- fixed pixiv login–for now!
- system:dimensions now catches an enter key event and passes it to the correct ok button, rather than always num_pixels
- fixed some bad http->https conversion when uploading files to file repo
- folder deletion will try to deal better with read-only nested files
- tag parent uploads will now go one at a time (rather than up to 100 as before) to reduce commit lag
- updated to python 2.7.13 for windows
- updated to OpenCV 3.2 for windows–this new version does not crash with the same files that 3.1 does, so I recommend windows users turn off 'load images with pil' under options->media if they have it set
- I think I improved some unicode error handling
- added LICENSE_PATH and harmonised various instances of default db dir creation to DEFAULT_DB_DIR, both in HydrusConstants
- misc code cleanup and bitmap button cleanup
next week
I'm going to collect my different thoughts on how to filter duplicate pairs into a reasonable and pragmatic plan and finally get this show on the road. I do not think I will have a working workflow done in one week, but I'd like to have something to show off–maybe displaying pairs at the least, so we can see well how the whole system is working at different distances.