>>9151
>>9148
Thanks for this info. This looks great stuff–if it is all correct with you, I'll point people at your repos here in the v311 release post if they would like to experiment. I am interested in knowing further how these different systems work well and badly.
For the one-image-at-a-time issue, it has long been a thought to convert the file-lookup system to an automated 'figure out any tags for these 10,000 files in idle time please' maintenance routine, particularly once I had introduced the new bandwidth system, which will help us not DDoS the services we are pulling from. Your use case here is another vote in doing something like this, and generally to generalise the way(s) hydrus can pull tag recommendations from other services and scripts.
I'll reiterate that I can't put any time into this atm, but I am interested in having an ongoing conversation about how I can make these workflows better for you. I am keen to get hydrus and The Imageboard Community into the ML game in the coming years.
>>9152
Yes. In making hydrus, I have come to make a point about putting 'human eyes' in front of certain decisions, especially for large automated systems. Figuring out a good workflow is often as difficult as getting the technical side working. It is easy to fuck up a script, and if that script touches 100,000 files, it can be a huge pain to fix. Tag siblings is a good example of this–it seemed simple when I got in, but it turns out for multiple reasons–human preference and complexity of language and translation issues and simple human mistake and technical complexity at the data and gui levels and low CPU availability at certain critical moments in the sibling processing pipeline–to be much more complicated.
If we gain the ability to harvest millions of new tag mappings based on ML systems or otherwise, I would imagine putting them in a separate low-trust cache on the side of the 'current' mappings–probably just in their own service in the hydrus context–until you approved them. The decision to apply these tags to files or not would start being approved by human feedback, with human eyes seeing every decision, until some critical threshold of success were reached, at which point we could trust it.
I am keen to train our own models, so I can imagine us starting with just a handful of new tags, like say 'character:pepe'. The model could inherit its opinion of what is and isn't pepe based on the existing mappings in your db, and then it could ask you about cases in which it was not sure. After refining its model, it can start suggesting pepe for new files. Again, you say yes, yes, no, yes, no, until you are saying yes enough times in a row for it to be 99.8% or whatever confident about pepe, at which point you can say, 'ok, you are ready to pepe anything'. You can then move onto the next tag to train on.
I'll reiterate that I am not read up on this stuff yet, but if it is possible to separate or even quantize these decisions and how they affect the model, I could even write a ML repository where this info could be shared, or even make it exportable to pngs or something, so Anons could share recognition models for different tags and thus share the workload. "I spent ten hours teaching my client to recognise the top ten 2hus so you don't have to." "Here is an official model that determines the difference between 'medium breasts' and 'large breasts'." "Here is a 'feminine penis' detector." And so on, depending on how all this data shakes out, which kinds of tags it works well and badly for, and how people want to use and share it. The public tag repository has been a great success in a bunch of ways, but I think the future has a lot more sharing metadata about what tags are, and less sharing 'this file has this tag'. I am certain these systems will be complicated and require multiple iterations of work to be useful in broad ways.