[ / / / / / / / / / ] [ dir / cute / egy / fur / kind / kpop / miku / waifuist / wooo ]

/hydrus/ - Hydrus Network

Bug reports, feature requests, and other discussion for the hydrus network.

Catalog

Name
Email
Subject
Comment *
File
* = required field[▶ Show post options & limits]
Confused? See the FAQ.
Embed
(replaces files and can be used instead)
Options
Password (For file and post deletion.)

Allowed file types:jpg, jpeg, gif, png, webm, mp4, swf, pdf
Max filesize is 12 MB.
Max image dimensions are 10000 x 10000.
You may upload 5 per post.


New user? Start here ---> http://hydrusnetwork.github.io/hydrus/

Current to-do list has: 714 items

Current big job: finishing off duplicate search/filtering workflow


File: 1423691761008.jpg (41.93 KB, 500x119, 500:119, 2014_09_wintersnowolf.jpg)

0b4765 No.195

I've made a hash/tag scraper for furaffinity: http://pastebin.com/v7VCZ0W4
The database it creates is valid, but I get some strange errors when loading it into Hydrus.

I would love to run it myself, but what with furaffinity's 16*10^6 submissions and my terrible internet connection, it would take way too long.

82208f No.198

The main problem I'm seeing is that your DB does not use the Hydrus Tag Archive creation framework ( https://github.com/hydrusnetwork/hydrus/blob/master/include/HydrusTagArchive.py). This is probably why you get errors when putting this into Hydrus.

The second problem is that Furaffinity's tags suck large amounts of ass and you wouldn't want to dump them in Hydrus without some cleanup.

If you can edit this script to produce a CSV of hash, tag1, tag2, etc, newline, hash and so on, I will take a look at its output and see if I can create a companion script that cleans it up and runs it through the Hydrus framework, because I'm very interested in getting this data.

The third problem is that Dragoneer will probably ban anybody who runs this a lot, maybe by IP. If I can figure a way around that I'd be down for running it though.

82208f No.199

>>198

Also, am I reading this correctly or does it get hashes without actually saving the image? Because that's fantastic if so, given that nobody has the damn space for a full image rip of FA.

0b4765 No.201

>>199
Yeah, it just creates a download connection to the server as a stream, then just pump the info into a hash generator. It never stores the image on the disk, nor use that much memory.

>>198
Oh, did not know that was a thing. I will reconfigure the script later tonight to employ the framework.

Regarding the third problem: I have been on a zeromq kick in the last week, and do believe I could do a little distributed computing 'magic'. I'll get back to you on this one.

0b4765 No.203

>>201
Having some small difficulties right now. It appears that some people like to use unicode characters in urls, which the requests library does not like.

0b4765 No.204

>>203
There we go: http://pastebin.com/BM9GjGk2
No more errors when importing into hydrus.

82208f No.207

File: 1423816580776.jpg (325.26 KB, 631x643, 631:643, 6b7ddb081a83e45703b3247fe6….jpg)

>>201

Oh neat, I will play around with this this weekend. I want to see if I can get an idea of what it will be "proper" to sanitize on this data (that is, uncontroversial and relatively easy to automate).

If you can come up with a solution to avoid IP bans, so much the awesomer. Another possibility is a throwaway VPS, but that's obviously not ideal.

e6f4cf No.246

While this is a fantastic bit of work, I can't help agreeing with >>198. Most FA users don't understand or respect tags: they use them poorly, inconsistently, or barely at all.

I suppose with a VERY aggressive tag sibling setup to catch and unify the million different ways of tagging a given thing and some kind of filter to dump common words (so that #when #some #smartass #tagsallthewords #inasentence they don't end up in the database) that it might be useful, but even then it seems like a bit of a longshot.

8e5541 No.255

>>246

Proper sanitization guy here. My thought is that this will be 95% only useful for artist, rating, species, and maybe a few hundred very common tags. That's still huge though - regexing artists out of filenames is a great hack but it's not 100% reliable and it only works if you saved from FA in the first place. Rating is also very reliable on FA.

What I'm thinking about doing is figuring a way to take the output of this and remove all tags that aren't artists AND have a hash count of under whatever threshold gives the best signal to noise ratio (rough guess: <500 is probably trash, but no way to tell til we have the whole database ripped). I'm not at all sure how I'm going to do this, but it'll be a fun challenge - I'm just waiting till I have the time and energy to really pound on it for a couple days straight.

OP, does this grab gender at all? If not it should. "Multiple characters" is a shit convention, but getting out "gender:male" or whatever for all the solo pics would be great. Same thing for species, though I may want to write a parser to break out "feline - tiger" to "species:feline" and "species:tiger", et cetera.

8e5541 No.256

>>246
>>255

Also, re tag sibling setup, check out derpicleaner.py in the Derpibooru tag rip I did. I'm thinking of something like that, only way more hard core, to fix FA as much as it can be fixed.

Doing this sort of shit at the database level, before importing into Hydrus, is IMO always better for reasons of elegance, simplicity, flexibility, and ease of modification.

9e9256 No.257

>>255
>>204

I haven't really been working on this in the last couple of days, because of school, I got a little done on it yesterday though.
The status right now: I got a networked way of doing this working. I have a server with a database, and a bunch of clients/workers who actually access furaffinity, and sends the hashes and tags to the database. This will ease the burden of a single computer duing this alone. This way we might actually accomplish this in a somewhat short amount of time. I have once again run into problems with the hashes not properly importing, though I expect to find a solution for this later today. Afterwards I need to implement some sort of autentication to avoid rogue workers sending crap (, this might or might not be nessasary, I am not entirely sure, but I do not want to risk corrupting the database. It'll properly involve some sort of username/password pair.)

Right now I do not import the gender or any of the other info from that little infobox, as it is structured terribly. It is just a flat structure, with a few bold tags, but no actual divs or something to simplify the access of variables. I'll try to do something about it later today, along with a more extensive page scraper interface, though to be honest, I find the gender field so bad, the artists I follow don't even tag solo pictures with proper gender.

I was wondering if doing a spellchecking test over the database, when doing sanitation would help. I am a little scared of ending up "correcting" some random character name. Though I guess, as hydrus_dev stated earlier, towards a perfect tag database lies madness.

I also got thinking about commisions both parties upload to their galleries. How would be a good way to handle this? Except doing a detailed analyzing of the descriptions which would be pretty unfeasable.

82208f No.262

>>257

Spell check might help, yeah. Assuming it isn't very fuzzy - don't want autocorrect changing Meesh to Mesh or Marsh or Me Each, just as an example.

I think "UL'd by the artist and the commissioner" should be handled by adding both creator tags, if the hash is the same from both. You can do the same with title - I have no real opinion on that because I'll be unchecking that namespace when I import.

You'll probably find that the hashes match less often than you think - frequently the commissioner uploads a higher res version. I also don't know if changing the filename, which FA does for those, will change the hash.

e6f4cf No.263

>>255
Fair enough. Artist, rating and species should at least be fairly reasonable. And hey, if you can salvage a bit beyond that, great.

Also, FA allows music and stories as well, so filtering those out might be a good idea, at least for a first run.

0b4765 No.265

>>262
The file name shouldn't affect anything, though I am going to need to utilize another type of hash to detect similar images.

>>263
Did not know music and stories was though the same id system. Will think up something to either skip them or include them. If I remember correctly, flashes are also within that system too.

82208f No.277

>>265

Yep, everything is through the same ID system because Dragoneer is a moron and the site hasn't been substantively updated since like 2008.

e67601 No.288

>>277
I've been using the site since like 2006 or so, and I don't think it's ever had a "substantive" update. There was the time when they added the header and the time when they made thumbnails bigger (but only for art submissions, not for any other kind).

0b4765 No.322

File: 1424570239695.jpg (1.23 MB, 1311x2000, 1311:2000, Nymph1.jpg)

>>195
And here we have it, the first version of Furaffinity Hash&Tag Scraper Network-Thing (woo! Bullshit long preposterous names!)
https://bitbucket.org/anonymph/furaffinity-scraper-network
It is way too late where I'm at, to actually set a server up right now, and the code is not as mature as I want it to, and it is a little poorly documented, BUT I do believe it to be mature enough for public viewing. If I have the time I'll get a server up and running tomorrow.

If any of you professional code critics are in here, and have the time, I could use a little of your criticism magic. Especially regarding the errors on loading the db into hydrus.

And now, if you will have me excused, I'll go lie down.

0b4765 No.327

File: 1424728945830.jpg (102.5 KB, 650x416, 25:16, Nymph2.jpg)

>>322
I feel that the code is stable enough for actual usage now. I'll get a server set up soon.

Just a quick summary of how this is works:
There will be a server delegating which submissions which workers should work on.
The workers will firstly connect to fur affinity with login credentials given to it by the maintainer (This information will only be transmitted between the worker and affinity,) after that it will connect to the hash/tag server and request some work. When it gets an reply it will begin scraping the information it can from furaffinity, based on the submissions assigned.
All workers will need to be white listed, with the maintainers emailing me the hash/tag server username/password combo they would wish to use.

I'm not sure if this is too complex a design for a website scraper, but then again, it is furaffinity we are dealing with here.
I am going to be accepting username/password combos at anonymph@8chan.co, and setting up the server in the next day or two.
NOTE: The username/password combo should just be some random jumble of characters. Don't send me usernames or passwords you actually use.

82208f No.329

>>327

…this sounds a lot like Bitcoin pool mining. Interested. Any chance of IP bans for workers? Also, will the data be made public in raw form as it is being collected?

0b4765 No.330

File: 1424762504854.jpg (86.6 KB, 453x650, 453:650, Nymph_with_morning_glory_f….jpg)

>>329
If anybody sends in some bad stuff (every incoming message to the server is logged), I'll ban that user, though to be honest I don't have a lot of security in place for that sort of thing. I really hope that people will have some goodwill.

Regarding data publishing, I think I'll upload the database in this thread every week.

82208f No.332

>>330

I mean IP bans from FA.

0b4765 No.333

>>332
Oh, not sure. It is quite possible, but that really depends upon how hard you are pushing it.
Yesterday when testing they disconnected from a request I was making, but didn't do any more than that.
In their terms of service I think they specify that it is breaking them, by scraping the site at the cost of performance. Not sure if that means it is fine to scrape at less than that, though I hope it is.

c55632 No.346

>>333

Hmm. I'm irrationally paranoid. Think I could run this on Amazon EC2.micro instances? Cause if so I will throw a few bucks a month at compute power for it.

0b4765 No.347

>>346
I totally understand that, I'm not gonna run this on a local machine either.

I haven't used any of the Amazon services, but the script should run as long as the machine can run python and has a internet connection.

82208f No.361

>>327

Sending you a login combo request, for whenever you get a server rolling. I'm gonna try to set up a free trial EC2 environment now.

2d976e No.362

>>361

Well this was easy. Posting from a free tier EC2 Windows Server 2012 environment.

Now to install Python and dependencies.

eeab4d No.363

>>362

Alright, I'm ready to connect and start working whenever the server is live.

Question: regarding the custom useragent, should I set it to something meaningful like "Furaffinity tag scraper v0.95", or should I emulate a real browser's useragent, or should I do something ridiculous like "Hi Dragoneer I'm behind 7 proxies :P"?

9d6a70 No.364

>>363
Use a fairly generic "real" looking user agent, FA techs have had a profound hateboner for scrapers of any kind for quite some time now, they'll probably ban you quickly if you use something too obvious for the UA.

eeab4d No.365

>>364

Check, thanks. I set it to the same UA as the copy of Chrome on this environment.

0b4765 No.366

>>363
Great to hear. I got the server up and running, and added you to as a user. The ip is 188.166.47.98. You should be able to connect without trouble.
I am going to be pushing a new version of the scraper later on today, with some capabilities for limiting the amount of requests a day the script can make to furaffinity, to lessen the risk of ip banning, and some more documentation.

eeab4d No.367

File: 1425155499286.gif (137.4 KB, 300x149, 300:149, goddamnright.gif)

>>366

Logged in, running. You should be seeing results now. Worker DST0001 will remain at 100% uptime barring any weird billing issues; I'll evaluate performance and see if I think adding another worker on the same environment makes sense at some point.

It'd be cool to have a "% of FA completed" ticker on your page, if that's not insanely difficult to do.

Also, are we going oldest to newest here? Like, is Block 0 the oldest stuff on FA? Doing that would make it easier to update this when/if we catch up. And how many submissions are in a block?

For any other workers who may join us: Recommend you use throwaway FA accounts, created from your remote machines with the same credentials you use for server login. This way Neer can't backtrack you to your real account/IP. Do remember to turn on mature content for these accounts, though!

Server anon, once you're sure this is reliable maybe you should make an announce post at >>/furry/ ?

eeab4d No.368

File: 1425155644568.png (15.94 KB, 645x308, 645:308, error sub 422.png)

>>367

Error'd out on block 4. Pic related. Does that mean it skipped the rest of blocks 4 and 5?

0b4765 No.369

File: 1425156552951.png (10.59 KB, 640x288, 20:9, ServerView.png)

>>367
>>368
I can see you! I also noticed it crashed on 422, Turns out the way I was removing unnecessary tag characters did not work well when the species field is empty. I'm working on a fix for that.
Restarting the worker will result in it receiving new blocks. The way I am doing it will result in those blocks (4,5) being 'blocked' for an hour, after which they return to the 'todo' block list.

You guessed right on the order, yes, we are going from oldest to newest, though this is handled by the server, and is going to be configurable. Currently a block is 100 submissions.

I'll think about making a /furry/ announcement. Though to be honest, this is a really niche project, what with getting the hash and tags, ignoring the actual image.

eeab4d No.370

>>369

Happened again on sub 1642 and 1895. Will keep restarting it for the weekend or until your fix is out - but for unattended usage, errors like this should drop the blocks and add them to the re-do list without terminating the process. That way it can't, say, die right as I'm going to work and sit idle for 11+ hours.

I know we have a sizable contingent of Hydrus users on /furry/, is why I mentioned it - yeah, this is only of interest to Hydrus users, but for the ones who use it but don't religiously check this board, we could use their help.

82208f No.371

>>369

Out of curiosity, does the "51/100" or whatever mean that of the 100 subs in block 1, only 51 of them still exist, with the other 49 having been deleted since they were originally posted?

0b4765 No.373

>>370
Good idea, I will add an option for that.
Got the fix implemented and uploaded. If you cloned it through git, just pull, else download and paste into the directory.
You may want to add the option "maxSubmissionsPrHour" to the config.json file, though its not necessary.

>>371
Spot on.

eeab4d No.374

>>373

Updated. Now getting ('page', 502) etc written to stdout for each submission. Is this by design?

0b4765 No.375

>>374
Oh, that is just some debugging text I forgot to remove. Its kinda by design I guess. I removed it just now.

eeab4d No.376

>>375

Threw you a request for a couple more worker logins - I don't think I'm fully utilizing this machine.

eeab4d No.379

>>376

About 24 hours in and we're at 60,000 submissions scanned. At this rate it will take us 266 days to scan the entire site, though our actual speed now should be somewhat faster due to more workers.

Still, an excellent start.

0b4765 No.381

>>379
We currently have 7 workers running, each scanning a little under 1000 submission pages an hour. Should take us 3 months, if we don't add more workers, which I expect I'll do.
If anyone is interested in the output this produces, the archive so far is located here:
https://bitbucket.org/anonymph/furaffinity-hydrus-network-database

82208f No.387

File: 1425260826002.png (174.06 KB, 1657x791, 1657:791, too many tags.PNG)

>>381

Okay, I've been digging around in the rough output you posted, and something wacky is going on.

Pic related is the hash and tags for Hash_ID 924 in this database. It's got six zillion titles and creators (I checked, there are 635 mappings for this Hash_ID).

There's no way that can be right, is there?

82208f No.388

>>387

Also, I love that I knew exactly what image this was just from the artist and that one tag:

264116c0b5c9c5cd953f06cdf30d3a566f32fd1a703667bd5915dffb3603e257, species:unspecified / any, title:i hate linksys., age-rating:adult, gender:any, creator:jazzwolf

82208f No.389

>>387
>>388

So if anyone's interested, here's a script that dumps a Hydrus tag archive to a CSV file in the form hash, tags (each tag being one value):

https://gist.github.com/disposablecat/bda7b2ae9b4143b4fa1a

I have found this sort of thing to be useful in writing tag sanitizer scripts because it allows immediate output inspection, without having to inspect a .db file somehow. The opening and iterating routines may also be useful for writing an actual cleaner script, which is what I'm doing next for FA.

82208f No.390

>>389

And here's the output of that script over the initial results Anonymph posted, so y'all can see what we're dealing with:

http://a.pomf.se/ngtprt.csv

859e99 No.391

File: 1425269263542.png (53.53 KB, 193x168, 193:168, 1420591240632.png)

>>388
Your a big guy
If you forget to spoiler that will you die?

82208f No.392


82208f No.393

>>389

At a glance: I'm going to have to drop all "unspecified/any/multiple characters" tags, as they're worse than useless. I'm also going to have to parse the entire species list into something actually useful, because it's coded for shit on FA itself.

0b4765 No.394

File: 1425286596373.jpg (693.03 KB, 1473x1098, 491:366, FantinLatour_Naiade_hermit….jpg)

>>387
I investigated why that was happening, turns out, furaffinity uses the "image not found" image for all submissions it cannot find images for. That image always generates the exact same hash, and this is what happens when it start getting mappings assigned to it.
I am gonna build a check for that into the worker.

>>393
Definitely agree. "Western" really don't say much about the creature, until you look at the list, and see that what furaffinity really means is "western dragon".

82208f No.395

>>394

…why in the nine hells would mappings be assigned to "image not found"'s hash? Did some joker upload it?

I'm almost done with a preliminary cleaner/sanitizer that drops all the useless "any" stuff and also fixes species. Just working out some Unicode bullcrap now.

0b4765 No.396

File: 1425287848880.jpg (3 KB, 120x120, 1:1, image_not_found.jpg)

>>395
Maybe, but properly not, any request to "http://d.facdn.net/art/IDSTUFFHERE", where IDSTUFFHERE is not a correct image path will result in it returning image not found. Some of the images must have been lost to server migration and such at a later point.
For example, the first submission, "DN_WildThing" appears to be a pretty naughty image going by the description and the comment section, but for some reason it cant be found on the image server. Unless of cause those guys are huge trolls, which might very well be.

82208f No.397

>>394
>>395

Alright, here's the first version of a sanitizer for FA's dataset:

https://gist.github.com/disposablecat/46c19217d04f42edb9a5

This thing does about the bare minimum required to make FA's metadata workable (IMO, anyway). I'll be iterating on this and adding things for it to fix as we get data.

82208f No.398

>>396

…that makes more sense. Still, wonderful website architecture, FA.

0b4765 No.399

>>396
>>397
Me and sanitizer anon decided to shut down the server while I work on some improvements, which I hope to have implemented by the end of the day.
In the mean time have a look at the database. It almost doubled in size overnight:
https://bitbucket.org/anonymph/furaffinity-hydrus-network-database

0b4765 No.402

>>399
I got the improvements implemented. The script will capture a lot more information now, which will be useful when parsing it.
The information we are getting for each submission:
title, creator, hash, upload time, FA id, species, gender, "theme", "category", age rating, users tagged in description with links.
Do anybody see anything we should add beyond this?

82208f No.404

>>402

Sanitizer has also been updated. It now properly treats tags as sets, since not doing that caused problems.

New features: splits tags that were in the form "tag1,tag2", and splits trailing commas from tags.

https://gist.github.com/disposablecat/46c19217d04f42edb9a5

82208f No.408

>>404

Next feature to do: Automatic conversion of tags from the following datasets to species namespace, with appropriate parent species:

-all pokemon
-all digimon
-all registered dog breeds
-all registered cat breeds

If anyone can think of another set of animals that would be available and likely to be useful, let me know. Also, wish me luck, this is gonna be "fun".

0b4765 No.409

File: 1425502615477.png (12.15 KB, 638x382, 319:191, Console2.png)

>>408
And we are rolling again. With the exception of some technical difficulties when copying the database, this time has been moving along quite nicely.
Another version of the database is up on the git, already with twice the amount of scraped submissions as the last.
As the server and worker code has been mostly completed, my part from now on out, will mostly involve maintainable of the server and database. I'm thinking of trying to scrape hashes off of that TOR backup of FA, that was found some days back.

82208f No.411

>>408

Sanitizer updated.

New features:
-automatic conversion as above, plus Monster Hunter monsters
-converts any word or set of words used as a species tag on e621 as of november 2014 into a species tag

With this update, species should be pretty well covered. Next: Spelling corrections and a few light implications.

https://gist.github.com/disposablecat/46c19217d04f42edb9a5

82208f No.412

>>411

Haven't tested this with the new DB yet. Will tomorrow.

82208f No.415

>>409

So, something odd is going on with upload times. Specifically, all the data in this set seems to have POSIX timestamps of around 2007 (11/22/2007, specifically), even though viewing the individual submission pages (which I can now do by ID, thanks!) shows upload dates back in 2005.

82208f No.416

>>409
>>415

Examples:
sub correct reported diff
917 1115357400 1195738137 80380737
441 1107596520 1195738191 88141671
75551 1143003840 1195743785 52739945
153487 1152943080 1195757296 42814216

(ignore the bad art if you look these up, I just picked random ones)

If there's a consistently calculable offset I can correct it in the sanitizer, but these don't look like there is.

0b4765 No.417

>>416
I'm pulling the timestamps directly off what the last-modified headers of the files. This might be a server move or something causing this. I don't see this as that large an issue.

ddd571 No.418

>>417

It isn't huge, no. I would just like to correct for it in the sanitizer, if it's predictably correctable.

Reading your code, I thought you were beautifulsouping it off the submission page. If you're getting it off the file that makes much more sense.

ddd571 No.419

>>411
>>416

Now working on an option to prettify fa_ type namespaces - in particular, human readable dates.

I also realized I can do what I did for species to other e6 namespaces (series, character), so that's going to be a thing that happens.

0b4765 No.420

>>419
That is actually quite clever, I like it.
>>418
Yeah I have two methods to scrape it, first is from the header info, another is from the submission info. The problem with the submission info date is it doesn't work nicely together with the python time-string parser, and I have no particular want to do my own parser. (Though it might be an interesting sideproject, if I got nothing else to do on a lazy day.)

0b4765 No.421

>>420
You know, thinking about it, I made the wrong decision here. I have modified the code to instead use the submission information instead, I will just need to go through and parse it later on.

ddd571 No.422

>>421

That works - if you need me to run some kind of script that just pulls that data from subs up til the change let me know.

Is the raw tag still going to be a POSIX stamp, or should I change my code?

ddd571 No.423

>>422

Cause if the tag is in some crazy native format I can totally parse that in the sanitizer.

0b4765 No.424

>>423
Crazy native format.
>>422
I'll figure something out.

eeab4d No.425

Workers restarted on most recent updated code.

82208f No.426

>>419

Sanitizer has been significantly updated. Highlights:

-prettification of f_ namespaces (optional, selected via dialog box). Date isn't handled for now since the format is changing.
-data is now in a separate module for readability (posted in same Gist)
-now requires multi_key_dict module
-now has a spellcheck engine, which will allow me to build a custom spellchecking dictionary based on errors people actually make in the dataset (so that we avoid bad autocorrections). This happens after format unification but before namespacing, which means that if someone misspells a species it will still get namespaced after it is corrected.
-I have written a quick and dirty e621 tag list scraper (also on my Gist account) that will allow me to comprehensively convert character and series e621 namespaces via the same dataset and code as species once the scrape is complete. I'm probably going to skip e6-izing creator, since I can't think of a way in which that would be useful, but if anyone can justify doing it I'll put it in the dataset.

Next steps: I am going to evaluate whether or not it will be logical to import implications and aliases from e621. I'd also like to find a good list of common words and drop them if they appear as orphan tags (that is, after namespacing has already occurred). Once I've built in as much automatic work as possible I will begin manually populating the spellcheck list.

If anyone has any thoughts on any of this - desirability, edge cases, failure modes, whatever - I am interested to hear them.

https://gist.github.com/disposablecat/46c19217d04f42edb9a5

82208f No.427

>>426

Actually, I've just had another thought. What if the sanitizer removed any bare (non-namespaced) tags that do not have equivalents on e621? Correcting for phrases, of course, since for example "call center" on e6 would be "call" and "center" on FA (I already do this for species).

This would get rid of common words, misspellings I don't catch, things that don't make sense as tags ("across"? "damage"? "hip"?), etc etc.

Can anyone think of a good reason not to do that? Or to make it optional?

At the very minimum, I'm going to correct phrases as above for bare tags now that I've thought of this.

82208f No.428

>>427

…hilariously, I just checked and both "damage" and "hip" are valid (though very low count) on e621. So's "he".

Maybe drop anything that has a count below like 25 or 50 on e6. I'll run some more scrapes and see what I think.

82208f No.433

Getting a lot of attribute errors in recent blocks. Did FA change something after ~660000?

82208f No.434

Also, killed 2 workers to try to stabilize CPU credit usage.

0b4765 No.437

>>433
Yeah I noticed that. I implemented some new traceback capabilities to allow me to understand what is going on.

0b4765 No.443

YouTube embed. Click thumbnail to play.
Got a proper time parser implemented, I think. Python don't have good ways to deal with time, unfortunately. I feel this video is very related.
Anyway, this change means two things:
One: Hopefully the time standard will be constant beyond the first million submissions.
Two: The first million submissions has a changing standard.
I would like to hear if anybody here care enough to want us to redo time for first million, because I don't.

82208f No.445

>>443

Sanitizer anon gives no fucks about redoing time. I'll just write a parser that handles both.

Actually, now that I think of it: do we really need more data than "year" here?

82208f No.446

>>445

Like I kind of thought month/day would be useful, but really it's not at all. If someone really wants that shit they can parse the raw data themselves, wouldn't be hard to modify my code.

0b4765 No.448

>>446
First million submissions scraped!
I need to host the file on mega, as git is not a great platform for these kinds of things.
Link here:
https://mega.co.nz/#!VVJWnBLS!OMBDYVauYgymVg8l_060J8EVfyTONoihHaQ2_QKNvVI

8f9467 No.449

>>448

I will finish (for certain values of finish) the sanitizer this weekend and rehost a cleaned version.

8f9467 No.451

>>448

Also, this would 7zip down to about a tenth the size.

0b4765 No.453

File: 1426439529331.png (115.21 KB, 942x851, 942:851, ConsoleStuff.png)

>>451
Yeah, just tested, and it got down to 93 MB. I'll upload them as 7zip from now on. 7zip of the first part here:
https://mega.co.nz/#!oA5RXKrA!p9He54yBs8YLw-pKt_qjCE-uGjUMGVyrtX6a3JmPPso

Would it be a good idea have a torrent with these files? Are updating torrents a thing, and would it be applicable here?

Also, would you be kind to update to the latest version? I am getting some very persistent errors on the server side.

82208f No.454

>>453

Whoops, sorry - updated and restarted. I was out drinking last night and didn't get your email.

Re torrents: Probably once we're caught up, a torrent set of the whole thing would be neat. Updating torrents is not a thing - you have to post a new torrent every time you change the files in it.

82208f No.461

>>449

Okay, sanitizer was significantly improved this weekend but needs a bit more work - gotta get implications in and sanity check them, then drop common words (I'm not doing the non-e621 cutoff, it seems too lossy) and maybe get some misspellings fixed.

First million sanitized v1 will be up later this week and I will xpost to /furry/ at that time. Sorry for the delay.

4dd64b No.467

File: 1426654532597.png (21.51 KB, 1680x516, 140:43, 2015-03-18-005211_1680x516….png)

Just making sure i'm not throwing a bunch of garbage data at the server. Is the worker supposed to have all these errors, I checked the ID's of these posts and they all seem private but it seems like every block it throws them?

82208f No.468

>>467

I can see these IDs; they are mature/adult. You most likely do not have your worker's FA account set to be able to see mature and adult content; you should change this under My FA -> Account Management -> Account Settings -> Content Maturity Filter. I believe Anonymph marks blocks that error for retry, but he would have to answer for sure.

>>461

Sanitizer is updated again, but honestly, we're probably looking at maybe this Sunday for a version of the first 1m that I feel comfortable posting.

4dd64b No.472

>>468
Thanks, that fixed it!
Hopefully the blocks that errored are marked.

0b4765 No.473

I needed to restart the server to test some code, and an error resulted in all workers crashing. Oops, sorry.

>>468
Yeah, they are marked for retry. They will be added to standard block 'queue'.

>>467
Holy shit. Your worker reported around 25000 blocks as 'bad'. Before the crash the workers were processing blocks in the 40 thousands.

The server has be set at a state at which is should be working, with the bad blocks having been moved into the queue. I'll look at the mature/adult content bug after I've eaten.
I'm thinking about added the ability for the script to change account settings, but that feels a little weird to be doing.

82208f No.474

>>473

On the one hand yes that does feel weird, but on the other hand a) people should not be running workers on their actual personal account and b) the project needs that setting a certain way so why not force it?

82208f No.475

>>473

Restarted my workers. Getting blocks in the sub-10k range - didn't we finish those? Or are those part of the bad blocks?

It occurs to me I don't know how your system chunks DBs. Like, is it going to add the tags from these to a new "first million" db that you'll have to reupload, or are they going to roll into the second million?

0b4765 No.476

>>475
Bad blocks apparently.

They'll fall into the second part. The entire system have been pretty messed up due to the errors, and part two will properly include data from various blocks between 7000 and 43000.
I think I'll merge whatever I got currently into one large database, and make a script to export some small part of it. It should be quite easy now that f_id is implemented.
When that script is ready, I'll upload a new version of 1 million.

4dd64b No.480

>>473
Shit, sorry about that!

I'd say make the worker have the ability to at least check account settings and refuse to run if mature isn't set, that way no one else makes my mistake.

0b4765 No.483

>>480
Heh, no problem, happened to me too when I added my own worker.
The script as now able to change settings for the user, and will be changing the settings to something that should work properly. In addition the scraper library will now return an error when it cannot access a submission because of the maturity filter.

I really recommend you to update your workers. I fixed a lot of errors.

82208f No.485

>>483

DST workers are updated.

0b4765 No.486

File: 1426874543430.gif (73.11 KB, 500x281, 500:281, 1362687834124.gif)

>>485
Getting real irritated at these errors by now.
Bla bla bla, fixed errors, did stuff, new version, should fix all of our problems, plz update.

No, seriously, there is a new version, it fixes a crash, which has laid waste to all the workers.

5c7a48 No.488

>>486

DST updated and restarted.

82208f No.489

>>486

Killing DST1 and 2 to try to get back into positive CPU credits. Fuck, these things are spiky as hell on usage. Think I might pay for another instance this weekend to even it out and get capacity back.

How's our progress rate, overall?

82208f No.490

>>489

Okay actually duplicating a server is pretty cake. All six workers are now up and running, 3 per instance.

82208f No.491


0b4765 No.493

File: 1426936928737.png (104.65 KB, 1026x836, 27:22, Console.png)

>>489
We are at the time of writing around 2.5 million submissions. I am holding out on publishing another db part until I've figured out a good way to do it.

Image is our progress over ~12 minutes. In that time, we processed 42 blocks (4200 submissions page hits). At that speed we check 21000 submissions each hour, and 504000 per day.
At this rate we are expected to be at submission 16e6 in the middle of April.

>>491
I do not really have an opinion on this. It's Neer's site, he can do whatever he wants with it. All I do on a regular basis on FA is to check my new submissions page.
The only problem I see with this currently is if they change the entire site layout before we are done.

82208f No.503

>>493

Oh, hey, so we are actually making excellent progress. Good to know. How many total workers is that with, if you don't mind me asking?

The FA buy thing worries me technically, as far as layout change goes, but it also worries me existentially. I don't trust IMVU long term to not censor, paywall or otherwise fuck the site, and I also worry as an art and metadata collector about the fragmentation of the userbase both from the announcment and if they do fuck it up.

82208f No.510

A thought: We might want to do something about people using commas in titles. I have no real way to check how prevalent it is and also no way to correct for it, but currently those are being split as two separate tags in the raw dataset.

I would suggest making the scraper check titles before it adds them to the DB, and remove any commas.

0b4765 No.511

File: 1427061367090.png (27.63 KB, 381x701, 381:701, commasInFuraffinityDB.png)

>>510
You have any examples?
This is of course a problem with csv formated data. Luckily none of the code involved in the worker nor the server uses csv format.
Unless HydrusTagArchive is badly implemented, which it is not, there is really no reason for titles being spit in the DB. It may look like there is a problem with the sqlite library replacement function, but luckily the library is smarter than that, and has no problem with commas.

It's quite easy to look for these, use pretty much any sqlite browser/viewer. Image related is in the aptly named SqliteBrowser.

82208f No.512

>>511

Huh, I had not figured out how to wildcard in a DB browser.

I'm a moron. I checked the cleaned DB and these are coming across fine; I had been using a CSV dump of the databases for convenience of ctrl-f. False alarm, this isn't a problem at all.

Thanks for the explanation.

82208f No.526

>>461

Still finishing up sanitizer. Hopefully this weekend I can call it good.

0b4765 No.539

Server is down again. One of the workers was returning mass errors. Not sure why, but it resulted in the skipping of 5 mil submissions. I'm currently working on checking where we left off. I'm also using the "opportunity" to do some database administration. Hopefully I can then begin pumping out parts of the database, which is currently at 1.8 GB.
Reset your workers, and I'll restart the server in the morning.

82208f No.542

>>461

Workers shut off for the time being. Will turn them back on at your notice, updating if need be.

4dd64b No.553

File: 1427600226094.png (8 KB, 776x396, 194:99, 2015-03-28_23-34-23.png)

>>539
God fucking damn it.
It was my worker again, just checked it today to see that it had errored like hell.

Turns out my scraper got IP banned. Oh well.
Good luck to the rest of y'all.

82208f No.554

>>553

If I restart my instances, I get a new IP assigned to me by Amazon. Is that not the same on your environment?

Started my workers back up. They're idling for whenever the server comes back.

4dd64b No.556

>>554
I'm using DigitalOcean.

The instance the scraper is on has other things on it, so I would need to pay for two servers at once if I were to spool another one. Not too sure I want to do that?

82208f No.557

>>556

Ah, I see. This is why I'm using separate environments just for this - we anticipated the possibility of IP bans at the start, since FA is notorious for IP banning scraping users.

There is always the Amazon EC2 free tier, if you haven't used your 12 months yet. A single EC2.micro instance can be run 24/7 for free, and in my experience supports ~3 scraper workers.

0b4765 No.559

The server is up now.
Sorry for the delay, fixing the database took longer than I had anticipated.

>>553
Funny that its only you that got your IP banned yet. I should properly add a way to check for IP bans. Could you send me a copy/image of that page's source?

>>556
If you are using DigitalOcean just take a snapshot, destroy the droplet, and create another one from that snapshot. You will be assigned another IP address without losing anything, except a little time.

82208f No.560

File: 1427658492378.png (33.49 KB, 1139x330, 1139:330, errors.PNG)

Getting a few of these errors. Related to this:

http://www.fileformat.info/info/unicode/char/2022/index.htm

82208f No.563

>>559

Also, regarding the database, the Part1.db that you posted does not seem to have any tags beginning with letters past R, except for title: and species: namespaces. What could be going on with that?

82208f No.564

>>563

This is true of the three versions on the Bitbucket repo, too.

You've got this line in Furaffinity.py, under GetKeywords. Related?:

for a in self.soup.find_all( "a", href=re.compile(r"/search/@keywords [a-r]+") ):

82208f No.565

>>564

Yeah, thinking about it that's definitely it. You've also excluded tags beginning with numbers or non-alphanumeric characters with this; I'm not sure how common they are, but I feel like we should be getting them if they exist.

0b4765 No.569

>>564
Oh shit. Fuck. I'm sorry, that is such a stupid mistake. Not sure how that happened, but I should have been using a better regex from the beginning. Thanks for pointing it out. It has been fixed along with an untested check for IP bans. If you still have the server >>553, I would appreciate if you tried it out and reported what it outputs. (Please stop it if it begins to send errors instead of shutting down.)

>>565
Panic control is as follows: We continue as we are currently doing until we reach 16e6. Then we do another round of scraping, this time exclusively the submission pages, with focus on missing metadata, this round would happen a lot faster, as we ignore the image files.

Sounds good?

0b4765 No.570

>>569
Forgot to mention, the fix is available on git now.

82208f No.572

>>569

I've updated the DST workers.

That solution works. I guess we'll distribute the entire dataset at once, when we're done, then, instead of in chunks. And hey, that means that in passing we'll get the right dates on stuff :P

I've also updated the sanitizer again. At this point it's *almost* ready, I just have to finish manually cross-checking implications and adding spellcheck entries that unify non-namespaced tags. So I'll keep working on that.

0b4765 No.601

Everything is running smoothly. It's starting to become a little boring to not have stuff to fix every few days.

I was pleasantly surprised earlier today when I noticed one image with 'f_' namespaces.

Not much else to report though.

82208f No.651

What's our progress looking like? Curious how close to done we are..


0b4765 No.653

>>651

Hard to properly estimate as they are currently fixing up a lot of early missing blocks. After these missing submissions, of which there is around a million (i'm not sure), we'll continue from submission 12500000.

Seems like I was a little optimistic with my estimate previously.

My hosting is going to be doing an software upgrade between the 28th and 29th. They have said this might cause some network issues, but it should hopefully not be that large a problem, but I'm going to be observing the network activity to the best of my abilities in that time period.

I'm also currently looking into getting the AWS trial to add some speed to the network, but this is low priority right now.


82208f No.655

>>653

Sounds good. Currently I am thinking that since we're doing the whole mess at once, I'm going to run one massive "non-ns tag unification process" in the sanitizer once we've got the whole dataset in hand. Instead of doing it all manually I'm going to sort out about the top 10% of erroneous tags and correct for them; the rest will have so few counts as to be functionally irrelevant.


0b4765 No.677

Shit, they made an change in their default layout out of fucking nowhere. I am not in a position to fix this right now, so I shut down the server.

I also need to investigate a better way to save the state of the server. It has constantly been avoiding saving progress when shutting down.

I'll look into both tomorrow.


82208f No.681

>>677

Thankfully, they have reverted the layout change after mass outrage, so we can probably start up again.


0b4765 No.692

>>681

Sorry, 'real life' have been really hectic this last week, I've barely got time to work on stuff.

Turns out that when they changed the sidebar they also made some arbitrary change in the html for the title and creator display. This was the reason I didn't just restart the server when I noticed that they changed it back.

I fixed it by simply changing the code to extract it from the page title. Simpler and hopefully a little more future proof.


0b4765 No.747

File: 1432282222892.png (170.52 KB, 1300x1091, 1300:1091, JollyCooperation.png)

I am proud to announce that we are now finished, sort of. I do not currently have the time or money to maintain the project, and am thus shutting it down, at least for now. There is currently around 9 million images/submissions contained within the database, which may seem too few when compared to ids of currently created submissions. This is due to deleted submissions and such, not that much to do about it.

There are still missing data, but I estimate 75% coverage of available data. Most notably, any keywords involving anything other than the letters from a to r, will be excluded, at least for the first half of the images.

This thing may be pushing the limits of what Hydrus can handle, I'm quite excited to see what happens when people try it out.

The entire (raw) database (9.6GB) can be found as an archive(2.5GB) here:

https://mega.co.nz/#!ooRRXLyL!v8IT4uHWsmx-c0gVK7XGx0Xjjij0b4dXbA-Wl_uFXLs

A processed/'noise reduced' database is being worked on, but that might take some time.

I'd like to thank the wonderful people who've put up workers for the project, it could never have been completed without them.


2f8e25 No.753

File: 1432505069706.gif (906.52 KB, 150x113, 150:113, 1c328e23899667080bf20012c8….gif)

>>747

I am really impressed with what you have done here. I cannot imagine there are many people who have ever created any sort of metadata store this large.

Please let me know where hydrus or the tag archive code has trouble with this and I will try to fix it!


d14d0d No.763

File: 1432752458459.jpg (35.29 KB, 319x400, 319:400, fe1d12603f42207c94ae86faf0….jpg)

>>747

Echoing Hydrus Dev here, this is a tremendous undertaking and I'm glad to have been a part of it.

I will probably attempt to complete coverage up to the 16mil mark on my own, after which I will finish my sanitizer script (as described above) and filter the whole thing at once, then post here. I don't know when I'll have time to do this, but "by the end of summer" would be far too late IMO so we shall have to see.

Following that I'm going to try to write a background updater and keep pace, releasing sanitized update HTAs every time FA hits another 500k or so.


d5425d No.2538

Has this been sanitized yet?

It would be awesome to have at least a FA-based artist:-tag database


fa6754 No.2541

Oh! I forgot about this project. I'm guessing no news for a year is a bad sign? A database like this would be so damn useful.


fde218 No.2567

File: 1462271313528.jpg (9.54 KB, 225x225, 1:1, 1462229137296.jpg)

>>747

Does this database included thumbnails of deleted images? I fucking hate when I find something cool, return to my browser session and find out it's been deleted by the uploader. Same thing on dA and pixiv…


d1f99c No.2616

File: 1462502241445.png (2.57 KB, 379x87, 379:87, asdf.png)

is there any way to make the tagging process go faster? if i am doing this math correctly it will take my pretty beefy gaming computer ~70 days to run through this db.

is this accurate?

it almost seems like it would be faster to tag this stuff myself at this point. also hydrus is frozen all the time while its working, and i cant even pause the tagging anymore.

pic related, i've been running the db for about 26 hrs now, or around that at least


6b9755 No.2621

File: 1462550956927.jpg (363.4 KB, 1280x1006, 640:503, 0d558b492a90531789922bb04d….jpg)

>>2616

I would say:

Stop the sync.

Run database->maintenance->analyze->full. (might take up to 20 mins)

Retry the sync.

This should load a lot of the db into memory, which will accelerate things a bit, although my experience with that has mostly been in Win10, which I think has a better disk cache than 8.1, although I am not sure. Also, if you have been running for a day, I'd expect most of it is loaded into memory anyway.

Having said that, unless you have special plans to search the 'all known files'' domain, I am not sure you want to sync entirely with that database. 9m files' of tags is probably going to make a huge db when you don't need most of those files. If you wait a week, I'll add an option to v205 to only add local files' tags on an archive sync/import, which should speed things up massively and keep your db lean.


d1f99c No.2629

>>2621

cool, thanks for the reply.

I think I'll wait for the update

what you are doing with this program is really cool and I hope one day I can do something of similar scale.

learning java right now


c701ea No.2631

>>2567

This site hosts backups of FA.

https://vj5pbopejlhcbz4n.onion.link


69f852 No.2632

>>2621

Can I make a suggestion?

It's something that's been bugging me for a while - I really REALLY like the idea of a global tag base, but I despite how there are so many bullshit tags.

Would it be possible to affect it in such a way that the public repo only contains tags that get a lot of hits to it? For example, show tags. Obviously, useful.

I've put in a lot of requests to this in an effort to help clean it up, but it's amazing to me how many tags get added to the repo that are nothing but garbage. (Tags containing just a character happen a lot.)

I think the purpose of a general tag repo should be to apply general, but useful tags. It certainly kind of sounds like a pain in the ass, but I honestly think it might go a lot farther in just ensuring that it keeps the repo clean.

I'm just curious as to how many single tags out there exist. Certainly plenty of tags with different names that aren't in the same group. Just something I've been tossing around for a while.

PS - I let the whole thing sync in it's entirely, so whenever I go to add a tag, it takes a moment to search through everything. It's a slight pause, I've gotten used to it. Is this normal, or is it a bad symptom of letting the whole database sync?


6b9755 No.2642

File: 1462645736634.jpg (771.67 KB, 2048x1122, 1024:561, 4b8e9cb24c92497a4fbb4d65c8….jpg)

>>2629

Go for it! This whole thing has been a lot slower and more difficult than I ever expected, but it has also been very rewarding. The memes flow just a little faster because of the thing I made.

>>2632

I think I agree. I'm really pleased with the tag corpus on the ptr, but it is a mess. Again and again while making hydrus, I've learned that gathering and propagating is much easier than managing and optimising.

These numbers might be a bit wrong, but I just queried the all known files ac cache on my real world db, and of 870,000 entries, 513,000 have a count of 1. There might be some namespace stuff making those numbers unusual, but I think that means 500k/870k tags have only a single entry. I assume many of those are title tags. 750k have a count less than 10. The top ten tags have 360k-780k mappings. (top ten are something like [breasts, solo, female, 1girl, blush, long hair, nipples, mammal, penis, nude], jej)

My current vague plan is to add serverside tag filtering, so admins will be able to set up rules like 'this tag repo is only for tags with series: namespace', or 'creator:unknown is disallowed' or 'only these five tags are permitted'. The client will be informed about those rules, so garbage can be stopped at the source. Having multiple specialised tag repos will also make for ultimately leaner databases.

I would also like to spend some part of the next year or two researching the current state of neural net tagging. This seems to be increasingly possible. I would much rather distribute packets of 'things that are shape xxxx should get tag yyyy with confidence zzzz' than specific file-tag pairs. I'm confident his would be a much more flexible and powerful system than we have currently, pushing the human cognitive load to 'is my guess here accurate yes/no' rather than 'please type twenty good tags for every new file forever', although I absolutely need to read up on it. We can use our current messy pile of 57m mappings (and how many million will we have in two years?) to create the new system–if the technical problems can be overcome.

The new ac cache is working hugely better than the old system, but it sometimes spikes with lag, I think mostly due to disk latency. If you like, please run some profiles while you type in tags as described here:

http://hydrusnetwork.github.io/hydrus/help/reducing_lag.html

And send them in!


8d6f27 No.2679

>>2538

>>2541

I was the guy doing the sanitizer for this, and no, I haven't cleaned the final DB because life got in the way, among other reasons (there are some data integrity issues resulting from the process we used). Having become a much better coder in the intervening time, I recently scratchbuilt an e621 tag ripper and posted the cleaned result DB in the big database thread; many of the patterns I used for that can be applied to FA, and I just need to find the time to sit down and write the code for it, at which point I'll begin re-ripping the entire site's metadata.

>>2567

Unfortunately no - there is no method by which we could include a thumbnail in an HTA, and even if we could there's no way for us to get thumbnails from deleted subs, because FA deletes them when the sub is deleted (as far as I know). >>2631

is your solution here; it's slow and painful, but you can in fact do mass-downloading from it given the appropriate TOR-friendly tools.

I have also run into the problem of "artists delete stuff I probably would have liked"; my solution to this was to write a Python script that runs daily and automatically downloads all new submissions directly to my computer, then nukes them. This way an artist has to delete something within 24 hours, or I will have already grabbed it. Of course, this further aggravates my "I have too many images to sort through" problem, since I'm watching close to 1600 artists which leads to ~300 new files a day…

Code for that will eventually be posted somewhere, probably along with the tag ripper code once it's written.


8d6f27 No.2684

>>2679

Okay, replying to this got my coder juices flowing, so I hammered out a parser for FA pages and will start scraping data; once I have enough data I'll re-attack the sanitization problem.

Request for public comment:

Below, as an example, is all the metadata I can think to extract from http://www.furaffinity.net/view/19948852/ :

{'artist': u'feralise',
'category': u'artwork (digital)',
'faId': 19948852,
'gender': u'male',
'keywords': [u'feralise',
u'adult',
u'sketchem',
u'pyro29',
u'pyro',
u'vi',
u'navalia',
u'arctic',
u'white',
u'wolf',
u'smirk',
u'glance',
u'erection',
u'boxer',
u'briefs',
u'chill',
u'temptation',
u'muscle',
u'pose',
u'sexy',
u'male',
u'canine'],
'postTimestamp': 1463018547,
'rating': u'adult',
'sha1Hash': '6ad603c975d20600bb565f9619ac5c51647feffd',
'species': u'canid - wolf',
'taggedUsers': [u'pyro29'],
'theme': u'general furry art',
'title': u'chill'}

Keeping in mind that all of this is completely raw, since processing and sanitization happens at a later stage (also, ignore the "u'" bits, that's just a Python marker for "unicode string"), can anyone think of any additional data that is available from FA that I am not gathering?


8d6f27 No.2709

did my posts get deleted or is 8chan being garbage


8d6f27 No.2710

>>2684

>>2709

Oh yeah just cache garbage.


8d6f27 No.2784

>>2710

Godsfuckingdamnit there's a captcha on the FA login page now, for *every* login.

It will take me a not insignificant amount of time to figure a way around this.


fa6754 No.2787

>>2784

Ah hells. I was just about to mention that personally I would need just the artist/species for tags while everything else would be a bonus, but now that seems pointless. That whole leakage deal has been a pain.


8d6f27 No.2825

I've actually figured it out, I think - mechanize can use the session key from my browser login to avoid having to log in itself. Project is back on once I have time.


d3af7c No.3390

It's been a couple months since there was an update on this, is anyone still working on getting a FA database going?

If this is dead, perhaps a new project could be started for https://beta.furrynetwork.com/ ? Seems to be the "new" Furry art site and will probably be hosting more and more of FA's content as artists move over to it.




[Return][Go to top][Catalog][Post a Reply]
Delete Post [ ]
[]
[ / / / / / / / / / ] [ dir / cute / egy / fur / kind / kpop / miku / waifuist / wooo ]