Help me backup French Fascist footages

Email
Comment *
File	Select/drop/paste files here
Password	(Randomized for file and post deletion; you may also set your own.)
Archive	Archive [500 char limit]
* = required field	[▶ Show post options & limits] Confused? See the FAQ.

Comment *

File

Select/drop/paste files here

Password

(Randomized for file and post deletion; you may also set your own.)

Archive

Archive [500 char limit]

* = required field

[▶ Show post options & limits]
Confused? See the FAQ.

Flag	open flag menu
Oekaki	Show oekaki applet (replaces files and can be used instead)
Voice recorder	Show voice recorder (the Stop button will be clickable 5 seconds after you press Record)
Options	Do not bump (you can also write sage in the email field)
Allowed file types:jpg, jpeg, gif, png, webm, mp4, pdf Max filesize is 16 MB. Max image dimensions are 15000 x 15000. You may upload 3 per post.

Flag

open flag menu

Oekaki

Show oekaki applet
(replaces files and can be used instead)

Voice recorder

Show voice recorder

(the Stop button will be clickable 5 seconds after you press Record)

Options

Do not bump
(you can also write sage in the email field)

Allowed file types:jpg, jpeg, gif, png, webm, mp4, pdf
Max filesize is 16 MB.
Max image dimensions are 15000 x 15000.
You may upload 3 per post.

[–]

▶ Help me backup French Fascist footages Anonymous 07/07/20 (Tue) 18:40:21 No.1084506 [Watch Thread][Show All Posts]

Alright guys. I wanna make a full copy of every footages currently made available on the French National Archives' website: http://www.ina.fr

Now first, keep in mind every document has an url beginning by either AFE85 or AFE86.

Here's an example: https://www.ina.fr/video/AFE85000955

A low-tech way to find said footages would be to bruteforce each url, so I made a small Bash script :


#!/bin/bash
url="https://www.ina.fr/video/AFE"
get_http_code(){
    readarray -t http_get < <( curl --write-out '%{http_code}' --silent -S "${1}" || return 1 ) # download webpage and http code 
    http_code=$( printf '%s\n' "${http_get[@]}" | tail -n 1 )
    video_date=$( printf '%s\n' "${http_get[@]}" | grep "broadcast" | grep -o "194[0-5]" ) # contain the document date only if between 1940 and 1945
    echo "${i} - ${http_code} - ${video_date}" # output serial being tried and associated http code
}
get_video(){
    # video_name=$( printf '%s\n' "${http_get[@]}" | grep '"h2--title"' | sed 's/<[^>]*>//g;s/^[ \t]*//' ) # not used right now
    video_author=$( printf '%s\n' "${http_get[@]}" | grep -A1 '"h3--title"' | sed 's/<[^>]*>//g;s/^[ \t]*//;/^$/d' ) # will contain the author of the document
    mkdir -p "${HOME}"/INA/"${video_author}" # create author directory 
    youtube-dl -o "$HOME/INA/${video_author}/%(title)s.%(ext)s" "$1" # download video into author directory 
}
loop_function(){
    get_http_code ${url}"${i}" || return 1 #sleep $((RANDOM % 10)) && get_http_code ${url}"${i}"
    if [[ ${http_code} -eq 200 ]] && [[ -n ${video_date} ]]; then 
        ((find++)) ; ((loop++))
        get_video ${url}"${i}"
        if [[ $loop -eq 10 ]]; then
            sleep 5
            loop="0"
        fi # no more than 10 download at the same time to avoid crashing the website
    fi
}
for i in {85000000..86999999}; do
    sleep 0.25 # start in background a new inquiry every 25ms
    (loop_function || sleep $((RANDOM % 60)) && loop_function) & # if inquiry because the server overloaded, retry in a while 
done 
echo "${find} videos were found."

But that's an extremely inefficient way to do it.

The Institute's website is quite slow and fragile, so to avoid overloading it I need to leave 25ms between each queries.

Another, smarter way to do it would be to use the website's AJAX API : https://www.ina.fr/layout/set/ajax/recherche/result?

But I have next to zero knowledge on AJAX, and client request seem to be somewhat hashed/encoded before being sent.

Would you know a way to do it?

____________________________

Disclaimer: this post and the subject matter and contents thereof - text, media, or otherwise - do not necessarily reflect the views of the 8kun administration.

/tech/ - Technology★

General

WebM

Theme

User JS

Do not paste code here unless you absolutely trust the source or have read it yourself!

Favorites

Customize Formatting

Filters