Alright guys. I wanna make a full copy of every footages currently made available on the French National Archives' website: http://www.ina.fr
Now first, keep in mind every document has an url beginning by either AFE85 or AFE86.
Here's an example: https://www.ina.fr/video/AFE85000955
A low-tech way to find said footages would be to bruteforce each url, so I made a small Bash script :
#!/bin/bash
url="https://www.ina.fr/video/AFE"
get_http_code(){
readarray -t http_get < <( curl --write-out '%{http_code}' --silent -S "${1}" || return 1 ) # download webpage and http code
http_code=$( printf '%s\n' "${http_get[@]}" | tail -n 1 )
video_date=$( printf '%s\n' "${http_get[@]}" | grep "broadcast" | grep -o "194[0-5]" ) # contain the document date only if between 1940 and 1945
echo "${i} - ${http_code} - ${video_date}" # output serial being tried and associated http code
}
get_video(){
# video_name=$( printf '%s\n' "${http_get[@]}" | grep '"h2--title"' | sed 's/<[^>]*>//g;s/^[ \t]*//' ) # not used right now
video_author=$( printf '%s\n' "${http_get[@]}" | grep -A1 '"h3--title"' | sed 's/<[^>]*>//g;s/^[ \t]*//;/^$/d' ) # will contain the author of the document
mkdir -p "${HOME}"/INA/"${video_author}" # create author directory
youtube-dl -o "$HOME/INA/${video_author}/%(title)s.%(ext)s" "$1" # download video into author directory
}
loop_function(){
get_http_code ${url}"${i}" || return 1 #sleep $((RANDOM % 10)) && get_http_code ${url}"${i}"
if [[ ${http_code} -eq 200 ]] && [[ -n ${video_date} ]]; then
((find++)) ; ((loop++))
get_video ${url}"${i}"
if [[ $loop -eq 10 ]]; then
sleep 5
loop="0"
fi # no more than 10 download at the same time to avoid crashing the website
fi
}
for i in {85000000..86999999}; do
sleep 0.25 # start in background a new inquiry every 25ms
(loop_function || sleep $((RANDOM % 60)) && loop_function) & # if inquiry because the server overloaded, retry in a while
done
echo "${find} videos were found."
But that's an extremely inefficient way to do it.
The Institute's website is quite slow and fragile, so to avoid overloading it I need to leave 25ms between each queries.
Another, smarter way to do it would be to use the website's AJAX API : https://www.ina.fr/layout/set/ajax/recherche/result?
But I have next to zero knowledge on AJAX, and client request seem to be somewhat hashed/encoded before being sent.
Would you know a way to do it?