[ / / / / / / / / / / / / / ] [ dir / random / abcu / abdl / erp / kind / liberty / mde / random / rule34 ][Options][ watchlist ]

/tech/ - Technology

Freedom Isn't Free
You can now write text to your AI-generated image at https://aiproto.com It is currently free to use for Proto members.
Email
Comment *
File
Select/drop/paste files here
Password (Randomized for file and post deletion; you may also set your own.)
Archive
* = required field[▶ Show post options & limits]
Confused? See the FAQ.
Expand all images

[–]

 No.1084506[Watch Thread][Show All Posts]

Alright guys. I wanna make a full copy of every footages currently made available on the French National Archives' website: http://www.ina.fr

Now first, keep in mind every document has an url beginning by either AFE85 or AFE86.

Here's an example: https://www.ina.fr/video/AFE85000955

A low-tech way to find said footages would be to bruteforce each url, so I made a small Bash script :


#!/bin/bash
url="https://www.ina.fr/video/AFE"
get_http_code(){
readarray -t http_get < <( curl --write-out '%{http_code}' --silent -S "${1}" || return 1 ) # download webpage and http code
http_code=$( printf '%s\n' "${http_get[@]}" | tail -n 1 )
video_date=$( printf '%s\n' "${http_get[@]}" | grep "broadcast" | grep -o "194[0-5]" ) # contain the document date only if between 1940 and 1945
echo "${i} - ${http_code} - ${video_date}" # output serial being tried and associated http code
}
get_video(){
# video_name=$( printf '%s\n' "${http_get[@]}" | grep '"h2--title"' | sed 's/<[^>]*>//g;s/^[ \t]*//' ) # not used right now
video_author=$( printf '%s\n' "${http_get[@]}" | grep -A1 '"h3--title"' | sed 's/<[^>]*>//g;s/^[ \t]*//;/^$/d' ) # will contain the author of the document
mkdir -p "${HOME}"/INA/"${video_author}" # create author directory
youtube-dl -o "$HOME/INA/${video_author}/%(title)s.%(ext)s" "$1" # download video into author directory
}
loop_function(){
get_http_code ${url}"${i}" || return 1 #sleep $((RANDOM % 10)) && get_http_code ${url}"${i}"
if [[ ${http_code} -eq 200 ]] && [[ -n ${video_date} ]]; then
((find++)) ; ((loop++))
get_video ${url}"${i}"
if [[ $loop -eq 10 ]]; then
sleep 5
loop="0"
fi # no more than 10 download at the same time to avoid crashing the website
fi
}
for i in {85000000..86999999}; do
sleep 0.25 # start in background a new inquiry every 25ms
(loop_function || sleep $((RANDOM % 60)) && loop_function) & # if inquiry because the server overloaded, retry in a while
done
echo "${find} videos were found."

But that's an extremely inefficient way to do it.

The Institute's website is quite slow and fragile, so to avoid overloading it I need to leave 25ms between each queries.

Another, smarter way to do it would be to use the website's AJAX API : https://www.ina.fr/layout/set/ajax/recherche/result?

But I have next to zero knowledge on AJAX, and client request seem to be somewhat hashed/encoded before being sent.

Would you know a way to do it?

____________________________
Disclaimer: this post and the subject matter and contents thereof - text, media, or otherwise - do not necessarily reflect the views of the 8kun administration.


[Return][Go to top][Catalog][Screencap][Nerve Center][Random][Update] ( Scroll to new posts) ( Auto) 5
0 replies | 0 images | Page ?
[Post a Reply]
[ / / / / / / / / / / / / / ] [ dir / random / abcu / abdl / erp / kind / liberty / mde / random / rule34 ][ watchlist ]