[lug] Bulk wget from archive.org

Thu Jul 20 12:40:19 MDT 2017

Found out yesterday that there are 355 issues of the old Sci-Fi pulp
magazine "Galaxy" available at archive.org. So I went about trying to do
a bulk download. So far, no luck. Here's some stuff on that.

http://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/
https://gareth.halfacree.co.uk/2013/04/bulk-downloading-collections-from-archive-org

On the 2nd one of those, scrolling down a bit in comments:
"I get the expected results with this command:

wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -i ../acorn -B
‘http://archive.org/download/’ -A .pdf"

(where ../acorn is an item list)

Which I guess is old, because it looks as if archive.org has changed
things up a bit. The above gets me a bunch of 404 errors.

I changed it around a bit, to this:
wget -r -l 1 -nc -np -nd -e robots=off -i shortlist.txt -B
'https://archive.org/details/' -A .pdf

Because when you go here:
https://archive.org/details/galaxymagazine&tab=collection and click on a
specific issue, it's in 'details', not 'download', and I figured wget
would find the PDF links on the page, and snag the files, because '-r -l
1'. However, wget is deleting the downloaded page before it spiders it
for PDF links, because it isn't PDF. I looked for a 'defer delete' option
or something like it, but no dice. Also, the downloaded page doesn't have
a suffix of 'html' or 'htm', so I'm not sure how to tell wget to keep it
using the -A option, without also grabbing a bunch of stuff I don't want
(yeah, I care about data usage, for some reason -- habit I guess).

Any suggestions? Here's a short list of item identifiers, if anyone wants
to play around.

Galaxy_v38n03_1977-05
Galaxy_v38n04_1977-06
Galaxy_v38n05_1977-07
Galaxy_v38n06_1977-08
Galaxy_v38n07_1977-09
Galaxy_v38n08_1977-10
Galaxy_v38n09_1977-11
Galaxy_v39n01_1978-01
Galaxy_v39n02_1978-02
Galaxy_v39n03_1978-03
Galaxy_v39n04_1978-04
Galaxy_v39n06_1978-06
Galaxy_v39n07_1978-09
Galaxy_v39n08_1978-12
Galaxy_v39n09_1979-03
Galaxy_v39n10_1979-07
Galaxy_v39n11_1979-10
Galaxy_v40n01_1980-07

Also, noting the Python app mentioned, it _requires_ a login at
archive.org, even for operations where such isn't required by
archive.org. So, I'm pondering maybe a little Python coding exercise
here, but I'm still curious if there's a way I can use wget, other than
1st downloading all the pages for all issues, and then telling wget to
use those as input. That's inelegant.