[lug] Bulk wget from archive.org

Jed S. Baer blug at jbaer.cotse.net
Thu Jul 20 12:40:19 MDT 2017


Found out yesterday that there are 355 issues of the old Sci-Fi pulp
magazine "Galaxy" available at archive.org. So I went about trying to do
a bulk download. So far, no luck. Here's some stuff on that.

http://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/
https://gareth.halfacree.co.uk/2013/04/bulk-downloading-collections-from-archive-org

On the 2nd one of those, scrolling down a bit in comments:
"I get the expected results with this command:

wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -i ../acorn -B
‘http://archive.org/download/’ -A .pdf"

(where ../acorn is an item list)

Which I guess is old, because it looks as if archive.org has changed
things up a bit. The above gets me a bunch of 404 errors.

I changed it around a bit, to this:
wget -r -l 1 -nc -np -nd -e robots=off -i shortlist.txt -B
'https://archive.org/details/' -A .pdf

Because when you go here:
https://archive.org/details/galaxymagazine&tab=collection and click on a
specific issue, it's in 'details', not 'download', and I figured wget
would find the PDF links on the page, and snag the files, because '-r -l
1'. However, wget is deleting the downloaded page before it spiders it
for PDF links, because it isn't PDF. I looked for a 'defer delete' option
or something like it, but no dice. Also, the downloaded page doesn't have
a suffix of 'html' or 'htm', so I'm not sure how to tell wget to keep it
using the -A option, without also grabbing a bunch of stuff I don't want
(yeah, I care about data usage, for some reason -- habit I guess).

Any suggestions? Here's a short list of item identifiers, if anyone wants
to play around.

Galaxy_v38n03_1977-05
Galaxy_v38n04_1977-06
Galaxy_v38n05_1977-07
Galaxy_v38n06_1977-08
Galaxy_v38n07_1977-09
Galaxy_v38n08_1977-10
Galaxy_v38n09_1977-11
Galaxy_v39n01_1978-01
Galaxy_v39n02_1978-02
Galaxy_v39n03_1978-03
Galaxy_v39n04_1978-04
Galaxy_v39n06_1978-06
Galaxy_v39n07_1978-09
Galaxy_v39n08_1978-12
Galaxy_v39n09_1979-03
Galaxy_v39n10_1979-07
Galaxy_v39n11_1979-10
Galaxy_v40n01_1980-07

Also, noting the Python app mentioned, it _requires_ a login at
archive.org, even for operations where such isn't required by
archive.org. So, I'm pondering maybe a little Python coding exercise
here, but I'm still curious if there's a way I can use wget, other than
1st downloading all the pages for all issues, and then telling wget to
use those as input. That's inelegant.


More information about the LUG mailing list