[lug] Bulk wget from archive.org

Tyler Cipriani tyler at tylercipriani.com
Thu Jul 20 13:40:49 MDT 2017


On 17-07-20 12:40:19, Jed S. Baer wrote:
>Found out yesterday that there are 355 issues of the old Sci-Fi pulp
>magazine "Galaxy" available at archive.org. So I went about trying to do
>a bulk download. So far, no luck. Here's some stuff on that.
>
>http://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/
>https://gareth.halfacree.co.uk/2013/04/bulk-downloading-collections-from-archive-org
>
>On the 2nd one of those, scrolling down a bit in comments:
>"I get the expected results with this command:
>
>wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -i ../acorn -B
>http://archive.org/download/’ -A .pdf"
>
>(where ../acorn is an item list)
>
>Which I guess is old, because it looks as if archive.org has changed
>things up a bit. The above gets me a bunch of 404 errors.

I think I ran into this before when downloading some of the fantastic 
David W. Niven Collection of early jazz tapes[0].

I made a script to do these bulk downloads and (as is my wont) saved it 
in my dotfiles it may yet be useful in this situation[1].

It takes as input a csv of identifiers that you get from following the 
instructions on _Downloading in bulk using wget_[2] page and outputs a 
list of urls to download the ogg files therein. A minor tweak to a few 
lines[3] should enable you to download whatever extension you want to 
target (just change `.ogg` to `.pdf` on line 52 looks like what you 
want). Script requires pyquery and requests - both are packaged for 
Debian (as python-pyquery and python-requests).

Usage is:

    ./archive-org-m3u.py < search.csv > 'open_golberg_variations.m3u'

This script is also about 20 lines of code, so it should be simple to 
reimplement in your language of choice :)

-- Tyler

[0]. <https://archive.org/details/davidwnivenjazz&tab=about>
[1]. <https://github.com/thcipriani/dotfiles/blob/master/bin/archive-m3u.py>
[2]. <https://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/>
[3]. <https://github.com/thcipriani/dotfiles/blob/master/bin/archive-m3u.py#L52-L53>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.lug.boulder.co.us/pipermail/lug/attachments/20170720/51bece7f/attachment.sig>


More information about the LUG mailing list