[lug] Bulk wget from archive.org
Tyler Cipriani
tyler at tylercipriani.com
Thu Jul 20 13:40:49 MDT 2017
On 17-07-20 12:40:19, Jed S. Baer wrote:
>Found out yesterday that there are 355 issues of the old Sci-Fi pulp
>magazine "Galaxy" available at archive.org. So I went about trying to do
>a bulk download. So far, no luck. Here's some stuff on that.
>
>http://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/
>https://gareth.halfacree.co.uk/2013/04/bulk-downloading-collections-from-archive-org
>
>On the 2nd one of those, scrolling down a bit in comments:
>"I get the expected results with this command:
>
>wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -i ../acorn -B
>‘http://archive.org/download/’ -A .pdf"
>
>(where ../acorn is an item list)
>
>Which I guess is old, because it looks as if archive.org has changed
>things up a bit. The above gets me a bunch of 404 errors.
I think I ran into this before when downloading some of the fantastic
David W. Niven Collection of early jazz tapes[0].
I made a script to do these bulk downloads and (as is my wont) saved it
in my dotfiles it may yet be useful in this situation[1].
It takes as input a csv of identifiers that you get from following the
instructions on _Downloading in bulk using wget_[2] page and outputs a
list of urls to download the ogg files therein. A minor tweak to a few
lines[3] should enable you to download whatever extension you want to
target (just change `.ogg` to `.pdf` on line 52 looks like what you
want). Script requires pyquery and requests - both are packaged for
Debian (as python-pyquery and python-requests).
Usage is:
./archive-org-m3u.py < search.csv > 'open_golberg_variations.m3u'
This script is also about 20 lines of code, so it should be simple to
reimplement in your language of choice :)
-- Tyler
[0]. <https://archive.org/details/davidwnivenjazz&tab=about>
[1]. <https://github.com/thcipriani/dotfiles/blob/master/bin/archive-m3u.py>
[2]. <https://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/>
[3]. <https://github.com/thcipriani/dotfiles/blob/master/bin/archive-m3u.py#L52-L53>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.lug.boulder.co.us/pipermail/lug/attachments/20170720/51bece7f/attachment.sig>
More information about the LUG
mailing list