Is there an app to download all PDFs from a page?

Alexander · March 16, 2021, 12:35pm

I found a site that has a lot of my favorite magazines in PDF format. Seeing as they are a lot, doing it manually would take ages. Anyone know of an app that will allow me to download them all at once?

pebcak · March 16, 2021, 12:50pm

Perhaps this Firefox’ addon?

Question 1. How to use this extension?

Answer Say for example you have opened a web page with many pdf resources. The most basic workflow it to open the extension popup (via the browser toolbar icon) and click there the Load page links button. This usually fills the resources list with all sort of links, not only the desired pdfs. So the next step is to filter this list, such that only desired pdfs stays here. Write pdf in the extensions text field. Now only pdf type resources show in the list. If the list is still too large (you want only smaller subset) you may continue the filtering process indicating some relevant terms in the next text filter field. Finally, you check desired items individually or all at once and start immediate downloading using the corresponding button (on right bottom corner).
The downloading items are now visible in the downloads tab of the popup; you can manage this list in the usual manner (pause, resume, open or remove individual items etc)

Alexander · March 16, 2021, 1:02pm

Thanks, I’ll try it now!

Alexander · March 16, 2021, 1:17pm

It’s downloading HTML links but not the PDFs.

Alexander · March 16, 2021, 1:20pm

This is the page If it helps:
https://archive.org/details/ZX-computing-magazine?&sort=date

Kresimir · March 16, 2021, 1:44pm

I use wget for that purpose. I just copy and paste manually all the links into a file and then run wget on every line in a script.

Alexander · March 16, 2021, 1:46pm

that would take very long for these magazines If I were to do that.

Kresimir · March 16, 2021, 1:47pm

How many files are we talking about? Can you give a link to the website?

Are they all linked in the same page? If so, just copy the HTML and, using regular expressions, extract all PDF urls from it. Once you have only a list of URLs to download, you can wget them easily and automatically.

dalto · March 16, 2021, 1:48pm

He already did…

dalto · March 16, 2021, 1:49pm

I was just looking at it, they don’t appear to be. You need to drill into each page then the pdf link is on that detail page.

dalto · March 16, 2021, 1:52pm

It looks like they all follow a naming convention. You could probably write some kind of script that extracted the links from the page and then tried to download the file from the download server by modifying the URLs following a formula.

I am not sure that would be less work than downloading them manually though.

Alexander · March 16, 2021, 1:57pm

Yeah, it looks like I’m in for manual downloading which will take hours.
Thanks.

Kresimir · March 16, 2021, 1:57pm

Yeah, that’s going to be a bit more difficult.

It seems you can generate a list of files using

https://archive.org/advancedsearch.php

I haven’t figured out how, yet.

If we can get it to a list of URLs, then it’s easy.

Kresimir · March 16, 2021, 2:11pm

I think I figured it out, just testing a bit…

You should have a solution in a few minutes, I hope.

EDIT:

Here it is.

This command downloads one issue:

wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1982-08

Here is a list of all commands for all issues. You can run them one by one, or, if you feel brave, just save them as a script and run them all

wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1982-08
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1987-01
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1985-04
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1984-06
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1986-08
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1984-08
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1986-11
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1982-Sum
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1985-12
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1986-07
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1984-04
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1983-02
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1983-04
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1986-02
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1983-12
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1987-03
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1985-02
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1986-06
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1986-10
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1987-02
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1985-06
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1985-10
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1987-04
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1985-08
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1987-06
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1984-10
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1987-05
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1986-05
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1986-09
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1983-06
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1983-08
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1984-02
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1983-10
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1986-04
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1982-10
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1986-12
wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1984-12

There are only 37 files, so it shouldn’t take too long.

How I generated the list...

I went to https://archive.org/advancedsearch.php, and under “Advanced Search returning JSON, XML, and more” in the “Query” field I entered:

collection:ZX-computing-magazine

I increased the number of results per page to 500 (in retrospective, that was unnecessary), and selected CSV as the output file. This file contained:

"identifier"
"ZX-computing-1982-08"
"ZX-computing-1987-01"
"ZX-computing-1984-08"
"ZX-computing-1986-11"
"ZX-computing-1982-Sum"
"ZX-computing-1985-12"
"ZX-computing-1986-07"
"ZX-computing-1985-04"
"ZX-computing-1984-06"
"ZX-computing-1986-08"
"ZX-computing-1984-04"
"ZX-computing-1983-02"
"ZX-computing-1983-04"
"ZX-computing-1986-02"
"ZX-computing-1986-10"
"ZX-computing-1983-12"
"ZX-computing-1987-03"
"ZX-computing-1985-02"
"ZX-computing-1986-06"
"ZX-computing-1987-02"
"ZX-computing-1983-06"
"ZX-computing-1983-08"
"ZX-computing-1985-06"
"ZX-computing-1985-10"
"ZX-computing-1987-04"
"ZX-computing-1985-08"
"ZX-computing-1987-06"
"ZX-computing-1984-10"
"ZX-computing-1987-05"
"ZX-computing-1986-05"
"ZX-computing-1986-09"
"ZX-computing-1986-12"
"ZX-computing-1984-12"
"ZX-computing-1984-02"
"ZX-computing-1983-10"
"ZX-computing-1986-04"
"ZX-computing-1982-10"

Then, with some text editing magic, I removed the first line, the quotes and pasted the wget command. A very useful feature for that is block selection in Kate.

SimonJ · March 16, 2021, 5:09pm

This is most interesting, thank you.

Alexander · March 16, 2021, 9:40pm

That is ingenious, thanks.

Alexander · March 18, 2021, 10:07am

Although on downloaded, the second one returns with this:

bash: n: command not found

Kresimir · March 18, 2021, 4:59pm

The second what?

The second command in the list?

This one?

wget --no-directories --content-disposition -e robots=off -A.pdf -r http://archive.org/download/ZX-computing-1987-01

It works for me…

This should not happen. If you look closely, you’ll see that I am not using the command n anywhere in that script. It’s just 37 times wget, with a different final option.

Alexander · March 18, 2021, 6:57pm

It’s working fine, bad copy and paste due to slight fever.

Kresimir · March 18, 2021, 6:57pm

Get well soon. Stay hydrated!