You are not logged in.

#1 2012-01-19 04:29:47

cfr
Member
From: Cymru
Registered: 2011-11-27
Posts: 7,132

perl-libwww: possible to replace lwp-rget?

I use a perl script (available from https://github.com/jgm/sep-offprint) to download nicely formatted entries from an online encyclopaedia. (Actually, I use a very slightly modified version of this.)

I therefore tried to test this in my new Arch install and have run into problems. Initially, I thought the problem had to do with stuff not being installed but it turns out that the relevant module is installed through pacman (perl-libwww). However, lwp-rget, which used to be provided by this module, has been dropped from more recent versions. Since sep-offprint hasn't been updated since 2008, I am not expecting it to be made compatible any time soon.

I'm therefore wondering if there is any relatively straightforward way of replacing lwp-rget in the script. I had a look at the commands which are still provided by the module (lwp-download, lwp-request looked the most promising) but they don't offer command line options which sep-offprint uses with lwp-rget so I certainly cannot use one as a drop-in replacement.

My knowledge of perl is minimal, to say the least, so this is the point at which I start to think it might be easier to just write the articles out with a pencil. If it were a shell script I might stand some chance...

The relevant part of the script is:

# download all the HTML files
print STDERR "Retrieving files...\n";
$downloadedFiles = `lwp-rget --limit=200 $source/index.html 2>&1`;
(-e "index.html") or die "Could not retrieve files from $source\nAre you sure you have the right entry name?\n"; 

If there is an easyish work around, I would be really grateful if somebody could point me to it.

The relevant part of the code _looks_ as if it is just a shell command but I'm not sure how shell commands integrate into perl scripts or whether trying to replace it with something non-perl would be a possibility. (If I can turn it into a shell scripting problem, I might be able to have a go at replacing it myself. I'm guessing the --limit option limited downloads in some way - to 200 files? 200kb?.)


CLI Paste | How To Ask Questions

Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L

Offline

#2 2012-01-19 20:20:17

juster
Forum Fellow
Registered: 2008-10-07
Posts: 195

Re: perl-libwww: possible to replace lwp-rget?

The backticks (`) do work just like in shell programming. The output of the lwp-rget command is stored in $downloadedFiles. You could replace the lwp-rget command with any other command that mirrors a webpage and echoes the files it downloads. You could use wget instead. It mirrors web pages easily. lwp-rget's --limit seems to limit to downloading 200 documents. This could be handly, so you don't mirror the entire encyclopedia...

Offline

#3 2012-01-20 02:25:28

cfr
Member
From: Cymru
Registered: 2011-11-27
Posts: 7,132

Re: perl-libwww: possible to replace lwp-rget?

That's very helpful - thanks very much. I had a read of the wget man page and could not find anything quite like the option for lwp-rget so I experimented a little and came up with this:

$downloadedFiles = `wget -Q2m $source/index.html 2>&1`

Initially, I tried:

$downloadedFiles = `wget -Q2m -E -k -p $source/index.html 2>&1`

but that seemed to give me errors.

Actually, I'm still getting errors but I am also getting an article downloaded and formatted in a reasonable fashion. So I'm not sure whether I need to worry about the errors or not. I've tried a couple of different articles and one form of error seems to concern html2ps:

Retrieving files...
Creating offprint...
Use of assignment to $[ is deprecated at /usr/bin/html2ps line 3408.
Warning: cannot open the global resource file: /it/sw/share/www/lib/html2ps/html2psrc
*** Error opening ../../symbols/sepman-icon.jpg
*** Error opening ../../symbols/inpho.png
*** Error opening ../../symbols/pp.gif
*** Error opening --2012-01-20
*** Error opening 02:07:23--

This seems to be partly a complaint about html2ps itself in which case, I'm just going to hope it gets updated at some point. The errors about not finding the pngs are I think because I had problems trying to use the -p option with wget which I'd expected to address that issue. Might be important for logic articles, I guess, but probably just for prettiness otherwise.

The second error I see after the pdf version of the article is created:

cannot remove path when cwd is /tmp/phonLCKPUg for /tmp/phonLCKPUg:  at /usr/share/perl5/core_perl/File/Temp.pm line 902

I've followed a good many google threads on this one and as I understand it, the solution is to include a line which tells perl to change directories before trying to clean up the temporary directory. Google also suggested a fix:

chdir($ENV{HOME}) || chdir('/');

I wasn't sure where to put this but putting it as the last non-comment line of the file seems to work - I don't see errors about not being able to clean up.

I always worry about hacking things I don't understand (so pretty much everything) but this does seem an improvement and I think the -Q2m option to wget should stop it trying to download the entire encyclopaedia and filling my entire disk. And the perl code I copied looks straightforward enough - I take it it is pretty much like

cd ${HOME} || cd /

Thanks very much for telling me I could stick something non-perl in the lwp-rget-shaped hole!


CLI Paste | How To Ask Questions

Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L

Offline

Board footer

Powered by FluxBB