You are not logged in.

#1 2012-01-20 19:52:52

Llama
Banned
From: St.-Petersburg, Russia
Registered: 2008-03-03
Posts: 1,379

lynx -dump

Hi,

Trying to convert html to plain text, I got a result that shook me considerably.

$ lynx -dump http://lib.rus.ec/b/79105/read > Blue.txt

This is a real url, you can try it easily. I expected to get some decent rendering of this pretty plain source in Blue.txt; I got it all right. I feared to get garbage instead of the Russian characters, which happens now and again, but I got something I never seen before: a correctly latinized cyrillic text. That's way too clever. Is there a way of asking lynx to keep the original characters?

Offline

#2 2012-01-20 20:04:19

karol
Archivist
Registered: 2009-05-06
Posts: 25,440

Re: lynx -dump

curl http://lib.rus.ec/b/79105/read | html2text > Blue.txt

seems to get you some cyrillic, some garbage.

Edit: No need for curl ;P

html2text -o Blue.txt http://lib.rus.ec/b/79105/read

Edit 2: Using https://aur.archlinux.org/packages.php?ID=37064 w/ '-utf8' switch

html2text -utf8 -o Blue.txt http://lib.rus.ec/b/79105/read

gives slightly better results.

Last edited by karol (2012-01-20 20:16:38)

Offline

#3 2012-01-20 20:55:35

bohoomil
Banned
Registered: 2010-09-04
Posts: 2,377
Website

Re: lynx -dump

cat read.htm | w3m -dump -T text/html > blue.txt

:: Registered Linux User No. 223384

:: github
:: infinality-bundle+fonts: good looking fonts made easy

Offline

#4 2012-01-21 05:28:48

Llama
Banned
From: St.-Petersburg, Russia
Registered: 2008-03-03
Posts: 1,379

Re: lynx -dump

How did I forget about pandoc?

$ pandoc blue.html -o blue.txt

I'm still curious about the lynx potential.

Offline

Board footer

Powered by FluxBB