You are not logged in.
Pages: 1
Hi,
Trying to convert html to plain text, I got a result that shook me considerably.
$ lynx -dump http://lib.rus.ec/b/79105/read > Blue.txt
This is a real url, you can try it easily. I expected to get some decent rendering of this pretty plain source in Blue.txt; I got it all right. I feared to get garbage instead of the Russian characters, which happens now and again, but I got something I never seen before: a correctly latinized cyrillic text. That's way too clever. Is there a way of asking lynx to keep the original characters?
Offline
curl http://lib.rus.ec/b/79105/read | html2text > Blue.txtseems to get you some cyrillic, some garbage.
Edit: No need for curl ;P
html2text -o Blue.txt http://lib.rus.ec/b/79105/readEdit 2: Using https://aur.archlinux.org/packages.php?ID=37064 w/ '-utf8' switch
html2text -utf8 -o Blue.txt http://lib.rus.ec/b/79105/readgives slightly better results.
Last edited by karol (2012-01-20 20:16:38)
Offline
cat read.htm | w3m -dump -T text/html > blue.txt:: Registered Linux User No. 223384
:: github
:: infinality-bundle+fonts: good looking fonts made easy
Offline
How did I forget about pandoc?
$ pandoc blue.html -o blue.txt
I'm still curious about the lynx potential.
Offline
Pages: 1