You are not logged in.
This is prolly first year IT stuff, but well beyond me. I have a website to translate and need to know how many words and keystrokes it has. Obviously wc is the way forward.
Unfortunately the site has a bunch of subdirectories and doing it manually would take yonks! If this is too stupid then I apologize in advance But if some kind soul can point me in the right direction I'd be more than grateful. I can do VERY simple bash stuff and even simpler python but this calls for arrays or other such weird creatures...
Last edited by toad (2009-07-14 22:39:46)
never trust a toad...
::Grateful ArchDonor::
::Grateful Wikipedia Donor::
Offline
wget -r?
Last edited by Wintervenom (2009-07-14 20:17:56)
Offline
yep, did that to download the entire site - now I just need to count aforesaid words and keystrokes in all the files across all the subdirectories...
never trust a toad...
::Grateful ArchDonor::
::Grateful Wikipedia Donor::
Offline
html2text will work on multiple html files concatenated
wget -r -O - 'http://.....' | html2text | wc -w
Or if you already have it all
find -iname '*.html' -exec cat '{}' \; | html2text | wc -w
Offline
html2text - hm... Unfortunately the latter command gives me only a count of 87 so I suppose it does not go into subdirectories. Will keep on trying and thanks for the tip!
never trust a toad...
::Grateful ArchDonor::
::Grateful Wikipedia Donor::
Offline
html2text - hm... Unfortunately the latter command gives me only a count of 87 so I suppose it does not go into subdirectories. Will keep on trying and thanks for the tip!
You could perhaps use an array for that. For example:
find -iname {'*.html','*/*.html'} -exec cat '{}' \; | html2text | wc -w
I haven't tried that (so I don't know if it works), but it should.
Offline
Nice one... The website in question is www.forggensee-yachtschule.de
I just deleted it 'cos I thought I'd try Procryon's solution - but no joy. Am downloading it again to check smartboyathome's suggestion (but I've got a slow connection, takes about an hour...).
never trust a toad...
::Grateful ArchDonor::
::Grateful Wikipedia Donor::
Offline
They are all php files, not html. With wget -r you are also downloading all jpg files although html2text doesn't mind that. So to save bandwidth set up a filter like wget -r -R JPG,jpg
I downloaded it and got:
$ html2text $(find) | wc -w
26644
EDIT: I looked at the output and some images were in and those gave +1000 words
So probably use: find -iname '*.php*' -exec cat '{}' \; | html2text | wc -w
Last edited by Procyon (2009-07-14 21:54:44)
Offline
Procyon - I got
toad@archtop 965\73 /home/toad/www.forggensee-yachtschule.de > html2text $(find) | wc -w
26150
and
toad@archtop 966\74 /home/toad/www.forggensee-yachtschule.de > find -iname '*.php*' -exec cat '{}' \; | html2text | wc -w
25710
Whatever - I am most grateful for your and everybody else's help!. You lot saved me hours! Many many thanks
never trust a toad...
::Grateful ArchDonor::
::Grateful Wikipedia Donor::
Offline