[SOLVED] - counting words on a website

toad · 2009-07-14 20:13:28

This is prolly first year IT stuff, but well beyond me. I have a website to translate and need to know how many words and keystrokes it has. Obviously wc is the way forward.

Unfortunately the site has a bunch of subdirectories and doing it manually would take yonks! If this is too stupid then I apologize in advance But if some kind soul can point me in the right direction I'd be more than grateful. I can do VERY simple bash stuff and even simpler python but this calls for arrays or other such weird creatures...

Last edited by toad (2009-07-14 22:39:46)

Wintervenom · 2009-07-14 20:17:20

wget -r?

Last edited by Wintervenom (2009-07-14 20:17:56)

toad · 2009-07-14 20:21:54

yep, did that to download the entire site - now I just need to count aforesaid words and keystrokes in all the files across all the subdirectories...

Procyon · 2009-07-14 20:30:58

html2text will work on multiple html files concatenated

wget -r -O - 'http://.....' | html2text | wc -w

Or if you already have it all

find -iname '*.html' -exec cat '{}' \; | html2text | wc -w

toad · 2009-07-14 20:45:07

html2text - hm... Unfortunately the latter command gives me only a count of 87 so I suppose it does not go into subdirectories. Will keep on trying and thanks for the tip!

smartboyathome · 2009-07-14 21:06:21

toad wrote:

html2text - hm... Unfortunately the latter command gives me only a count of 87 so I suppose it does not go into subdirectories. Will keep on trying and thanks for the tip!

You could perhaps use an array for that. For example:

find -iname {'*.html','*/*.html'} -exec cat '{}' \; | html2text | wc -w

I haven't tried that (so I don't know if it works), but it should.

toad · 2009-07-14 21:14:38

Nice one... The website in question is www.forggensee-yachtschule.de

I just deleted it 'cos I thought I'd try Procryon's solution - but no joy. Am downloading it again to check smartboyathome's suggestion (but I've got a slow connection, takes about an hour...).

Procyon · 2009-07-14 21:47:49

They are all php files, not html. With wget -r you are also downloading all jpg files although html2text doesn't mind that. So to save bandwidth set up a filter like wget -r -R JPG,jpg
I downloaded it and got:
$ html2text $(find) | wc -w
26644

EDIT: I looked at the output and some images were in and those gave +1000 words
So probably use: find -iname '*.php*' -exec cat '{}' \; | html2text | wc -w

Last edited by Procyon (2009-07-14 21:54:44)

toad · 2009-07-14 22:39:20

Procyon - I got

toad@archtop 965\73 /home/toad/www.forggensee-yachtschule.de > html2text $(find) | wc -w
26150

and

toad@archtop 966\74 /home/toad/www.forggensee-yachtschule.de > find -iname '*.php*' -exec cat '{}' \; | html2text | wc -w
25710

Whatever - I am most grateful for your and everybody else's help!. You lot saved me hours! Many many thanks

Arch Linux

#1 2009-07-14 20:13:28

[SOLVED] - counting words on a website

#2 2009-07-14 20:17:20

Re: [SOLVED] - counting words on a website

#3 2009-07-14 20:21:54

Re: [SOLVED] - counting words on a website

#4 2009-07-14 20:30:58

Re: [SOLVED] - counting words on a website

#5 2009-07-14 20:45:07

Re: [SOLVED] - counting words on a website

#6 2009-07-14 21:06:21

Re: [SOLVED] - counting words on a website

#7 2009-07-14 21:14:38

Re: [SOLVED] - counting words on a website

#8 2009-07-14 21:47:49

Re: [SOLVED] - counting words on a website

#9 2009-07-14 22:39:20

Re: [SOLVED] - counting words on a website

Board footer