You are not logged in.

#1 2009-07-14 20:13:28

toad
Member
From: if only I knew
Registered: 2008-12-22
Posts: 1,775
Website

[SOLVED] - counting words on a website

This is prolly first year IT stuff, but well beyond me. I have a website to translate and need to know how many words and keystrokes it has. Obviously wc is the way forward.

Unfortunately the site has a bunch of subdirectories and doing it manually would take yonks! If this is too stupid then I apologize in advance tongue But if some kind soul can point me in the right direction I'd be more than grateful. I can do VERY simple bash stuff and even simpler python but this calls for arrays or other such weird creatures...

Last edited by toad (2009-07-14 22:39:46)


never trust a toad...
::Grateful ArchDonor::
::Grateful Wikipedia Donor::

Offline

#2 2009-07-14 20:17:20

Wintervenom
Member
Registered: 2008-08-20
Posts: 1,011

Re: [SOLVED] - counting words on a website

wget -r?

Last edited by Wintervenom (2009-07-14 20:17:56)

Offline

#3 2009-07-14 20:21:54

toad
Member
From: if only I knew
Registered: 2008-12-22
Posts: 1,775
Website

Re: [SOLVED] - counting words on a website

yep, did that to download the entire site - now I just need to count aforesaid words and keystrokes in all the files across all the subdirectories...


never trust a toad...
::Grateful ArchDonor::
::Grateful Wikipedia Donor::

Offline

#4 2009-07-14 20:30:58

Procyon
Member
Registered: 2008-05-07
Posts: 1,819

Re: [SOLVED] - counting words on a website

html2text will work on multiple html files concatenated

wget -r -O - 'http://.....' | html2text | wc -w

Or if you already have it all

find -iname '*.html' -exec cat '{}' \; | html2text | wc -w

Offline

#5 2009-07-14 20:45:07

toad
Member
From: if only I knew
Registered: 2008-12-22
Posts: 1,775
Website

Re: [SOLVED] - counting words on a website

html2text - hm... Unfortunately the latter command gives me only a count of 87 so I suppose it does not go into subdirectories. Will keep on trying and thanks for the tip!


never trust a toad...
::Grateful ArchDonor::
::Grateful Wikipedia Donor::

Offline

#6 2009-07-14 21:06:21

smartboyathome
Member
From: $HOME
Registered: 2007-12-23
Posts: 334
Website

Re: [SOLVED] - counting words on a website

toad wrote:

html2text - hm... Unfortunately the latter command gives me only a count of 87 so I suppose it does not go into subdirectories. Will keep on trying and thanks for the tip!

You could perhaps use an array for that. For example:

find -iname {'*.html','*/*.html'} -exec cat '{}' \; | html2text | wc -w

I haven't tried that (so I don't know if it works), but it should.

Offline

#7 2009-07-14 21:14:38

toad
Member
From: if only I knew
Registered: 2008-12-22
Posts: 1,775
Website

Re: [SOLVED] - counting words on a website

Nice one... The website in question is www.forggensee-yachtschule.de

I just deleted it 'cos I thought I'd try Procryon's solution - but no joy. Am downloading it again to check smartboyathome's suggestion (but I've got a slow connection, takes about an hour...).


never trust a toad...
::Grateful ArchDonor::
::Grateful Wikipedia Donor::

Offline

#8 2009-07-14 21:47:49

Procyon
Member
Registered: 2008-05-07
Posts: 1,819

Re: [SOLVED] - counting words on a website

They are all php files, not html. With wget -r you are also downloading all jpg files although html2text doesn't mind that. So to save bandwidth set up a filter like wget -r -R JPG,jpg
I downloaded it and got:
$ html2text $(find) | wc -w
26644

EDIT: I looked at the output and some images were in and those gave +1000 words
So probably use: find -iname '*.php*' -exec cat '{}' \; | html2text | wc -w

Last edited by Procyon (2009-07-14 21:54:44)

Offline

#9 2009-07-14 22:39:20

toad
Member
From: if only I knew
Registered: 2008-12-22
Posts: 1,775
Website

Re: [SOLVED] - counting words on a website

Procyon - I got

toad@archtop 965\73 /home/toad/www.forggensee-yachtschule.de > html2text $(find) | wc -w
26150 

and

toad@archtop 966\74 /home/toad/www.forggensee-yachtschule.de > find -iname '*.php*' -exec cat '{}' \; | html2text | wc -w
25710

Whatever - I am most grateful for your and everybody else's help!. You lot saved me hours! Many many thanks smile


never trust a toad...
::Grateful ArchDonor::
::Grateful Wikipedia Donor::

Offline

Board footer

Powered by FluxBB