You are not logged in.
Pages: 1
Taco Bell programming is one of the steps on the path to Unix Zen. This is a path that I am personally just beginning, but it's already starting to pay dividends. To really get into it, you need to throw away a lot of your ideas about how systems are designed: I made most of a SOAP server using static files and Apache's mod_rewrite. I could have done the whole thing Taco Bell style if I had only manned up and broken out sed, but I pussied out and wrote some Python.
The whole thing is well worth the read: http://teddziuba.com/2010/10/taco-bell-programming.html
Offline
Sounds very suckless-ish So how would this be applied when you need a database involved....just use plain ol' SQL with no other language to manage it (php, python, etc)? Or am I missing something? Probably
Scott
Offline
Or am I missing something? Probably
The path to Unix Zen? I get the feeling that Chuck Norris is somewhere along that path...
Offline
Hehe...
Moving on, once you have these millions of pages (or even tens of millions), how do you process them? Surely, Hadoop MapReduce is necessary, after all, that's what Google uses to parse the web, right?
Pfft, fuck that noise:
find crawl_dir/ -type f -print0 | xargs -n1 -0 -P32 ./process
Yet this very short article pretty much sums up my view on most everything:
I have far more faith in xargs than I do in Hadoop. Hell, I trust xargs more than I trust myself to write a simple multithreaded processor. I trust syslog to handle asynchronous message recording far more than I trust a message queue service.
Offline
The guy is right about not overengineering things, but come on, nothing new there. He just slapped another name on it.
However, he's totally wrong about devops. Devops is about improving developer/operations practices and it's unrelated. So he's just trolling about that
< Daenyth> and he works prolifically
4 8 15 16 23 42
Offline
The Taco Bell answer? xargs and wget. In the rare case that you saturate the network connection, add some split and rsync. A "distributed crawler" is really only like 10 lines of shell script.
had something oddly similar to this happen at work the other day. I suggested a short shell script at a meeting, but they didn't brushed my suggestion aside and proceeded with designing a few hundred lines of .NET C# because it "seems like a more solid solution"
Last edited by Cyrusm (2010-10-27 12:20:43)
Hofstadter's Law:
It always takes longer than you expect, even when you take into account Hofstadter's Law.
Offline
Much as I agree, I have seen it go the other way. At a company I know a developer or similar spent months trying to get a bash, awk, sed etc. etc. system running. Eventually someone knocked up a solution using a proprietary piece of software (I think it was DB2, but it was some time ago) in about two weeks. The first attempt may have eventually worked, but he was writing stuff in awk (say) for which there were API calls in the proprietary stuff. It's worth pointing out that they already had many DB2 systems installed and running so no licenses had to be bought. As in most cases, you need to know your aims and your tools.
"...one cannot be angry when one looks at a penguin." - John Ruskin
"Life in general is a bit shit, and so too is the internet. And that's all there is." - scepticisle
Offline
Tacos and Arch.... I pretty much agree, at least for my purposes. When I gots a task to do, I'm going to put it together the most simple way with the least coding effort, using things like sed, xargs, grep, and so on. I guess one could argue that if you put together a new tool in lua or something, you could get a faster program than if you pipe stuff around... but for what I do, its faster to use a quickly made, slightly inefficient script rather than to make a uber-optimized brand-new program. The time i save by making sure everything is optimized and shiny doesn't make up for the time it would take for me to make the shiny stuff in the first place.
ps. tacos
Nai haryuvalyë melwa rë
Offline
The simplest way is not always the best:
I tried to do a simple search for i2p...
crawler.sh
#!/bin/sh
is_reachable ()
{
http_proxy=http://localhost:1337 wget -q --spider "$1"
return $?
}
add_index ()
{
J=$(echo "$1" | sed "s!http://!!g")
cp -R /opt/i2p/.i2p/eepsite/docroot/indexorig/"$J" /opt/i2p/.i2p/eepsite/docroot/index/
}
de_htmlify ()
{
J=$(echo "$1" | sed "s!http://!!g")
find /opt/i2p/.i2p/eepsite/docroot/index/"$J" ! -type d ! -iname "*.txt" | while read K
do
mv "$K" /tmp/to_dehtmlify.txt
elinks -force-html --dump -dump-width 300 /tmp/to_dehtmlify.txt | sed 's/ \+/ /g' | sed "/^\[ \]$/d" > /tmp/dehtmlified.txt
rm /tmp/to_dehtmlify.txt
mv /tmp/dehtmlified.txt "$K"
done
}
search_add_new ()
{
echo "new URLS: "
J=$(echo "$1" | sed "s!http://!!g")
# long hashes
egrep "([a-zA-Z0-9_-]+\.)+i2p=[a-zA-Z0-9~-]+AAAA" -h -R -o /opt/i2p/.i2p/eepsite/docroot/indexorig/"$J" >> /opt/i2p/.i2p/eepsite/docroot/found_longhashes.txt
cat /opt/i2p/.i2p/eepsite/docroot/found_longhashes.txt | sort | uniq > /tmp/found_longhashes.txt
mv /tmp/found_longhashes.txt /opt/i2p/.i2p/eepsite/docroot/found_longhashes.txt
# urls and short hashes
egrep "http://([a-zA-Z0-9_-]+\.)+i2p" -h -R -o /opt/i2p/.i2p/eepsite/docroot/indexorig/"$J" | sort | uniq > /tmp/newurls.txt
for K in $(cat /tmp/newurls.txt)
do
echo " checking, if" "$K" "is reachable..."
is_reachable "$K"
if [ "$?" -eq 0 ]
then
echo " " "$K" "is reachable, adding..."
echo "$K" >> /opt/i2p/.i2p/eepsite/docroot/list.txt
cat /opt/i2p/.i2p/eepsite/docroot/list.txt | sort | uniq > /tmp/list.txt
mv /tmp/list.txt /opt/i2p/.i2p/eepsite/docroot/list.txt
else
echo "..." "$K" "is not reachable"
fi
done
}
remove_from_list ()
{
J=$(echo "$1" | sed "s!http://!!g")
sed -i "/$J/d" /opt/i2p/.i2p/eepsite/docroot/list.txt
}
while :
do
for I in $(cat /opt/i2p/.i2p/eepsite/docroot/list.txt)
do
echo "processing:" "$I"
is_reachable "$I"
if [ "$?" -eq 0 ]
then
echo "$I" "is reachable"
echo "downloading..."
http_proxy=http://localhost:1337 wget --timeout=45 --tries=2 -q --exclude-directories="*doc*" --convert-links --user-agent="Complain in #i2p if this download is bothering you" -N -P /opt/i2p/.i2p/eepsite/docroot/indexorig/ -r -l 4 -A htm,html,cgi,txt "$I" || echo "Download failed"
echo "Reading new URLs and adding to list..."
search_add_new "$I"
echo "Adding" "$I" "to the searchindex:"
add_index "$I"
echo "de-htmlify" "$I" "..."
de_htmlify "$I"
echo
else
echo "$I" "is NOT reachable"
# for strict settings, but we are so slow that it will probably up when we get to it the next time
#echo "removing" "$I" "from list"
#remove_from_list "$I"
echo
fi
done
echo "sleeping 10 sec..."
sleep 10
done
search.sh, called by gatling via CGI
#!/bin/sh
SEARCHTERMS=$(echo "$QUERY_STRING" | sed -n 's/^.*search=\([^&]*\).*$/\1/p' | sed "s/%20/ /g" | sed "s/+/ /g")
echo "Content-type: text/html"
echo
echo "<HTML><HEAD>"
echo "<TITLE>Search Results</TITLE></HEAD>"
echo "<p>Multiple matches on one site are indicated by - - - > another_find</p>"
echo "<body><h1>Search Results</h1>"
echo '<div>Search again:
<form action="search.sh" method="GET">
<input type="text" name="search" value="'
echo "$SEARCHTERMS"
echo '"><br>
<input type="submit" value="Search!">
</form></div><br>'
echo "<br>"
echo "Searchterm:"
echo "$SEARCHTERMS"
echo "<br>"
if [ $(echo "$SEARCHTERMS" | wc -c ) -gt 3 ]
then
grep -R -i --exclude=hosts.txt --exclude-dir=hikki.i2p -F "$SEARCHTERMS" /opt/i2p/.i2p/eepsite/docroot/index/ | cut --complement -f 1-7 -d "/" > /tmp/search"$SEARCHTERMS"
if [ "$(cat /tmp/search"""$SEARCHTERMS""")" != "" ]
then
echo "$(wc -l < /tmp/search""$SEARCHTERMS"")" "matches"
echo '<div style="width:99%; overflow:auto; border:dotted gray 2px";>'
echo "<div><p>"
cat /tmp/search"$SEARCHTERMS" | while read I
do
URL=$(echo "$I" | cut -f 1 -d ":")
FILENAME=$(echo "$URL" | cut --complement -f 1 -d "/")
URL=$(echo "$URL" | cut -f 1 -d "/")
CONTENT=$(echo "$I" | cut --complement -f 1 -d ":")
if [ "$URL" == "$LASTURL" ]
then
RESULT="$CONTENT"
if [ "$FILENAME" != "$LASTFILENAME" ]
then
echo '<a href="http://'"$URL"/"$FILENAME"'">'"$FILENAME"'</a>'":<br>"
else
echo "- - - > "
fi
echo "$RESULT""<br>"
else
echo "<hr>"
RESULT='<a href="http://'"$URL"/"$FILENAME"'">'http://"$URL"/"$FILENAME"'</a><br>'"$CONTENT:""<br>"
echo "$RESULT"
fi
LASTFILENAME="$FILENAME"
LASTURL="$URL"
done
echo "</p></div>"
else
echo "No results<br>"
fi
rm /tmp/search"$SEARCHTERMS"
else
echo "Searchterms need to be at least 3 characters long, otherwise you'd get REALLY many results. Your browser doesn't want to render that.<br>"
fi
echo "</div>"
echo "</div></body></html>"
Actually it works better than I expected but it has the same problem I suspect with "find crawl_dir/ -type f -print0 | xargs -n1 -0 -P32 ./process".
$ du -sh index
82M index
The filesystem is a BAD database because when it is not in cache it is slow as hell.
QUERY_STRING="search=foo+bar" ./search.sh 9,45s user 1,34s system 41% cpu 26,036 total
So I will redo the whole thing (It's not very well written, anyway, because I started without any plan) in C or Java + postgresql. I like the bash + filesystem version better but it just doesn't work out in the end.
฿ 18PRsqbZCrwPUrVnJe1BZvza7bwSDbpxZz
Offline
Pages: 1