Taco Bell Programming

jasonwryan · 2010-10-26 23:22:21

Taco Bell programming is one of the steps on the path to Unix Zen. This is a path that I am personally just beginning, but it's already starting to pay dividends. To really get into it, you need to throw away a lot of your ideas about how systems are designed: I made most of a SOAP server using static files and Apache's mod_rewrite. I could have done the whole thing Taco Bell style if I had only manned up and broken out sed, but I pussied out and wrote some Python.

The whole thing is well worth the read: http://teddziuba.com/2010/10/taco-bell-programming.html

firecat53 · 2010-10-26 23:46:43

Sounds very suckless-ish So how would this be applied when you need a database involved....just use plain ol' SQL with no other language to manage it (php, python, etc)? Or am I missing something? Probably

Scott

jasonwryan · 2010-10-27 00:46:47

firecat53 wrote:

Or am I missing something? Probably

The path to Unix Zen? I get the feeling that Chuck Norris is somewhere along that path...

cesura · 2010-10-27 01:11:26

Hehe...

Taco Bell Programming wrote:

Moving on, once you have these millions of pages (or even tens of millions), how do you process them? Surely, Hadoop MapReduce is necessary, after all, that's what Google uses to parse the web, right?
Pfft, fuck that noise:
find crawl_dir/ -type f -print0 | xargs -n1 -0 -P32 ./process

Yet this very short article pretty much sums up my view on most everything:

Taco Bell Programming wrote:

I have far more faith in xargs than I do in Hadoop. Hell, I trust xargs more than I trust myself to write a simple multithreaded processor. I trust syslog to handle asynchronous message recording far more than I trust a message queue service.

Dieter@be · 2010-10-27 10:26:02

The guy is right about not overengineering things, but come on, nothing new there. He just slapped another name on it.
However, he's totally wrong about devops. Devops is about improving developer/operations practices and it's unrelated. So he's just trolling about that

Cyrusm · 2010-10-27 12:19:49

Taco Bell Programming wrote:

The Taco Bell answer? xargs and wget. In the rare case that you saturate the network connection, add some split and rsync. A "distributed crawler" is really only like 10 lines of shell script.

had something oddly similar to this happen at work the other day. I suggested a short shell script at a meeting, but they didn't brushed my suggestion aside and proceeded with designing a few hundred lines of .NET C# because it "seems like a more solid solution"

Last edited by Cyrusm (2010-10-27 12:20:43)

skanky · 2010-10-27 13:14:34

Much as I agree, I have seen it go the other way. At a company I know a developer or similar spent months trying to get a bash, awk, sed etc. etc. system running. Eventually someone knocked up a solution using a proprietary piece of software (I think it was DB2, but it was some time ago) in about two weeks. The first attempt may have eventually worked, but he was writing stuff in awk (say) for which there were API calls in the proprietary stuff. It's worth pointing out that they already had many DB2 systems installed and running so no licenses had to be bought. As in most cases, you need to know your aims and your tools.

Bregol · 2010-10-27 16:58:28

Tacos and Arch.... I pretty much agree, at least for my purposes. When I gots a task to do, I'm going to put it together the most simple way with the least coding effort, using things like sed, xargs, grep, and so on. I guess one could argue that if you put together a new tool in lua or something, you could get a faster program than if you pipe stuff around... but for what I do, its faster to use a quickly made, slightly inefficient script rather than to make a uber-optimized brand-new program. The time i save by making sure everything is optimized and shiny doesn't make up for the time it would take for me to make the shiny stuff in the first place.

ps. tacos

Cdh · 2010-10-27 18:05:15

The simplest way is not always the best:

I tried to do a simple search for i2p...

crawler.sh

#!/bin/sh

is_reachable ()
{
   http_proxy=http://localhost:1337 wget -q --spider "$1"
   return $?
}

add_index ()
{
   J=$(echo "$1" | sed "s!http://!!g")
   cp -R /opt/i2p/.i2p/eepsite/docroot/indexorig/"$J" /opt/i2p/.i2p/eepsite/docroot/index/
}

de_htmlify ()
{
   J=$(echo "$1" | sed "s!http://!!g")
   find /opt/i2p/.i2p/eepsite/docroot/index/"$J" ! -type d ! -iname "*.txt" | while read K
   do
      mv "$K" /tmp/to_dehtmlify.txt
      elinks -force-html --dump -dump-width 300 /tmp/to_dehtmlify.txt | sed 's/ \+/ /g' | sed "/^\[ \]$/d" >  /tmp/dehtmlified.txt
      rm /tmp/to_dehtmlify.txt
      mv /tmp/dehtmlified.txt "$K"
   done
}

search_add_new ()
{
   echo "new URLS: "
   J=$(echo "$1" | sed "s!http://!!g")

   # long hashes
   egrep "([a-zA-Z0-9_-]+\.)+i2p=[a-zA-Z0-9~-]+AAAA" -h -R -o /opt/i2p/.i2p/eepsite/docroot/indexorig/"$J" >> /opt/i2p/.i2p/eepsite/docroot/found_longhashes.txt
   cat /opt/i2p/.i2p/eepsite/docroot/found_longhashes.txt | sort | uniq > /tmp/found_longhashes.txt
   mv /tmp/found_longhashes.txt /opt/i2p/.i2p/eepsite/docroot/found_longhashes.txt

   # urls and short hashes
   egrep "http://([a-zA-Z0-9_-]+\.)+i2p" -h -R -o /opt/i2p/.i2p/eepsite/docroot/indexorig/"$J" | sort | uniq > /tmp/newurls.txt
   for K in $(cat /tmp/newurls.txt)
   do
      echo "   checking, if" "$K" "is reachable..."
      is_reachable "$K"
      if [ "$?" -eq 0 ]
      then
         echo "   " "$K" "is reachable, adding..."
         echo "$K" >> /opt/i2p/.i2p/eepsite/docroot/list.txt
         cat /opt/i2p/.i2p/eepsite/docroot/list.txt | sort | uniq > /tmp/list.txt
         mv /tmp/list.txt /opt/i2p/.i2p/eepsite/docroot/list.txt
      else
         echo "..." "$K" "is not reachable"
      fi
   done
}

remove_from_list ()
{
   J=$(echo "$1" | sed "s!http://!!g")
   sed -i "/$J/d" /opt/i2p/.i2p/eepsite/docroot/list.txt
}

while :
do
   for I in $(cat /opt/i2p/.i2p/eepsite/docroot/list.txt)
   do
      echo "processing:" "$I"
      is_reachable "$I"
      if [ "$?" -eq 0 ]
      then
         echo "$I" "is reachable"
         echo "downloading..."
         http_proxy=http://localhost:1337 wget --timeout=45 --tries=2 -q --exclude-directories="*doc*" --convert-links --user-agent="Complain in #i2p if this download is bothering you" -N -P /opt/i2p/.i2p/eepsite/docroot/indexorig/ -r -l 4 -A htm,html,cgi,txt "$I" || echo "Download failed"
         echo "Reading new URLs and adding to list..."
         search_add_new "$I"
         echo "Adding" "$I" "to the searchindex:"
         add_index "$I"
         echo "de-htmlify" "$I" "..."
         de_htmlify "$I"
         echo
      else
         echo "$I" "is NOT reachable"
         # for strict settings, but we are so slow that it will probably up when we get to it the next time
         #echo "removing" "$I" "from list"
         #remove_from_list "$I"
         echo
      fi
   done

   echo "sleeping 10 sec..."
   sleep 10
done

search.sh, called by gatling via CGI

#!/bin/sh
SEARCHTERMS=$(echo "$QUERY_STRING" | sed -n 's/^.*search=\([^&]*\).*$/\1/p' | sed "s/%20/ /g" | sed "s/+/ /g")
echo "Content-type: text/html"
echo
echo "<HTML><HEAD>"
echo "<TITLE>Search Results</TITLE></HEAD>"
echo "<p>Multiple matches on one site are indicated by - - - &gt; another_find</p>"
echo "<body><h1>Search Results</h1>"
echo '<div>Search again:
<form action="search.sh" method="GET">
   <input type="text" name="search" value="'
   echo "$SEARCHTERMS"
   echo '"><br>
   <input type="submit" value="Search!">
</form></div><br>'

echo "<br>"
echo "Searchterm:"
echo "$SEARCHTERMS"
echo "<br>"
if [ $(echo "$SEARCHTERMS" | wc -c ) -gt 3 ]
then
   grep -R -i --exclude=hosts.txt --exclude-dir=hikki.i2p -F "$SEARCHTERMS" /opt/i2p/.i2p/eepsite/docroot/index/ | cut --complement -f 1-7 -d "/" > /tmp/search"$SEARCHTERMS"
   if [ "$(cat /tmp/search"""$SEARCHTERMS""")" != "" ]
   then
      echo "$(wc -l < /tmp/search""$SEARCHTERMS"")" "matches"
      echo '<div style="width:99%; overflow:auto; border:dotted gray 2px";>'
      echo "<div><p>"
      cat /tmp/search"$SEARCHTERMS" | while read I
      do
         URL=$(echo "$I" | cut -f 1 -d ":")
         FILENAME=$(echo "$URL" | cut --complement -f 1 -d "/")
         URL=$(echo "$URL" | cut -f 1 -d "/")
         CONTENT=$(echo "$I" | cut --complement -f 1 -d ":")
         if [ "$URL" == "$LASTURL" ]
         then
            RESULT="$CONTENT"
            if [ "$FILENAME" != "$LASTFILENAME" ]
            then
               echo '<a href="http://'"$URL"/"$FILENAME"'">'"$FILENAME"'</a>'":<br>"
            else
               echo "- - - &gt; "
            fi
            echo "$RESULT""<br>"
         else
            echo "<hr>"
            RESULT='<a href="http://'"$URL"/"$FILENAME"'">'http://"$URL"/"$FILENAME"'</a><br>'"$CONTENT:""<br>"
            echo "$RESULT"
         fi
         LASTFILENAME="$FILENAME"
         LASTURL="$URL"
      done
      echo "</p></div>"
   else
      echo "No results<br>"
   fi
   rm /tmp/search"$SEARCHTERMS"
else
   echo "Searchterms need to be at least 3 characters long, otherwise you'd get REALLY many results. Your browser doesn't want to render that.<br>"
fi
echo "</div>"
echo "</div></body></html>"

Actually it works better than I expected but it has the same problem I suspect with "find crawl_dir/ -type f -print0 | xargs -n1 -0 -P32 ./process".
$ du -sh index
82M index

The filesystem is a BAD database because when it is not in cache it is slow as hell.
QUERY_STRING="search=foo+bar" ./search.sh 9,45s user 1,34s system 41% cpu 26,036 total

So I will redo the whole thing (It's not very well written, anyway, because I started without any plan) in C or Java + postgresql. I like the bash + filesystem version better but it just doesn't work out in the end.

Arch Linux

#1 2010-10-26 23:22:21

Taco Bell Programming

#2 2010-10-26 23:46:43

Re: Taco Bell Programming

#3 2010-10-27 00:46:47

Re: Taco Bell Programming

#4 2010-10-27 01:11:26

Re: Taco Bell Programming

#5 2010-10-27 10:26:02

Re: Taco Bell Programming

#6 2010-10-27 12:19:49

Re: Taco Bell Programming

#7 2010-10-27 13:14:34

Re: Taco Bell Programming

#8 2010-10-27 16:58:28

Re: Taco Bell Programming

#9 2010-10-27 18:05:15

Re: Taco Bell Programming

Board footer