You are not logged in.

#1 2007-09-13 07:27:22

freigeist
Member
From: Cologne, Germany
Registered: 2006-07-14
Posts: 191

Ruby Performance / Help needed

I wrote a litte ruby program to parse some logfiles of our MS ISA Proxy Server and to calculate anomalies like sites visited very often by only a few users or sites where much data has been uploaded to. Our Logfiles are very big (~600 MB per day), therefore the script has to parse about 18 GB of data a month. It then builds a hierarchically structure like this:

Domain
|
|-> Requests
|-> BytesIn
|-> BytesOut
|-> Anomaly
 -> Unique Users
      |
       -> Bytes In, Bytes Out, Requests

Where Domain and Unique Users are hashtables, to quickly find if the user or domain already exists.

The script takes a bunch of filenames on the commandline and parses them line by line. Everythings working fine but compared to a similar python script I wrote, ruby is much slower. I really don't know where the problem is, maybe file i/o is slower, or the garbage collection is the problem or made a big mistake? Maybe someone can take a look at the script and give me a hint?

Here is the link to the script


Elfenbeinturm.cc
a metaphysical space of solitude and sanctity: http://www.elfenbeinturm.cc

Offline

#2 2007-09-13 14:27:53

cactus
Taco Eater
From: t͈̫̹ͨa͖͕͎̱͈ͨ͆ć̥̖̝o̫̫̼s͈̭̱̞͍̃!̰
Registered: 2004-05-25
Posts: 4,622
Website

Re: Ruby Performance / Help needed

heh. i like finding funny bits of code.

site = @h[url]

hurl. lol

You might try breaking out the nested classes.
Also..ruby is sometimes just slower than python. I recommend using the profiler on a run of your code though..see where it is spending most of its time..
my long ago post on ruby profiler output

Also... you are doing things like.. @anomaly = @rpu*@tpu*@upu*@upu / (1024*1024*1024*1024*1024*1024*1024*1024*1024)

I don't know if the ruby compiler is smart enough to replace that 1024*1024*1024*1024*1024*1024*1024*1024*1024 section with just a single integer.. but you can always do that yourself.

You also have some slightly ambiguous divisions.. like in getbytesin and getbytes out.
Just clean up some of the code and run the profiler I guess... see where the time is being spent.


"Be conservative in what you send; be liberal in what you accept." -- Postel's Law
"tacos" -- Cactus' Law
"t̥͍͎̪̪͗a̴̻̩͈͚ͨc̠o̩̙͈ͫͅs͙͎̙͊ ͔͇̫̜t͎̳̀a̜̞̗ͩc̗͍͚o̲̯̿s̖̣̤̙͌ ̖̜̈ț̰̫͓ạ̪͖̳c̲͎͕̰̯̃̈o͉ͅs̪ͪ ̜̻̖̜͕" -- -̖͚̫̙̓-̺̠͇ͤ̃ ̜̪̜ͯZ͔̗̭̞ͪA̝͈̙͖̩L͉̠̺͓G̙̞̦͖O̳̗͍

Offline

#3 2007-09-13 14:28:01

Jessehk
Member
From: Toronto, Ontario, Canada
Registered: 2007-01-16
Posts: 152

Re: Ruby Performance / Help needed

The reality is that Ruby's implementation is very slow. The new Ruby 2.0 interpreter, YARV, is considerably faster (about 6X to 10X in my own informal tests) but it is still under development.

Offline

#4 2007-09-14 11:42:21

freigeist
Member
From: Cologne, Germany
Registered: 2006-07-14
Posts: 191

Re: Ruby Performance / Help needed

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 39.67  2439.49   2439.49        1 2439490.00 6046910.00  IO#each
 13.75  3284.91    845.42  1660302     0.51     1.06  Websites::Entry#addUser
  9.82  3888.60    603.69  1671769     0.36     1.49  Websites#add_entry
  6.92  4314.17    425.57  1611081     0.26     0.37  Websites::Entry::UniqueUser#addRequests
  6.72  4727.22    413.05 11589502     0.04     0.04  Fixnum#+
  5.85  5087.17    359.95 10212851     0.04     0.04  Array#[]
  4.88  5387.36    300.19  3458565     0.09     0.09  String#split
  3.04  5574.52    187.16  4966080     0.04     0.04  Hash#[]
  2.01  5697.97    123.45  3458568     0.04     0.04  Kernel.==
  1.98  5819.52    121.55  3343538     0.04     0.04  String#to_i
  1.08  5885.64     66.12  1729284     0.04     0.04  Fixnum#%
  1.00  5947.33     61.69  1729484     0.04     0.04  Fixnum#==
  0.95  6005.85     58.52  1671769     0.04     0.04  String#==
  0.60  6042.84     36.99    22930     1.61     2.54  Websites::Entry#calcAnomaly
  0.30  6061.35     18.51   172939     0.11     0.15  Kernel.printf
  0.18  6072.43     11.08       53   209.06  1573.58  Hash#each
  0.18  6083.36     10.93        2  5465.00  9360.00  Hash#sort
  0.12  6091.02      7.66   179760     0.04     0.04  IO#write
  0.11  6097.70      6.68   162374     0.04     0.04  Bignum#*
  0.10  6104.10      6.40    72152     0.09     0.25  Class#new
  0.08  6109.11      5.01   116806     0.04     0.05  Fixnum#*
  0.08  6114.08      4.97    60686     0.08     0.08  Websites::Entry::UniqueUser#initialize
  0.06  6117.95      3.87    68990     0.06     0.06  Hash#size
  0.06  6121.56      3.61    23428     0.15     0.38  Fixnum#>
  0.06  6125.02      3.46    11465     0.30     0.59  Websites::Entry#initialize
  0.06  6128.43      3.41    94213     0.04     0.04  Fixnum#/
  0.05  6131.51      3.08    72151     0.04     0.04  Hash#[]=
  0.05  6134.37      2.86    19538     0.15     0.21  Comparable.>
  0.04  6136.95      2.58    60686     0.04     0.04  Hash#default
  0.04  6139.39      2.44    35884     0.07     0.07  Websites::Entry#getAnomaly
  0.03  6141.03      1.64    24754     0.07     0.08  Websites::Entry#getRequests
  0.03  6142.61      1.58    28820     0.05     0.05  Bignum#/
  0.02  6143.97      1.36    23549     0.06     0.06  Bignum#coerce
  0.02  6145.24      1.27    30168     0.04     0.04  Fixnum#<=>
  0.02  6146.50      1.26    19545     0.06     0.06  Bignum#<=>
  0.02  6147.66      1.16     5674     0.20     0.29  IO#printf
  0.01  6148.20      0.54     1393     0.39     0.42  Websites::Entry::UniqueUser#getbytesout
  0.01  6148.61      0.41     1393     0.29     0.40  Websites::Entry::UniqueUser#getbytesin
  0.01  6148.97      0.36     5591     0.06     0.06  Fixnum#to_s
  0.00  6149.16      0.19     1141     0.17     0.19  IO#print
  0.00  6149.30      0.14      200     0.70     0.70  Websites::Entry#getBytesOut
  0.00  6149.44      0.14       51     2.75     4.90  Range#each
  0.00  6149.53      0.09        3    30.00 2017256.67  Array#each
  0.00  6149.56      0.03     1393     0.02     0.02  Websites::Entry::UniqueUser#getrequests
  0.00  6149.58      0.02     1069     0.02     0.02  String#+
  0.00  6149.60      0.02      200     0.10     0.10  Websites::Entry#getUserCount
  0.00  6149.61      0.01      553     0.02     0.02  String#length
  0.00  6149.62      0.01      200     0.05     0.05  Websites::Entry#getBytesIn
  0.00  6149.63      0.01        1    10.00 53630.00  Websites#results_anomaly
  0.00  6149.63      0.00       49     0.00     0.00  Kernel.sprintf
  0.00  6149.63      0.00        2     0.00     0.00  Kernel.puts
  0.00  6149.63      0.00        2     0.00   250.00  Websites#calc_str_length
  0.00  6149.63      0.00        2     0.00     0.00  IO#puts
  0.00  6149.63      0.00       52     0.00     0.00  File#initialize
  0.00  6149.63      0.00       49     0.00    87.76  Websites::Entry#printUniqueUsers
  0.00  6149.63      0.00        1     0.00 49090.00  Websites#results_all
  0.00  6149.63      0.00        1     0.00     0.00  Fixnum#<
  0.00  6149.63      0.00        1     0.00     0.00  Websites#initialize
  0.00  6149.63      0.00       52     0.00     0.00  IO#close
  0.00  6149.63      0.00        4     0.00     0.00  String#to_s
  0.00  6149.63      0.00        2     0.00     0.00  Bignum#+
  0.00  6149.63      0.00        3     0.00     0.00  Class#inherited
  0.00  6149.63      0.00       51     0.00     0.00  IO#new
  0.00  6149.63      0.00        1     0.00     0.00  IO#open
  0.00  6149.63      0.00        2     0.00     0.00  Array#length
  0.00  6149.63      0.00       21     0.00     0.00  Module#method_added
  0.00  6149.64      0.00        1     0.00 6149640.00  #toplevel

Seems to be an IO problem...


Elfenbeinturm.cc
a metaphysical space of solitude and sanctity: http://www.elfenbeinturm.cc

Offline

#5 2007-09-19 17:33:14

Bison
Member
From: Jacksonville, FL
Registered: 2006-04-12
Posts: 158
Website

Re: Ruby Performance / Help needed

Actually python garbage collection is much slower.  Python uses reference counting and Ruby uses Mark and Sweep.

Ruby itself is a slower implementation partly because it uses a source-tree walker whereas python uses bytecode.  YARV uses bytecode.

Offline

Board footer

Powered by FluxBB