You are not logged in.
Pages: 1
I wrote a litte ruby program to parse some logfiles of our MS ISA Proxy Server and to calculate anomalies like sites visited very often by only a few users or sites where much data has been uploaded to. Our Logfiles are very big (~600 MB per day), therefore the script has to parse about 18 GB of data a month. It then builds a hierarchically structure like this:
Domain
|
|-> Requests
|-> BytesIn
|-> BytesOut
|-> Anomaly
-> Unique Users
|
-> Bytes In, Bytes Out, Requests
Where Domain and Unique Users are hashtables, to quickly find if the user or domain already exists.
The script takes a bunch of filenames on the commandline and parses them line by line. Everythings working fine but compared to a similar python script I wrote, ruby is much slower. I really don't know where the problem is, maybe file i/o is slower, or the garbage collection is the problem or made a big mistake? Maybe someone can take a look at the script and give me a hint?
Here is the link to the script
Elfenbeinturm.cc
a metaphysical space of solitude and sanctity: http://www.elfenbeinturm.cc
Offline
heh. i like finding funny bits of code.
site = @h[url]
hurl. lol
You might try breaking out the nested classes.
Also..ruby is sometimes just slower than python. I recommend using the profiler on a run of your code though..see where it is spending most of its time..
my long ago post on ruby profiler output
Also... you are doing things like.. @anomaly = @rpu*@tpu*@upu*@upu / (1024*1024*1024*1024*1024*1024*1024*1024*1024)
I don't know if the ruby compiler is smart enough to replace that 1024*1024*1024*1024*1024*1024*1024*1024*1024 section with just a single integer.. but you can always do that yourself.
You also have some slightly ambiguous divisions.. like in getbytesin and getbytes out.
Just clean up some of the code and run the profiler I guess... see where the time is being spent.
"Be conservative in what you send; be liberal in what you accept." -- Postel's Law
"tacos" -- Cactus' Law
"t̥͍͎̪̪͗a̴̻̩͈͚ͨc̠o̩̙͈ͫͅs͙͎̙͊ ͔͇̫̜t͎̳̀a̜̞̗ͩc̗͍͚o̲̯̿s̖̣̤̙͌ ̖̜̈ț̰̫͓ạ̪͖̳c̲͎͕̰̯̃̈o͉ͅs̪ͪ ̜̻̖̜͕" -- -̖͚̫̙̓-̺̠͇ͤ̃ ̜̪̜ͯZ͔̗̭̞ͪA̝͈̙͖̩L͉̠̺͓G̙̞̦͖O̳̗͍
Offline
The reality is that Ruby's implementation is very slow. The new Ruby 2.0 interpreter, YARV, is considerably faster (about 6X to 10X in my own informal tests) but it is still under development.
Offline
% cumulative self self total
time seconds seconds calls ms/call ms/call name
39.67 2439.49 2439.49 1 2439490.00 6046910.00 IO#each
13.75 3284.91 845.42 1660302 0.51 1.06 Websites::Entry#addUser
9.82 3888.60 603.69 1671769 0.36 1.49 Websites#add_entry
6.92 4314.17 425.57 1611081 0.26 0.37 Websites::Entry::UniqueUser#addRequests
6.72 4727.22 413.05 11589502 0.04 0.04 Fixnum#+
5.85 5087.17 359.95 10212851 0.04 0.04 Array#[]
4.88 5387.36 300.19 3458565 0.09 0.09 String#split
3.04 5574.52 187.16 4966080 0.04 0.04 Hash#[]
2.01 5697.97 123.45 3458568 0.04 0.04 Kernel.==
1.98 5819.52 121.55 3343538 0.04 0.04 String#to_i
1.08 5885.64 66.12 1729284 0.04 0.04 Fixnum#%
1.00 5947.33 61.69 1729484 0.04 0.04 Fixnum#==
0.95 6005.85 58.52 1671769 0.04 0.04 String#==
0.60 6042.84 36.99 22930 1.61 2.54 Websites::Entry#calcAnomaly
0.30 6061.35 18.51 172939 0.11 0.15 Kernel.printf
0.18 6072.43 11.08 53 209.06 1573.58 Hash#each
0.18 6083.36 10.93 2 5465.00 9360.00 Hash#sort
0.12 6091.02 7.66 179760 0.04 0.04 IO#write
0.11 6097.70 6.68 162374 0.04 0.04 Bignum#*
0.10 6104.10 6.40 72152 0.09 0.25 Class#new
0.08 6109.11 5.01 116806 0.04 0.05 Fixnum#*
0.08 6114.08 4.97 60686 0.08 0.08 Websites::Entry::UniqueUser#initialize
0.06 6117.95 3.87 68990 0.06 0.06 Hash#size
0.06 6121.56 3.61 23428 0.15 0.38 Fixnum#>
0.06 6125.02 3.46 11465 0.30 0.59 Websites::Entry#initialize
0.06 6128.43 3.41 94213 0.04 0.04 Fixnum#/
0.05 6131.51 3.08 72151 0.04 0.04 Hash#[]=
0.05 6134.37 2.86 19538 0.15 0.21 Comparable.>
0.04 6136.95 2.58 60686 0.04 0.04 Hash#default
0.04 6139.39 2.44 35884 0.07 0.07 Websites::Entry#getAnomaly
0.03 6141.03 1.64 24754 0.07 0.08 Websites::Entry#getRequests
0.03 6142.61 1.58 28820 0.05 0.05 Bignum#/
0.02 6143.97 1.36 23549 0.06 0.06 Bignum#coerce
0.02 6145.24 1.27 30168 0.04 0.04 Fixnum#<=>
0.02 6146.50 1.26 19545 0.06 0.06 Bignum#<=>
0.02 6147.66 1.16 5674 0.20 0.29 IO#printf
0.01 6148.20 0.54 1393 0.39 0.42 Websites::Entry::UniqueUser#getbytesout
0.01 6148.61 0.41 1393 0.29 0.40 Websites::Entry::UniqueUser#getbytesin
0.01 6148.97 0.36 5591 0.06 0.06 Fixnum#to_s
0.00 6149.16 0.19 1141 0.17 0.19 IO#print
0.00 6149.30 0.14 200 0.70 0.70 Websites::Entry#getBytesOut
0.00 6149.44 0.14 51 2.75 4.90 Range#each
0.00 6149.53 0.09 3 30.00 2017256.67 Array#each
0.00 6149.56 0.03 1393 0.02 0.02 Websites::Entry::UniqueUser#getrequests
0.00 6149.58 0.02 1069 0.02 0.02 String#+
0.00 6149.60 0.02 200 0.10 0.10 Websites::Entry#getUserCount
0.00 6149.61 0.01 553 0.02 0.02 String#length
0.00 6149.62 0.01 200 0.05 0.05 Websites::Entry#getBytesIn
0.00 6149.63 0.01 1 10.00 53630.00 Websites#results_anomaly
0.00 6149.63 0.00 49 0.00 0.00 Kernel.sprintf
0.00 6149.63 0.00 2 0.00 0.00 Kernel.puts
0.00 6149.63 0.00 2 0.00 250.00 Websites#calc_str_length
0.00 6149.63 0.00 2 0.00 0.00 IO#puts
0.00 6149.63 0.00 52 0.00 0.00 File#initialize
0.00 6149.63 0.00 49 0.00 87.76 Websites::Entry#printUniqueUsers
0.00 6149.63 0.00 1 0.00 49090.00 Websites#results_all
0.00 6149.63 0.00 1 0.00 0.00 Fixnum#<
0.00 6149.63 0.00 1 0.00 0.00 Websites#initialize
0.00 6149.63 0.00 52 0.00 0.00 IO#close
0.00 6149.63 0.00 4 0.00 0.00 String#to_s
0.00 6149.63 0.00 2 0.00 0.00 Bignum#+
0.00 6149.63 0.00 3 0.00 0.00 Class#inherited
0.00 6149.63 0.00 51 0.00 0.00 IO#new
0.00 6149.63 0.00 1 0.00 0.00 IO#open
0.00 6149.63 0.00 2 0.00 0.00 Array#length
0.00 6149.63 0.00 21 0.00 0.00 Module#method_added
0.00 6149.64 0.00 1 0.00 6149640.00 #toplevel
Seems to be an IO problem...
Elfenbeinturm.cc
a metaphysical space of solitude and sanctity: http://www.elfenbeinturm.cc
Offline
Actually python garbage collection is much slower. Python uses reference counting and Ruby uses Mark and Sweep.
Ruby itself is a slower implementation partly because it uses a source-tree walker whereas python uses bytecode. YARV uses bytecode.
Offline
Pages: 1