[SOLVED] Faster script?

srikanthradix · 2011-09-17 16:04:00

Log file extract id and destination substrings using awk.

20110911 12:30:33 [seq=123444][src=sample1][Id=12345][Destination=CME][SourceSystem<5177>=RAINIER]

In temp1.log, there are a million lines with the same format as above which needs id and destination to be extracted.

1st attempt:

[srikanth@hana ~]$ time awk '{match($0,/Id=([0-9]*)/,a); print a[1]; match($0,/Destination=([A-Za-z_]*)/,a); print a[1];}' temp1.log > ids.log

real 0m14.748s
user 0m14.656s
sys 0m0.077s

2nd attempt:

[srikanth@hana ~]$ time awk '
BEGIN {
             destlen=length("Destination=") + 1; 
             idlen = length("Id=") + 1; 
} 
{
             split ($0,a,"["); 
             j=0;  len=length(a); 
             for(i=4;i<=len;i++){ 
                                 alen=length(a[i]); 
                                 if(a[i] ~ "Id"){ 
                                                      print substr(a[i],idlen,alen-idlen); 
                                 } 
                                 if(a[i] ~ "Dest"){ 
                                                          print substr(a[i],destlen,alen-destlen); i=len; 
                                 } 
            } 
}' temp1.log > ids.log

real 0m11.576s
user 0m11.463s
sys 0m0.103s

Also, I would like to use awk.

Can I do any better than this?

Last edited by srikanthradix (2011-09-22 21:55:08)

juster · 2011-09-17 18:26:41

Here is a simple example using your latest approach which avoids splitting and using regexps for the sake of speed:

{
    i = index($0, "[Id=")                                                      
    id = substr($0, i + 4)
    id = substr(id, 1, index(id, "]") - 1)

    i = index($0, "[Destination=")
    dest = substr($0, i + 13)
    dest = substr(dest, 1, index(dest, "]") - 1)

    print id, dest
}

If there are no spaces in your data between the brackets, then it is slightly faster to use $3 instead of $0.

The next example changes the field separator (FS) to split each line where one or more brackets occur, instead of on whitespace. This also assumes that id and destination are always the third and fourth key/value pair enclosed in brackets. When splitting on brackets, $1 is the date and time string, so the id is $4 and the destination is $5.

BEGIN {
    FS="[[\\]]+" # \\ are converted to single \ in string                      
                 # \] in regexp escapes the ] inside [...]                     
}
{ print substr($4, index($4, "=") + 1), substr($5, index($5, "=") + 1) }

Procyon · 2011-09-17 18:46:10

@Juster:

I went for getting rid of the regex too. I got it 20% faster by putting the reduced strings back in $0 and putting the first call to index inside the substr.

time awk '{
                                                          
    $0 = substr($0, index($0,"[Id=") + 4)
    id = substr($0, 1, index($0, "]") - 1)

    
    $0 = substr($0, index($0, "[Destination=") + 13)
    dest = substr($0, 1, index($0, "]") - 1)

    print id, dest
}' temp1.log > /dev/null

Last edited by Procyon (2011-09-17 19:00:50)

srikanthradix · 2011-09-17 19:32:40

time awk '{
    i = index($3, "[Id=")    
    id = substr($3, i + 4)
    id = substr(id, 1, index(id, "]") - 1)

    i = index($3, "[Destination=")
    dest = substr($3, i + 13)
    dest = substr(dest, 1, index(dest, "]") - 1)

    print id, dest
}' temp1.log > ids.log

real 0m4.219s
user 0m4.146s
sys 0m0.067s

The Id and Destination can be in any position after $3. I should have mentioned that(mea culpa).Hence, I can't assign $0 to the value of substr.

But, This works for me. That definitely is a much speedy script than mine.

Thanks to you both.

Last edited by srikanthradix (2011-09-18 01:15:57)

Arch Linux

#1 2011-09-17 16:04:00

[SOLVED] Faster script?

#2 2011-09-17 18:26:41

Re: [SOLVED] Faster script?

#3 2011-09-17 18:46:10

Re: [SOLVED] Faster script?

#4 2011-09-17 19:32:40

Re: [SOLVED] Faster script?

Board footer