Obtain substring using python.

srikanthradix · 2011-09-22 21:10:30

20110920 19:59:59.752441 [ORDER PENDINGACCEPT        ] [seq=178122233][Major=O][Minor=a][src=osf][Id=11579995][ref=53839822][SourceSystemTimeStamp=2011-09-20 23:59:59.750][OrderID<25100>=11579995][Shares<38>=400][OrderType<40>=2][Side<54>=5][Destination<25104>=router][Tif<59>=5][Capacity<528>=A][RefId<25105>=1150414889][ParentID<25101>=-1]

Need to get Destination "router" and Id "11579995" substring from the line using python. How can I do that? There are lot of tags which are avoided for brevity and Tags are not in fixed location. but they all come after seq Tag.

From my previous post: https://bbs.archlinux.org/viewtopic.php?id=126660 awk seems pretty fast.

Also, how do I close it as SOLVED once the answer/solution is obtained?

Last edited by srikanthradix (2011-09-22 21:17:16)

karol · 2011-09-22 21:48:11

srikanthradix wrote:

Also, how do I close it as SOLVED once the answer/solution is obtained?

Edit your first post and add '[solved]' to the topic line.
https://wiki.archlinux.org/index.php/Fo … ow_to_Post

Last edited by karol (2011-09-22 21:49:56)

Mr.Elendig · 2011-09-23 02:22:28

re module perhaps?

srikanthradix · 2011-09-23 14:31:27

for line in open("temp.log"):
        found = line.find("Id=")
        if found > -1:
                next=line.find("]",found)
                subs=line[found+3:next]
                print subs

Is there a better way to do it? or is this it? I mean performance wise.

srikanthradix · 2011-09-23 15:01:40

bash-3.2$ time python temp.py > ids1.log

real 0m0.300s
user 0m0.286s
sys 0m0.012s

bash-3.2$ cat temp.py
for line in open("temp.log"):
        start_idx = line.find("Id=")
        if start_idx > -1:
                end_idx=line.find("]",start_idx)
                subs=line[start_idx+3:end_idx]
                print subs
        start_idx = line.find("<25104>=")
        if start_idx > -1:
                end_idx=line.find("]",start_idx)
                subs=line[start_idx+8:end_idx]
                print subs

where as when I do it with old awk

bash-3.2$ time awk '{
    i = index($0, "Id=")
    if(i > 0) {
    id = substr($0, i + 3)
    id = substr(id, 1, index(id, "]") - 1)
    print id
    }
    i = index($0, "<25104>=")
    if(i > 0) {
    dest = substr($0, i + 8)
    dest = substr(dest, 1, index(dest, "]") - 1)
    print dest
    }
}' temp.log > ids.log

real 0m0.189s
user 0m0.177s
sys 0m0.012s

Last edited by srikanthradix (2011-09-23 15:02:56)

Mr.Elendig · 2011-09-23 15:50:12

Time using the re module too.

marxav · 2011-09-23 16:02:53

import re
f=open("temp.log")
pattern=r'\[Destination<\d+>=(.+)\]'
out=re.search(patterns,f
print(out.group(1))

srikanthradix · 2011-09-23 16:07:56

<<Execution>>

bash-3.2$ python temp.py > ids1.log

<<Output>>

bash-3.2$ tail -1 ids1.log

0.277968883514

<<temp.py>>

bash-3.2$ cat temp.py
from time import time as clock

start = clock()

for line in open("temp.log"):
        start_idx = line.find("Id=")
        if start_idx > -1:
                end_idx=line.find("]",start_idx)
                subs=line[start_idx+3:end_idx]
                print subs
        start_idx = line.find("<25104>=")
        if start_idx > -1:
                end_idx=line.find("]",start_idx)
                subs=line[start_idx+8:end_idx]
                print subs

diff = (clock() - start)
print diff

srikanthradix · 2011-09-23 18:29:22

regular expression is wreaking havoc on the time

<<Execute>>

bash-3.2$ python temp2.py > ids1.log

<<Output>>

bash-3.2$ tail -1 ids1.log

0.770469903946

<<Code>>

bash-3.2$ cat temp2.py
import re
file = open("temp.log")
from time import time as clock
start = clock()
while 1:
        lines = file.readlines(10000)
        if not lines:
                break
        for line in lines:
                out=re.search(r"(<25104>)\=(?P<dest>\w+)", line)
                if out is None:
                        pass
                else:
                        print(out.group('dest'))

                out=re.search(r"(Id)\=(?P<id>\w+)", line)
                if out is None:
                        pass
                else:
                        print(out.group('id'))

diff = (clock() - start)
print diff

kachelaqa · 2011-09-24 00:43:00

srikanthradix wrote:

<<Execution>>

bash-3.2$ python temp.py > ids1.log

<<Output>>

bash-3.2$ tail -1 ids1.log

0.277968883514

<<temp.py>>

bash-3.2$ cat temp.py
from time import time as clock

start = clock()

for line in open("temp.log"):
        start_idx = line.find("Id=")
        if start_idx > -1:
                end_idx=line.find("]",start_idx)
                subs=line[start_idx+3:end_idx]
                print subs
        start_idx = line.find("<25104>=")
        if start_idx > -1:
                end_idx=line.find("]",start_idx)
                subs=line[start_idx+8:end_idx]
                print subs

diff = (clock() - start)
print diff

your timing method is probably giving you bogus results.

firstly: never use time.time() for benchmarking code. it will almost always give inaccurate results (see here for why). use the timeit module instead.

secondly: don't include print statements in the code you're testing because the i/o will mask the real performance of your algorithm.

try running your code like this:

from timeit import timeit

def func():
    output = []
    for line in open("temp.log"):
        start_idx = line.find("Id=")
        if start_idx > -1:
            end_idx=line.find("]",start_idx)
            subs=line[start_idx+3:end_idx]
            output.append(subs)
        start_idx = line.find("<25104>=")
        if start_idx > -1:
            end_idx=line.find("]",start_idx)
            subs=line[start_idx+8:end_idx]
            output.append(subs)
    return '\n'.join(output)

time = timeit('func()', 'from __main__ import func', number=3)
print 'func: %.8f sec/pass' % (time / 3)

Nisstyre56 · 2011-09-24 19:38:37

print [item for item in str.split("][") if "Destination" in item or "ID" in item]

output:

['OrderID<25100>=11579995', 'Destination<25104>=router', 'ParentID<25101>=-1]']

performance is O(n*2) (because str.split() gets run before the list comprehension, since python doesn't come with a lazy version of split for strings)

Last edited by Nisstyre56 (2011-09-24 19:44:49)

Arch Linux

#1 2011-09-22 21:10:30

Obtain substring using python.

#2 2011-09-22 21:48:11

Re: Obtain substring using python.

#3 2011-09-23 02:22:28

Re: Obtain substring using python.

#4 2011-09-23 14:31:27

Re: Obtain substring using python.

#5 2011-09-23 15:01:40

Re: Obtain substring using python.

#6 2011-09-23 15:50:12

Re: Obtain substring using python.

#7 2011-09-23 16:02:53

Re: Obtain substring using python.

#8 2011-09-23 16:07:56

Re: Obtain substring using python.

#9 2011-09-23 18:29:22

Re: Obtain substring using python.

#10 2011-09-24 00:43:00

Re: Obtain substring using python.

#11 2011-09-24 19:38:37

Re: Obtain substring using python.

Board footer