You are not logged in.

#1 2007-11-23 07:20:54

ironbug
Member
Registered: 2007-11-04
Posts: 19

python: are files left open when using open().readlines together?

hello archers
i wrote this silly little script to add line numbers to a file...

#!/usr/bin/env python
import sys

if len(sys.argv) != 3:
  print "Correct syntax: addlinenums [input_file] [output_file]"
  exit()

try:
  buffer = open(sys.argv[1], 'rU').readlines()
except:
  print "Error reading input file - you sure it exists?"
  exit()  

try:
  outfile = open(sys.argv[2], 'w')
  for num, line in enumerate(buffer):
    print >> outfile, str(num+1) + "  " + line,
  outfile.close()
except:
  print "Error writing output file - you sure you have permission?"

exit()

...and i was wondering how to properly close the file i opened for the buffer. since i never explicitly instantiated a file object, does python close the file for me after i read the lines into the buffer?

(by the way, is there already some other general utility to add line numbers to a file that i don't know about? in other words, am i reinventing the wheel?)

Offline

#2 2007-11-23 09:07:18

Husio
Member
From: Europe
Registered: 2005-12-04
Posts: 359
Website

Re: python: are files left open when using open().readlines together?

You can't.

#!/usr/bin/env python
import sys

if len(sys.argv) != 3:
  print "Correct syntax: %s [input_file] [output_file]" % sys.argv[0]
  exit()

try:
  buffer = open(sys.argv[1], 'rU')
except IOError:
  print "Error reading input file - you sure it exists?"
  exit()  

try:
  outfile = open(sys.argv[2], 'w')
  for (num, line) in enumerate(buffer.readlines()):
    print >> outfile, str(num+1) + "  " + line,
  outfile.close()
except IOError:
  print "Error writing output file - you sure you have permission?"

Don't use try/except without choosing errors that you want to catch.

btw:

$ cat -n file_from >> file_to

Last edited by Husio (2007-11-23 09:12:50)

Offline

#3 2007-11-23 12:13:54

MrWeatherbee
Member
Registered: 2007-08-01
Posts: 277

Re: python: are files left open when using open().readlines together?

ironbug wrote:

...and i was wondering how to properly close the file i opened for the buffer. since i never explicitly instantiated a file object, does python close the file for me after i read the lines into the buffer?

Husio wrote:

You can't.

@ ironbug
I'm not sure I understand why 'outfile' is closed explicitly in your code, yet you ask about closing 'buffer'. In both cases, a file object was created with 'open()', so both can / should be closed in the same way. Strategically placing some print commands at the end of the script (see Mod1 code below) should show this:

- when 'buffer.closed()' is commented:

Open File Test 1 - buffer At Time Of Exit: <open file 'buffer.txt', mode 'rU' at 0xb7c68d58>
Open File Test 2 - outfile At Time Of Exit: <closed file 'outfile.txt', mode 'w' at 0xb7c6cd58>

- when 'buffer.closed()' is uncommented:

Open File Test 1 - buffer At Time Of Exit: <closed file 'buffer.txt', mode 'rU' at 0xb7befd58>
Open File Test 2 - outfile At Time Of Exit: <closed file 'outfile.txt', mode 'w' at 0xb7bf3d58>

But, maybe I misunderstood the point of your question and Husio's subsequent answer.

Also, your code will print unpadded line numbers like this:

1  a
2  b
3  c
4  d
...
10  e
...
100  f
...
1000  g

which gets messy. I added some code to modify that behavior if you are interested.

Mod1 code is a simple modification that is hard coded but manually changeable (so not very elegant). Output will then be this with Mod1:

0001  a
0002  b
0003  c
0004  d
...
0010  e
...
0100  f
...
1000  g

Mod1 code - less radical changes:

#!/usr/bin/env python

import sys
from itertools import count, izip                                   # new code

if len(sys.argv) != 3:
  print "Correct syntax: %s [input_file] [output_file]" % sys.argv[0]
  exit()

try:
  buffer = open(sys.argv[1], 'rU')
except IOError:
  print "Error reading input file - you sure it exists?"
  exit()  

try:
  outfile = open(sys.argv[2], 'w')
  #for num, line in enumerate(buffer.readlines()):                  # orig. code
  for (num, line) in izip(count(1), buffer.readlines()):            # new code
    #print >> outfile, str(num+1) + "  " + line,                    # orig. code
    outfile.write("%.04d  %s" % (num, line))                        # new code
  outfile.close()
  #buffer.close()                                                    # new code
except IOError:
  print "Error writing output file - you sure you have permission?"

print "Open File Test 1 - buffer At Time Of Exit: %s" % buffer      # test 1
print "Open File Test 2 - outfile At Time Of Exit: %s" % outfile    # test 2

Mod 2 code is a more extensive modification that dynamically pads the line numbers with spaces depending on how many lines are being written to file. Notice that I had to create a list ('lines') of the lines read from the file object, 'buffer' so that I could operate on the information more than once. This probably gets to the heart of your original question. Working directly with the file object removes the lines from the buffer and it then becomes empty (but it is still open, thus it can be / should be closed). Mod2 code will produce this:

1  a
2  b
3  c
4  d

or this:

 1  a
 2  b
 3  c
 4  d
...
10  e

or this (etc):

  1  a
  2  b
  3  c
  4  d
...
 10  e
...
100  f

Mod 2 code:

#!/usr/bin/env python

import sys
from itertools import count, izip                                   # new code

if len(sys.argv) != 3:
  print "Correct syntax: %s [input_file] [output_file]" % sys.argv[0]
  exit()

try:
  buffer = open(sys.argv[1], 'rU', 0)
except IOError:
  print "Error reading input file - you sure it exists?"
  exit()  

lines = buffer.readlines()                                          # new code
buffer.close()                                                      # new code
lenpad = len(str(len(lines)))                                       # new code

try:
  outfile = open(sys.argv[2], 'w')
  #for num, line in enumerate(buffer.readlines()):                  # orig. code
  for (num, line) in izip(count(1), lines):                         # new code
    #print >> outfile, str(num+1) + "  " + line,                    # orig. code
    outfile.write("%s  %s" % ((str(num).rjust(lenpad)), line))      # new code
  outfile.close()
except IOError:
  print "Error writing output file - you sure you have permission?"

And as Husio pointed out, maybe other 'non-python' methods are easier to achieve similar results.

Last edited by MrWeatherbee (2007-11-23 14:45:13)

Offline

#4 2007-11-25 03:51:03

ironbug
Member
Registered: 2007-11-04
Posts: 19

Re: python: are files left open when using open().readlines together?

MrWeatherbee wrote:

Notice that I had to create a list ('lines') of the lines read from the file object, 'buffer' so that I could operate on the information more than once. This probably gets to the heart of your original question. Working directly with the file object removes the lines from the buffer and it then becomes empty (but it is still open, thus it can be / should be closed).

that's exactly what i was asking. thank you very much.
i admit i was being lazy with the actual numbering in the script, but now that you've beefed up my original code i'll tuck your version away somewhere

also, i think its really funny that you love itertools so much (i noticed you used it also in another recent python thread)... any reason?

Offline

#5 2007-11-25 15:19:32

MrWeatherbee
Member
Registered: 2007-08-01
Posts: 277

Re: python: are files left open when using open().readlines together?

ironbug wrote:

also, i think its really funny that you love itertools so much (i noticed you used it also in another recent python thread)... any reason?

"Love" is a strong (not to mention odd) emotion to have for something like a Python generator. smile

For now (things may change as I learn more), I'll just call it a preference in cases where the count needs to begin at something other than zero (0). Simply put, using 'izip' combined with 'count' is pretty much 'enumerate' with an optional 'start' argument. That definitely comes in handy.

To beat a dead horse, using enumerate in cases where you need a non-zero starting point is like doing this:

cnt = 0
for i in li:
    print "%s %s" % (cnt + 1, i)
    cnt += 1

In the above example, why set 'cnt' to zero, and then immediately increment it by one in the 'print' statement? Why not just initialize 'cnt' to 1?

As far as performance, they seem comparable to me. I cProfiled the Mod 1 script, first with the enumerate code and then with izip + count code (I also removed the itertools import when profiling the 'enumerate' code). The averages over 5 trials run against a file containing 1,000,000 lines were as follows:

Values are CPU seconds

         enumerate               itertools

1          3.495                    3.570
2          3.522                    3.520
3          3.473                    3.412
4          3.450                    3.450
5          3.530                    3.397

Avg.       3.503                    3.470

Sample cProfile output for 'enumerate' code:

         1000010 function calls in 3.450 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    3.450    3.450 <string>:1(<module>)
        1    2.738    2.738    3.449    3.449 addlinenums_alpha.py:5(<module>)
        1    0.000    0.000    3.450    3.450 {execfile}
        1    0.000    0.000    0.000    0.000 {len}
        2    0.000    0.000    0.000    0.000 {method 'close' of 'file' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.138    0.138    0.138    0.138 {method 'readlines' of 'file' objects}
  1000000    0.569    0.000    0.569    0.000 {method 'write' of 'file' objects}
        2    0.004    0.002    0.004    0.002 {open}

Sample cProfile output for itertools code:

         1000010 function calls in 3.397 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    3.397    3.397 <string>:1(<module>)
        1    2.666    2.666    3.396    3.396 addlinenums_alpha.py:5(<module>)
        1    0.000    0.000    3.397    3.397 {execfile}
        1    0.000    0.000    0.000    0.000 {len}
        2    0.000    0.000    0.000    0.000 {method 'close' of 'file' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.141    0.141    0.141    0.141 {method 'readlines' of 'file' objects}
  1000000    0.575    0.000    0.575    0.000 {method 'write' of 'file' objects}
        2    0.014    0.007    0.014    0.007 {open}

And ... just to finish up this line of thought, here are the results of cProfiling the Mod 2 code. The trials were run against the code as I posted it originally and then with an optimization. The optimization took the function:

str(num).rjust(lenpad)

and assigned it to a variable outside the loop:

rjust = str.rjust

thus the code inside the loop is converted to this:

rjust(str(num), lenpad)

Here is the cProfile before the optimization:

         2000012 function calls in 4.719 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.719    4.719 <string>:1(<module>)
        1    3.498    3.498    4.718    4.718 addlinenums.py:3(<module>)
        1    0.000    0.000    4.719    4.719 {execfile}
        3    0.000    0.000    0.000    0.000 {len}
        2    0.000    0.000    0.000    0.000 {method 'close' of 'file' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.144    0.144    0.144    0.144 {method 'readlines' of 'file' objects}
  1000000    0.520    0.000    0.520    0.000 {method 'rjust' of 'str' objects}
  1000000    0.552    0.000    0.552    0.000 {method 'write' of 'file' objects}
        2    0.005    0.003    0.005    0.003 {open}

and after the optimization:

         1000012 function calls in 4.330 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.330    4.330 <string>:1(<module>)
        1    3.616    3.616    4.329    4.329 addlinenums.py:3(<module>)
        1    0.000    0.000    4.330    4.330 {execfile}
        3    0.000    0.000    0.000    0.000 {len}
        2    0.000    0.000    0.000    0.000 {method 'close' of 'file' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.135    0.135    0.135    0.135 {method 'readlines' of 'file' objects}
  1000000    0.574    0.000    0.574    0.000 {method 'write' of 'file' objects}
        2    0.005    0.002    0.005    0.002 {open}

Offline

Board footer

Powered by FluxBB