[Solved] (Python3) Ignoring lines with control characters

alphaniner · 2014-06-13 15:42:55

Long story short, I log the output of an in-house program with tee. The program displays a progress meter. When I read the file into a list, I end up with lines like:

>>> temp_file[18]
''Write:   1% [>                                                  ] 1 MB\x1b[2K\n'

I can filter these out with:

for line in temp_file:
  if not '\x1b' in line:
    do_something_with(line)

This leaves me with exactly what I want, but maybe someone knows of a better way? Thanks.

Last edited by alphaniner (2014-06-17 16:51:05)

firecat53 · 2014-06-13 15:51:53

Perhaps look at string.printable

Scott

Edit: fixed link

Last edited by firecat53 (2014-06-13 16:28:22)

alphaniner · 2014-06-13 16:18:49

Something tells me you didn't test that link.

firecat53 · 2014-06-13 16:23:24

Oops! Posting from phone :-P
https://docs.python.org/3.4/library/string.html
Sorry!

Edit:

From here:

filtered_string = filter(lambda x: x in string.printable, myStr)

Last edited by firecat53 (2014-06-13 16:31:53)

alphaniner · 2014-06-13 16:45:12

I figured that's where you were going. But I don't want to filter unwanted stuff out of strings, I want to filter out entire strings if they contain unwanted stuff.

IOW, it should go ding when there's stuff.

firecat53 · 2014-06-13 17:17:30

Here's a one-liner. Might be able to use itertools somehow, but I have to leave now I guess you could try regex character classes as well. Not sure how this would handle unicode.

[line for line in temp_file if not any([i for i in line if i not in string.printable])]

Trent · 2014-06-15 13:38:05

TBH, I think your original solution is clearer and to the point -- in a word, Pythonic. If it works just the way you want, leave it.

If you want to filter out all nonprinting characters, I might combine the approaches (untested):

for line in temp_file:
    if any(c not in string.printable for c in line):
        continue
    do_something_with(line)

Just be sure to leave a comment explaining why you're doing such a thing.

N.B. firecat's solution walks through the string creating a list of all the nonprinting characters, then passes that list to any(). Since I left off the [], mine stops the first time it sees a nonprinting character. You might even accelerate the process more by doing "for c in reversed(line)" which will tend to find the nonprinting characters faster when they're near the end of the string.

Last edited by Trent (2014-06-15 21:46:41)

firecat53 · 2014-06-15 16:14:34

@Trent thanks! I knew there was a way to stop iteration when the first non-printable character is found. I apparently didn't stare at it long enough to figure it out I only moved to string.printable instead of alphaniner's original solution because it seems more flexible...assuming that at some point there might be other non-printable characters introduced into the input.

Scott

Last edited by firecat53 (2014-06-15 16:14:57)

progandy · 2014-06-15 19:53:34

I believe writing it this way might be cleaner:

printset = set(string.printable)
for line in temp_file:
    if printset.issuperset(line):
        do_something_with(line)

Trent · 2014-06-15 20:23:03

You could do it that way, but I think it begins to fall afoul of "Readability counts".

Speaking of which, Guido's time machine pulls through:

for line in temp_file:
    if line.isprintable():
        do_something_with(line)

progandy · 2014-06-15 20:43:34

Trent wrote:

You could do it that way, but I think it begins to fall afoul of "Readability counts".

In my book printableset.issuperset is more readable than an any(for c in line), but that might be my mathematical knowledge about sets.

Speaking of which, Guido's time machine pulls through:
for line in temp_file:
    if line.isprintable():
        do_something_with(line)

That is right. python3 finally has this function. I am still partially thinking in python2.7
Edit: Take care, since str.isprintable interprets tabs and newlines as non-printable. Maybe do str.expandtabs(1).isprintable().

Last edited by progandy (2014-06-15 21:38:19)

firecat53 · 2014-06-15 21:32:08

I love Python So many ways to skin the proverbial cat! This is a great learning thread...

Scott

Trent · 2014-06-15 21:46:56

progandy wrote:

In my book printableset.issuperset is more readable than an any(for c in line), but that might be my mathematical knowledge about sets.

To each his own. A string is not a set and I would never ask a question like "Is the set of characters in this string a subset of the set of characters in string.printable?" On the other hand, "Is it true, for any character c in this string, that c is not in string.printable?" seems relatively straightforward. But that's a subjective measure.

Edit: Take care, since str.isprintable interprets tabs and newlines as non-printable. Maybe do str.expandtabs(1).isprintable().

Hrm. Yuck. No, in that case I'll fall back to my original version. Disclaimer removed.

Actually, on second thought, I don't know why I went for any() in the first place, since its complement/counterpart is also provided:

for line in temp_file:
    if all(c in string.printable for c in line):
        do_something_with(line)

Last edited by Trent (2014-06-15 21:51:05)

firecat53 · 2014-06-15 22:56:50

Trent wrote:

Actually, on second thought, I don't know why I went for any() in the first place, since its complement/counterpart is also provided:
for line in temp_file:
    if all(c in string.printable for c in line):
        do_something_with(line)

Didn't you go for it because the for loop would terminate after the first non-matching character? Possibly speeding it up a little? Or does any have to wait for the entire result from the for loop first?

Scott

Last edited by firecat53 (2014-06-15 22:57:44)

Trent · 2014-06-16 02:18:19

That's true for all() as well. If the first character in line is nonprinting, all() won't continue to request more from the iterator because it already knows the final result.

>>> i = iter('hello, world')
>>> all(c in 'ehlo' for c in i)
False
>>> ''.join(i)
' world'

You could still do the reversed(line) trick, but on the whole it probably buys you very little.

Edit -- I wasn't very clear earlier. The difference between my solution and your earlier one is the lack of square brackets around the comprehension expression. If you were to write instead of the above

all([c in 'ehlo' for c in i])

then Python would construct a list [True, True, True, True, True, False, False, False, True, False, True, False] and hand the entire thing to all(). Without the brackets, Python creates a generator, which is lazy -- it doesn't calculate whether the next value is True or False until it's needed.

Last edited by Trent (2014-06-16 02:27:39)

alphaniner · 2014-06-16 14:50:36

Thanks for all the suggestions everyone! Regarding reversal, I had thought of that as well but in a different context:

for line in temp_file:
  if line.rfind('\x1b', -5) < 0:
    do_something_with(line)

I guess that's rather unpythonic, but only ~20 of ~1000 will pass the test and I'm an efficiency fiend.

Everything else is new to me so I'll have to do some research and mull things over.

Trent · 2014-06-16 23:18:37

Slicing is quicker (in this case) and more pythonic:

if '\x1b' in line[-5:]:

But then perhaps you might as well just

if line[-5] == '\x1b':

Personally, I think that's fine, if it works and you're not expecting the in-house program to change its output format. Don't do more work to no advantage. But if you think you might need it to be more flexible, by all means go for one of the other suggestions.

progandy · 2014-06-16 23:37:03

Trent wrote:

Personally, I think that's fine, if it works and you're not expecting the in-house program to change its output format. Don't do more work to no advantage. But if you think you might need it to be more flexible, by all means go for one of the other suggestions.

You have the right mindset. Just document that stuff, otherwise you won't understand your code if you ever have to change it lateron.

alphaniner · 2014-06-17 16:50:05

Trent wrote:

Slicing is quicker (in this case) and more pythonic:
if '\x1b' in line[-5:]:
But then perhaps you might as well just
if line[-5] == '\x1b':
Personally, I think that's fine, if it works and you're not expecting the in-house program to change its output format. Don't do more work to no advantage. But if you think you might need it to be more flexible, by all means go for one of the other suggestions.

Doh, I didn't even consider slicing. Definitely clearer and more pythonic, though I'll have to take your word about quicker. If I use the index it will be necessary to check the length:

if (len(line) < 5 or line[-5] != '\x1b':

for now I'll just stick with the slice to be safe.

Thanks again to everyone.

Trent · 2014-06-18 00:32:30

alphaniner wrote:

Doh, I didn't even consider slicing. Definitely clearer and more pythonic, though I'll have to take your word about quicker.

I know this thread is approaching EOL, but you don't have to take my word for it.

% python -m timeit -s "s='hello, world'" "'\x1b' in s[-5:]"
10000000 loops, best of 3: 0.194 usec per loop
% python -m timeit -s "s='hello, world'" "s.rfind('\x1b', -5)"
1000000 loops, best of 3: 0.304 usec per loop

(using timeit)

If I use the index it will be necessary to check the length

Oh, true. Hadn't thought of that.

progandy · 2014-06-18 01:01:02

Trent wrote:

If I use the index it will be necessary to check the length
Oh, true. Hadn't thought of that.

Then slice a single character:

$ python -m timeit -s "s='hello, world'" "'\x1b' in s[-5:]"                 
1000000 loops, best of 3: 0.601 usec per loop
$ python -m timeit -s "s='hello, world'" "s.rfind('\x1b', -5)"              
1000000 loops, best of 3: 0.817 usec per loop
$ python -m timeit -s "s='hello, world'" "s[-5:-4] == '\x1b'"
1000000 loops, best of 3: 0.546 usec per loop

Arch Linux

#1 2014-06-13 15:42:55

[Solved] (Python3) Ignoring lines with control characters

#2 2014-06-13 15:51:53

Re: [Solved] (Python3) Ignoring lines with control characters

#3 2014-06-13 16:18:49

Re: [Solved] (Python3) Ignoring lines with control characters

#4 2014-06-13 16:23:24

Re: [Solved] (Python3) Ignoring lines with control characters

#5 2014-06-13 16:45:12

Re: [Solved] (Python3) Ignoring lines with control characters

#6 2014-06-13 17:17:30

Re: [Solved] (Python3) Ignoring lines with control characters

#7 2014-06-15 13:38:05

Re: [Solved] (Python3) Ignoring lines with control characters

#8 2014-06-15 16:14:34

Re: [Solved] (Python3) Ignoring lines with control characters

#9 2014-06-15 19:53:34

Re: [Solved] (Python3) Ignoring lines with control characters

#10 2014-06-15 20:23:03

Re: [Solved] (Python3) Ignoring lines with control characters

#11 2014-06-15 20:43:34

Re: [Solved] (Python3) Ignoring lines with control characters

#12 2014-06-15 21:32:08

Re: [Solved] (Python3) Ignoring lines with control characters

#13 2014-06-15 21:46:56

Re: [Solved] (Python3) Ignoring lines with control characters

#14 2014-06-15 22:56:50

Re: [Solved] (Python3) Ignoring lines with control characters

#15 2014-06-16 02:18:19

Re: [Solved] (Python3) Ignoring lines with control characters

#16 2014-06-16 14:50:36

Re: [Solved] (Python3) Ignoring lines with control characters

#17 2014-06-16 23:18:37

Re: [Solved] (Python3) Ignoring lines with control characters

#18 2014-06-16 23:37:03

Re: [Solved] (Python3) Ignoring lines with control characters

#19 2014-06-17 16:50:05

Re: [Solved] (Python3) Ignoring lines with control characters

#20 2014-06-18 00:32:30

Re: [Solved] (Python3) Ignoring lines with control characters

#21 2014-06-18 01:01:02

Re: [Solved] (Python3) Ignoring lines with control characters

Board footer