[SOLVED] Conky running a Python script, encoding problem

orderlyDisorganized · 2017-07-08 17:23:45

Greetings,

I made a Python script to fetch some news from the Internet. I wanted to translate HTML entities to their respective symbols, for example "&rsaquo;" to ">". At first, I got some UnicodeEncodeErrors but after studying encoding and how it's managed in Python I managed to fix my issues - as far as I know. Here is a simplified script which I use in my main script to change HTML entities to symbols:

#-*- coding: utf-8 -*-

import re
# HTMLParser does the mapping between HTML entities and symbols
from HTMLParser import HTMLParser

h = HTMLParser()
# Change type str to unicode so "".replace() won't cause UnicodeEncodeError because otherwise Python uses ASCII encoding
# Also, use UTF-8 in case the source contains non-ASCII characters
string = unicode("Some stuff&rsaquo;&rsaquo;", "utf-8")
# Get an iterator yielding matches of HTML entities
matches = re.finditer("&[^;]+;", string)
# Put the indexes of matches in a list
match_spans = [match.span() for match in matches]

# Use reversed() to avoid the side effects when the indexes change when modifying unicode string.
for span in reversed(match_spans):
    c = h.unescape(string[span[0]:span[1]])
    # Replace HTML entity with symbol
    string = string.replace(string[span[0]:span[1]], c)
    
print string

The output is fine in Eclipse output or in terminal. Unfortunately, for reasons I don't understand, running the script in Conky causes again those UnicodeEncodeErrors which I got rid of in Eclipse (or in terminal). I don't understand why Conky affects the script - I just thought it would forward everything Python prints on my desktop.

I've tried the line "override_utf8_locale = true," in my conky.conf without success.

Does anyone know what I'm missing here?
Thanks!

Edit: And I'm using Python 2.7.

Last edited by orderlyDisorganized (2017-07-08 18:11:45)

rtoijala · 2017-07-08 17:41:53

The problem is not with conky, since you can reproduce the problem by simply piping to tee, as in

python2 script.py | tee
Traceback (most recent call last):
  File "./a.py", line 21, in <module>
    print string 
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-11: ordinal not in range(128)

The difference is probably piping to a process instead of to the terminal, which makes Python for some reason assume that you want ASCII output.
I'm not familiar enough with Python 2 to solve the problem there, but running with Python 3 instead (parenthesis for "print" and importing from "html.parser" instead of "HTMLParser" and using the literal as-is instead of with "unicode()") works:

python3 modified-script.py | tee
Some stuff››

Edit:
The answer is here: https://stackoverflow.com/questions/454 … ng-to-file
tl;dr: use PYTHONIOENCODING="UTF-8"
Though I still recommend switching to Python 3, which has much saner semantics regarding Unicode.

Last edited by rtoijala (2017-07-08 17:54:22)

orderlyDisorganized · 2017-07-08 18:10:59

rtoijala wrote:

Edit:
The answer is here: https://stackoverflow.com/questions/454 … ng-to-file
tl;dr: use PYTHONIOENCODING="UTF-8"
Though I still recommend switching to Python 3, which has much saner semantics regarding Unicode.

Thanks! Now it's working as I intended.

I agree about switching to Python 3 but the nice thing using Python 2 in this script was to learn about ASCII and Unicode.

Edit: Additional testing and explaining if someone is interested. I recommend checking the first answer in rtoijala's link.

>>> c = "›"
>>> print len(c)
3
>>> print type(c)
<type 'str'>
>>> c
'\xe2\x80\xba'
>>> print c
›

The length of three might seem a bit odd at first. I guess Python knows to use UTF-8 because the terminal uses it (which gets the encoding from the system or from its own settings(?)). I suppose the same applies to IDE: Python knows the encoding set to an IDE.

In the case of piping Python output is put to some other software but it has no way of knowing what encoding it's using - hence Python resorts to the default encoding which is ASCII in Python 2 (and UTF-8 in Python 3 which is why the script works fine in Python 3).

example.py

#-*- coding: utf-8 -*-
# Since I need to write the literal "›", I need to use this but this doesn't directly have anything to do with the encoding discussed here

c = unicode("›", "utf-8")
print c

[user@host ~]$ python2 example.py
›
[user@host ~]$ python2 example.py | tee
Traceback (most recent call last):
  File "example.py", line 5, in <module>
    print c
UnicodeEncodeError: 'ascii' codec can't encode character u'\u203a' in position 0: ordinal not in range(128)

Reproducing the problem without piping:

[user@host ~]$ export LC_CTYPE=trash
bash: warning: setlocale: LC_CTYPE: cannot change locale (trash): No such file or directory
[user@host ~]$ python2
Python 2.7.13 (default, Jul  2 2017, 22:24:59) 
[GCC 7.1.1 20170621] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.stdout.encoding
ANSI_X3.4-1968
>>> c = unichr(128)
>>> print c
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\x80' in position 0: ordinal not in range(128)

Python's print() uses the encoding set in my terminal. If the output is piped, it defaults to ASCII.

Last edited by orderlyDisorganized (2017-07-10 16:54:46)

Arch Linux

#1 2017-07-08 17:23:45

[SOLVED] Conky running a Python script, encoding problem

#2 2017-07-08 17:41:53

Re: [SOLVED] Conky running a Python script, encoding problem

#3 2017-07-08 18:10:59

Re: [SOLVED] Conky running a Python script, encoding problem

Board footer