Python and UTF-8

Ilya · 2007-10-15 22:22:26

Can't understand...

My locale: ru_RU.UTF-8 (russian)

script.py (executable)

#! /usr/bin/env python
# -*- coding: UTF-8 -*-

hello = "ПРЕВЕД"
print hello
print hello[-1:]
print hello[-2:]
print hello[-3:]
print hello[-4:]

Output must be (as i think):

$ ./script.py
ПРЕВЕД
Д
ЕД
ВЕД
ЕВЕД

But real output is:

$ ./script.py
ПРЕВЕД

Д
Д
ЕД

It's OK, when hello defined as u"ПРЕВЕД", but i supposed that second sting of the script makes all stings in UTF-8 by default (without prefix, "u")...

Last edited by Ilya (2007-10-16 03:14:49)

kumico · 2007-10-15 23:28:21

add

# -*- coding: utf-16 -*-

to the first or second line of the source file ..

Ilya · 2007-10-16 03:11:09

kumico wrote:

add
# -*- coding: utf-16 -*-
to the first or second line of the source file ..

It doesn't work.

$ ./script.py 
  File "./script.py", line 3
SyntaxError: UTF-16 stream does not start with BOM

Anyway, my script.py already have line with encoding, that corresponds with my locale (UTF-8). So why UTF-16??

P.S. With BOM (Byte Order Mark) or without it, i have the same (wrong) result.

Last edited by Ilya (2007-10-16 03:12:50)

stonecrest · 2007-10-16 04:01:18

Ilya wrote:

It's OK, when hello defined as u"ПРЕВЕД", but i supposed that second sting of the script makes all stings in UTF-8 by default (without prefix, "u")...

I believe that that line simply allows you to put unicode characters into your script, it doesn't actually tell python anything - so it's not making all strings utf8 by default. If you took away that line from your script, you wouldn't even be able to use the line hello = "ПРЕВЕД" at all.

Ilya · 2007-10-16 05:28:11

stonecrest wrote:

Ilya wrote:
It's OK, when hello defined as u"ПРЕВЕД", but i supposed that second sting of the script makes all stings in UTF-8 by default (without prefix, "u")...
I believe that that line simply allows you to put unicode characters into your script, it doesn't actually tell python anything - so it's not making all strings utf8 by default. If you took away that line from your script, you wouldn't even be able to use the line hello = "ПРЕВЕД" at all.

Thus there is no other easy way to use Unicode stings except this?
I'm newbie in Python... I was expect more comfortable unicode support

#! /usr/bin/env python
# -*- coding: UTF-8 -*-
...
hello = u"ПРЕВЕД"

Last edited by Ilya (2007-10-16 05:41:12)

skymt · 2007-10-16 07:25:16

Read this article on Unicode in Python.

Also, Python 3.0 will combine the str and unicode types.

stonecrest · 2007-10-17 02:28:26

skymt wrote:

Also, Python 3.0 will combine the str and unicode types.

Can you expand on this? I've also had headaches dealing with unicode in python, it's quite painful. Is python going to handle everything implicitly instead of forcing the dev to deal with this nonsense?

skymt · 2007-10-17 03:48:08

stonecrest wrote:

Can you expand on this? I've also had headaches dealing with unicode in python, it's quite painful. Is python going to handle everything implicitly instead of forcing the dev to deal with this nonsense?

Mostly, yeah. Previously, the str was used to handle both ASCII strings and binary data, since each is a simple array of bytes. Unicode is much more complicated, so it needed a separate data type.

In Python 3, a string is always Unicode; there is no separate type. Raw binary data is stored with the bytes type, and you can convert between the two with str.encode() and bytes.decode(). You will probably have to do some manual conversions, depending on where you got the data. See Guido's Python 3000 status update (under "Unicode, Codecs and I/O"), and PEPs 358, 3112, and 3120.

Ilya · 2007-10-18 20:40:49

skymt wrote:

Read this article on Unicode in Python.
Also, Python 3.0 will combine the str and unicode types.

Thanks, this article was very useful!

Arch Linux

#1 2007-10-15 22:22:26

Python and UTF-8

#2 2007-10-15 23:28:21

Re: Python and UTF-8

#3 2007-10-16 03:11:09

Re: Python and UTF-8

#4 2007-10-16 04:01:18

Re: Python and UTF-8

#5 2007-10-16 05:28:11

Re: Python and UTF-8

#6 2007-10-16 07:25:16

Re: Python and UTF-8

#7 2007-10-17 02:28:26

Re: Python and UTF-8

#8 2007-10-17 03:48:08

Re: Python and UTF-8

#9 2007-10-18 20:40:49

Re: Python and UTF-8

Board footer