You are not logged in.

#1 2007-10-15 22:22:26

Ilya
Member
From: Russia, Saint-Petersburg
Registered: 2007-08-16
Posts: 98

Python and UTF-8

Can't understand...

My locale: ru_RU.UTF-8 (russian)

script.py (executable)

#! /usr/bin/env python
# -*- coding: UTF-8 -*-

hello = "ПРЕВЕД"
print hello
print hello[-1:]
print hello[-2:]
print hello[-3:]
print hello[-4:]

Output must be (as i think):

$ ./script.py
ПРЕВЕД
Д
ЕД
ВЕД
ЕВЕД

But real output is:

$ ./script.py
ПРЕВЕД

Д
Д
ЕД

It's OK, when hello defined as u"ПРЕВЕД", but i supposed that second sting of the script makes all stings in UTF-8 by default (without prefix, "u")...

Last edited by Ilya (2007-10-16 03:14:49)

Offline

#2 2007-10-15 23:28:21

kumico
Member
Registered: 2007-09-28
Posts: 224
Website

Re: Python and UTF-8

add

# -*- coding: utf-16 -*-

to the first or second line of the source file ..

Offline

#3 2007-10-16 03:11:09

Ilya
Member
From: Russia, Saint-Petersburg
Registered: 2007-08-16
Posts: 98

Re: Python and UTF-8

kumico wrote:

add

# -*- coding: utf-16 -*-

to the first or second line of the source file ..

It doesn't work.

$ ./script.py 
  File "./script.py", line 3
SyntaxError: UTF-16 stream does not start with BOM

Anyway, my script.py already have line with encoding, that corresponds with my locale (UTF-8). So why UTF-16??


P.S. With BOM (Byte Order Mark) or without it, i have the same (wrong) result.

Last edited by Ilya (2007-10-16 03:12:50)

Offline

#4 2007-10-16 04:01:18

stonecrest
Member
From: Boulder
Registered: 2005-01-22
Posts: 1,190

Re: Python and UTF-8

Ilya wrote:

It's OK, when hello defined as u"ПРЕВЕД", but i supposed that second sting of the script makes all stings in UTF-8 by default (without prefix, "u")...

I believe that that line simply allows you to put unicode characters into your script, it doesn't actually tell python anything - so it's not making all strings utf8 by default. If you took away that line from your script, you wouldn't even be able to use the line hello = "ПРЕВЕД" at all.


I am a gated community.

Offline

#5 2007-10-16 05:28:11

Ilya
Member
From: Russia, Saint-Petersburg
Registered: 2007-08-16
Posts: 98

Re: Python and UTF-8

stonecrest wrote:
Ilya wrote:

It's OK, when hello defined as u"ПРЕВЕД", but i supposed that second sting of the script makes all stings in UTF-8 by default (without prefix, "u")...

I believe that that line simply allows you to put unicode characters into your script, it doesn't actually tell python anything - so it's not making all strings utf8 by default. If you took away that line from your script, you wouldn't even be able to use the line hello = "ПРЕВЕД" at all.

Thus there is no other easy way to use Unicode stings except this?
I'm newbie in Python... I was expect more comfortable unicode support hmm

#! /usr/bin/env python
# -*- coding: UTF-8 -*-
...
hello = u"ПРЕВЕД"

Last edited by Ilya (2007-10-16 05:41:12)

Offline

#6 2007-10-16 07:25:16

skymt
Member
Registered: 2006-11-27
Posts: 443

Re: Python and UTF-8

Read this article on Unicode in Python.

Also, Python 3.0 will combine the str and unicode types.

Offline

#7 2007-10-17 02:28:26

stonecrest
Member
From: Boulder
Registered: 2005-01-22
Posts: 1,190

Re: Python and UTF-8

skymt wrote:

Also, Python 3.0 will combine the str and unicode types.

Can you expand on this? I've also had headaches dealing with unicode in python, it's quite painful. Is python going to handle everything implicitly instead of forcing the dev to deal with this nonsense?


I am a gated community.

Offline

#8 2007-10-17 03:48:08

skymt
Member
Registered: 2006-11-27
Posts: 443

Re: Python and UTF-8

stonecrest wrote:

Can you expand on this? I've also had headaches dealing with unicode in python, it's quite painful. Is python going to handle everything implicitly instead of forcing the dev to deal with this nonsense?

Mostly, yeah. Previously, the str was used to handle both ASCII strings and binary data, since each is a simple array of bytes. Unicode is much more complicated, so it needed a separate data type.

In Python 3, a string is always Unicode; there is no separate type. Raw binary data is stored with the bytes type, and you can convert between the two with str.encode() and bytes.decode(). You will probably have to do some manual conversions, depending on where you got the data. See Guido's Python 3000 status update (under "Unicode, Codecs and I/O"), and PEPs 358, 3112, and 3120.

Offline

#9 2007-10-18 20:40:49

Ilya
Member
From: Russia, Saint-Petersburg
Registered: 2007-08-16
Posts: 98

Re: Python and UTF-8

skymt wrote:

Read this article on Unicode in Python.

Also, Python 3.0 will combine the str and unicode types.

Thanks, this article was very useful!

Offline

Board footer

Powered by FluxBB