You are not logged in.

#1 2012-07-09 18:50:30

barthel
Member
From: Saint Charles, MO
Registered: 2011-07-27
Posts: 10

Want a value for LC_COLLATE that respects UTF8 and punctuation...

I'm trying to find a setting for LC_COLLATE that is:

  1. Case insensitive

  2. Accent/diacritical insensitive

  3. Punctuation sensitive

If I use the default based on my LANG setting, I can achive both 1 & 2. But I can only get 3 if I use LC_COLLATE=C (which negates both 1 & 2).

Here's a silly example which illustrates the issue for me. Let's assume I have files named for people which should be sorted first by honorific, then last name:

$ touch mr.richards Mr.Rogers Mr.Stevens mrs.robinson Mrs.Robinson Mrs.Stephens Mrs.Renee Mrs.Renée Mrs.Reno
$ LC_COLLATE=C ls -x
Mr.Rogers
Mr.Stevens
Mrs.Renee
Mrs.Reno
Mrs.Renée
Mrs.Robinson
Mrs.Stephens
mr.richards
mrs.robinson
$ LC_COLLATE=en_US.UTF-8 ls -x
mr.richards
Mr.Rogers
Mrs.Renee
Mrs.Renée
Mrs.Reno
mrs.robinson
Mrs.Robinson
Mrs.Stephens
Mr.Stevens

What I want in this example is:
mr.richards
Mr.Rogers
Mr.Stevens
Mrs.Renee
Mrs.Renée
Mrs.Reno
mrs.robinson
Mrs.Robinson
Mrs.Stephens

This old PostgreSQL post accurately summarizes what I've discovered via Google and man pages: http://archives.postgresql.org/pgsql-sq … g00078.php

Currently, the relevant collation source is /usr/share/i18n/locales/iso14651_t1_common (which is sourced by /usr/share/i18n/locales/iso14651_t1).

Any suggestions? Is there an additional locale I could enable to give me the features I want? Or am I looking at creating a new locale based on en_US.utf8 that doesn't ignore the characters in the UNDEFINED block of /usr/share/i18n/locales/iso14651_t1_common?

For now, at least, I guess I'm going back to LC_COLLATE=C--if only because I'm accustomed to its quirks after all these years.

Thanks for looking!
Barthel

For the record, here is my current setup:

$ locale -a

C
en_US
en_US.iso88591
en_US.utf8
POSIX

/etc/rc.conf snippet

# LOCALIZATION
# ------------
HARDWARECLOCK="UTC"
TIMEZONE="America/Chicago"
KEYMAP="us"
CONSOLEFONT=
CONSOLEMAP=
LOCALE=
DAEMON_LOCALE="no"
USECOLOR="yes"

$ locale also contents of new /etc/locale.conf

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Any technology distinguishable from magic is insufficiently advanced.
    - Cleon, _Foundation's Fear_

Offline

Board footer

Powered by FluxBB