sed and perl not replacing a letter in a file

sant527 · 2013-10-19 06:22:00

I have a file 1.htm. I want to replace a letter ṣ (s with dot below). I tried with both sed and perl and it does not replace.

sed -i 's/ṣ/s/g' "1.htm"
perl -i -pe 's/ṣ/s/g' "1.htm"

can anyone suggest what to do

1.html

WorMzy · 2013-10-19 06:33:47

There doesn't appear to be any in that file. Are you sure it didn't work?

jasonwryan · 2013-10-19 06:37:29

Line 81.

# edit: just do it by hand...

Or, use the devanagari locale: that may be the only way to have it picked up...

WorMzy · 2013-10-19 10:04:04

It'd be easier to do it by hand. I managed to replace it, but only after treating it as two 'characters': an 's' followed by a bash escape sequence (or something) for applying that modification to the previous character. Unfortunately I didn't have time to investigate further, but it might help put you on the right track.

sant527 · 2013-10-19 10:55:50

the below command worked. I found it on stackoverflow.

perl -C -i -pe 's/s\x{0323}/s/g' "1.htm"

Can anyone tell what is my file encoding. Also how to find out the corresponding number 0323

If some one can tell me how to use this with sed. I have 170 files with many other characters to replace. Sed has useful option of y, I want to use sed some thing like this

find . -name "*.html" -exec sed -i.bak 'y/ŚśḌḍḤḥḶḷṀṁṄṅṆṇṚṛṜṝṢṣṬṭl̐/SsDdHhLlMmNnNnRrRrSsTtl/' {} \;

then it should replace all the respective letters in all the files.

sant527 · 2013-10-19 12:49:22

also i found the command

sed -i 's/ṣ/s/g'  filename"

its strange that, it replaces on one file and not the other. I am giving the links of the fles

replaceble.html
unreplaceble.html

Trent · 2013-10-19 13:20:46

Perl also has that operator, but it usually goes by its alias tr (for transliteration). y is provided if you want to use that instead, but I don't think it will help; keep reading to find out why.

\x{0323} is a special Perl-specific syntax for Unicode character U+0323 (COMBINING DOT BELOW). Since 323 is its Unicode code point, not a byte value, that won't tell you what encoding you're using. (In UTF-8, that character becomes the 2-byte sequence cca3, which is what you'll see if you examine the file with a dumb viewer like xxd.) Fortunately, since it comes through correctly, you know Perl is properly set up and isn't trying to use some weird encoding to read your file, or interpreting it as just bytes. This makes your job in principle much easier.

However, you'll notice that Perl doesn't recognize ṣ as a single character. That's because the way your file is encoded, it isn't one character: it's an s followed by the Unicode character U+0323. This is a valid way to represent the glyph ṣ using Unicode, but it doesn't treat that glyph as a single character (in, for instance, tr/foo/bar/). Since tr can only do one-to-one character transliteration, you can't use it to transform ṣ (2 characters) into s (one character) while doing other similar transformations in one pass.

One thing you might do is to simply strip out all the U+0323 characters:

s/\x{0323}//g
# or as I would write it,
s/\N{COMBINING DOT BELOW}//g

which would leave you with just the ASCII remaining for those glyphs that use it. This works for ṛ, ṣ, and ṇ (for instance), but it will not work on ā, and you can't use a similar tactic there because ā in your file is not made by combining the ASCII 'a' with U+0305 COMBINING OVERLINE, but is in fact a single Unicode character, U+0101 (LATIN SMALL LETTER A WITH MACRON).

To make a long story even longer, there is a Unicode character for ṣ (U+1E63 LATIN SMALL LETTER S WITH DOT BELOW), which does not appear in your 1.htm file, but is the character you will get if you copy and paste from this forum (at least, it works that way for me). So if you typed or pasted "s/ṣ/s/g" into your perl or sed command, whether or not it would have the desired effect depends on which form of the ṣ glyph your input method gives you.

If this is a one-off, quick script, I'd write something like the following: (incomplete!)

#!/usr/bin/perl -p -i.bak
tr/ā/a/;  # here add all multibyte characters that you want to transliterate
s/
    \N{COMBINING DOT BELOW} |   # this should take care of ḌḍḤḥḶḷṆṇṚṛṢṣṬṭ
    \N{COMBINING CANDRABINDU}   # this should take care of l̐
                                # add other combining characters that you want to get rid of here
//gx;

I added ā to the tr/// as an example; if you don't actually want to replace ā in the file, you'll want to remove that part. Without seeing the other files or knowing how they were created (different encoders produce the same glyphs in different ways), I couldn't say with certainty which of the glyphs you want to replace are represented by a single Unicode code point and which ones are represented by a letter + a combining diacritic or other mark, so you'll have to examine your files (or the encoder that generated them) to see which ones you have to deal with in which way.

You can use this tool to look up a UTF-8 byte sequence like "cc a3" to get its Unicode code point and name. Again, you'll have to examine the file with something like xxd to see the bare bytes, since any editor will probably present them as characters.

You may also wish to investigate the Text::Unidecode module on CPAN; I can't say for sure whether it will do what you want, but maybe it's close enough.

sant527 · 2013-10-19 13:36:41

Thank you for the reply.

I found that

ṣ (U+1E63 LATIN SMALL LETTER S WITH DOT BELOW) and s followed by the Unicode character U+0323 are different

Is there any text editor which distinctly shows these things.

Yes you are right. I have posted links to two files replaceble.html and unreplaceble.htm

When i open in firefox replaceble.htm - i found ṣ to be shown as &#7771
when i open in firefox unreplaceble.htm - i found ṣ to be shown as s&#803

thats why sed replaces in replaceble.htm and not in unreplaceble.htm

But if you tell me any tool to find these difference from text files (because in text both look same)

sant527 · 2013-10-19 13:49:08

thank you for that tool. it distinctly shows the difference.

if i want to convert all s+U+0323 to U+1E63 it is possible. Because i am only removing dot because in my nokia mobile browser dot below scatters.

Also can we change all other characters with dot below to single character

Trent · 2013-10-19 13:56:20

Yeah, that's possible (assuming Unicode actually has the code points to map them to). I don't know of an automated tool that does that, even though it probably already exists. In your case I'd consider just making a script that individually maps each character followed by U+0323 to its own code point. I wrote the following script that works for S:

#!/usr/bin/perl -i.bak
use warnings;
use strict;
my %dotchars = (
    's' => "\N{LATIN SMALL LETTER S WITH DOT BELOW}",
    'S' => "\N{LATIN CAPITAL LETTER S WITH DOT BELOW}",
    # (etc.)
);
while (<>) {
    s/(.)\xcc\xa3/$dotchars{$1}?$dotchars{$1}:$1/age;
    print;
}

The idea is to replace (any character plus a combining dot) with the corresponding string in the %dotchars hash, or with just the character (minus the dot) if it can't be found in %dotchars. You can make it work for all the dotted characters by expanding %dotchars with more mappings.

It's not a pretty solution, but good enough for a one-off. Maybe someone else knows of an automated tool that can do this kind of mapping in a more intelligent way.

Last edited by Trent (2013-10-19 14:34:10)

sant527 · 2013-10-19 15:09:09

thank you,

I found a command which merges all such occurences

in this link http://stackoverflow.com/questions/1946 … -in-a-file

Another possibility is to normalize all the combining characters using Unicode::Normalize, e.g.:

perl -C -MUnicode::Normalize -Mutf8 -i -pe '$_=NFKC($_); s/ṣ/s/g' "1.htm"

But this would also normalize all the other characters in the input file, which may or may not be OK for you.

i didnt understand the command much. Do you understand it. Is it safe to run this.

Last edited by sant527 (2013-10-19 15:10:32)

Trent · 2013-10-19 16:34:54

Yeah, it's safe, but I'm not 100% sure it will do exactly what you want. I'd try the following, if I've understood correctly what you want:

perl -C -MUnicode::Normalize -i.bak -pe '$_=NFKC($_)' "1.htm"

If translating "s\x{0323}" to "ṣ" is sufficient to make it render correctly for you, you don't want to further replace ṣ, so I removed that part. I also removed -Mutf8 because that isn't necessary when the program source code doesn't include Unicode. Finally, I changed -i to -i.bak so that the original file won't be lost (it gets saved as 1.htm.bak); you can leave it as -i if you have a backup copy.

This is a good place to start; even if it doesn't do exactly what you want, you may then be able to successfully run the find/sed command you mentioned in post #5, since it should (if I understand Unicode::Normalize right) remove all the combining diacritics.

Arch Linux

#1 2013-10-19 06:22:00

sed and perl not replacing a letter in a file

#2 2013-10-19 06:33:47

Re: sed and perl not replacing a letter in a file

#3 2013-10-19 06:37:29

Re: sed and perl not replacing a letter in a file

#4 2013-10-19 10:04:04

Re: sed and perl not replacing a letter in a file

#5 2013-10-19 10:55:50

Re: sed and perl not replacing a letter in a file

#6 2013-10-19 12:49:22

Re: sed and perl not replacing a letter in a file

#7 2013-10-19 13:20:46

Re: sed and perl not replacing a letter in a file

#8 2013-10-19 13:36:41

Re: sed and perl not replacing a letter in a file

#9 2013-10-19 13:49:08

Re: sed and perl not replacing a letter in a file

#10 2013-10-19 13:56:20

Re: sed and perl not replacing a letter in a file

#11 2013-10-19 15:09:09

Re: sed and perl not replacing a letter in a file

#12 2013-10-19 16:34:54

Re: sed and perl not replacing a letter in a file

Board footer