You are not logged in.

#1 2008-06-07 16:27:18

wuischke
Member
From: Suisse Romande
Registered: 2007-01-06
Posts: 630

multi-line regexp

What at first seemed like a two minutes action turned out to be 3 hours google session with a shattered ego, because I failed to find a solution.

The task looks pretty simple: Remove all empty (i.e. untranslated) messages from a po file. A quick grep -v '\#:.*msgfmt ""\n\n' file.po should have done the job...if grep would work on a per line base. The same applies for sed and awk. I failed at making perl do the job and all hacks involving awk or sed didn't work the way I wanted - one sed line was really close, but a bit too greedy - it searched for the biggest possible match and killed 99% of the file instead of every single occasion of the regexp.

Do you have any idea on how to do it properly? Here's an example file content: (first two are empty, the last one is translated)

#: src/libs/ec/cpp/RemoteConnect.cpp:91
msgid "Invalid password, not a MD5 hash!"
msgstr ""

#: src/libs/ec/cpp/RemoteConnect.cpp:136
msgid "Connection failure"
msgstr ""

#: src/libs/ec/cpp/RemoteConnect.cpp:194
msgid "EC Connection Failed. Empty reply."
msgstr "EC connection failed: empty reply."

Offline

#2 2008-06-07 16:39:15

Zepp
Member
From: Ontario, Canada
Registered: 2006-03-25
Posts: 334
Website

Re: multi-line regexp

Offline

#3 2008-06-07 18:10:53

Procyon
Member
Registered: 2008-05-07
Posts: 1,819

Re: multi-line regexp

So lines that are msgstr "" need to be removed? Why did grep -v fail?

Wait I understand it more now (also the two lines before). Hold on.

Yeah grep -v -B 2 'msgstr ""' doesn't do it.

Last edited by Procyon (2008-06-07 18:58:38)

Offline

#4 2008-06-07 19:17:23

Procyon
Member
Registered: 2008-05-07
Posts: 1,819

Re: multi-line regexp

Does this do it?

sed '/^#:/ {:get_msgid;N;s/"$/&/;T get_msgid;:get_msgstr;N;s/"$/&/;T get_msgstr;s/msgstr ""/&/;T nodelete;d;:nodelete}' file.txt

It makes some excessive blank lines, because if N hits EOF it doesn't work or something, but "cat -s" will get rid of it (except the first one).

Offline

#5 2008-06-07 19:26:28

wuischke
Member
From: Suisse Romande
Registered: 2007-01-06
Posts: 630

Re: multi-line regexp

The problem is a non-fixed length, too. Some of these strings are multi-line and therefore I cannot say: ignore the last 2 lines when you encounter an empty string.

Zepp's link looks very promising, I was close to writing myself a simple C program to do the job. (I have to improve my Python or learn some Perl...seriously...)

Edit: Procyon: Your second project looks very promising, thanks a lot! I just have to add a white-space remover and my problem is gone! I'll have to understand the command later. wink

Edit2: Unfortunately it fails to remove strings similar to the following:

#: src/amule.cpp:971
#, c-format
msgid ""
"Port %u is not available!\n"
"\n"
"This means that you will be LOWID.\n"
"\n"
"Check your network to make sure the port is open for output and input."
msgstr ""

Last edited by wuischke (2008-06-07 19:31:15)

Offline

#6 2008-06-07 19:42:46

Procyon
Member
Registered: 2008-05-07
Posts: 1,819

Re: multi-line regexp

Oh so that's what multiline looks like. I thought
msgid "foo
bar"

and additional comments confuse it too. Maybe it should be paragraph based.

Offline

#7 2008-06-07 19:56:26

peets
Member
From: Montreal
Registered: 2007-01-11
Posts: 936
Website

Re: multi-line regexp

I've given up on sed and friends a while ago. I use perl mostly. Here's what I've got:

#!/usr/bin/env perl
use strict;

my @lines;
while(my $line = <STDIN>) {
    chomp $line;
    # if message is empty, forget about previous lines in same group
    if($line =~ /msgstr ""/) {
        @lines = ();
    } elsif(!$line) {
        # if we have a blank line, I'm assuming it's the end of a group
        @lines and print join("\n", @lines) . "\n";
    } else {
        # otherwise, add line to group buffer
        push @lines, $line;
    }
}
# EOF reached, but there might still be stuff to be printed
@lines and print join("\n", @lines) . "\n";

Offline

#8 2008-06-07 20:09:59

Procyon
Member
Registered: 2008-05-07
Posts: 1,819

Re: multi-line regexp

Ok, how about this one:

sed -ne ':get_paragraph;H;n;s/^$//;T get_paragraph;x;s/msgstr ".\+"/&/p' file.txt

It gets a paragraph and prints if it has something in the msgstr.
It ignores the last one due to EOF, so give it a blank last line first (echo >> file.txt (two >>'s not one > like I just did))

Last edited by Procyon (2008-06-07 20:10:55)

Offline

#9 2008-06-07 20:11:18

carlocci
Member
From: Padova - Italy
Registered: 2008-02-12
Posts: 368

Re: multi-line regexp

bash script: invoke as ./script | tac

#!/bin/bash

skipline=0
tac "file.po" |
while read -r i; do
        if [[ "$i" =~ ^msgstr\ \"\"$ ]]; then
                skipline=1
        fi
        if [[ "$i" =~ ^msgstr\ \".+\" ]]; then
                skipline=0
        fi
        if [ $skipline -eq 1 ]; then
                continue
        fi
        /bin/echo -E "$i"
done

I've spent a while on awk but I couldn't find a decent way to do it, even with obscure RS and FS fiddling


edit: maybe I just had an idea for awk

Offline

#10 2008-06-07 20:20:27

carlocci
Member
From: Padova - Italy
Registered: 2008-02-12
Posts: 368

Re: multi-line regexp

carlocci wrote:

edit: maybe I just had an idea for awk

nay, I lost it

Offline

#11 2008-06-08 10:12:05

briest
Member
From: Katowice, PL
Registered: 2006-05-04
Posts: 468

Re: multi-line regexp

awk processing is record-, not line-based. You only have to assing proper value for record separator...

BEGIN{RS=""; ORS="\n\n"}/msgstr ""/{next}{print}

Offline

#12 2008-06-08 10:23:10

wuischke
Member
From: Suisse Romande
Registered: 2007-01-06
Posts: 630

Re: multi-line regexp

briest: Argh, I'm feeling very stupid now. Thanks a lot for this information!

I'll walk through the other solutions, too, because there's a lot I can learn. >>I know regexp<< Is not always enough...

Offline

Board footer

Powered by FluxBB