You are not logged in.

#1 2010-09-11 05:38:49

stryder
Member
Registered: 2009-02-28
Posts: 500

[solved] Help me with Perl and Regex

I manage a wordpress site and get articles via email often. And most times in copying text from formatted email to bluefish I get additional blank lines between paragraphs. Sometimes these blank lines have spaces sometimes they are clean. I like to clean up the text before I post the article and make sure that paragraphs are separated by a single blank line. Let me confess that I really don't know Perl or Regex and what I have done was via reading, testing and adapting what I have found. Anyway, I found

A single blank line (with or without other white characters) is "^\s*" . I can substitute that for anything BUT I cannot (or don't know how to) substitute 2 or more blank lines to create a single blank line. So what I did was to break the process into 2:

   $s =~ s-^\s*-qzq-g; 

$s =~ s-(qzq)+-\n-g; 

It works but the substitution has to be run twice since the second is intended to substitute the result of the first. I adapted a perl program I found to do this:

#! /usr/bin/perl

    #   Open input and output files

    $if = STDIN;
    $of = STDOUT;
    $ifname = "(stdin)";
    $tf = "tempfile";


    $iline = 0;
    $oline = 0;

    while ($l = <$if>) {
        $iline++;

        $l1 = &gmail1($l);
                  &printFile($l1);
    }

    close($if);
    close($of);



sub gmail1 {
    local($s) = @_;
    local($i, $c);
   $s =~ s-^\s*-qzq-g;
   $s;
}

sub gmail2 {
    local($s) = @_;
    local($i, $c);
   $s =~ s-(qzq)+-\n-g;
   $s;
}

sub printFile {
    local($s) = @_;
        print($of "$s"); 
        }
        

And basically what I did was to create 2 perl scripts, the first running gmail1 and the second gmail2 and create a filter in bluefish that runs |gmail1.pl|gmail2.pl| (the result of gmail1.pl is piped into gmail2.pl) as I cannot figure out how to achieve this in a single program script.

It works. I can copy and paste the email text into bluefish, run the filter and get my paragraphs cleanly separated by single blank lines.

I'm just wondering if there is a cleaner, more correct/elegant way of doing this. Perl is just what I have found. Bluefish works with PHP as well AFAIK.

Last edited by stryder (2010-09-16 18:46:04)

Offline

#2 2010-09-11 10:43:21

Procyon
Member
Registered: 2008-05-07
Posts: 1,819

Re: [solved] Help me with Perl and Regex

Perl regex has a multiline m modifier

$s =~ s/\n\s*\n/\n\n/gm

Offline

#3 2010-09-11 15:22:02

stryder
Member
Registered: 2009-02-28
Posts: 500

Re: [solved] Help me with Perl and Regex

Thanks Procyon. I put in your regex line into the little script above and nothing happened. Then I had an "aha" moment - if i understand the script, it processes STDIN line by line and so multiline matches cannot happen. It clicked because you used the word "multiline". Looks like my little workaround is the only way to use that script. Now I have to google around to find a script that processes STDIN as a whole, and then outputs to STDOUT.

Last edited by stryder (2010-09-11 15:26:36)

Offline

#4 2010-09-14 00:13:07

spctrl
Member
Registered: 2010-06-20
Posts: 32

Re: [solved] Help me with Perl and Regex

What you can do is to just append all lines that appear in STDIN to a text variable, and then process that as a whole.

If you do not, i have modified your code a little bit. I am not sure what you want, but I assumed you just wanted to get rid of consecutive blank lines.. so presto:

#! /usr/bin/perl

$if = STDIN;
$of = STDOUT;
$text = "";           # a text buffer, everything we want to ouput we append to this.
$blankLines = 0;      # number of consecutive blank lines we've encountered

while ($l = <$if>) {
        stripDoubleNewlines($l);
}
printFile($text);

close($if);
close($of);

sub stripDoubleNewlines {
        local($s) = @_;
        if($s =~ m/^$/) {
                $blankLines++; # we found a blank line
                if($blankLines < 2) {
                        $text .= $s; # if it was the first blank line, we still want it.
                }
        } else {
                $text .= $s;  # we found a line which was not blank, we want it.
                $blankLines = 0;  # make sure we reset the blank line counter.
        }
}

sub printFile {
        local($s) = @_;
        print($of "$s");
}

Offline

#5 2010-09-15 04:53:49

stryder
Member
Registered: 2009-02-28
Posts: 500

Re: [solved] Help me with Perl and Regex

Spctrl, thanks, really appreciate your idea. Unfortunately the code does not work - nothing happens when I ran it through bluefish. I think there's something wrong as substituting the if condition to match something simple also does nothing. But I don't understand Perl enough to figure out what is wrong. Your logic is clear though. Only thing I wasn't sure was if this m/^$/ matches a blank line.

If you can show me show to append all lines in STDIN to $if that would be great as I can then do everything I want to it. I read somewhere that you need use "undef $/;" but I don't quite know how to do it.

However looking through your code I realize that I can make use of $text .= $s to create a new variable that holds all the text. So for now I have come up with this, which works:

#! /usr/bin/perl

    #   Open input and output files

    $if = STDIN;
    $of = STDOUT;
    $holdall = "";

    while ($l = <$if>) {
        $l1 = &marklinebreaks($l);
        $holdall .= $l1;
    }

$holdall =~ s-(XZX)+-\n-g; #change 1 or more line breaks back to a single line break

printFile($holdall);

    close($if);
    close($of);


sub marklinebreaks {
    local($s) = @_;
    local($i, $c);
   $s =~ s-^\s*-XZX-g;
   $s;
}

        
sub printFile {
    local($s) = @_;
        print($of "$s"); 
        }

Yet to try out multiline substitutions on this.

Last edited by stryder (2010-09-15 05:03:25)

Offline

#6 2010-09-15 23:04:15

korpenkraxar
Member
Registered: 2006-04-02
Posts: 123

Re: [solved] Help me with Perl and Regex

One simple way of doing multiline matches is to first tweak the special newline $/ variable which defines the character that should be interpreted as the line break. You can put any string into $/ and that string will be used to split input into separate chunks of text. Beware though that the original line break character(s) will be present inside the text. Have a look here: http://perldoc.perl.org/perlvar.html

Offline

#7 2010-09-16 03:59:24

stryder
Member
Registered: 2009-02-28
Posts: 500

Re: [solved] Help me with Perl and Regex

Thanks korpenkraxar. I knew about $/ (see my last post) but didn't know how to use it. Your link was helpful and made me experiment further. So now I am down to this:

#! /usr/bin/perl

#    local $linebreak = $/;
    local $/; # enable localized slurp mode
    local $if = <STDIN>;
#    $/ = $linebreak; # redefine linebreak to original

    $if =~ s/(\n){2,}/\n/g;
    $if =~ s/ +/ /g;
    
    print($if);
    close($if);

This works and I can run further substitutions, in this case making sure only single spaces occur. Restoring the linebreak character doesn't seem important. Since everything is now 1 long string, beginning of line, etc. are no longer useful.

Offline

#8 2010-09-17 16:44:27

juster
Forum Fellow
Registered: 2008-10-07
Posts: 195

Re: [solved] Help me with Perl and Regex

stryder wrote:
#! /usr/bin/perl

#    local $linebreak = $/;
    local $/; # enable localized slurp mode
    local $if = <STDIN>;
#    $/ = $linebreak; # redefine linebreak to original

    $if =~ s/(\n){2,}/\n/g;
    $if =~ s/ +/ /g;
    
    print($if);
    close($if);

When you use local, the variable will automatically revert to its old value so you don't need to redefine it to the original. When the local goes out of scope, it gets set back automatically. Don't use local for variables where you don't need this, like your $if variable. Use my instead. Below, the do block keeps the local $/ inside as small a block as possible and returns the result of <STDIN>. This is often seen when "slurping" files. You might also try to the File::Slurp module for an easier to use solution.

You are calling close on your $if variable which is not a file handle... it is a string. This is bad. You don't really need to close STDIN and STDOUT, these file handles are special and close automatically. You only need to close them in special circumstances.

You don't need the parenthesis around (\n) inside your s/// substitution. The parenthesis create a match which is not helpful if you don't use it.

I just wanted to clear that up about local and noticed these other problems.

my $input = do { local $/; <STDIN> };
$input =~ s/\n{2,})/\n/g;
$input =~ s/ {2,}/ /g;

Offline

#9 2010-09-18 02:35:42

stryder
Member
Registered: 2009-02-28
Posts: 500

Re: [solved] Help me with Perl and Regex

@juster: Thanks a lot! Really helps when someone points out my mistakes, especially if they are about finer points because these are things I don't learn by just googling and reading. I don't know Perl at all and am really getting by "what works" but I am certainly happy to learn. smile Looking at the huge mess I started with, the code is now down to 4 lines. Wonderful!

Just for the record in case someone else uses the code, Juster's first substitution should be "$input =~ s/\n{2,}/\n/g;" - he inadvertently left in a stray ")". Also you do need to print($input) at the end to get the changes back to bluefish.

Offline

Board footer

Powered by FluxBB