perl substitution with multiple newlines, using possibly a one-liner?

milomouse · 2011-03-03 06:19:55

Ok, so perl is not my language. I've been reading a few expression tutorials and I'm pretty familiar with sed and so I'm managing somewhat allright with this, as I've been able to figure out all my other filters.. but I still need help as far as removing [multiple] newlines.

I'm using Privoxy which has the capability to filter web content using perl style substitution with one "job" per line (one-liners, if you will). If I can't figure this out in one-line I guess I might be able to substitute each removed line with a unique word like PRIVOXY999 and then substitute PRIVOXY999 in next filter line, etc., but I'm trying to avoid that if at all possible. I know this is related to Privoxy but it directly applies to the perl language so I thought this forum section would be appropriate, if not feel free to move it. Thanks.

So, this is the snippet of code I'm trying to remove completely.

<div  name="gmi-GZone" id="gmi-GZone" data-gmiclass="GZone" gmi-name="ad_zone"><div style="position:relative;zoom:1" class="modframe-12258633"  name="gmi-GMFrame_Gruser" id="gmi-GMFrame_Gruser" data-gmiclass="GMFrame_Gruser" gmi-gruser_id="5954501" gmi-view="generic" gmi-id="12258633" gmi-typeid="50" gmi-icon="51" gmi-position="1" gmi-version="1" gmon-behavior="{&quot;log&quot;:true,&quot;edit_view&quot;:&quot;config&quot;}">    <div id="12258633" class="gr-box gr-genericbox  gr-adcast"> 
        <i class="gr1"><i></i></i><i class="gr2"><i></i></i><i class="gr3"><i></i></i> 
                <div class="gr-top"><i class="tri"></i> 
            <div class="gr"> 
                <h2><i class="icon i51"></i> AdCast - Ads from the Community</h2> 
            </div> 
        </div> 
                        <div class="gr-body"> 
            <div class="gr"> 
                                    <div class="pp c"> 
                <a class="subbyCloseX"
                href="https://www.deviantart.com/checkout/?mx=premium&utm_source=RedX&utm_medium=Button&utm_campaign=Subscription_RedX"
                onmouseover="if (window.Subby)Subby.warning(this, 'Turn off', 'Ads')" 
                onmouseout="if (window.Subby)Subby.out(this)" 
        >[x]</a> 
                <iframe id="a7bc0ff3" name="a7bc0ff3" src="http://adsrv.deviantart.com/delivery/afr.php?n=a7bc0ff3&amp;zoneid=33" framespacing="0" frameborder="no" scrolling="no" width="300" height="250"><a href="http://adsrv.deviantart.com/delivery/ck.php?n=a885e60f" target="_blank"><img src="http://adsrv.deviantart.com/delivery/avw.php?zoneid=33&amp;n=a885e60f" border="0" alt=""></a></iframe> 
<script type="text/javascript" src="http://adsrv.deviantart.com/delivery/ag.php"></script> 
    </div> 
                </div> 
        </div> 
                <i class="gr3 gb"></i><i class="gr2 gb"></i><i class="gr1 gb gb1"></i> 
    </div>
</div></div>

I can remove the first line with no problem but that leaves a gap in <div..>'s that causes the next <div> to drop to the bottom of the page- not formatted on the right of the page like it was before I removed that first line. So I need to remove the rest of this code entirely to fix it.

Here's the line I use to remove the first line (up to first newline):

s|<div  name="gmi-GZone".*adcast.*\n|<!--privoxy--><div name="name" style="display:none">|

I tried adding another \n or .*\n but cannot do multiple newlines. I added --privoxy-- so I could view the sourcecode and verify it was removed, and it was.
As it stands now, I can do this about 20 something times to remove it..

s|<div  name="gmi-GZone".*adcast.*\n|<!--REMOVEME-->|
s|<!--REMOVEME-->.*\n|<!--REMOVEME-->|
s|<!--REMOVEME-->.*\n|<!--REMOVEME-->|
s|<!--REMOVEME-->.*\n|<!--REMOVEME-->|
... etc
s|<!--REMOVEME-->.*\n||

I realize there must be a way to do this but am not yet familiar enough (still reading tutorials) to figure it out on my own. Any ideas?

Thanks for reading.

Last edited by milomouse (2011-03-03 06:54:35)

ber_t · 2011-03-03 08:47:58

The problem is, that you can't rely on the fact, that the div you want to remove always spans the same amount of lines. You will have to rewrite the regex every time the site changes, in case you filter it out by the number of newlines.
So, I don't have an exact answer to your problem, but I know for sure, that it's a bad idea to filter xml/html using a regex. It's also in the Perl FAQ.
If you still want to use a regex, consider using the 's'-flag. There are also lots of other people with the same problem, seeking for help at stack overflow etc, e.g. this one.

milomouse · 2011-03-03 09:09:58

@ber_t:
yes, unforunately i've come to realize this as i continue experimenting. each page seems different, if only by one line. currently i've done the repeat and replace method until every page seems to work without error (also testing their script links, etc) but i've had to rewrite my substitution at least 10 different times (applying certain rules to handfuls of certain pages) and i realize that sites like this will certainly be ever changing and possibly some pages even dynamic. at least it works for now but is certainly no long-term solution as i noticed on some pages they put a div you <need> in between ad headers and footers with a snippet of javascript for desired functionality.

i appreciate the links and advice, really. somehow i always forget stackoverflow but it usually proves to be the most fruitful. as mentioned, this is for Privoxy, and even though i have the ads themselves blocked there are still large div windows blocking up the pages. even though these rules are working now (removing the divs, i mean) i may end up setting only a couple reg expressions for the most obtrusive pages and tweak as they evolve, learning to live with the others for the sake of functionality.

juster · 2011-03-05 04:51:59

As of perl 5.10 it is possible to make a truly recursive regexp. Now, this doesn't fit on one line (although it might be possible) but it seems from the Privoxy manual that you don't have to limit it to one line. Put backslashes (\) at the end of each line like with bash. That is what is done in the manual here: http://www.privoxy.org/user-manual/filter-file.html

The regexp resembles a grammar definition. Using regexps for HTML still sucks -- using a real parser would be great if you could -- but apparently it sucks alot less than it used to. Thanks for giving me a way to try the new regexps. I hope this works for you too.

Here is a gist link because I don't really like the code display on the bbs: https://gist.github.com/856124

edit: I just realized that this probably won't work if you just copy and paste it. Privoxy uses the PCRE lib and not direct perl so your solution would be slightly different. I still think you could use named sub-patterns like this to do what you want.

#!/usr/bin/env perl

use warnings;
use strict;

my $adjustme = do { local $/; <DATA> }; # slurp!

$adjustme =~ s{
    (?&AdBlock)

    (?(DEFINE)
        (?<AdBlock>
            <div [ ] name="gmi-GZone" [^>]+>
            (?: .*? (?&DivBlock)? )+
            </div>)
        (?<DivBlock>
            <div [^>]* >
            (?: .*? (?&DivBlock)? )+
            </div>))
}{<!-- Adjustered! -->}gxs;

print $adjustme;

__DATA__
  <div id=important>No wait! I have important information I swear!</div>
  <div name="gmi-GZone" id="gmi-GZone" data-gmiclass="GZone" gmi-name="ad_zone">
    <div style="position:relative;zoom:1" class="modframe-12258633" name=
    "gmi-GMFrame_Gruser" id="gmi-GMFrame_Gruser" data-gmiclass="GMFrame_Gruser"
    gmi-gruser_id="5954501" gmi-view="generic" gmi-id="12258633" gmi-typeid="50"
    gmi-icon="51" gmi-position="1" gmi-version="1" gmon-behavior=
    "{&quot;log&quot;:true,&quot;edit_view&quot;:&quot;config&quot;}">
      <div id="12258633" class="gr-box gr-genericbox gr-adcast">
        <div class="gr-top">
          <div class="gr">
            <h2>AdCast - Ads from the Community</h2>
          </div>
        </div>

        <div class="gr-body">
          <div class="gr">
            <div class="pp c">
              <a class="subbyCloseX" href=
              "https://www.deviantart.com/checkout/?mx=premium&amp;utm_source=RedX&amp;utm_medium=Button&amp;utm_campaign=Subscription_RedX"
              onmouseover="if (window.Subby)Subby.warning(this, 'Turn off', 'Ads')"
              onmouseout="if (window.Subby)Subby.out(this)">[x]</a> <iframe id="a7bc0ff3"
              name="a7bc0ff3" src=
              "http://adsrv.deviantart.com/delivery/afr.php?n=a7bc0ff3&amp;zoneid=33"
              framespacing="0" frameborder="no" scrolling="no" width="300" height=
              "250"><a href="http://adsrv.deviantart.com/delivery/ck.php?n=a885e60f"
              target="_blank"><img src=
              "http://adsrv.deviantart.com/delivery/avw.php?zoneid=33&amp;n=a885e60f"
              border="0" alt="" /></a></iframe> <script type="text/javascript" src=
              "http://adsrv.deviantart.com/delivery/ag.php">
</script>
            </div>
          </div>
        </div>
      </div>
    </div>
  </div>
  <div>You are teh sexy!</div>

Last edited by juster (2011-03-05 05:10:48)

Mr.Elendig · 2011-03-06 14:35:22

regex is really the wrong tool for the job. You should use a lib like lxml instead for parsing/manipulating xml/html/xhtml

milomouse · 2011-03-06 17:51:11

@juster: awesome news. i'll see what i can do with that information now. and hey, no need to thank me, haha, i appreciate the time you put into your own efforts (much better than anything i've come up with so far).

@Mr.Elendig: you're right, and i think we all agree at this point that regex is a dying star in the world the ML's, but ATM it's all i'm able to use in conjunction with Privoxy, unless i can somehow utilize lxml (great looking library, btw) outside of Privoxy and still be able to have it filter through. i guess it's time for me to start reading that documentation. and onwards to python..

Last edited by milomouse (2011-03-06 17:51:46)

Arch Linux

#1 2011-03-03 06:19:55

perl substitution with multiple newlines, using possibly a one-liner?

#2 2011-03-03 08:47:58

Re: perl substitution with multiple newlines, using possibly a one-liner?

#3 2011-03-03 09:09:58

Re: perl substitution with multiple newlines, using possibly a one-liner?

#4 2011-03-05 04:51:59

Re: perl substitution with multiple newlines, using possibly a one-liner?

#5 2011-03-06 14:35:22

Re: perl substitution with multiple newlines, using possibly a one-liner?

#6 2011-03-06 17:51:11

Re: perl substitution with multiple newlines, using possibly a one-liner?

Board footer