Text parsing tool - need suggestion

daneel971 · 2009-03-10 10:23:50

I need to do the following on a text file, tab separated.
The lines I'm interested in begin all with 4 digits; other lines must remain untouched.
The structure of a line is:

4-digit code
start
end
desciptive string

If the "end" of line X is equal to the "start" of line X+1, the end of the first line must be replaced by (for instance) --- ; I need to compare line 1 and 2, then line 2 and 3, then line 3 and 4, and so on. If a line doesn't begin with 4 digits, it must be printed literally and the comparison must be done with the following line.
Something like:
line 1 begins with 4 digit, so compare it with the following line
line 2 begins with text: print it and skip to the following for the comparison with line 1
line 3 begins with 4 digit: use it for the comparison with line 1 and then compare it with the following line
line 4 begins with 4 digit: use it for the comparison with the previous line and then compare it with the following line

### example input ###

Some garbage text
0123    001    003    dfghdfhg
4569    002    003    fdsygiusdg
Some other garbage text
5001    003    005    fguoiauhg
4786    005    007    fdhgsdhg
5641    007    008    kgjhlks
4455    009    010    dlkhsjsdf

### example desidered output ###

Some garbage text
0123    001    003    dfghdfhg
4569    002    ---     fdsygiusdg
Some other garbage text
5001    003    ---    fguoiauhg
4786    005    ---    fdhgsdhg
5641    007    008    kgjhlks
4455    009    010    dlkhsjsdf

Can someone suggest the right tool for this and give some hints?

Xyne · 2009-03-10 11:04:31

The best tool for this is a simple script.

The following seems to do the trick:

#!/usr/bin/perl
use strict;
use warnings;

my @last_line = ();
my $stack = '';
while (defined(my $line = <STDIN>))
{
  if ($line =~ m/^(\d{4}\s+)(\d+)(\s+)(\d+)(.+\n)$/)
  {
    my (@current_line) = ($1,$2,$3,$4,$5);
    if ($last_line[3] eq $current_line[1])
    {
      $last_line[3] =~ s/./-/g;
    }
    print join ('',@last_line), $stack;
    @last_line=@current_line;
    $stack = '';
  }
  else
  {
    $stack .= $line;
  }
}
print join ('',@last_line), $stack;

Save it as foo.pl (change "foo" to whatever you want), make it executable (chmod 700 ./foo.pl) and then invoke thus:

cat /path/to/file | ./foo.pl

Note that it requires Perl.

*edit*
I posted a working script because I thought that if you could already code, you would have written it already. As your main purpose seems to be to get a job done, I didn't see any purpose in making you jump through hoops to figure it out.

Last edited by Xyne (2009-03-10 11:06:26)

daneel971 · 2009-03-10 11:39:22

I really appreciate your help - I'm not able to write something similar... yet
I'm just a perl beginner and your solution will be be a "case study" for me.
Thanks again.

Last edited by daneel971 (2009-03-10 11:39:35)

Xyne · 2009-03-10 11:59:11

Just ask if you want me to explain anything in the script.

daneel971 · 2009-03-10 16:29:16

Well, I do have some questions...
I edited the script a bit and it works fine - I inserted some comments in it, to see if I understood it right

#!/usr/bin/perl
use strict;
use warnings;

# inizialize empty array
my @last_line = ();

# inizialize empty variable
my $stack = '';

# while receiving input, define $line as the current input line
while (defined(my $line = <STDIN>))
# and if it matches the regex
{
  if ($line =~ m/^(\d{4})(\t)(.{11})(\t)(.{11})(\t)(.+\n)$/) ## <-- Why do we need a "m" here?
                                                             ## Isn't "/" the default separator?
  {
# assign the regex groups to the "current line array"
    my (@current_line) = ($1,$2,$3,$4,$5,$6,$7);            ## in the 6th line there are no "()" around the
                                                            ## array declaration: are they needed here?
                                                            
# and compare the 5th element of the "previous line array" with
# the 3th element of "current line array";
# if they match, place a "-" in place of the 5th element of the "previous line array"
    if ($last_line[4] eq $current_line[2])
    {
      $last_line[4] =~ s/^.+$/-/;
    }
# print the content of the whole "previous line array"
# and of the $stack variable
    print join ('',@last_line), $stack;
# place the content of "current line array" into the "previous line array"
    @last_line=@current_line;
# set $stack to null
    $stack = '';
  }
  else
# that is, if $line doesn't match regexp at #15
  {
# concatenate $stack with $line
    $stack .= $line;                           ## why ".=" and not only "="? What if $stack isn't empty?
  }
}
# print the content of the whole "previous line array"
# and of the $stack variable
print join ('',@last_line), $stack;

Let's suppose that the first line is a match for the regexp: @last_line and $stack (at this point, both empty) are printed (#31), then @current_line is placed into @last_line (#33) and $stack is '' (#35); at the end of the script, @last_line (filled) and $stack (empty) are printed again (#46).
Now, line 2 is a match for the regexp too: @last_line (filled with the previous line, that was already printed) and $stack (empty) are printed (#31) - if needed, $last_line[4] has been "fixed" (#27).
I should have now duplicate lines (optionally, the second line has been corrected...).
I tested the script and it works "as it should": no duplicate lines at all. Can you clarify a bit this black magic?

Xyne · 2009-03-10 20:33:28

I've added my own comments to the code:

#!/usr/bin/perl
use strict;
use warnings;

# initialize an array to hold the matches from a line that fits our regular expression
# we initialize it here to make it global so we can use it in the loop
my @last_line = ();

# the stack will temporarily hold all the lines that don't match the regular expression
# we initialize it here too so that we can use it in the loop
my $stack = '';

# here's the main loop... it gets run once for each line of input
# the "my $line = <STDIN>" sets "$line" to the value of the next line on STDIN
# the "defined()" function is just to make sure that we get all of the input because
# perl can interpret some strings and digits as "False"
# without the "defined()" part, the actual value of each line would be used
while (defined(my $line = <STDIN>))
{
  # first we need to decide how to print the lines
  # the tricky part is that before we print one line,
  # we need to check the NEXT line to see if it's "start"
  # is this line's "end"
  # we also need to keep the junk text in the right order,
  # which means that we can't print the last line or any of
  # the junk lines between it and the next matching line
  # until we've checked it so we dump the junk lines on
  # a stack that we can print after we've printed the last
  # matching line


  # here we check for lines that begin with 4 digits,
  # followed by spaces, digits, spaces, more digits,
  # then whatever's left on the line
  # we capture the first 4 digits and the following spaces in first group
  # the "start" in the second group
  # the spaces between the "start" and the "end" in the third group
  # the "end" in the fourth group
  # and then the rest of the line in the fifth group
  
  # the "m" at the beginning is not necessary as it defaults
  # to matching, but I prefer to be explicit when sharing code
  # to make it clearer... you could skip it
  if ($line =~ m/^(\d{4}\s+)(\d+)(\s+)(\d+)(.+\n)$/)
  {
    # not we assign the matches to an array
    # remember that arrays are 0-indexed, so
    # group 1 will be at [0], etc
    my (@current_line) = ($1,$2,$3,$4,$5);
    
    # check if the "start" of the current line
    # (group 2, index 1) matches the "end" of the
    # last line (group 4, index 3)
    if ($last_line[3] eq $current_line[1])
    {
      # if it does, replace each character in
      # the "end" of the last line with "-"
      $last_line[3] =~ s/./-/g;
    }
    
    # print the last line by joining the 5 groups
    # together in a string without spaces then
    # print all the junk lines that came after it
    # before the current matching line
    print join ('',@last_line), $stack;
    
    # make the current line the "last line" for
    # the comparison with the next matching line
    @last_line=@current_line;
    
    # clear the stack because we've printed those
    # lines that were in it already
    $stack = '';
  }
  # otherwise, the line didn't match our regular express
  else
  {
    # so drop it on the stack for later
    # if we printed it now, it would be printed before
    # the last matching line
    $stack .= $line;
  }
}

# now we've run through all the input but we haven't
# printed the last matching line because we always waited
# for the next matching line to compare its "end" with the
# next "start"
# now there's nothing more to compare so we can print
# the last line and the stack
print join ('',@last_line), $stack;

Let's look at a modified version of your example input (with line numbers) and what happens when the script is run

1 Some garbage text
2 0123    001    003    dfghdfhg
3 4569    002    003    fdsygiusdg
4 Some other garbage text
5 Even more garbage text
6 5001    003    005    fguoiauhg
7 4786    005    007    fdhgsdhg
8 5641    007    008    kgjhlks
9 4455    009    010    dlkhsjsdf

@last_line is initalized (empty) and so is $stack
line 1 doesn't match the regex so it get's dropped on the stack
line 2 matches the regex
line 2's "start" (001) doesn't match @last_line's "end" (which is empty) so no changes are made to @last_line's "end"
print @last_line (""), followed by the stack (line 1) and clear the stack
@last_line = line 2
line 3 matches the regex
line 3's "start" (002) doesn't match @last_line's "end" (003) so no changes are made
print @last_line (line 2) followed by the stack ("")
@last_line = line 3
line 4 doesn't match the regex so it's dropped on the stack
same for line 5 so the stack is now line 4 + line 5
line 6 matches the regex
line 6's "start" (003) matches @last_line's (line 3's) "end" (003) so replace "003" with "---"
print @last_line (line 3) followed by the stack (line 4 + line 5)
etc

Without the stack, lines 4 and 5 would have been printed before line 3 because we were holding on to it for comparison with line 6.

(I sense another one of those "you have too much time on your hands" comments in this thread... I can't help it though, I like to help, plus it's good exercise to explain how you've programmed something and why)

Last edited by Xyne (2009-03-10 20:34:35)

Daenyth · 2009-03-10 22:20:26

Xyne wrote:

cat /path/to/file | ./foo.pl

BAD XYNE, BAD!

./foo.pl < /path/to/file

Or better yet, write the script to take a filename as $ARGV[0]

Allan · 2009-03-10 22:24:08

Just because I like asking... is this homework?

Procyon · 2009-03-10 22:44:32

I tried it too with a different strategy using tac. The tabs and such may be a bit inconsistent and it may not work exactly right (I think it is very prone to bugs)

$ tac example.txt | awk 'BEGIN { OFS="\t";previous=-1 } { if ( $1 !~ "^[0-9][0-9][0-9][0-9]" ) { print} else { if ( $3 == previous ) { $3 = "---" } previous=$2; print } }' | tac
### example input ###

Some garbage text
0123    001    003    dfghdfhg
4569    002     ---     fdsygiusdg
Some other garbage text
5001    003     ---     fguoiauhg
4786    005     ---     fdhgsdhg
5641    007    008    kgjhlks
4455    009    010    dlkhsjsdf

daneel971 · 2009-03-11 06:50:05

Allan wrote:

Just because I like asking... is this homework?

Lol.
It isn't, I swear. And thanks to all for your help

tomk · 2009-03-11 07:11:45

Moved to General Programming.

Xyne wrote:

I posted a working script because I thought that if you could already code, you would have written it already. As your main purpose seems to be to get a job done, I didn't see any purpose in making you jump through hoops to figure it out.

Noted Xyne. However, this still qualifies as spoonfeeding, which is not recommended around here. Arch users are expected to anticipate a certain amount of hoop-jumping, and even to enjoy it.

Xyne · 2009-03-11 09:51:05

tomk wrote:

Noted Xyne. However, this still qualifies as spoonfeeding, which is not recommended around here. Arch users are expected to anticipate a certain amount of hoop-jumping, and even to enjoy it.

I disagree. There's a very big difference between "I'm writing a script but I can't get this to work" and "I need something that can do this but I don't know how to script".

If you want to get a hotel room and buy a few things in town in a country the language of which you don't speak, having a friend who speaks the language give you some phrases to help you get by is nice. If you have some basic awareness of the language and would like to get a better idea of what the words you're using mean, there's nothing wrong with having the phrases broken down clearly. If your main intention is to get a hotel room and not to learn the ins and outs of the conjugations and inflections, being told to find a grammar book and figure it out yourself is unnecessary. The context just doesn't require it and the requested explanation is due to curiosity coincidental to the main objective of getting a room. It would be a completely different matter if he were sat there deciphering a text while learning that language. When unable to recognize a form of a word, rather than say "that's the plural genitive of ...", the better approach would be to say "well, do you recognize the root of the word?" and thus point him towards the answer. Doing so for the phrase would be inappropriate.

Expecting everyone that walks into your hotel to speak your native language well enough to get by on their own would be unreasonable. So is expecting everyone who uses Arch to be a scripter/programmer. With that attitude, one may as well call the Beginners Guide "spoonfeeding", but then one might have to admit to being an elitist. There's no reason to make people jump through hoops just for the sake of it.

For the record though, I think my posts on this forum show quite clearly that I endorse the "help yourself" attitude of Arch... I just don't think that applies in this case, based on the way the question was presented.

As for Allan's homework question, I thought about that too, but if that were the case I don't think he would have asked for some "tool" to do it. He would have asked about the specifics of doing it with some language. The general problem seems to be an incidental one.

Dusty · 2009-03-11 13:45:51

Xyne wrote:

Expecting everyone that walks into your hotel to speak your native language well enough to get by on their own would be unreasonable.

Not in the USA.

/me wanders off to renew his long-defunct title of official thread hijacker

juster · 2009-03-11 16:27:33

You can both read from STDIN and/or take the filename(s) as an argument if you use

while(<>) {
    my $line = $_;

which is pretty handy for these quick scripts.

BTW, cool awk Procyon!

Xyne · 2009-03-11 19:50:24

Dusty wrote:

Xyne wrote:
Expecting everyone that walks into your hotel to speak your native language well enough to get by on their own would be unreasonable.
Not in the USA.
/me wanders off to renew his long-defunct title of official thread hijacker

The USA is also the home of fast foot restaurants for pets, 15 lbs cheeseburgers, commercials every 5 minutes, the Westborough Baptist Church and a whole plethora of other things that beat back the limits of what's "unreasonable", so your point is null and void.

@juster
I didn't know that you could use <> to accept both STDIN and a file path. Nice.

Last edited by Xyne (2009-03-11 20:00:17)

Arch Linux

#1 2009-03-10 10:23:50

Text parsing tool - need suggestion

#2 2009-03-10 11:04:31

Re: Text parsing tool - need suggestion

#3 2009-03-10 11:39:22

Re: Text parsing tool - need suggestion

#4 2009-03-10 11:59:11

Re: Text parsing tool - need suggestion

#5 2009-03-10 16:29:16

Re: Text parsing tool - need suggestion

#6 2009-03-10 20:33:28

Re: Text parsing tool - need suggestion

#7 2009-03-10 22:20:26

Re: Text parsing tool - need suggestion

#8 2009-03-10 22:24:08

Re: Text parsing tool - need suggestion

#9 2009-03-10 22:44:32

Re: Text parsing tool - need suggestion

#10 2009-03-11 06:50:05

Re: Text parsing tool - need suggestion

#11 2009-03-11 07:11:45

Re: Text parsing tool - need suggestion

#12 2009-03-11 09:51:05

Re: Text parsing tool - need suggestion

#13 2009-03-11 13:45:51

Re: Text parsing tool - need suggestion

#14 2009-03-11 16:27:33

Re: Text parsing tool - need suggestion

#15 2009-03-11 19:50:24

Re: Text parsing tool - need suggestion

Board footer