Text file - replace a pattern with part of the previous line? -SOLVED

darkbeanies · 2013-09-01 17:34:52

Hello, I'm stuck with sed/awk/grep...

So I have a file with lines like this:

Nice bunch of words <STUFF> <STUFF> <IMPORTANT_DELIMITER_TYPE_1>  <STUFF> <IMPORTANT_DELIMITER_TYPE_2> <STUFF>
Even Nicer bunch of Words <STUFF> <IMPORTANT_DELIMITER_TYPE_1> <STUFF> <STUFF> <STUFF>
Wonderful bunch of Words <STUFF> <STUFF> <STUFF><IMPORTANT_DELIMITER_TYPE_1><STUFF>

Then, I want to move the "important delimiters" to new lines (might be better not to do this in fact...)

Nice bunch of words <STUFF> <STUFF> 
<IMPORTANT_DELIMITER_TYPE_1>  <STUFF> 
<IMPORTANT_DELIMITER_TYPE_2> <STUFF>
Even Nicer bunch of Words <STUFF> 
<IMPORTANT_DELIMITER_TYPE_1> <STUFF> <STUFF> <STUFF>
Wonderful bunch of Words <STUFF> <STUFF> <STUFF>
<IMPORTANT_DELIMITER_TYPE_1><STUFF>

And finally, I want to replace the important delimiters with the content of the line they came from originally, up to the first angle bracket:

Nice bunch of words <STUFF> <STUFF> 
Nice bunch of words  <STUFF> 
Nice bunch of words <STUFF>
Even Nicer bunch of Words <STUFF> 
Even Nicer bunch of Words <STUFF> <STUFF> <STUFF>
Wonderful bunch of Words <STUFF> <STUFF> <STUFF>
Wonderful bunch of Words<STUFF>

How can I accomplish this using absolutely anything at all that doesn't involve too much manual effort (the file is about 30,000 lines of this stuff)

Thanks !

Last edited by darkbeanies (2013-09-01 19:18:29)

Xyne · 2013-09-01 17:49:35

Without knowing what <STUFF>, <IMPORTANT_DELIMITER_TYPE_*>, etc. are, or where the first angle bracket is (or whether it is left or right), there isn't enough information to propose a solution. Please post a concrete example.

That said, I suspect it would be trivial to load the entire file in a script (e.g. Python, Perl) and use some simple replacements to achieve what you want (unless those 30k lines are ridiculously long). I don't doubt that this can be done with awk and/or sed, but I prefer scripts for anything beyond simple one-liners. Of course, a wild awk/sed wizard may appear and revel us with the hidden beauty of his arcane knowledge.

darkbeanies · 2013-09-01 18:03:26

Okay, basically important delimiter type 1,2,etc... are KNOWN. They are called <I> <II> <III> <IV> etc. so I can just search for that...

<STUFF> is irrelevant for this purpose I think.

All lines start with some bunch of words, which are ALWAYS followed by a < angle bracket. So I can just grab ALL words, before the FIRST < angle bracket.

So, I basically want to replace all <I>, <II>, <III> etc. with a new line, AND any and all words that occur on the original line, before that first angle bracket.

Does that make sense yet?

Also, the first code block pretty much IS a concrete example, if you just replace the <important delimiters> with roman numerals. I thought that would help clarify things...

Last edited by darkbeanies (2013-09-01 18:12:30)

Xyne · 2013-09-01 18:26:14

"<STUFF>" is not irrelevant. In your original example

Nice bunch of words <STUFF> <STUFF> <IMPORTANT_DELIMITER_TYPE_1>  <STUFF> <IMPORTANT_DELIMITER_TYPE_2> <STUFF>
Even Nicer bunch of Words <STUFF> <IMPORTANT_DELIMITER_TYPE_1> <STUFF> <STUFF> <STUFF>
Wonderful bunch of Words <STUFF> <STUFF> <STUFF><IMPORTANT_DELIMITER_TYPE_1><STUFF>

you want to break the lines along the delimiters:

Nice bunch of words <STUFF> <STUFF> 
<IMPORTANT_DELIMITER_TYPE_1>  <STUFF> 
<IMPORTANT_DELIMITER_TYPE_2> <STUFF>
Even Nicer bunch of Words <STUFF> 
<IMPORTANT_DELIMITER_TYPE_1> <STUFF> <STUFF> <STUFF>
Wonderful bunch of Words <STUFF> <STUFF> <STUFF>
<IMPORTANT_DELIMITER_TYPE_1><STUFF>

and then you say that you want to replace the delimiters with the contents before the first angle bracket:

Nice bunch of words <STUFF> <STUFF> 
Nice bunch of words  <STUFF> 
Nice bunch of words <STUFF>
Even Nicer bunch of Words <STUFF> 
Even Nicer bunch of Words <STUFF> <STUFF> <STUFF>
Wonderful bunch of Words <STUFF> <STUFF> <STUFF>
Wonderful bunch of Words<STUFF>

but you have clearly made a distinction between the "nice bunch of words" and "<STUFF>", otherwise the output would have been

Nice bunch of words <STUFF> <STUFF> 
Nice bunch of words <STUFF> <STUFF>  <STUFF> 
...

So, do you want everything up to the first delimiter, or do you want everything up to <STUFF> in the replacement? If you only want the "nicer words" then you need some way to distinguish between them and "stuff" programmatically.

edit
Here's a trivial script that will split the lines along the delimiters and replace them with the contents of the line before the first delimiter:

#!/usr/bin/env python3

import re
import sys

def main(args=None):
  for line in sys.stdin:
    # Trim trailing newline.
    line = line.rstrip('\n')
    # Split by delimiters.
    parts = re.split(r'<[^>]+>', line)
    print(parts[0])
    for p in parts[1:]:
      print(parts[0] + p)

if __name__ == '__main__':
  try:
    main()
  except (KeyboardInterrupt, BrokenPipeError):
    pass

Usage

path/to/script < /path/to/input file

Last edited by Xyne (2013-09-01 18:34:55)

2ManyDogs · 2013-09-01 18:32:39

Do have a basic understanding of any script language (bash, python, perl, etc) or of awk or sed? Are you asking us to write a script for you, or just point you in the right direction? If you'd like pointers, it would help if we knew what tools you already understand, even if only at a basic level.

darkbeanies · 2013-09-01 18:39:19

Let's try again with a simpler example...

WORDS <CRAP> <DELIMITER> <MORE CRAP>

desired output:

WORDS <CRAP> 
WORDS <MORE CRAP>

Any good?

Xyne · 2013-09-01 18:49:44

darkbeanies wrote:

Let's try again with a simpler example...
WORDS <CRAP> <DELIMITER> <MORE CRAP>
desired output:
WORDS <CRAP> 
WORDS <MORE CRAP>
Any good?

No. The script can't magically tell the difference between WORDS and <CRAP>. What defines a sequence of bytes as "<CRAP>"?

darkbeanies · 2013-09-01 18:52:03

@Xyne, i tried your script, but it made my mouse cursor turn to crosshairs and the bash said:

./script.sh: line 7: syntax error near unexpected token `('
./script.sh: line 7: `def main(args=None):'

darkbeanies · 2013-09-01 18:53:44

Xyne wrote:

What defines a sequence of bytes as "<CRAP>"?

It is not before the first < angle bracket, and it is not called <DELIMITER>

???

zorro · 2013-09-01 19:03:06

If you can replace your roman numeral delimeters with a single character delimeter (maybe using search/replace in an editor), your input could look like this:

Nice bunch of words <STUFF1>!<STUFF2>!<STUFF3>!<STUFF4>
Even Nicer bunch of Words <STUFF5>!<STUFF6>!<STUFF7>!<STUFF8>
Wonderful bunch of Words <STUFF9>!<STUFF10>!<STUFF11>!<STUFF12>

I have used the '!' char, you will need to choose one that doesn't exist in the input file.

Running the following sed script

sed -r ':loop; s/([^<]*)<([^!]*)!/\1<\2\n\1/;t loop' < input.txt

Generates

Nice bunch of words <STUFF1>
Nice bunch of words <STUFF2>
Nice bunch of words <STUFF3>
Nice bunch of words <STUFF4>
Even Nicer bunch of Words <STUFF5>
Even Nicer bunch of Words <STUFF6>
Even Nicer bunch of Words <STUFF7>
Even Nicer bunch of Words <STUFF8>
Wonderful bunch of Words <STUFF9>
Wonderful bunch of Words <STUFF10>
Wonderful bunch of Words <STUFF11>
Wonderful bunch of Words <STUFF12>

Last edited by zorro (2013-09-01 19:03:58)

darkbeanies · 2013-09-01 19:07:54

You the man ZORRO!!!! YOU the ****ING MAAAAAAAAAAN!!!!!!!!!!!!!!!!!!!!!!!!!!

Xyne · 2013-09-01 19:13:38

darkbeanies wrote:

@Xyne, i tried your script, but it made my mouse cursor turn to crosshairs and the bash said:
./script.sh: line 7: syntax error near unexpected token `('
./script.sh: line 7: `def main(args=None):'

Did you omit "#!/usr/bin/env python3" from the top of the file?
Do you have the python package installed?

darkbeanies · 2013-09-01 19:17:50

Yeah, I have python, and I didn't omit anything. Possibly python could be out of date/out of sync I suppose. Anyway, that sed one-liner works great. Thanks a lot, I've been puzzling/googling for hours now...

Arch Linux

#1 2013-09-01 17:34:52

Text file - replace a pattern with part of the previous line? -SOLVED

#2 2013-09-01 17:49:35

Re: Text file - replace a pattern with part of the previous line? -SOLVED

#3 2013-09-01 18:03:26

Re: Text file - replace a pattern with part of the previous line? -SOLVED

#4 2013-09-01 18:26:14

Re: Text file - replace a pattern with part of the previous line? -SOLVED

#5 2013-09-01 18:32:39

Re: Text file - replace a pattern with part of the previous line? -SOLVED

#6 2013-09-01 18:39:19

Re: Text file - replace a pattern with part of the previous line? -SOLVED

#7 2013-09-01 18:49:44

Re: Text file - replace a pattern with part of the previous line? -SOLVED

#8 2013-09-01 18:52:03

Re: Text file - replace a pattern with part of the previous line? -SOLVED

#9 2013-09-01 18:53:44

Re: Text file - replace a pattern with part of the previous line? -SOLVED

#10 2013-09-01 19:03:06

Re: Text file - replace a pattern with part of the previous line? -SOLVED

#11 2013-09-01 19:07:54

Re: Text file - replace a pattern with part of the previous line? -SOLVED

#12 2013-09-01 19:13:38

Re: Text file - replace a pattern with part of the previous line? -SOLVED

#13 2013-09-01 19:17:50

Re: Text file - replace a pattern with part of the previous line? -SOLVED

Board footer