[SOLVED] sed search result identical to searching for its opposite

porphyry5 · 2014-11-30 22:16:32

In sed, I want to detect lines that are NOT terminated by any of the sentence terminators, '.?!', with possibly a following quote sign, and to join such lines with their following line.

Here is a file of data, y

man, why not Great-grandfather?"
	I've never

And here are the results of two opposite sed commands run on that data. Why are the results identical?

~ $ sed '/[^.?!]["]*$/{N;s/\n\t/ /}' y
man, why not Great-grandfather?" I've never
~ $ sed '/[.?!]["]*$/{N;s/\n\t/ /}' y
man, why not Great-grandfather?" I've never
~ $

I use a similar regex, [^.?!], in a substitute where it performs as expected, why should it be different in a search?

Last edited by porphyry5 (2014-11-30 23:56:25)

Trilby · 2014-11-30 23:05:08

...

EDIT: okay I've edited this a few times. I had it working, but I couldn't figure out why. Now I have:

It matches correctly, but in the non-matching case the quote is still optional and the expression matches *any* character other than the punctuation, which includes the quote. So the line matches both cases. Here's what you probably want:

$ sed '/[.?!]"*$/{N;s/\n\t/ /}' y
man, why not Great-grandfather?" I've never 

$ sed '/[^.?!"]$/{N;s/\n\t/ /}' y
man, why not Great-grandfather?"
    I've never

Actually the negative case should be more like this (to match lines that end in a quote without that punctuation):

sed '/\([^.?!"]\|[^.?!]"\)$/{N;s/\n\t/ /}' y

but that is ~~untested~~. I just tested it, it works, but it only seems to hit the first match in a file.

Last edited by Trilby (2014-11-30 23:25:29)

porphyry5 · 2014-11-30 23:52:47

Trilby wrote:

$ sed '/[.?!]"*..$/{N;s/\n\t/ /}' y
man, why not Great-grandfather?" I've never
$ sed '/[^.?!]"*../{N;s/\n\t/ /}' y
man, why not Great-grandfather?"
	I've never
EDIT: okay I've edited this a few times. I had it working, but I couldn't figure out why. Now I have: the line matching needs the apparently needs ~~the newline and tab~~ two unknown characters before the $.
D'Oh, I think I just figured it out: It matches correctly, but in the non-matching case the quote is still optional and the expression matches *any* character other than the punctuation, which includes the quote. So the line matches both cases. Here's what you probably want:
$ sed '/[.?!]"*$/{N;s/\n\t/ /}' y
man, why not Great-grandfather?" I've never 

$ sed '/[^.?!"]$/{N;s/\n\t/ /}' y
man, why not Great-grandfather?"
    I've never

OK, I just tried it thus

sed '/[^.?!]["]*.$/{N;s/\n\t/ /}' y

which works for me, but I'm not understanding what is going on (I'm keeping the ["] because in the real version within a script it is ["'] but I can't get a ' to be acceptable within this command line test). The ? is there, the " is there, so what is the . immediately before the $ matching to?

This works too

sed '/[^.?!"]\+$/{N;s/\n\t/ /}' y

, which I think I will go with as it conforms to the actual data patterns its likely to find. The form without the \+ is just matching to the " and not to the ?. I prefer to have it match to both if they are there, and the sentence terminator must always be there. I haven't used sed very much before, it certainly seems to have its own slant on things. I thank you very much for your help.

Trilby · 2014-12-01 00:00:52

Ha, you caught my post in progress. Please see the edit with a much more robust and sane version. I made files with all four possible endings on the first line: question mark or not, and quote or not, and the updated version in my post worked as you claimed you desired. The preliminary version you've quoted did not seem to work right in all cases.

porphyry5 · 2014-12-02 17:44:38

Trilby wrote:

Ha, you caught my post in progress. Please see the edit with a much more robust and sane version. I made files with all four possible endings on the first line: question mark or not, and quote or not, and the updated version in my post worked as you claimed you desired. The preliminary version you've quoted did not seem to work right in all cases.

I have actually now moved to this

sed '/[a-z][^.?!a-z]\+$/{N;s/\n\t/ /}' y

which states (I hope), after the last lower case letter on the line, if a sentence terminator (.?!) is not found, then join this line with the next one. I think I should have stated earlier the purpose for which this script is for. There are ocr-ed texts available for download from sites such as books.google.com which often contain so many errors that reading them is a trying experience. I am exploring the feasibility of how far one can get with fully automated 'enhancement' of such texts, and how much of a chore the remaining semi-automated correction of the text is then likely to be. I rather mindlessly started this thinking of 'how a sentence should properly be terminated', rather than 'what is the minimum information that indicates the line ended with a properly terminated sentence regardless of any garbage characters that may be there too'. My apologies for any confusion this may have caused.

Arch Linux

#1 2014-11-30 22:16:32

[SOLVED] sed search result identical to searching for its opposite

#2 2014-11-30 23:05:08

Re: [SOLVED] sed search result identical to searching for its opposite

#3 2014-11-30 23:52:47

Re: [SOLVED] sed search result identical to searching for its opposite

#4 2014-12-01 00:00:52

Re: [SOLVED] sed search result identical to searching for its opposite

#5 2014-12-02 17:44:38

Re: [SOLVED] sed search result identical to searching for its opposite

Board footer