How to filter characters and/or words & phrases from a text file.

WeeDram · 2019-02-02 01:33:30

I would like some help deciding what program / scripting to start learning in order to do some filtering of text files. I have several text exports of data from a database that contains English words, kanji, hiragana, katakana and a few ‘junk’ characters.
I want to be able to do any one of the following depending on my specific need:
simply remove sets of strings of characters (such as “<div>” or “&nbsp” or “<br>” and several others)
OR replace such strings with a space
AND/Or remove ALL (or some of) the hirgana and/or the katakana and/or the english
AND perhaps count the occurences of certain characters

I sort of envision somehow using three files: the original text file, a file with a list of stuff to remove (or count) and an output file. Maybe that way I can have several different files listing what needs to be removed or replaced.
I started searching on how to do this and quickly realized there seems to be various ways to do some of these but I’d like to limit my search & learning to something that will actually be able to the things I need.

Trilby · 2019-02-02 01:35:29

http://www.grymoire.com/Unix/Sed.html

WeeDram · 2019-02-02 02:04:32

Wonderful! I tried using Sed and it looked as though it might do what I wanted but seemed a bit cryptic (to me, anyways). Didn't want to spend time with it only to find out that I couldn't accomplish what I wanted with it.
Thank you.

Trilby · 2019-02-02 02:28:49

Cryptic is fairly accurate until you spend some time with it. But I love it for the efficiency (both in running, and in expression of the language).

ondoho · 2019-02-02 11:33:16

WeeDram wrote:

simply remove sets of strings of characters (such as “<div>” or “&nbsp” or “<br>” and several others)
OR replace such strings with a space

looks like sed is the right tool for this.
except - if the file is some sort of XML file you might want to use something like xmlstarlet or xmllint to easily extract actual content.

WeeDram wrote:

I sort of envision somehow using three files: the original text file, a file with a list of stuff to remove (or count) and an output file.

whatever solution you come up with, you probably need to wrap it in a shell script that does this.

Trilby · 2019-02-02 13:42:22

ondoho wrote:

.. if the file is some sort of XML file you might want to use something like xmlstarlet or xmllint to easily extract actual content.

I second this. If you want to remove/change those html tags, sed will do it well. But if you need the content (what's in between <div> and </div> for example) sed would only accomplish this under a very narrow set of assumptions, and even then it may not be particularly convenient.

ondoho wrote:

whatever solution you come up with, you probably need to wrap it in a shell script that does this.

Maybe. Sed is a turing complete scripting language in and of itself, though. So definitely do not use a shell script to wrap a bunch of seperate sed invocations - just write the sed script properly in the first place.

The one goal listed that sed would not directly achieve is counting the number of a specific character. But sed can easily eliminate everything but that character (including newlines) to produce a new file. The count of that character is then just the filesize in bytes minus one (unless it's a non-ascii/wide char, then just the filesize minus one divided by the bytes-per-character).

For example, to get the number of capital X's in a file:

sed -n 'H;${x;s/[^X]*//gp;}' filename > tmpfile
ls -l tmpfile

# Or perhaps just the following (still minus one)
sed -n 'H;${x;s/[^X]*//gp;}' filename | wc -m

Last edited by Trilby (2019-02-02 16:08:20)

ewaller · 2019-02-02 15:24:54

Trilby wrote:

Sed is a turing complete scripting language in and of itself, though.

Is it? I need to go back and look at it again.
I would have reached for awk in this case and am a bit surprised it has not been mentioned.

Well, I'll be: https://catonmat.net/proof-that-sed-is-turing-complete
On the other hand, a Turing engine is Turing complete; that does not mean it is the best tool. I would probably still go with awk.

Note: Trilby, I fsked up your post. I hit edit, not respond. See if I put it back correctly.

Edit:

ewaller@turing/home/ewaller % sed -f Downloads/turing.sed increment.tm 
(0) | |10010111
(0) |1|0010111
(1) 1|0|010111
(1) 10|0|10111
(1) 100|1|0111
(1) 1001|0|111
(1) 10010|1|11
(1) 100101|1|1
(1) 1001011|1|
(1) 10010111| |
(2) 1001011|1|
(2) 100101|1|0
(2) 10010|1|00
(2) 1001|0|000
(3) 10011|0|00
(3) 100110|0|0
(3) 1001100|0|
(3) 10011000| |
(F) 10011000 | |
Final state F reached... end of processing
ewaller@turing/home/ewaller %

Well, there you go.

Last edited by ewaller (2019-02-02 15:34:57)

Trilby · 2019-02-02 15:49:43

awk is column focused. There's nothing in the OPs criteria that suggests there are columns or reliable column delimiters. In such cases awk can of course do the job, but it is particularly clunky as you have to use functions like sub and gsub (and those must be used within a brace-enclosed action preceded by a pattern) all to do what sed does quite naturally.

For example, the best way to count 'X's in awk that I could come up with is this:

awk '// { gsub(/[^X]/,"",$0); str=str$0; } END { print length(str); }'  filename

This has the advantage of directly outputing the number and not counting the newline, but other than that it feels (admittedly subjectively) quite clunky to me. I like that the article above likens sed to an assembly language - perhaps that's why I like it. I'd prefer assembly to one of these syntactically-diabetic languages like C# ... but I suppose not everyone would have the same tastes.

If the goal/problem were to work with a subset of columns of input data, then I would agree awk would most likely be best.

But perhaps then for the OP: for the specific situations you listed in your first post, I do believe sed would be best, but if/when you face related but not identical challenges, do not assume sed will still be the right tool. Learn sed, awk, grep, cut, comm, and other unix tools so you can always quickly find good solutions to new problems rather than trying to shoe-horn new problems into the solution that worked for an old problem.

Or in other words, the most important tool is your own grey matter.

Last edited by Trilby (2019-02-02 15:59:04)

WeeDram · 2019-02-03 19:48:48

Thank you for the ongoing suggestions and discussion. I am beginning to make progress on this.

The files are not XML but I have wished occasionally that I could get at some of the content of these a little more easily.

There ARE tab delimited columns in the files but I want to keep these and not operate on them. Does their actual existence in the file influence the search and replace? It appears, from one of my learning attempts that commands can remove or replace some but not all of the tabs. Still not sure what happened but I have a bad habit of driving ahead looking for the 'right' way instead of looking closely at the 'wrong' way and its problematic output so I can't replicate this.

Trilby, your advice about tools reminds me of the old saying "To a person with a hammer, every problem looks like a nail" (or something like that). As for grey matter, well perhaps I'll post in a different Arch forum on how memory problems are slowing down my computing.

Arch Linux

#1 2019-02-02 01:33:30

How to filter characters and/or words & phrases from a text file.

#2 2019-02-02 01:35:29

Re: How to filter characters and/or words & phrases from a text file.

#3 2019-02-02 02:04:32

Re: How to filter characters and/or words & phrases from a text file.

#4 2019-02-02 02:28:49

Re: How to filter characters and/or words & phrases from a text file.

#5 2019-02-02 11:33:16

Re: How to filter characters and/or words & phrases from a text file.

#6 2019-02-02 13:42:22

Re: How to filter characters and/or words & phrases from a text file.

#7 2019-02-02 15:24:54

Re: How to filter characters and/or words & phrases from a text file.

#8 2019-02-02 15:49:43

Re: How to filter characters and/or words & phrases from a text file.

#9 2019-02-03 19:48:48

Re: How to filter characters and/or words & phrases from a text file.

Board footer