You are not logged in.
Pages: 1
I am looking for a way to recursively go through a set of directories, removing XML style comments from all files.
Here are some examples of removing XML comments from <file>
For starters:
awk 'in_comment&&/-->/{sub(/([^-]|-[^-])*--+>/,"");in_comment=0}
in_comment{next}
{gsub(/<!--+([^-]|-[^-])*--+>/,"");
in_comment=sub(/<!--+.*/,"");
print}'
Or:
xmlstarlet c14n --without-comments original.file > new.file
The latter example isn't in and of itself desirable as it would clutter the directory with dupes.
I need to put this into a script that I can pull the trigger on once and it strip and save these changes, as there are sub-directories with multiple files.
Many thanks for any help with this.
Offline
Thinking of something like:
find ./ -type f -print0 | awk 'in_comment&&/-->/{sub(/([^-]|-[^-])*--+>/,"");in_comment=0}
in_comment{next}
{gsub(/<!--+([^-]|-[^-])*--+>/,"");
in_comment=sub(/<!--+.*/,"");
print}'
Or:
find . -type f -exec sed -i 's/<!--/\x0<!--/g;s/-->/-->\x0/g' {} \;
I went with the latter, so this is solved as far as I am concerned.
For those that want to use it; beware, it is completely indiscriminate and will strip comments from ALL files from where the shell script is executed and down.
Last edited by opt1mus (2020-06-02 16:08:45)
Offline
Or:
xmlstarlet c14n --without-comments original.file > new.file
The latter example isn't in and of itself desirable as it would clutter the directory with dupes.
Follow up on this by moving the temp file over on top of the original if the previous command succeeded. You're already using a script, this shouldn't be a problem.
Make a function to process a single file, then invoke that function with something like
find . -name "*.xml" -print0 | while IFS= read -r -d '' line; do
process_one_file "$line"
done
Managing AUR repos The Right Way -- aurpublish (now a standalone tool)
Offline
Thanks eschwartz.
I ultimately went with this;
find . -type f -exec sed -i 's/<!--/\x0<!--/g;s/-->/-->\x0/g' {} \;
I'll repeat the warning I gave in editing my previous post; this one liner will strip comments out of EVERY file from where the script is executed and down, so beware for those wishing to copy this and use it in anger.
--edit below, this is better; more discriminate--
find . -name '*.xml' -exec sed -i 's/<!--/\x0<!--/g;s/-->/-->\x0/g' {} \;
Last edited by opt1mus (2020-06-02 16:20:24)
Offline
Sed is not an xml parser.
Why are you *adding* null bytes to the document and considering this to answer the question???
Managing AUR repos The Right Way -- aurpublish (now a standalone tool)
Offline
I'm not parsing xml files specifically, I'm stripping xml comments from <anything>, as for the null bytes, I lifted this search term from elsewhere and missed an important chunk off the end. This line had appeared to work for me though I may have gone off in the holster.
What I inadvertently did was miss out important steps from someone else's solution:
sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' | grep -zv '^<!--' | tr -d '\0' | grep -v "^\s*$"
To mark and hose things in the middle.
I accidentally hit report against your post, meaning to click reply, apologies.
Last edited by opt1mus (2020-06-02 17:03:02)
Offline
Well then, there you have it. Marking ranges with null bytes and then deleting those ranges *could* be used to implement a multiline comment removal filter, which is less terminally broken than just sed'ing out lines with <!-- , however, it requires more than just using sed.
You're not going to be able to do this without saving grep | tr | grep results to a temp file, and you cannot do this with find -exec either way. So at this point I'd strongly recommend not using homebrew xml parsers, and just using xmlstarlet.
Is there a reason you think you need to remove xml comments from files which are not xml files? I'm curious what that tangent was meant in defense of. o_0
Managing AUR repos The Right Way -- aurpublish (now a standalone tool)
Offline
Sed certainly can deal with multiline comments without that null byte and pipeline gymnastics. While I definitely agree you should use a proper xml parser, if you must use sed, learn to use it properly rather than just using the s// command. This sed script will remove comments, multiline, within line, or only item on the line. It will fail on nested comments though:
/<!--/ {
:loop
/-->/ {
/^<!--.*-->$/ {
N
s/\n//
}
s/<!--.*-->//
b done
}
N
b loop
:done
}
EDIT: actually, the extra step there to remove the newline if the comment is the only thing on the line doesn't make sense. While it's aesthetically pleasing, and I'm leaving it as someone may want to do that, if you literally remove just what's between the <!-- and --> then the blank line with a newline should remain. So the simpler version:
/<!--/ {
:loop
/-->/ {
s/<!--.*-->//
b done
}
N
b loop
:done
}
This can be uglified/compressed to a oneliner
sed '/<!--/{:a;/-->/{s/<!--.*-->//;bb;};N;ba;:b;}'
Last edited by Trilby (2020-06-02 18:28:19)
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
Thank you both for your help so far.
The reason I need comments removed; they don't play nice with a HMI runtime, they had been inserted into hundreds of custom assets from a development platform and likely presumed harmless.
Offline
Pages: 1