Possibly incorrect output from 'wc' [findings]

NoGuiLinux · 2016-08-31 01:19:00

I was creating a piece of C++ test code for one of my scripts that would dump the file size, and the line count of a specific file. To test the c++ code, I needed a set of files that were from bytes to gigabytes in size. To do this I used the command below.

 SIZE_FACTOR=M ; dd if=/dev/zero of=test bs=1$SIZE_FACTOR count=1

I ran my C++ program, called linereader as:

 ./linereader test --no-print

Receiving an output of:

LINE_COUNT : 1
FILE_SIZE_BYTES : 1048576
ROUNDED_SIZE : 1.04858Mb

Now when I run the command below:

 wc -l test

I receive the following result:

 
0 test

Now, I know that the 'dd' command copied zero's from the /dev/null and wrote them to the file test, resulting in a 1Mb size file.
So there should be some data in the file, and if there is nothing to delimit it as multiple lines, or even one line, it would stand to reason that 'wc' would not even consider the file to contain any line. But, if there is data within the file without the presence of a delimiter, then it would stand to reason that the file is one line in length. I guess it would be better said that the file consist of a 1Mb long line, without a delimiter, or end of file terminator. So for 'wc' to say that the file 'test' consists of zero lines would be incorrect.

Now, if I were to use the command below to generate an absolutely empty file:

 touch Test

and run the following command, knowing that 'touch' creates an empty file.

 wc -l Test

I would receive and accurate result, as below.

0 Test

which if I were to use 'linereader':

./linereader Test --no-print

I would also receive:

LINE_COUNT : 0
FILE_SIZE_BYTES : 0
ROUNDED_SIZE : 0b

I would receive the resulting output due to the file containing nothing. But for the output of 'dd', there is data contained within the file, so therefore, it would stand to reason that the file 'test' did have one line, not the 'wc -l' reported 'zero' lines.

To take this further, I created another file called test, using the same 'dd' method, and split the file using 'split'.

split test -n 2

resulting in the files 'xaa' and 'xab'.

I then created a python3 script read the both files called 'tester.py'.

#! /usr/bin/python3
## I prefer to specify which python, so that when I use my scripts on distros that still link to python2.7, I won't have as many issues
for i in open("./xaa","r"):
  print(i)
for i in open("./xab","r"):
  print(i)

Running the code with:

python tester.py

Results in the displayed result:

[carl@sparknohss proff]$ python tester.py 


[carl@sparknohss proff]$

What this shows is that there is indeed a line to each file, if by the definition of a line being data, with or without a line terminator.
Again, let's take this one step further, and run the command displayed below.

python tester.py > tw

After doing so, run the command:

wc -l tw

you should now see the result as:

2 tw

what happened is this, when the python interpreter finished reading, and printing the first file, it placed a newline at the end of the resulting data printed on stdout, and continued on to the next file. Effectively, this added a '\n' character to the end of each printed data set. Thus, now 'wc -l' considers the file to have two lines, because 'wc -l' uses a terminator to determine if a line exists or not, even if the line does not have a terminator.

So, from the side of curiosity, as this is not really that important, would it not be better if 'wc' had a line terminating character option, so that various other types of lines can be determined, or am I barking up the wrong tree(no pun intended)?

Trilby · 2016-08-31 01:27:48

wc doesn't count lines.

man wc wrote:

NAME
wc - print newline, word, and byte counts for each file
...
-l, --lines
print the newline counts

`grep -c "." filename` can count lines.

NoGuiLinux · 2016-08-31 04:32:22

Missed that one, and now I feel like and idiot. Thanks.

NoGuiLinux · 2016-08-31 04:56:46

grep does not work either.

[carl@sparknohss linereader]$ cat test | grep -c "."
0
[carl@sparknohss linereader]$ grep -c . test
0
[carl@sparknohss linereader]$ grep -v . test
Binary file test matches
[carl@sparknohss linereader]$ grep -vc . test <- equivalent to `wc -c test`
1048576
[carl@sparknohss linereader]$ grep -cz '.' test
0
[carl@sparknohss linereader]$ grep -c '.*' test 
1048576

From grep man page

   -v, --invert-match
              Invert the sense of matching, to  select  non-matching
              lines.
-c, --count
              Suppress  normal  output;  instead  print  a  count of
              matching lines for each  input  file.   With  the  -v,
              --invert-match  option (see below), count non-matching
              lines.
       -z, --null-data
              Treat  the input as a set of lines, each terminated by
              a zero byte (the ASCII NUL  character)  instead  of  a
              newline.   Like  the  -Z or --null option, this option
              can be used with commands  like  sort  -z  to  process
              arbitrary file names.

Do you have any other suggestions?

bullet · 2016-08-31 05:10:10

I'm confused, what do you want to do? A file only containing null bytes is not a text file, so the notion of applying text file metrics on it makes no sense at all.

So we're clear here - it doesn't contain "0" as in the character, it contains \0 bytes, bytes with the value 0. How is that supposed to be a line?

Last edited by bullet (2016-08-31 05:10:17)

Arch Linux

#1 2016-08-31 01:19:00

Possibly incorrect output from 'wc' [findings]

#2 2016-08-31 01:27:48

Re: Possibly incorrect output from 'wc' [findings]

#3 2016-08-31 04:32:22

Re: Possibly incorrect output from 'wc' [findings]

#4 2016-08-31 04:56:46

Re: Possibly incorrect output from 'wc' [findings]

#5 2016-08-31 05:10:10

Re: Possibly incorrect output from 'wc' [findings]

Board footer