Pivot table with awk or something to sum replicate values [solved]

graysky · 2022-08-15 20:32:57

I want to take a file structured in two columns and for every unique value in the first column, sum the corresponding value in the 2nd column.

% cat example.txt
10 20
10 20
20 24
20 40
30 40
30 20

The desired output is:

10 40
20 44
30 60

Googling for this isn't turning up anything I can use. Suggestions are welcomed, awk, perl, doesn't matter. My goal is to put it into a shell script at the end of the day.

Last edited by graysky (2022-08-16 09:55:19)

jasonwryan · 2022-08-15 20:51:02

I get different ouptput to you, because the 20 total is 64 according to your requirements (as I read them)

 awk '{x[$1]+=$2;}END{for(i in x)print i" "x[i];}' example.txt                                                                                                            
10 40
20 64
30 60

graysky · 2022-08-15 21:52:06

Nice, thanks! Yes, typo there, 44 != 64.

Alad · 2022-08-16 09:00:01

datamash makes this a bit easier:

datamash groupby 1 sum 2

graysky · 2022-08-16 09:54:45

Thanks Alad, useful util!

Last edited by graysky (2022-08-16 09:55:00)

CarbonChauvinist · 2022-08-17 02:28:09

So besides the awk and *interesting* datamash options already shared, this post intrigued me because it's an obvious place for SQL.

Some quick google-fu shows that there are a number of applications that will allow you to run SQL queries directly on text files and a couple of these are already packaged in the AUR. (Assuming you can work with csv/tsv for this, though some do allow for json and other formats, quite informative write up about the options here)

I tried textql first which is based on SQLite and could read from a .txt extension file without issue. It also handled spaces in the csv file which caused other of the programs to treat the second column as a text.

~ $ cat example.txt
a,b
10, 20
10, 20
20, 24
20, 40
30, 40
30, 20

~ $ textql -header -sql "SELECT a, sum(b), count(*) from example GROUP BY a" example.txt
10,40,2
20,64,2
30,60,2

I also tried octosql also packaged in aur which allows for more file formats (via plugins, by default only csv and JSON) and also supports joins! But it didn't seamlessly handle the spaces in the second column like textql (I didn't look further into casting than a simple try that also failed).

~ $ cat example.csv
a,b
10,20
10,20
20,24
20,40
30,40
30,20

~ $ octosql "SELECT a, SUM(b), COUNT(*) FROM example.csv GROUP BY a"
+----+-------+-------+
| a  | sum_b | count |
+----+-------+-------+
| 10 |    40 |     2 |
| 20 |    64 |     2 |
| 30 |    60 |     2 |
+----+-------+-------+

~ $ cat example_spaces.csv
a,b
10, 20
10, 20
20, 24
20, 40
30, 40
30, 20

~ $ octosql "SELECT a, SUM(b), COUNT(*) FROM example_spaces.csv GROUP BY a"
...
Error: typecheck error: unknown aggregate: sum(String)

Octosql uses a custom engine, not SQLite which may explain some of the differences, oh and there's also this tidbit when first run,

OctoSQL sends anonymous usage statistics to help us guide the development of OctoSQL.
You can view the most recently sent usage events in the ~/.octosql/telemetry/recent directory.
You can turn telemetry off by setting the environment variable OCTOSQL_NO_TELEMETRY to 1.
Please don't though, as it helps us a great deal.

Last edited by CarbonChauvinist (2022-08-17 02:32:17)

Trilby · 2022-08-17 02:42:57

I'm not sure what textql really does here as sqlite will import flat files just fine:

# Assuming example.txt has added headers "A B" on the first line
printf '.import example.txt d\nselect A, sum(B) from d group by A;' | sqlite3

CarbonChauvinist · 2022-08-17 02:51:25

@Trilby, interesting, is that importing the data though and not just querying the flat file directly (i.e what does the 'import' mean there)? Either way, I really need to play more with SQLite.

Trilby · 2022-08-17 03:00:58

There is no difference betwee "importing the data" and "querying the flat file directly" ... or at least the latter doesn't exist. All the tools you refer to import the flat file into a in-memory sqlite database; that's the only way any sql queries could run. The tools you refer to just wrap sqlite3 and do the import for you. You may as well use sqlite3 directly where you have more direct control of what is happening (and you don't need to build additional packages from the AUR).

Last edited by Trilby (2022-08-17 03:02:21)

Arch Linux

#1 2022-08-15 20:32:57

Pivot table with awk or something to sum replicate values [solved]

#2 2022-08-15 20:51:02

Re: Pivot table with awk or something to sum replicate values [solved]

#3 2022-08-15 21:52:06

Re: Pivot table with awk or something to sum replicate values [solved]

#4 2022-08-16 09:00:01

Re: Pivot table with awk or something to sum replicate values [solved]

#5 2022-08-16 09:54:45

Re: Pivot table with awk or something to sum replicate values [solved]

#6 2022-08-17 02:28:09

Re: Pivot table with awk or something to sum replicate values [solved]

#7 2022-08-17 02:42:57

Re: Pivot table with awk or something to sum replicate values [solved]

#8 2022-08-17 02:51:25

Re: Pivot table with awk or something to sum replicate values [solved]

#9 2022-08-17 03:00:58

Re: Pivot table with awk or something to sum replicate values [solved]

Board footer