[solved] split or manipulate file based on matching entries in column

oliver · 2014-05-15 15:16:59

I have a file in the following format:

500 +18588581234 +13035551212 0001 0001-0001
400 +18588581235 +13035551212 0001 0001-0001
100 +18588581236 +13035551212 0001 0001-0001
300 +16196191234 +14045551212 0001 0001-0001
200 +16196191235 +14045551212 0001 0001-0001

I'd like to end up with the following output where the entries in the first column are added if the 3rd column matches (I don't really care about the 2nd column)
In other words, 500 + 400 + 100 because the third column are all +13035551212

1000 +13035551212 0001 0001-0001
500  +14045551212 0001 0001-0001

I've been playing around with trying to split the file based on the 3rd column and then walking through it line by line but I thought there must be a graceful one-liner to do this.

I have any Arch tools (repo or AUR) available to me and don't really mind the method as long as it's sane.

The file is always sorted reverse-numerically on the first column and then numerically on the third but I can do whatever it needs to get it working if required

Last edited by oliver (2014-05-16 13:04:56)

Trilby · 2014-05-15 15:22:26

I don't know about a *one* liner, but this seems like a perfect task for awk in which it would be a few-liner.

EDIT: I was going to provide an example - and I'll probably get around to it later, but I have to run at the moment.

Last edited by Trilby (2014-05-15 15:23:24)

Saint0fCloud · 2014-05-15 15:42:29

I don't know about a one-liner either, but here's a zsh solution.

#!/bin/zsh
# To run: `./<script> [filename]`

zmodload zsh/mapfile

[[ ! -f "$1" ]] && { print "Invalid file, $1"; exit 1 }
FILE=(${(f@)mapfile[$1]})
typeset -A SUMS

for i in $FILE[@]; do
    SUMS[${i[(ws. .)3,-1]}]=$((${i[(ws. .)1]} + ${SUMS[${i[(ws. .)3,-1]}]:-0}))
done

for i in ${(ok)SUMS}; print "$SUMS[$i] $i"

Output:

1000 +13035551212 0001 0001-0001
500 +14045551212 0001 0001-0001

Last edited by Saint0fCloud (2014-05-15 15:44:41)

oliver · 2014-05-15 15:54:04

Thank you both Trilby and Saint0fCloud... I found a similar example online using awk and modified it a little:

awk '{array[$3]+=$1} END { for (f in array) {print array[f], f, $4, $5}}' $DATAFILE

It's getting closer but not quite there (and if I modify my test data it's clearer why)

datafile

500 +18588581234 +13035551212 0003 0003-0003
400 +18588581235 +13035551212 0003 0003-0003
100 +18588581236 +13035551212 0003 0003-0003
300 +16196191234 +14045551212 0004 0004-0004
200 +16196191235 +14045551212 0004 0004-0004

$ ./testfile 
500 +14045551212 0004 0004-0004
1000 +13035551212 0004 0004-0004

If I can't get it going in awk, I'll switch over to the zsh method

edit
Not sure this is the most graceful but it seems to work

awk '{array[$3," " $4," " $5]+=$1} END { for (f in array) {print array[f], f}}' $DATAFILE

1000 +13035551212 0003 0003-0003
500 +14045551212 0004 0004-0004

If anyone sees anything horribly wrong with the logic, feel free to point it out.

Last edited by oliver (2014-05-15 17:10:45)

jakobcreutzfeldt · 2014-05-15 21:33:40

Check out compute.

$ compute --group 3 sum 1 <data.csv 
+13035551212 1000
+14045551212 500

Last edited by jakobcreutzfeldt (2014-05-15 21:34:24)

Trilby · 2014-05-15 21:45:31

Whoa, that looks handy. I can't find it in the repos or AUR though. Am I missing an AUR package for it and/or do you use it in Arch?

jakobcreutzfeldt · 2014-05-15 21:48:55

I just found out about it and I haven't had a chance to really play around with it or make a PKGBUILD. But, I saw this thread and thought, "oh that's what that's for".

Sure, you can do way more with awk or R, but sometimes, just like it's easier to do a quick 'cut' or 'join', it would be easier to do some quick number-crunching on a csv file with 'compute'. Frankly, I'm surprised something like it hasn't been written before.

Last edited by jakobcreutzfeldt (2014-05-15 21:52:11)

Trilby · 2014-05-15 21:52:46

Yup, I've used R for doing little tasks like that *many* times. Working with student grade sheets is one example.

While R is powerful, it can take a handful of lines, or at least some careful thought about how to select the right rows and columns to get the calculation I'd want. The 'group' option looks like the real selling point for me for some quick summary info from the command line.

I'm building it from git now, I'll double check later that it's not already in the AUR, then I may put together a PKGBUILD ... unless you were planning on it.

jakobcreutzfeldt · 2014-05-15 21:56:57

Trilby wrote:

then I may put together a PKGBUILD ... unless you were planning on it.

I'm about to head to bed, so it's all yours!

oliver · 2014-05-16 13:04:10

compute looks great - thanks for mentioning it. I'm going to mark this badboy as solved

oliver · 2014-05-16 18:43:38

I'm going to attempt to submit compute to AUR unless anyone has already got the process underway. It will be my first time so don't expect something in 60 seconds

Trilby · 2014-05-16 18:55:22

Sorry about that - I intended to do so last night. I had the working PKGBUILD in a working directory in /tmp/. I then got distracted, forgot about it, and shutdown before bed. But is a trivially simple PKGBUILD to make (at least the git version I did was), it will a good first PKGBUILD experience. Let me know if you need any help with it.

oliver · 2014-05-16 18:57:37

Just posted it here:

https://bbs.archlinux.org/viewtopic.php?id=181613

Arch Linux

#1 2014-05-15 15:16:59

[solved] split or manipulate file based on matching entries in column

#2 2014-05-15 15:22:26

Re: [solved] split or manipulate file based on matching entries in column

#3 2014-05-15 15:42:29

Re: [solved] split or manipulate file based on matching entries in column

#4 2014-05-15 15:54:04

Re: [solved] split or manipulate file based on matching entries in column

#5 2014-05-15 21:33:40

Re: [solved] split or manipulate file based on matching entries in column

#6 2014-05-15 21:45:31

Re: [solved] split or manipulate file based on matching entries in column

#7 2014-05-15 21:48:55

Re: [solved] split or manipulate file based on matching entries in column

#8 2014-05-15 21:52:46

Re: [solved] split or manipulate file based on matching entries in column

#9 2014-05-15 21:56:57

Re: [solved] split or manipulate file based on matching entries in column

#10 2014-05-16 13:04:10

Re: [solved] split or manipulate file based on matching entries in column

#11 2014-05-16 18:43:38

Re: [solved] split or manipulate file based on matching entries in column

#12 2014-05-16 18:55:22

Re: [solved] split or manipulate file based on matching entries in column

#13 2014-05-16 18:57:37

Re: [solved] split or manipulate file based on matching entries in column

Board footer