You are not logged in.

#1 2009-07-30 07:10:17

karag
Member
Registered: 2006-09-05
Posts: 84

Request: linux solution for comparing large sets of data

Hi all -
I'm an enthusiastic linux amateur with no programming skills at all. I would like to develop some kind of tool to compare large sets of data for work. I am simply not able to learn a programming language (I work full time, have 2 pre-schoolers and studying for a masters) - although I would like to - but have dabbled with LAMP before and guess this might work.

Here is a brief outline of what I need to do:

-I have two sets of data (this will be thousands of rows in .csv format)
- they should both hold exactly the same info, although one set formats it in a different way
- however, one set of data is missing info, for various reasons
- both sets need to be compared against each other
- a result is produced which lists the data missing

Because they format the same info in different way, I guess something like a 'rule' or 'lookup table' will be needed to identify what should be considered a 'match'. Ideally, if there is a solution it can be accessed via a web interface or even portably on USB.

So I'm just asking for someone to point me in the right direction with where to start - if there is a basic way of doing this in linux I would love to use it and hopefully extend it somehow to my Windows work environment. Very grateful for advice smile

Offline

#2 2009-07-30 07:27:26

Xyne
Administrator/PM
Registered: 2008-08-03
Posts: 6,965
Website

Re: Request: linux solution for comparing large sets of data

Just to make sure that I've understood this correctly, you have 2 datasets that might look like this (abstractly):

a,b,,d
e,,g,h,
i,,k,l
A,B,C,D
E,F,G,H
I,J,K,L

and you would like a way to compare them to find the missing elements/cells in the former and produce a result which shows what they should be according to the latter, using a rule/hash table (C->c, F->f, J->j).

If that's the case, then yes, there is a basic way to do it, but it will involve some scripting. Not much, but I understand that if you're pressed for time than it would take too long to get your feet wet enough to manage it.

Can you give an example of the different formats and the data involved?
Does the rule/lookup table change between data sets?
Should the output format tell you the row and column for each missing datum?


My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

#3 2009-07-30 08:40:26

karag
Member
Registered: 2006-09-05
Posts: 84

Re: Request: linux solution for comparing large sets of data

Hi Xyne - thanks for the reply...
the two data sets would look more like this:

A,B,C,D
E,F,G,H
I,J,K,L
A,b,c,D

i,J,K,l

the first set of data is assumed to be 'correct'.
both sets have the same info in the top and bottom row - even though the info is presented in different ways, they should be 'matched'.
the middle row is totally missing in the second set, so I want to have the details of

E,F,G,H

and other missing rows, so it can be input manually.

here is an example from data set 1, which is assumed to be correct:

NHI            task               date                 outcome                   
ABC1234      Pneum 5m          19-Aug-08            Complete
ZZZ1234     Pneum 3m           21-Oct-08            Complete
LNX1234      DTaP 6w             20-Oct-08             Complete

and here is an example from data set 2:

NHI          task              indication                  status             completed
ABC1234     PCV7               5m                           Completed       19/08/2008                           

LNX1234      DTaP               6w                           Completed       20/10/2008

The 'NHI' number is a unique identifier that is used consistently in both sets of data (3 alpha, 4 numeric).
as before, the top and bottom row should be 'matched' - the same info is presented in different ways (task 'Pneum 5m' is the same as task 'PCV7' and indication '5m'). However, the middle row is missing in the second set so I need it to recognise that on 21-OCt-08 there is no such task recorded on that date for that NHI - and then to give me the details so it can be input.

data set 1 and 2 will always be produced in this format with that structure, so any lookup table will not need to be changed once set up.
the output should ideally print the whole of the missing row from data set 1, as is.

really appreciate your help with this (sorry if the code boxes above are messed up, can't tab!)

Offline

#4 2009-07-30 09:35:31

Xyne
Administrator/PM
Registered: 2008-08-03
Posts: 6,965
Website

Re: Request: linux solution for comparing large sets of data

The unique identifiers simplify things considerably if all you need is to know which lines in set 1 lack equivalents in set 2.

Here's a python script: *edit* with comments to make it clear what's going on

#!/usr/bin/env python

from sys import argv

# the file paths to set 1 and set 2, passed as arguments
path1 = argv[1]
path2 = argv[2]

# a set to keep track of which tags we've seen
nhi_set = set()

# open the file of set 2 and read each line,
# adding the unique NHI tag to the nhi_set
f = open(path2)
for line in f.readlines():
  nhi_set.add( line[0:7] )
f.close

# open the file of set 1 and then print each line 
# which contains an NHI tag not found in set 2
f = open(path1)
for line in f.readlines():
  nhi = line[0:7]
  if nhi not in nhi_set:
    print line
f.close

You'll need Python to run that but it's completely portable. Save it to a file (e.g. "compare-nhi"), make it executable and then invoke it:

./compare-nhi /path/to/set1 /path/to/set2

The script makes the following assumptions:
1) The first 7 characters of each line is always the NHI tag, i.e. there are never any spaces at the beginning of the line.
2) The NHI tag is the only thing that matters... it doesn't care about anything else on the line.

The script will only print out those lines in set 1 that lack corresponding lines in set 2. Output for the example sets given above:

ZZZ1234     Pneum 3m           21-Oct-08            Complete

It wouldn't be difficult to modify it to output a full replacement for set 2 if there is a clear rule for filling in each field.  Date translation from set 1's format to set 2's format is straight-forward and I see that the indication in set 2 is pulled from the task in set 1, but I don't see how "Pneum" in set 1 translates to "PCV7" in set 2.

There are of course several ways to make the script a bit more robust/fault-tolerant, such as using regular expressions to grab the NIH and ignoring whitespace, giving more info such as line numbers of missing lines, etc.

It would only take a few minutes to rewrite it in Perl if that would be more practical.

Last edited by Xyne (2009-07-30 09:39:22)


My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

#5 2009-08-01 09:20:18

karag
Member
Registered: 2006-09-05
Posts: 84

Re: Request: linux solution for comparing large sets of data

Hi again -
This is great!
I've been playing around with this, trying to understand as much as I can - this is a great way to learn a little bit, hopefully to learn more in future...
It definitely works. I tested it on a larger set of data and it produced the correct result-I only need it to return the missing line(s) from set 1.

The formatting of the task is a bit weird - PCV7 is just an acronym for 'pneumococcal conjugate vaccine (7-valent)' so 'Pneum' is a lazy way of recording that (set 1 is produced as an output of a third party application which is, err, pretty awful).

I did forget to mention one important thing-ideally this script would also search for missing tasks recorded for an NHI which is already present in both sets of data. 'NHI' is a unique identifier for a person, and different tasks are recorded and stored for each person. A person would normally have 9 tasks recorded over a 15 month period. Whilst this script is perfect for identifying NHIs (people) totally missing from data set 2, it wouldn't detect if data set 1 had (for example) all 9 tasks recorded, but data set 2 only had 6 - for the same NHI. For example, data set 1 has 4 tasks recorded for 2 NHIs:

ABC1234   DTaP 6w
ABC1234   Pneum 6w
XXX1234   DTaP 3m
XXX1234   Pneum 3m

But data set 2 has only 1 task for 1 of these NHIs:

ABC1234   DTaP 6w

so, ideally the result should be the last 3 rows of data set 1.

I realise this makes it much more complex!
Matching the tasks up might be tricky since they're formatted in different way across different fields. As far as I can see there would be 2 possibilities to do this:
1. to find a unique match per row somehow - there are several more columns in the real life data which could be used. I can provide some examples if this might be a possibility. Or
2. to count the number of times an NHI occurs in data set 1, compare it with data set 2, and if there is a discrepancy return all rows with that NHI from data set 1. This would be easier, I guess, but would return a lot of 'false' results which would require more manual checking.

The other potential issue is that data in set 2 does not have the NHI at the beginning of the line - it's in the 6th column of the .csv data and its absolute position will always vary because of the fields that precede it. This is not a big deal for me cos I can just move that column to the beginning, but I wonder if there is some way to search for it via some alpha-numeric wildcard combination?? there is no other data on either set that will always be in that format 'ABC1234'.

I'm sorry this probably makes it a lot more complex! very grateful for your help.

Offline

#6 2009-08-01 11:17:14

iphitus
Forum Fellow
From: Melbourne, Australia
Registered: 2004-10-09
Posts: 4,927

Re: Request: linux solution for comparing large sets of data

What exact format is it in, you mentioned it being in CSV at the start, though the example sets above are not CSV.

If it's CSV, that makes things much easier. Python has a usable CSV module that will simplify any loading a great deal.

Offline

#7 2009-08-01 18:35:45

Xyne
Administrator/PM
Registered: 2008-08-03
Posts: 6,965
Website

Re: Request: linux solution for comparing large sets of data

Here's an updated version which should find all tasks in data set 1 which lack corresponding tasks in data set 2:

#!/usr/bin/env python

import re
from sys import argv

def main(argv):

  # a dictionary to hold the file paths to set 1, set 2 and the task table
  fpaths = {}
  
  # the command-line options which we will accept, in the form "opt=arg"
  options = set(['set1', 'set2', 'table'])
  
  # read in the command-line options
  for arg in argv[1:]:
    split_arg = arg.split('=', 1)
    if len(split_arg) > 1:
      opt, fpath = split_arg
      if opt in options and len(fpath) > 0:
        fpaths[opt] = fpath
  
  # check that we have all 3 file paths and exit with a usage message if not
  for opt in options:
    if not fpaths.has_key(opt):
      print "usage:\n    %s set1=/path/to/set1 set2=/path/to/set2 table=/path/to/table\n" % argv[0]
      print "where:\n    set1 is the file containing data set 1\n"
      print "    set2 is the file containing data set 2\n"
      print "    table is a file containing a task conversion table to map tasks"
      print "    in set 1 to tasks in set2. Each conversion pair should be on a"
      print "    single line and separated with a single space, e.g.\n"
      print "        Pneum PCV7\n"
      exit(1)
  

  # build the task conversion dictionary from the table file
  tasks = {}
  try:
    f = open(fpaths['table'])
  except IOError:
    print "error: unable to open", fpaths['table']
    exit(1)
  else:
    for line in f.readlines():
      columns = line.strip().split(' ', 1)
      if len(columns) > 1:
        tasks[columns[0]] = columns[1]


  # use a dictionary to store NHIs and corresponding tasks from set 2
  set2_data = {}
  try:
    f = open(fpaths['set2'])
  except IOError:
    print "error: unable to open", fpaths['set2']
    exit(1)
  else:
    for line in f.readlines():
      line = line.strip()
      m = re.match(r'([A-Za-z]{3}[0-9]{4})\s+(\S+)', line)
      if m:
        nhi = m.group(1)
        task = m.group(2)
        if set2_data.has_key(nhi):
          set2_data[nhi].append(task)
        else:
          set2_data[nhi] = [task]
      else:
        pass
        #print "warning: ignoring the following line in data set 2"
        #print "    ", line
      f.close


  # now print out all lines in data set 1 which lack corresponding lines
  # in data set 2, using the conversion table to match tasks
  try:
    f = open(fpaths['set1'])
  except IOError:
    print "error: unable to open", fpaths['set1']
    exit(1)
  else:
    for line in f.readlines():
      line = line.strip()
      m = re.match(r'([A-Za-z]{3}[0-9]{4})\s+(\S+)', line)
      if m:
        nhi = m.group(1)
        task = m.group(2)
        if tasks.has_key(task):
          task = tasks[task]
        if not set2_data.has_key(nhi) or not task in set2_data[nhi]:
          print line
      else:
        pass
        #print "warning: ignoring the following line in data set 1"
        #print "    ", line
      f.close
      

if __name__ == "__main__":
  main(argv)

I'm calling it "task_compare" for now but the name makes no difference.

In order for this to work, it requires a file in addition to the 2 data sets. The additional file is a conversion table for mapping tasks in set 1 to tasks in set 2 and takes the following format:

Pneum PCV7

Each row contains a single pair separated by a single space.

I've also switched to using named options to handle 3 different files, so the script should be invoked with

./task_compare set1=/path/to/set1 set2=/path/to/set2 table=/path/to/table

The order of the arguments is irrelevant.


I've received your email and will reply to that shortly.


My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

#8 2009-08-10 08:52:39

karag
Member
Registered: 2006-09-05
Posts: 84

Re: Request: linux solution for comparing large sets of data

hi xyne & all -
thanks so much for this, and sorry not to reply sooner. My laptop screen died this week so I've been hastily getting a new machine together, a drama in itself roll
I've tried this a few times, and played around with different source data, but I get no results - no errors, but it does not return anything. I wonder if I am doing this correctly.
iphitus made the good point that I have not provided data in exactly the format it will be produced - I was trying to be concise but realise it may have just been unhelpful. So, I have copied two sets of real life (anonymised) data exactly as they will be produced and given to me:
dataset1 is from the clinic and assumed to be correct-

"NHI","VACCINE","IMMDATE","OC","OUTCOME","BATCH","EXPIRY","DILUENTBATCH","DILUENTEXPIRY","PROVIDER","VACCINATOR","SITECODE","NIROPTDESCR"
"ABC1234","DTaP-IPV-Hep B/Hib6w",20-Oct-08,"G","Given","A21CA438D",15-Sep-10,,,"ABM","LS","RT","Opt on"
"ABC1234","Pneum, conj 6w",20-Oct-08,"G","Given",31256,9-Jun-10,,,"ABM","LS","LT","Opt on"
"ABC1234","DTaP-IPV-Hep B/Hib3m",1-Dec-08,"G","Given","A21FAA426CA",15-Sep-10,,,"ABM","MJ","RT","Opt on"
"ABC1234","Pneum, conj 3m",1-Dec-08,"G","Given",32500,15-Oct-10,,,"ABM","MJ","LT","Opt on"
"ABC1234","DTaP-IPV-Hep B/Hib5m",11-Feb-09,"G","Given","A21CA469G",15-Sep-10,,,"ABM","MART","RVL","Opt on"
"ABC1234","Pneum, conj 5m",11-Feb-09,"G","Given",33463,15-Oct-10,,,"ABM","MART","LVL","Opt on"
"ZZZ9311","DTaP-IPV-Hep B/Hib6w",24-Oct-08,"G","Given","A21CA438D",15-Sep-10,,,"JK","LS","LT","Opt on"
"ZZZ9311","Pneum, conj 6w",24-Oct-08,"G","Given",31256,9-Jun-10,,,"JK","LS","RT","Opt on"
"ZZZ9311","DTaP-IPV-Hep B/Hib3m",8-Dec-08,"G","Given","A21FAA426CA",15-Sep-10,,,"ABM","MJ","RT","Opt on"
"ZZZ9311","Pneum, conj 3m",8-Dec-08,"G","Given",32500,15-Oct-10,,,"ABM","MJ","LT","Opt on"
"ZZZ9311","DTaP-IPV-Hep B/Hib5m",17-Feb-09,"G","Given","A21CA469G",15-Sep-10,,,"ABM","LS","LT","Opt on"
"ZZZ9311","Pneum, conj 5m",17-Feb-09,"G","Given",33463,15-Oct-10,,,"ABM","LS","RT","Opt on"
"LNX9320","DTaP-IPV-Hep B/Hib6w",24-Oct-08,"G","Given","A21CA438D",15-Sep-10,,,"JK","LS","LT","Opt on"
"LNX9320","Pneum, conj 6w",24-Oct-08,"G","Given",31256,9-Jun-10,,,"JK","LS","RT","Opt on"
"LNX9320","DTaP-IPV-Hep B/Hib3m",8-Dec-08,"G","Given","A21FAA426CA",15-Sep-10,,,"ABM","MJ","RT","Opt on"
"LNX9320","Pneum, conj 3m",8-Dec-08,"G","Given",32500,15-Oct-10,,,"ABM","MJ","LT","Opt on"
"LNX9320","DTaP-IPV-Hep B/Hib5m",17-Feb-09,"G","Given","A21CA469G",15-Sep-10,,,"ABM","MJ","LT","Opt on"
"LNX9320","Pneum, conj 5m",17-Feb-09,"G","Given",33463,15-Oct-10,,,"ABM","MJ","RT","Opt on"
"SVG7698","Pneum, conj 5m",02/17/09,"G","Given",33463,10/15/10,,,"ABM","MJ","RT","Opt on"

In this example, the task and indication are combined (ie. DTaP-IPV-Hep B/Hib5m) where in dataset2 they are separated. The other main difference is that the 3rd column (date given) is formatted 17-Feb-09, rather than 2/17/2009 as in dataset2 column 12.
Here is dataset2, which represents the data I currently have in my database for this clinic:

"Clinic","Responsible Provider","Name","Nominated Provider","DOB","NHI","Task","Indication","Status","Code","Due","Completed","Vaccinator","Site","Batch No.","Expiry Date"
"D Medical Centre","Dr M",,"Unknown GP",9/5/2008,"ABC1234","DTaP-IPV-HepB/Hib","6w","Completed",,10/17/2008,10/20/2008,"Nrs. L ","OTH","A21CA438D",Sep-10
"D Medical Centre","Dr M",,"Unknown GP",9/5/2008,"ABC1234","PCV7","6w","Completed",,10/17/2008,10/20/2008,"Nrs. L ","OTH",31256,Jun-10
"D Medical Centre","Dr M",,"Unknown GP",9/5/2008,"ABC1234","DTaP-IPV-HepB/Hib","3m","Completed",,12/5/2008,12/1/2008,"Nrs. M ","OTH","A21FAA426CA",Sep-10
"D Medical Centre","Dr M",,"Unknown GP",9/5/2008,"ABC1234","PCV7","3m","Completed",,12/17/2008,12/1/2008,"Nrs. M ","OTH",32500,Oct-10
"D Medical Centre","Dr M",,"Unknown GP",9/5/2008,"ABC1234","DTaP-IPV-HepB/Hib","5m","Completed",,2/5/2009,2/11/2009,"Nrs. M ","RVL","A21CA469G",Sep-10
"D Medical Centre","Dr M",,"Unknown GP",9/5/2008,"ABC1234","PCV7","5m","Completed",,3/7/2009,2/11/2009,"Nrs. M ","LVL",33463,Oct-10
"D Medical Centre","Dr K",,"Unknown GP",9/6/2008,"ZZZ9311","DTaP-IPV-HepB/Hib","6w","Completed",,10/18/2008,10/24/2008,"Nrs. L ","OTH","A21CA438D",Sep-10
"D Medical Centre","Dr K",,"Unknown GP",9/6/2008,"ZZZ9311","PCV7","6w","Completed",,10/18/2008,10/24/2008,"Nrs. L ","OTH",31256,Jun-10
"D Medical Centre","Dr M",,"Unknown GP",9/6/2008,"ZZZ9311","DTaP-IPV-HepB/Hib","5m","Completed",,2/6/2009,2/17/2009,"Nrs. L ","OTH","A21CA469G",Sep-10
"D Medical Centre","Dr M",,"Unknown GP",9/6/2008,"ZZZ9311","PCV7","5m","Completed",,3/11/2009,2/17/2009,"Nrs. L ","OTH",33463,Oct-10
"D Medical Centre","Dr K",,"Unknown GP",9/6/2008,"LNX9320","DTaP-IPV-HepB/Hib","6w","Completed",,10/18/2008,10/24/2008,"Nrs. L ","OTH","A21CA438D",Sep-10
"D Medical Centre","Dr K",,"Unknown GP",9/6/2008,"LNX9320","PCV7","6w","Completed",,10/18/2008,10/24/2008,"Nrs. L ","OTH",31256,Jun-10
"D Medical Centre","Dr M",,"Unknown GP",9/6/2008,"LNX9320","DTaP-IPV-HepB/Hib","3m","Completed",,12/6/2008,12/8/2008,"Nrs. M ","OTH","A21FAA426CA",Sep-10
"D Medical Centre","Dr M",,"Unknown GP",9/6/2008,"LNX9320","PCV7","3m","Completed",,12/20/2008,12/8/2008,"Nrs. M ","OTH",32500,Oct-10

So we can see that data for NHI ABC1234 matches exactly. The "DTaP-IPV-Hep B/Hib6w" is the same as "DTaP-IPV-HepB/Hib" (task),"6w" (indication). The "Pneum, conj 6w" is the same as "PCV7" (task),"6w" (indication) - and so on for the subsequent doses 3m and 5m.

However, we have 2 tasks missing for NHI ZZZ1234, 2 missing for NHI LNX9320 and we have no record at all of NHI SVG7698. Therefore the output required from a comparison would be:

"ZZZ9311","DTaP-IPV-Hep B/Hib3m",8-Dec-08,"G","Given","A21FAA426CA",15-Sep-10,,,"ABM","MJ","RT","Opt on"
"ZZZ9311","Pneum, conj 3m",8-Dec-08,"G","Given",32500,15-Oct-10,,,"ABM","MJ","LT","Opt on"
"LNX9320","DTaP-IPV-Hep B/Hib5m",17-Feb-09,"G","Given","A21CA469G",15-Sep-10,,,"ABM","MJ","LT","Opt on"
"LNX9320","Pneum, conj 5m",17-Feb-09,"G","Given",33463,15-Oct-10,,,"ABM","MJ","RT","Opt on"
"SVG7698","Pneum, conj 5m",02/17/09,"G","Given",33463,10/15/10,,,"ABM","MJ","RT","Opt on"

This is exactly how it will be used (only, for hundreds of thousands of rows!), so I'm really sorry I was not as clear beforehand. Would be very grateful for any advice or point in the right direction about how this can be achieved. I must say that python is (relatively) readable even for a complete beginner like myself, I even delved into some online tutorials today and am keen to learn more. It's a big task though with no programming experience.
let me know if I can clarify anything, and thanks again for your help.

Last edited by karag (2009-08-10 08:54:31)

Offline

#9 2009-08-10 15:06:32

Trent
Member
From: Baltimore, MD (US)
Registered: 2009-04-16
Posts: 990

Re: Request: linux solution for comparing large sets of data

I wish my computer science teachers had given assignments like this one. smile  It may be complicated, but at least it's well-defined.

Python is a good choice for this project because it's not only easy to read and beginner-friendly, it has a great CSV module that will automate away much of the tedium involved in parsing CSV files.  Perl has its own strengths, but I personally advise you to use Python.  Look at http://docs.python.org/library/csv.html, especially the part about DictReader -- I think it will help in your situation.

I'm thinking your code will be something along the lines of this:
1. Read in dataset1 and dataset2.
2. Write a method that will compare an entry from dataset1 to one from dataset2, and return True if they are identical (False otherwise).
3. For every record in dataset1,
   a. search for a corresponding record in dataset2
   b. if a corresponding record is found in dataset2, delete it;
   c. if a corresponding record is not found, print the record.

Step 2 will probably be the most challenging.  It will help if you establish exactly what makes each record unique.  It seems that the combination of NHI, vaccination, and date of administration is sufficient to uniquely identify a record, correct?  Then you could write a method that converts those three fields from the formats they have in dataset2 to those in dataset1, and compare them directly.

Last edited by Trent (2009-08-10 15:07:09)

Offline

#10 2009-08-10 15:44:09

Xyne
Administrator/PM
Registered: 2008-08-03
Posts: 6,965
Website

Re: Request: linux solution for comparing large sets of data

@Trent
I don't think this is a homework assignment. karag has sent me an email explaining the situation in greater detail and it appears to be a simple case of a non-programmer requiring a tool to get a job done. As he (she?) has already pointed out, he has "no programming skills at all". While Python might be relatively easy to learn (compared to other languages), not everyone can take the time to sit down and figure out how to code something. If you need to get to the airport on Friday and don't yet have a license, asking someone who does for a lift makes more sense than trying to get your own by then.

@karag
The full formatting makes a huge difference. Here's an updated version of the script:

#!/usr/bin/env python

# Author: Xyne
# Website: xyne.archlinux.ca
# Email: replace "xyne." with "xyne@" in the website URL

import csv
import re
from sys import argv, stdout

def main(argv):

  # a dictionary to hold the file paths to set 1, set 2 and the task table
  fpaths = {}

  # the command-line options which we will accept, in the form "opt=arg"
  options = set(['ds1', 'ds2', 't'])

  # read in the command-line options
  for arg in argv[1:]:
    split_arg = arg.split('=', 1)
    if len(split_arg) > 1:
      opt, fpath = split_arg
      if opt in options and len(fpath) > 0:
        fpaths[opt] = fpath

  # check that we have all 3 file paths and exit with a usage message if not
  for opt in options:
    if not fpaths.has_key(opt):
      print "usage:\n    %s ds1=/path/to/dataset1 ds2=/path/to/set2 t=/path/to/table\n" % argv[0]
      print '''where:
  "dataset1" is the csv file containing data set 1

  "dataset2" is the csv file containing data set 2

  "table" is a csv file which maps "VACCINE" entries in dataset1, minus the
  "Indication",  to "Task" entries in dataset2. This is best illustrated with
  an example.

  Given the following 2 "VACCINE" entries in dataset1

    "Pneum, conj 6w"
    "DTaP-IPV-Hep B/Hib5m"

  and the corresponding "Task" and "Indication" entries in dataset2:

    "PCV7","6w"
    "DTaP-IPV-Hep B/Hib"

  the table file would be constructed as follows:

    "VACCINE","Task"
    "Pneum, conj ","PCV7"
    "DTaP-IPV-Hep B/Hib","DTaP-IPV-HepB/Hib"

  Note the final space in "Pneum, conj ". This is so that when the "VACCINE"
  entry in the table ("Pneum, conj ") is combined with the "Indication" of
  dataset2 ("6w"), we get the "VACCINE" entry of dataset1 ("Pneum, conj 6w")

'''
      exit(1)


  # read the data sets and table
  dataset1 = csv.DictReader( open(fpaths['ds1']) )
  dataset2 = csv.DictReader( open(fpaths['ds2']) )
  table = csv.DictReader( open(fpaths['t']) )

  # create dictionary from table
  taskmap = {}
  for t in table:
    taskmap[ t['Task'] ] = t['VACCINE']

  # create NHI-VACCINE dictionary for dataset2 using
  # the taskmap dictionary to correlate the data sets
  ds2dict = {}
  for r in dataset2:
    nhi = r['NHI']
    if taskmap.has_key(r['Task']):
      vaccine = taskmap[ r['Task'] ] + r['Indication']
    else:
      vaccine = r['Task'] + r['Indication']
    # use sets to keep track of tasks
    if ds2dict.has_key(nhi):
      ds2dict[nhi].add(vaccine)
    else:
      ds2dict[nhi] = set([vaccine])

  # create a writer for printing the rows to stdout
  writer = csv.DictWriter(stdout, dataset1.fieldnames, quoting=csv.QUOTE_ALL)

  # print out the unmatched rows in dataset1
  for r in dataset1:
    nhi = r['NHI']
    if not ds2dict.has_key(nhi) or r['VACCINE'] not in ds2dict[nhi]:
      writer.writerow(r)





if __name__ == "__main__":
  main(argv)

This requires a csv file which maps "VACCINES in data set 1 to "Tasks" in data set 2. Run the script without any arguments for further explanation. Here's a table for the example data sets given in your post:

"VACCINE","Task"
"Pneum, conj ","PCV7"
"DTaP-IPV-Hep B/Hib","DTaP-IPV-HepB/Hib"

and here's the output of running the script:

"ZZZ9311","DTaP-IPV-Hep B/Hib3m","8-Dec-08","G","Given","A21FAA426CA","15-Sep-10","","","ABM","MJ","RT","Opt on"
"ZZZ9311","Pneum, conj 3m","8-Dec-08","G","Given","32500","15-Oct-10","","","ABM","MJ","LT","Opt on"
"LNX9320","DTaP-IPV-Hep B/Hib5m","17-Feb-09","G","Given","A21CA469G","15-Sep-10","","","ABM","MJ","LT","Opt on"
"LNX9320","Pneum, conj 5m","17-Feb-09","G","Given","33463","15-Oct-10","","","ABM","MJ","RT","Opt on"
"SVG7698","Pneum, conj 5m","02/17/09","G","Given","33463","10/15/10","","","ABM","MJ","RT","Opt on"

My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

#11 2009-08-11 00:58:02

Trent
Member
From: Baltimore, MD (US)
Registered: 2009-04-16
Posts: 990

Re: Request: linux solution for comparing large sets of data

Xyne wrote:

@Trent
I don't think this is a homework assignment. karag has sent me an email explaining the situation in greater detail and it appears to be a simple case of a non-programmer requiring a tool to get a job done. As he (she?) has already pointed out, he has "no programming skills at all". While Python might be relatively easy to learn (compared to other languages), not everyone can take the time to sit down and figure out how to code something. If you need to get to the airport on Friday and don't yet have a license, asking someone who does for a lift makes more sense than trying to get your own by then.

I didn't mean to imply that I thought it was a homework assignment, although I see that my post did give that impression (sorry about that).  I was just saying that a true-to-life example like this would have been a lot more educational than "center-align an equilateral triangle with side length N on an 80-column screen".

Offline

#12 2009-08-11 23:10:51

karag
Member
Registered: 2006-09-05
Posts: 84

Re: Request: linux solution for comparing large sets of data

WOW!
this is amazing big_smile
it works perfectly. Thankyou so much!
I'm playing around with the table file and mapping a few other tasks that crop up, have just extended it to include MMR and other older vaccines.

Trent's point about using real examples is a good one, and actually it seems to work both ways - as a non-programmer I have come across a real life situation which can most efficiently be fixed with some code, so (all things being equal) there is an incentive for me to learn how to do this. Most people who are not 'into' computers probably wouldn't even consider or know how to ask about this possibility, so in that respect I'm lucky enough to be part of a vibrant and positive linux community.

Xyne, once I've finished all the mapping I'll be testing it on much larger data sets so will let you know how it goes.
Thanks again

Offline

#13 2009-08-19 02:27:21

karag
Member
Registered: 2006-09-05
Posts: 84

Re: Request: linux solution for comparing large sets of data

Hi again -

there were a lot more combinations of 'task' and 'indication' than I thought roll
the sample data posted previously works fine - thank you so much for your help. But there are other task/indication combinations that do not match as things currently are.
I could post all the possible permutations for info, but thought I'd just ask:
(and this is based on my limited understanding of the mechanics of this script)

As things currently are, this script (via table.csv) takes the 'Indication' from dataset 2, and appends it to the main portion of the 'VACCINE' from dataset 1 to find a match.
Is it possible instead to have a more explicit lookup table which definitively lists what should be matched?
For example, instead of table.csv:

"VACCINE","Task"
"Pneum, conj ","PCV7"
"DTaP-IPV-Hep B/Hib","DTaP-IPV-HepB/Hib"

is this feasible:

"VACCINE","Task","Indication"
"Pneum, conj 6w","PCV7","6w"
"DTaP-IPV-Hep B/Hib6w","DTaP-IPV-HepB/Hib","6w"
"Pneum, conj 3m","PCV7","3m"
"DTaP-IPV-Hep B/Hib3m","DTaP-IPV-HepB/Hib","3m"
"Pneum, conj 5m","PCV7","5m"
"DTaP-IPV-Hep B/Hib5m","DTaP-IPV-HepB/Hib","5m"

I know this would have an impact on the rest of the code, but could make it much more extensible to apply to other vaccinations and to keep up to date with statutory changes.
So I'm guessing that the section

# create dictionary from table
  taskmap = {}
  for t in table:
    taskmap[ t['Task'] ] = t['VACCINE']

would become something like

# create dictionary from table
  taskmap = {}
  for t in table:
    taskmap[ t['Task'] ] + t['Indication'] = t['VACCINE']

Man, I'd love to understand this stuff smile

Offline

#14 2009-08-19 03:21:46

Xyne
Administrator/PM
Registered: 2008-08-03
Posts: 6,965
Website

Re: Request: linux solution for comparing large sets of data

Can you give an example of a task/indication combination that fails with the previous code?

Here's an updated version all the same:

#!/usr/bin/env python

# Author: Xyne
# Website: xyne.archlinux.ca
# Email: replace "xyne." with "xyne@" in the website URL

import csv
import re
from sys import argv, stdout

exit_on_unmatched = True

def main(argv):

  # a dictionary to hold the file paths to set 1, set 2 and the task table
  fpaths = {}

  # the command-line options which we will accept, in the form "opt=arg"
  options = set(['ds1', 'ds2', 't'])

  # read in the command-line options
  for arg in argv[1:]:
    split_arg = arg.split('=', 1)
    if len(split_arg) > 1:
      opt, fpath = split_arg
      if opt in options and len(fpath) > 0:
        fpaths[opt] = fpath

  # check that we have all 3 file paths and exit with a usage message if not
  for opt in options:
    if not fpaths.has_key(opt):
      print "usage:\n    %s ds1=/path/to/dataset1 ds2=/path/to/set2 t=/path/to/table\n" % argv[0]
      print '''where:
  "dataset1" is the csv file containing data set 1

  "dataset2" is the csv file containing data set 2

  "table" is a csv file which matches "VACCINE" entries in dataset1 to "Task"
  and "Indication" entries in dataset2. This is best illustrated with an
  example:

  Given the following 2 "VACCINE" entries in dataset1

    "Pneum, conj 6w"
    "DTaP-IPV-Hep B/Hib5m"

  and the corresponding "Task" and "Indication" entries in dataset2:

    "PCV7","6w"
    "DTaP-IPV-Hep B/Hib","5m"

  the table file would be constructed as follows:

    "VACCINE","Task","Indication"
    "Pneum, conj 6w","PCV7","6w"
    "DTaP-IPV-Hep B/Hib5m","DTaP-IPV-HepB/Hib","5m"

  Note the final space in "Pneum, conj ". This is so that when the "VACCINE"
  entry in the table ("Pneum, conj ") is combined with the "Indication" of
  dataset2 ("6w"), we get the "VACCINE" entry of dataset1 ("Pneum, conj 6w")

'''
      exit(1)


  # read the data sets and table
  dataset1 = csv.DictReader( open(fpaths['ds1']) )
  dataset2 = csv.DictReader( open(fpaths['ds2']) )
  table = csv.DictReader( open(fpaths['t']) )

  # create dictionary from table
  taskmap = {}
  for t in table:
    # the ':::' is to prevent possible overlapping concatenations
    taskmap[ t['Task'] + ':::' + t['Indication'] ] = t['VACCINE']

  # create NHI-VACCINE dictionary for dataset2 using
  # the taskmap dictionary to correlate the data sets
  ds2dict = {}
  unmatched = set()
  for r in dataset2:
    nhi = r['NHI']
    mapped_task = r['Task'] + ':::' + r['Indication']
    if taskmap.has_key(mapped_task):
      vaccine = taskmap[mapped_task]
    elif exit_on_unmatched:
      unmatched.add( (r['Task'], r['Indication']) )
      continue
    else:
      vaccine = r['Task'] + r['Indication']
    # use sets to keep track of tasks
    if ds2dict.has_key(nhi):
      ds2dict[nhi].add(vaccine)
    else:
      ds2dict[nhi] = set([vaccine])

  # exit with a message if there are unmatched tasks and indications"  
  if len(unmatched) > 0:
    print '''The table does not specify a matching "VACCINE" entry in data set 1 for the
following "Task" and "Indication" combinations in data set 2:
'''
    writer = csv.DictWriter(stdout, ('Task','Indication'), quoting=csv.QUOTE_ALL)
    for r in unmatched:
      writer.writerow({'Task':r[0],'Indication':r[1]})
    exit (1)

  # create a writer for printing the rows to stdout
  writer = csv.DictWriter(stdout, dataset1.fieldnames, quoting=csv.QUOTE_ALL)

  # print out the unmatched rows in dataset1
  for r in dataset1:
    nhi = r['NHI']
    if not ds2dict.has_key(nhi) or r['VACCINE'] not in ds2dict[nhi]:
      writer.writerow(r)





if __name__ == "__main__":
  main(argv)

This now requires a table which specifies all matches. The "exit_on_unmatched" variable can be set to false to default to "VACCINE" = "Task"+"Indication" when no matching row can be found in the table. Using the example data given above, the table file is:

"VACCINE","Task","Indication"
"DTaP-IPV-Hep B/Hib3m","DTaP-IPV-HepB/Hib","3m"
"DTaP-IPV-Hep B/Hib5m","DTaP-IPV-HepB/Hib","5m"
"DTaP-IPV-Hep B/Hib6w","DTaP-IPV-HepB/Hib","6w"
"Pneum, conj 3m","PCV7","3m"
"Pneum, conj 5m","PCV7","5m"
"Pneum, conj 6w","PCV7","6w"

The output is the same as in the previous post.

If the table is not complete, you will get a list of entries which need to be added:

The table does not specify a matching "VACCINE" entry in data set 1 for the
following "Task" and "Indication" combinations in data set 2:

"DTaP-IPV-HepB/Hib","6w"
"DTaP-IPV-HepB/Hib","3m"
"PCV7","6w"
"DTaP-IPV-HepB/Hib","5m"
"PCV7","3m"
"PCV7","5m"

My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

#15 2009-08-23 21:38:40

karag
Member
Registered: 2006-09-05
Posts: 84

Re: Request: linux solution for comparing large sets of data

Sorry to take so long to reply, and thanks for the new script! It sure does work on the data posted above. I've been testing it on PortablePython on Windows at work, where I was getting some incorrect matching. However, I've just spent time testing it on my linux box and it works perfectly - even with the new combinations added.
Here are the extra combinations as they would be represented in the 'table.csv' file:

"DTaP-IPV-1 6w","DTaP-IPV","6w"
"Hib-HepB-1 6w","Hib-HepB","6w"
"DTaP-IPV-2 3m","DTaP-IPV","3m"
"Hib-HepB-2 3m","Hib-HepB","3m"
"DTaP-IPV-3 5m","DTaP-IPV","5m"
"HepB, paed-3 5m","HepB","5m"
"MMR-1 15m","MMR","15m"
"Hib 15m","Hib","15m"
"MMR-2 4y","MMR","4y"

What was happening was that as I added the new combinations and tested the matching one at a time, at some point I would suddenly get ALL 'DTaP-IPV-Hep B/Hib' lines returned as well as the correct results. I can't seem to replicate this now on my linux box so will continue editing and testing.

Adding the new tasks/indications under a new NHI to both datasets, the updated dataset files would be-
dataset1:

"NHI","VACCINE","IMMDATE","OC","OUTCOME","BATCH","EXPIRY","DILUENTBATCH","DILUENTEXPIRY","PROVIDER","VACCINATOR","SITECODE","NIROPTDESCR"
"ABC1234","DTaP-IPV-Hep B/Hib6w","20-Oct-08","G","Given","A21CA438D","15-Sep-10",,,"ABM","LS","RT","Opt on"
"ABC1234","Pneum, conj 6w","20-Oct-08","G","Given",31256,"9-Jun-10",,,"ABM","LS","LT","Opt on"
"ABC1234","DTaP-IPV-Hep B/Hib3m","1-Dec-08","G","Given","A21FAA426CA","15-Sep-10",,,"ABM","MJ","RT","Opt on"
"ABC1234","Pneum, conj 3m","1-Dec-08","G","Given",32500,"15-Oct-10",,,"ABM","MJ","LT","Opt on"
"ABC1234","DTaP-IPV-Hep B/Hib5m","11-Feb-09","G","Given","A21CA469G","15-Sep-10",,,"ABM","MART","RVL","Opt on"
"ABC1234","Pneum, conj 5m","11-Feb-09","G","Given",33463,"15-Oct-10",,,"ABM","MART","LVL","Opt on"
"ZZZ9311","DTaP-IPV-Hep B/Hib6w","24-Oct-08","G","Given","A21CA438D","15-Sep-10",,,"JK","LS","LT","Opt on"
"ZZZ9311","Pneum, conj 6w","24-Oct-08","G","Given",31256,"9-Jun-10",,,"JK","LS","RT","Opt on"
"ZZZ9311","DTaP-IPV-Hep B/Hib3m","8-Dec-08","G","Given","A21FAA426CA","15-Sep-10",,,"ABM","MJ","RT","Opt on"
"ZZZ9311","Pneum, conj 3m","8-Dec-08","G","Given",32500,"15-Oct-10",,,"ABM","MJ","LT","Opt on"
"ZZZ9311","DTaP-IPV-Hep B/Hib5m","17-Feb-09","G","Given","A21CA469G","15-Sep-10",,,"ABM","LS","LT","Opt on"
"ZZZ9311","Pneum, conj 5m","17-Feb-09","G","Given",33463,"15-Oct-10",,,"ABM","LS","RT","Opt on"
"LNX9320","DTaP-IPV-Hep B/Hib6w","24-Oct-08","G","Given","A21CA438D","15-Sep-10",,,"JK","LS","LT","Opt on"
"LNX9320","Pneum, conj 6w","24-Oct-08","G","Given",31256,"9-Jun-10",,,"JK","LS","RT","Opt on"
"LNX9320","DTaP-IPV-Hep B/Hib3m","8-Dec-08","G","Given","A21FAA426CA","15-Sep-10",,,"ABM","MJ","RT","Opt on"
"LNX9320","Pneum, conj 3m","8-Dec-08","G","Given",32500,"15-Oct-10",,,"ABM","MJ","LT","Opt on"
"LNX9320","DTaP-IPV-Hep B/Hib5m","17-Feb-09","G","Given","A21CA469G","15-Sep-10",,,"ABM","MJ","LT","Opt on"
"LNX9320","Pneum, conj 5m","17-Feb-09","G","Given",33463,"15-Oct-10",,,"ABM","MJ","RT","Opt on"
"SVG7698","Pneum, conj 5m","02/17/09","G","Given",33463,"10/15/10",,,"ABM","MJ","RT","Opt on"
"XXX9999","DTaP-IPV-1 6w","26-Aug-05","G","Given","AC20B040BB","15-Feb-08",,,"NS","MART","RVL","Opt on"
"XXX9999","Hib-HepB-1 6w","26-Aug-05","G","Given","F2033","15-Feb-07",,,"NS","MART","RVL","Opt on"
"XXX9999","DTaP-IPV-2 3m","26-Sep-05","G","Given","AC20B040BB","15-Feb-08",,,"NS","MART","RVL","Opt on"
"XXX9999","Hib-HepB-2 3m","26-Sep-05","G","Given","F2033","15-Feb-07",,,"NS","MART","RVL","Opt on"
"XXX9999","DTaP-IPV-3 5m","26-Nov-05","G","Given","AC20B040BB","15-Feb-08",,,"NS","MART","RVL","Opt on"
"XXX9999","HepB, paed-3 5m","26-Nov-05","G","Given","F2033","15-Feb-07",,,"NS","MART","RVL","Opt on"
"XXX9999","Hib 15m","18-Nov-06","G","Given","A72DA154A","15-May-10",,,"NS","MART","RVL","Opt on"
"XXX9999","MMR-1 15m","18-Nov-06","G","Given","G3634","15-Nov-07","G2987","15-Feb-08","NS","MART","RA","Opt on"

dataset2:

"Clinic","Responsible Provider","Name","Nominated Provider","DOB","NHI","Task","Indication","Status","Code","Due","Completed","Vaccinator","Site","Batch No.","Expiry Date"
"D Medical Centre","Dr M",,"Unknown GP","9/5/2008","ABC1234","DTaP-IPV-HepB/Hib","6w","Completed",,"10/17/2008","10/20/2008","Nrs. L ","OTH","A21CA438D","Sep-10"
"D Medical Centre","Dr M",,"Unknown GP","9/5/2008","ABC1234","PCV7","6w","Completed",,"10/17/2008","10/20/2008","Nrs. L ","OTH",31256,"Jun-10"
"D Medical Centre","Dr M",,"Unknown GP","9/5/2008","ABC1234","DTaP-IPV-HepB/Hib","3m","Completed",,"12/5/2008","12/1/2008","Nrs. M ","OTH","A21FAA426CA","Sep-10"
"D Medical Centre","Dr M",,"Unknown GP","9/5/2008","ABC1234","PCV7","3m","Completed",,"12/17/2008","12/1/2008","Nrs. M ","OTH",32500,"Oct-10"
"D Medical Centre","Dr M",,"Unknown GP","9/5/2008","ABC1234","DTaP-IPV-HepB/Hib","5m","Completed",,"2/5/2009","2/11/2009","Nrs. M ","RVL","A21CA469G","Sep-10"
"D Medical Centre","Dr M",,"Unknown GP","9/5/2008","ABC1234","PCV7","5m","Completed",,"3/7/2009","2/11/2009","Nrs. M ","LVL",33463,"Oct-10"
"D Medical Centre","Dr K",,"Unknown GP","9/6/2008","ZZZ9311","DTaP-IPV-HepB/Hib","6w","Completed",,"10/18/2008","10/24/2008","Nrs. L ","OTH","A21CA438D","Sep-10"
"D Medical Centre","Dr K",,"Unknown GP","9/6/2008","ZZZ9311","PCV7","6w","Completed",,"10/18/2008","10/24/2008","Nrs. L ","OTH",31256,"Jun-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","ZZZ9311","DTaP-IPV-HepB/Hib","5m","Completed",,"2/6/2009","2/17/2009","Nrs. L ","OTH","A21CA469G","Sep-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","ZZZ9311","PCV7","5m","Completed",,"3/11/2009","2/17/2009","Nrs. L ","OTH",33463,"Oct-10"
"D Medical Centre","Dr K",,"Unknown GP","9/6/2008","LNX9320","DTaP-IPV-HepB/Hib","6w","Completed",,"10/18/2008","10/24/2008","Nrs. L ","OTH","A21CA438D","Sep-10"
"D Medical Centre","Dr K",,"Unknown GP","9/6/2008","LNX9320","PCV7","6w","Completed",,"10/18/2008","10/24/2008","Nrs. L ","OTH",31256,"Jun-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","LNX9320","DTaP-IPV-HepB/Hib","3m","Completed",,"12/6/2008","12/8/2008","Nrs. M ","OTH","A21FAA426CA","Sep-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","LNX9320","PCV7","3m","Completed",,"12/20/2008","12/8/2008","Nrs. M ","OTH",32500,"Oct-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","XXX9999","DTaP-IPV","6w","Completed",,"8/26/2005","12/8/2008","Nrs. M ","OTH","A21FAA426CA","Sep-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","XXX9999","Hib-HepB","6w","Completed",,"8/26/2005","12/8/2008","Nrs. M ","OTH",32500,"Oct-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","XXX9999","DTaP-IPV","3m","Completed",,"9/26/2005","12/8/2008","Nrs. M ","OTH","A21FAA426CA","Sep-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","XXX9999","Hib-HepB","3m","Completed",,"9/26/2005","12/8/2008","Nrs. M ","OTH",32500,"Oct-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","XXX9999","DTaP-IPV","5m","Completed",,"11/26/2005","12/8/2008","Nrs. M ","OTH","A21FAA426CA","Sep-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","XXX9999","HepB","5m","Completed",,"11/26/2005","12/8/2008","Nrs. M ","OTH",32500,"Oct-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","XXX9999","Hib","15m","Completed",,"10/18/2006","12/8/2008","Nrs. M ","OTH","A21FAA426CA","Sep-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","XXX9999","MMR","15m","Completed",,"10/18/2006","12/8/2008","Nrs. M ","OTH",32500,"Oct-10"

As I say, testing on windows was giving me some false results but I can't replicate this now in linux. I'll continue testing all the permutations I can find.
Thank you so much for your hard work, Xyne.

Offline

#16 2009-08-23 22:52:56

Xyne
Administrator/PM
Registered: 2008-08-03
Posts: 6,965
Website

Re: Request: linux solution for comparing large sets of data

I haven't tested it but the problem on Windows may be due to the distinction that it makes between text and binary files. I've modified the file opening line to hopefully handle this correctly:

#!/usr/bin/env python

# Author: Xyne
# Website: xyne.archlinux.ca
# Email: replace "xyne." with "xyne@" in the website URL

import csv
import re
from sys import argv, stdout

exit_on_unmatched = True

def main(argv):

  # a dictionary to hold the file paths to set 1, set 2 and the task table
  fpaths = {}

  # the command-line options which we will accept, in the form "opt=arg"
  options = set(['ds1', 'ds2', 't'])

  # read in the command-line options
  for arg in argv[1:]:
    split_arg = arg.split('=', 1)
    if len(split_arg) > 1:
      opt, fpath = split_arg
      if opt in options and len(fpath) > 0:
        fpaths[opt] = fpath

  # check that we have all 3 file paths and exit with a usage message if not
  for opt in options:
    if not fpaths.has_key(opt):
      print "usage:\n    %s ds1=/path/to/dataset1 ds2=/path/to/set2 t=/path/to/table\n" % argv[0]
      print '''where:
  "dataset1" is the csv file containing data set 1

  "dataset2" is the csv file containing data set 2

  "table" is a csv file which matches "VACCINE" entries in dataset1 to "Task"
  and "Indication" entries in dataset2. This is best illustrated with an
  example:

  Given the following 2 "VACCINE" entries in dataset1

    "Pneum, conj 6w"
    "DTaP-IPV-Hep B/Hib5m"

  and the corresponding "Task" and "Indication" entries in dataset2:

    "PCV7","6w"
    "DTaP-IPV-Hep B/Hib","5m"

  the table file would be constructed as follows:

    "VACCINE","Task","Indication"
    "Pneum, conj 6w","PCV7","6w"
    "DTaP-IPV-Hep B/Hib5m","DTaP-IPV-HepB/Hib","5m"

  Note the final space in "Pneum, conj ". This is so that when the "VACCINE"
  entry in the table ("Pneum, conj ") is combined with the "Indication" of
  dataset2 ("6w"), we get the "VACCINE" entry of dataset1 ("Pneum, conj 6w")

'''
      exit(1)


  # read the data sets and table
  dataset1 = csv.DictReader( open(fpaths['ds1'], "rb") )
  dataset2 = csv.DictReader( open(fpaths['ds2'], "rb") )
  table = csv.DictReader( open(fpaths['t'], "rb") )

  # create dictionary from table
  taskmap = {}
  for t in table:
    # the ':::' is to prevent possible overlapping concatenations
    taskmap[ t['Task'] + ':::' + t['Indication'] ] = t['VACCINE']

  # create NHI-VACCINE dictionary for dataset2 using
  # the taskmap dictionary to correlate the data sets
  ds2dict = {}
  unmatched = set()
  for r in dataset2:
    nhi = r['NHI']
    mapped_task = r['Task'] + ':::' + r['Indication']
    if taskmap.has_key(mapped_task):
      vaccine = taskmap[mapped_task]
    elif exit_on_unmatched:
      unmatched.add( (r['Task'], r['Indication']) )
      continue
    else:
      vaccine = r['Task'] + r['Indication']
    # use sets to keep track of tasks
    if ds2dict.has_key(nhi):
      ds2dict[nhi].add(vaccine)
    else:
      ds2dict[nhi] = set([vaccine])

  # exit with a message if there are unmatched tasks and indications"  
  if len(unmatched) > 0:
    print '''The table does not specify a matching "VACCINE" entry in data set 1 for the
following "Task" and "Indication" combinations in data set 2:
'''
    writer = csv.DictWriter(stdout, ('Task','Indication'), quoting=csv.QUOTE_ALL)
    for r in unmatched:
      writer.writerow({'Task':r[0],'Indication':r[1]})
    exit (1)

  # create a writer for printing the rows to stdout
  writer = csv.DictWriter(stdout, dataset1.fieldnames, quoting=csv.QUOTE_ALL)

  # print out the unmatched rows in dataset1
  for r in dataset1:
    nhi = r['NHI']
    if not ds2dict.has_key(nhi) or r['VACCINE'] not in ds2dict[nhi]:
      writer.writerow(r)





if __name__ == "__main__":
  main(argv)

If the issue persists then make sure that the file encoding is the same for all the files. I don't think that actually matters but it might.


My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

Board footer

Powered by FluxBB