You are not logged in.
Hi all -
I'm an enthusiastic linux amateur with no programming skills at all. I would like to develop some kind of tool to compare large sets of data for work. I am simply not able to learn a programming language (I work full time, have 2 pre-schoolers and studying for a masters) - although I would like to - but have dabbled with LAMP before and guess this might work.
Here is a brief outline of what I need to do:
-I have two sets of data (this will be thousands of rows in .csv format)
- they should both hold exactly the same info, although one set formats it in a different way
- however, one set of data is missing info, for various reasons
- both sets need to be compared against each other
- a result is produced which lists the data missing
Because they format the same info in different way, I guess something like a 'rule' or 'lookup table' will be needed to identify what should be considered a 'match'. Ideally, if there is a solution it can be accessed via a web interface or even portably on USB.
So I'm just asking for someone to point me in the right direction with where to start - if there is a basic way of doing this in linux I would love to use it and hopefully extend it somehow to my Windows work environment. Very grateful for advice
Offline
Just to make sure that I've understood this correctly, you have 2 datasets that might look like this (abstractly):
a,b,,d
e,,g,h,
i,,k,l
A,B,C,D
E,F,G,H
I,J,K,L
and you would like a way to compare them to find the missing elements/cells in the former and produce a result which shows what they should be according to the latter, using a rule/hash table (C->c, F->f, J->j).
If that's the case, then yes, there is a basic way to do it, but it will involve some scripting. Not much, but I understand that if you're pressed for time than it would take too long to get your feet wet enough to manage it.
Can you give an example of the different formats and the data involved?
Does the rule/lookup table change between data sets?
Should the output format tell you the row and column for each missing datum?
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
Hi Xyne - thanks for the reply...
the two data sets would look more like this:
A,B,C,D
E,F,G,H
I,J,K,L
A,b,c,D
i,J,K,l
the first set of data is assumed to be 'correct'.
both sets have the same info in the top and bottom row - even though the info is presented in different ways, they should be 'matched'.
the middle row is totally missing in the second set, so I want to have the details of
E,F,G,H
and other missing rows, so it can be input manually.
here is an example from data set 1, which is assumed to be correct:
NHI task date outcome
ABC1234 Pneum 5m 19-Aug-08 Complete
ZZZ1234 Pneum 3m 21-Oct-08 Complete
LNX1234 DTaP 6w 20-Oct-08 Complete
and here is an example from data set 2:
NHI task indication status completed
ABC1234 PCV7 5m Completed 19/08/2008
LNX1234 DTaP 6w Completed 20/10/2008
The 'NHI' number is a unique identifier that is used consistently in both sets of data (3 alpha, 4 numeric).
as before, the top and bottom row should be 'matched' - the same info is presented in different ways (task 'Pneum 5m' is the same as task 'PCV7' and indication '5m'). However, the middle row is missing in the second set so I need it to recognise that on 21-OCt-08 there is no such task recorded on that date for that NHI - and then to give me the details so it can be input.
data set 1 and 2 will always be produced in this format with that structure, so any lookup table will not need to be changed once set up.
the output should ideally print the whole of the missing row from data set 1, as is.
really appreciate your help with this (sorry if the code boxes above are messed up, can't tab!)
Offline
The unique identifiers simplify things considerably if all you need is to know which lines in set 1 lack equivalents in set 2.
Here's a python script: *edit* with comments to make it clear what's going on
#!/usr/bin/env python
from sys import argv
# the file paths to set 1 and set 2, passed as arguments
path1 = argv[1]
path2 = argv[2]
# a set to keep track of which tags we've seen
nhi_set = set()
# open the file of set 2 and read each line,
# adding the unique NHI tag to the nhi_set
f = open(path2)
for line in f.readlines():
nhi_set.add( line[0:7] )
f.close
# open the file of set 1 and then print each line
# which contains an NHI tag not found in set 2
f = open(path1)
for line in f.readlines():
nhi = line[0:7]
if nhi not in nhi_set:
print line
f.close
You'll need Python to run that but it's completely portable. Save it to a file (e.g. "compare-nhi"), make it executable and then invoke it:
./compare-nhi /path/to/set1 /path/to/set2
The script makes the following assumptions:
1) The first 7 characters of each line is always the NHI tag, i.e. there are never any spaces at the beginning of the line.
2) The NHI tag is the only thing that matters... it doesn't care about anything else on the line.
The script will only print out those lines in set 1 that lack corresponding lines in set 2. Output for the example sets given above:
ZZZ1234 Pneum 3m 21-Oct-08 Complete
It wouldn't be difficult to modify it to output a full replacement for set 2 if there is a clear rule for filling in each field. Date translation from set 1's format to set 2's format is straight-forward and I see that the indication in set 2 is pulled from the task in set 1, but I don't see how "Pneum" in set 1 translates to "PCV7" in set 2.
There are of course several ways to make the script a bit more robust/fault-tolerant, such as using regular expressions to grab the NIH and ignoring whitespace, giving more info such as line numbers of missing lines, etc.
It would only take a few minutes to rewrite it in Perl if that would be more practical.
Last edited by Xyne (2009-07-30 09:39:22)
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
Hi again -
This is great!
I've been playing around with this, trying to understand as much as I can - this is a great way to learn a little bit, hopefully to learn more in future...
It definitely works. I tested it on a larger set of data and it produced the correct result-I only need it to return the missing line(s) from set 1.
The formatting of the task is a bit weird - PCV7 is just an acronym for 'pneumococcal conjugate vaccine (7-valent)' so 'Pneum' is a lazy way of recording that (set 1 is produced as an output of a third party application which is, err, pretty awful).
I did forget to mention one important thing-ideally this script would also search for missing tasks recorded for an NHI which is already present in both sets of data. 'NHI' is a unique identifier for a person, and different tasks are recorded and stored for each person. A person would normally have 9 tasks recorded over a 15 month period. Whilst this script is perfect for identifying NHIs (people) totally missing from data set 2, it wouldn't detect if data set 1 had (for example) all 9 tasks recorded, but data set 2 only had 6 - for the same NHI. For example, data set 1 has 4 tasks recorded for 2 NHIs:
ABC1234 DTaP 6w
ABC1234 Pneum 6w
XXX1234 DTaP 3m
XXX1234 Pneum 3m
But data set 2 has only 1 task for 1 of these NHIs:
ABC1234 DTaP 6w
so, ideally the result should be the last 3 rows of data set 1.
I realise this makes it much more complex!
Matching the tasks up might be tricky since they're formatted in different way across different fields. As far as I can see there would be 2 possibilities to do this:
1. to find a unique match per row somehow - there are several more columns in the real life data which could be used. I can provide some examples if this might be a possibility. Or
2. to count the number of times an NHI occurs in data set 1, compare it with data set 2, and if there is a discrepancy return all rows with that NHI from data set 1. This would be easier, I guess, but would return a lot of 'false' results which would require more manual checking.
The other potential issue is that data in set 2 does not have the NHI at the beginning of the line - it's in the 6th column of the .csv data and its absolute position will always vary because of the fields that precede it. This is not a big deal for me cos I can just move that column to the beginning, but I wonder if there is some way to search for it via some alpha-numeric wildcard combination?? there is no other data on either set that will always be in that format 'ABC1234'.
I'm sorry this probably makes it a lot more complex! very grateful for your help.
Offline
What exact format is it in, you mentioned it being in CSV at the start, though the example sets above are not CSV.
If it's CSV, that makes things much easier. Python has a usable CSV module that will simplify any loading a great deal.
Offline
Here's an updated version which should find all tasks in data set 1 which lack corresponding tasks in data set 2:
#!/usr/bin/env python
import re
from sys import argv
def main(argv):
# a dictionary to hold the file paths to set 1, set 2 and the task table
fpaths = {}
# the command-line options which we will accept, in the form "opt=arg"
options = set(['set1', 'set2', 'table'])
# read in the command-line options
for arg in argv[1:]:
split_arg = arg.split('=', 1)
if len(split_arg) > 1:
opt, fpath = split_arg
if opt in options and len(fpath) > 0:
fpaths[opt] = fpath
# check that we have all 3 file paths and exit with a usage message if not
for opt in options:
if not fpaths.has_key(opt):
print "usage:\n %s set1=/path/to/set1 set2=/path/to/set2 table=/path/to/table\n" % argv[0]
print "where:\n set1 is the file containing data set 1\n"
print " set2 is the file containing data set 2\n"
print " table is a file containing a task conversion table to map tasks"
print " in set 1 to tasks in set2. Each conversion pair should be on a"
print " single line and separated with a single space, e.g.\n"
print " Pneum PCV7\n"
exit(1)
# build the task conversion dictionary from the table file
tasks = {}
try:
f = open(fpaths['table'])
except IOError:
print "error: unable to open", fpaths['table']
exit(1)
else:
for line in f.readlines():
columns = line.strip().split(' ', 1)
if len(columns) > 1:
tasks[columns[0]] = columns[1]
# use a dictionary to store NHIs and corresponding tasks from set 2
set2_data = {}
try:
f = open(fpaths['set2'])
except IOError:
print "error: unable to open", fpaths['set2']
exit(1)
else:
for line in f.readlines():
line = line.strip()
m = re.match(r'([A-Za-z]{3}[0-9]{4})\s+(\S+)', line)
if m:
nhi = m.group(1)
task = m.group(2)
if set2_data.has_key(nhi):
set2_data[nhi].append(task)
else:
set2_data[nhi] = [task]
else:
pass
#print "warning: ignoring the following line in data set 2"
#print " ", line
f.close
# now print out all lines in data set 1 which lack corresponding lines
# in data set 2, using the conversion table to match tasks
try:
f = open(fpaths['set1'])
except IOError:
print "error: unable to open", fpaths['set1']
exit(1)
else:
for line in f.readlines():
line = line.strip()
m = re.match(r'([A-Za-z]{3}[0-9]{4})\s+(\S+)', line)
if m:
nhi = m.group(1)
task = m.group(2)
if tasks.has_key(task):
task = tasks[task]
if not set2_data.has_key(nhi) or not task in set2_data[nhi]:
print line
else:
pass
#print "warning: ignoring the following line in data set 1"
#print " ", line
f.close
if __name__ == "__main__":
main(argv)
I'm calling it "task_compare" for now but the name makes no difference.
In order for this to work, it requires a file in addition to the 2 data sets. The additional file is a conversion table for mapping tasks in set 1 to tasks in set 2 and takes the following format:
Pneum PCV7
Each row contains a single pair separated by a single space.
I've also switched to using named options to handle 3 different files, so the script should be invoked with
./task_compare set1=/path/to/set1 set2=/path/to/set2 table=/path/to/table
The order of the arguments is irrelevant.
I've received your email and will reply to that shortly.
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
hi xyne & all -
thanks so much for this, and sorry not to reply sooner. My laptop screen died this week so I've been hastily getting a new machine together, a drama in itself
I've tried this a few times, and played around with different source data, but I get no results - no errors, but it does not return anything. I wonder if I am doing this correctly.
iphitus made the good point that I have not provided data in exactly the format it will be produced - I was trying to be concise but realise it may have just been unhelpful. So, I have copied two sets of real life (anonymised) data exactly as they will be produced and given to me:
dataset1 is from the clinic and assumed to be correct-
"NHI","VACCINE","IMMDATE","OC","OUTCOME","BATCH","EXPIRY","DILUENTBATCH","DILUENTEXPIRY","PROVIDER","VACCINATOR","SITECODE","NIROPTDESCR"
"ABC1234","DTaP-IPV-Hep B/Hib6w",20-Oct-08,"G","Given","A21CA438D",15-Sep-10,,,"ABM","LS","RT","Opt on"
"ABC1234","Pneum, conj 6w",20-Oct-08,"G","Given",31256,9-Jun-10,,,"ABM","LS","LT","Opt on"
"ABC1234","DTaP-IPV-Hep B/Hib3m",1-Dec-08,"G","Given","A21FAA426CA",15-Sep-10,,,"ABM","MJ","RT","Opt on"
"ABC1234","Pneum, conj 3m",1-Dec-08,"G","Given",32500,15-Oct-10,,,"ABM","MJ","LT","Opt on"
"ABC1234","DTaP-IPV-Hep B/Hib5m",11-Feb-09,"G","Given","A21CA469G",15-Sep-10,,,"ABM","MART","RVL","Opt on"
"ABC1234","Pneum, conj 5m",11-Feb-09,"G","Given",33463,15-Oct-10,,,"ABM","MART","LVL","Opt on"
"ZZZ9311","DTaP-IPV-Hep B/Hib6w",24-Oct-08,"G","Given","A21CA438D",15-Sep-10,,,"JK","LS","LT","Opt on"
"ZZZ9311","Pneum, conj 6w",24-Oct-08,"G","Given",31256,9-Jun-10,,,"JK","LS","RT","Opt on"
"ZZZ9311","DTaP-IPV-Hep B/Hib3m",8-Dec-08,"G","Given","A21FAA426CA",15-Sep-10,,,"ABM","MJ","RT","Opt on"
"ZZZ9311","Pneum, conj 3m",8-Dec-08,"G","Given",32500,15-Oct-10,,,"ABM","MJ","LT","Opt on"
"ZZZ9311","DTaP-IPV-Hep B/Hib5m",17-Feb-09,"G","Given","A21CA469G",15-Sep-10,,,"ABM","LS","LT","Opt on"
"ZZZ9311","Pneum, conj 5m",17-Feb-09,"G","Given",33463,15-Oct-10,,,"ABM","LS","RT","Opt on"
"LNX9320","DTaP-IPV-Hep B/Hib6w",24-Oct-08,"G","Given","A21CA438D",15-Sep-10,,,"JK","LS","LT","Opt on"
"LNX9320","Pneum, conj 6w",24-Oct-08,"G","Given",31256,9-Jun-10,,,"JK","LS","RT","Opt on"
"LNX9320","DTaP-IPV-Hep B/Hib3m",8-Dec-08,"G","Given","A21FAA426CA",15-Sep-10,,,"ABM","MJ","RT","Opt on"
"LNX9320","Pneum, conj 3m",8-Dec-08,"G","Given",32500,15-Oct-10,,,"ABM","MJ","LT","Opt on"
"LNX9320","DTaP-IPV-Hep B/Hib5m",17-Feb-09,"G","Given","A21CA469G",15-Sep-10,,,"ABM","MJ","LT","Opt on"
"LNX9320","Pneum, conj 5m",17-Feb-09,"G","Given",33463,15-Oct-10,,,"ABM","MJ","RT","Opt on"
"SVG7698","Pneum, conj 5m",02/17/09,"G","Given",33463,10/15/10,,,"ABM","MJ","RT","Opt on"
In this example, the task and indication are combined (ie. DTaP-IPV-Hep B/Hib5m) where in dataset2 they are separated. The other main difference is that the 3rd column (date given) is formatted 17-Feb-09, rather than 2/17/2009 as in dataset2 column 12.
Here is dataset2, which represents the data I currently have in my database for this clinic:
"Clinic","Responsible Provider","Name","Nominated Provider","DOB","NHI","Task","Indication","Status","Code","Due","Completed","Vaccinator","Site","Batch No.","Expiry Date"
"D Medical Centre","Dr M",,"Unknown GP",9/5/2008,"ABC1234","DTaP-IPV-HepB/Hib","6w","Completed",,10/17/2008,10/20/2008,"Nrs. L ","OTH","A21CA438D",Sep-10
"D Medical Centre","Dr M",,"Unknown GP",9/5/2008,"ABC1234","PCV7","6w","Completed",,10/17/2008,10/20/2008,"Nrs. L ","OTH",31256,Jun-10
"D Medical Centre","Dr M",,"Unknown GP",9/5/2008,"ABC1234","DTaP-IPV-HepB/Hib","3m","Completed",,12/5/2008,12/1/2008,"Nrs. M ","OTH","A21FAA426CA",Sep-10
"D Medical Centre","Dr M",,"Unknown GP",9/5/2008,"ABC1234","PCV7","3m","Completed",,12/17/2008,12/1/2008,"Nrs. M ","OTH",32500,Oct-10
"D Medical Centre","Dr M",,"Unknown GP",9/5/2008,"ABC1234","DTaP-IPV-HepB/Hib","5m","Completed",,2/5/2009,2/11/2009,"Nrs. M ","RVL","A21CA469G",Sep-10
"D Medical Centre","Dr M",,"Unknown GP",9/5/2008,"ABC1234","PCV7","5m","Completed",,3/7/2009,2/11/2009,"Nrs. M ","LVL",33463,Oct-10
"D Medical Centre","Dr K",,"Unknown GP",9/6/2008,"ZZZ9311","DTaP-IPV-HepB/Hib","6w","Completed",,10/18/2008,10/24/2008,"Nrs. L ","OTH","A21CA438D",Sep-10
"D Medical Centre","Dr K",,"Unknown GP",9/6/2008,"ZZZ9311","PCV7","6w","Completed",,10/18/2008,10/24/2008,"Nrs. L ","OTH",31256,Jun-10
"D Medical Centre","Dr M",,"Unknown GP",9/6/2008,"ZZZ9311","DTaP-IPV-HepB/Hib","5m","Completed",,2/6/2009,2/17/2009,"Nrs. L ","OTH","A21CA469G",Sep-10
"D Medical Centre","Dr M",,"Unknown GP",9/6/2008,"ZZZ9311","PCV7","5m","Completed",,3/11/2009,2/17/2009,"Nrs. L ","OTH",33463,Oct-10
"D Medical Centre","Dr K",,"Unknown GP",9/6/2008,"LNX9320","DTaP-IPV-HepB/Hib","6w","Completed",,10/18/2008,10/24/2008,"Nrs. L ","OTH","A21CA438D",Sep-10
"D Medical Centre","Dr K",,"Unknown GP",9/6/2008,"LNX9320","PCV7","6w","Completed",,10/18/2008,10/24/2008,"Nrs. L ","OTH",31256,Jun-10
"D Medical Centre","Dr M",,"Unknown GP",9/6/2008,"LNX9320","DTaP-IPV-HepB/Hib","3m","Completed",,12/6/2008,12/8/2008,"Nrs. M ","OTH","A21FAA426CA",Sep-10
"D Medical Centre","Dr M",,"Unknown GP",9/6/2008,"LNX9320","PCV7","3m","Completed",,12/20/2008,12/8/2008,"Nrs. M ","OTH",32500,Oct-10
So we can see that data for NHI ABC1234 matches exactly. The "DTaP-IPV-Hep B/Hib6w" is the same as "DTaP-IPV-HepB/Hib" (task),"6w" (indication). The "Pneum, conj 6w" is the same as "PCV7" (task),"6w" (indication) - and so on for the subsequent doses 3m and 5m.
However, we have 2 tasks missing for NHI ZZZ1234, 2 missing for NHI LNX9320 and we have no record at all of NHI SVG7698. Therefore the output required from a comparison would be:
"ZZZ9311","DTaP-IPV-Hep B/Hib3m",8-Dec-08,"G","Given","A21FAA426CA",15-Sep-10,,,"ABM","MJ","RT","Opt on"
"ZZZ9311","Pneum, conj 3m",8-Dec-08,"G","Given",32500,15-Oct-10,,,"ABM","MJ","LT","Opt on"
"LNX9320","DTaP-IPV-Hep B/Hib5m",17-Feb-09,"G","Given","A21CA469G",15-Sep-10,,,"ABM","MJ","LT","Opt on"
"LNX9320","Pneum, conj 5m",17-Feb-09,"G","Given",33463,15-Oct-10,,,"ABM","MJ","RT","Opt on"
"SVG7698","Pneum, conj 5m",02/17/09,"G","Given",33463,10/15/10,,,"ABM","MJ","RT","Opt on"
This is exactly how it will be used (only, for hundreds of thousands of rows!), so I'm really sorry I was not as clear beforehand. Would be very grateful for any advice or point in the right direction about how this can be achieved. I must say that python is (relatively) readable even for a complete beginner like myself, I even delved into some online tutorials today and am keen to learn more. It's a big task though with no programming experience.
let me know if I can clarify anything, and thanks again for your help.
Last edited by karag (2009-08-10 08:54:31)
Offline
I wish my computer science teachers had given assignments like this one. It may be complicated, but at least it's well-defined.
Python is a good choice for this project because it's not only easy to read and beginner-friendly, it has a great CSV module that will automate away much of the tedium involved in parsing CSV files. Perl has its own strengths, but I personally advise you to use Python. Look at http://docs.python.org/library/csv.html, especially the part about DictReader -- I think it will help in your situation.
I'm thinking your code will be something along the lines of this:
1. Read in dataset1 and dataset2.
2. Write a method that will compare an entry from dataset1 to one from dataset2, and return True if they are identical (False otherwise).
3. For every record in dataset1,
a. search for a corresponding record in dataset2
b. if a corresponding record is found in dataset2, delete it;
c. if a corresponding record is not found, print the record.
Step 2 will probably be the most challenging. It will help if you establish exactly what makes each record unique. It seems that the combination of NHI, vaccination, and date of administration is sufficient to uniquely identify a record, correct? Then you could write a method that converts those three fields from the formats they have in dataset2 to those in dataset1, and compare them directly.
Last edited by Trent (2009-08-10 15:07:09)
Offline
@Trent
I don't think this is a homework assignment. karag has sent me an email explaining the situation in greater detail and it appears to be a simple case of a non-programmer requiring a tool to get a job done. As he (she?) has already pointed out, he has "no programming skills at all". While Python might be relatively easy to learn (compared to other languages), not everyone can take the time to sit down and figure out how to code something. If you need to get to the airport on Friday and don't yet have a license, asking someone who does for a lift makes more sense than trying to get your own by then.
@karag
The full formatting makes a huge difference. Here's an updated version of the script:
#!/usr/bin/env python
# Author: Xyne
# Website: xyne.archlinux.ca
# Email: replace "xyne." with "xyne@" in the website URL
import csv
import re
from sys import argv, stdout
def main(argv):
# a dictionary to hold the file paths to set 1, set 2 and the task table
fpaths = {}
# the command-line options which we will accept, in the form "opt=arg"
options = set(['ds1', 'ds2', 't'])
# read in the command-line options
for arg in argv[1:]:
split_arg = arg.split('=', 1)
if len(split_arg) > 1:
opt, fpath = split_arg
if opt in options and len(fpath) > 0:
fpaths[opt] = fpath
# check that we have all 3 file paths and exit with a usage message if not
for opt in options:
if not fpaths.has_key(opt):
print "usage:\n %s ds1=/path/to/dataset1 ds2=/path/to/set2 t=/path/to/table\n" % argv[0]
print '''where:
"dataset1" is the csv file containing data set 1
"dataset2" is the csv file containing data set 2
"table" is a csv file which maps "VACCINE" entries in dataset1, minus the
"Indication", to "Task" entries in dataset2. This is best illustrated with
an example.
Given the following 2 "VACCINE" entries in dataset1
"Pneum, conj 6w"
"DTaP-IPV-Hep B/Hib5m"
and the corresponding "Task" and "Indication" entries in dataset2:
"PCV7","6w"
"DTaP-IPV-Hep B/Hib"
the table file would be constructed as follows:
"VACCINE","Task"
"Pneum, conj ","PCV7"
"DTaP-IPV-Hep B/Hib","DTaP-IPV-HepB/Hib"
Note the final space in "Pneum, conj ". This is so that when the "VACCINE"
entry in the table ("Pneum, conj ") is combined with the "Indication" of
dataset2 ("6w"), we get the "VACCINE" entry of dataset1 ("Pneum, conj 6w")
'''
exit(1)
# read the data sets and table
dataset1 = csv.DictReader( open(fpaths['ds1']) )
dataset2 = csv.DictReader( open(fpaths['ds2']) )
table = csv.DictReader( open(fpaths['t']) )
# create dictionary from table
taskmap = {}
for t in table:
taskmap[ t['Task'] ] = t['VACCINE']
# create NHI-VACCINE dictionary for dataset2 using
# the taskmap dictionary to correlate the data sets
ds2dict = {}
for r in dataset2:
nhi = r['NHI']
if taskmap.has_key(r['Task']):
vaccine = taskmap[ r['Task'] ] + r['Indication']
else:
vaccine = r['Task'] + r['Indication']
# use sets to keep track of tasks
if ds2dict.has_key(nhi):
ds2dict[nhi].add(vaccine)
else:
ds2dict[nhi] = set([vaccine])
# create a writer for printing the rows to stdout
writer = csv.DictWriter(stdout, dataset1.fieldnames, quoting=csv.QUOTE_ALL)
# print out the unmatched rows in dataset1
for r in dataset1:
nhi = r['NHI']
if not ds2dict.has_key(nhi) or r['VACCINE'] not in ds2dict[nhi]:
writer.writerow(r)
if __name__ == "__main__":
main(argv)
This requires a csv file which maps "VACCINES in data set 1 to "Tasks" in data set 2. Run the script without any arguments for further explanation. Here's a table for the example data sets given in your post:
"VACCINE","Task"
"Pneum, conj ","PCV7"
"DTaP-IPV-Hep B/Hib","DTaP-IPV-HepB/Hib"
and here's the output of running the script:
"ZZZ9311","DTaP-IPV-Hep B/Hib3m","8-Dec-08","G","Given","A21FAA426CA","15-Sep-10","","","ABM","MJ","RT","Opt on"
"ZZZ9311","Pneum, conj 3m","8-Dec-08","G","Given","32500","15-Oct-10","","","ABM","MJ","LT","Opt on"
"LNX9320","DTaP-IPV-Hep B/Hib5m","17-Feb-09","G","Given","A21CA469G","15-Sep-10","","","ABM","MJ","LT","Opt on"
"LNX9320","Pneum, conj 5m","17-Feb-09","G","Given","33463","15-Oct-10","","","ABM","MJ","RT","Opt on"
"SVG7698","Pneum, conj 5m","02/17/09","G","Given","33463","10/15/10","","","ABM","MJ","RT","Opt on"
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
@Trent
I don't think this is a homework assignment. karag has sent me an email explaining the situation in greater detail and it appears to be a simple case of a non-programmer requiring a tool to get a job done. As he (she?) has already pointed out, he has "no programming skills at all". While Python might be relatively easy to learn (compared to other languages), not everyone can take the time to sit down and figure out how to code something. If you need to get to the airport on Friday and don't yet have a license, asking someone who does for a lift makes more sense than trying to get your own by then.
I didn't mean to imply that I thought it was a homework assignment, although I see that my post did give that impression (sorry about that). I was just saying that a true-to-life example like this would have been a lot more educational than "center-align an equilateral triangle with side length N on an 80-column screen".
Offline
WOW!
this is amazing
it works perfectly. Thankyou so much!
I'm playing around with the table file and mapping a few other tasks that crop up, have just extended it to include MMR and other older vaccines.
Trent's point about using real examples is a good one, and actually it seems to work both ways - as a non-programmer I have come across a real life situation which can most efficiently be fixed with some code, so (all things being equal) there is an incentive for me to learn how to do this. Most people who are not 'into' computers probably wouldn't even consider or know how to ask about this possibility, so in that respect I'm lucky enough to be part of a vibrant and positive linux community.
Xyne, once I've finished all the mapping I'll be testing it on much larger data sets so will let you know how it goes.
Thanks again
Offline
Hi again -
there were a lot more combinations of 'task' and 'indication' than I thought
the sample data posted previously works fine - thank you so much for your help. But there are other task/indication combinations that do not match as things currently are.
I could post all the possible permutations for info, but thought I'd just ask:
(and this is based on my limited understanding of the mechanics of this script)
As things currently are, this script (via table.csv) takes the 'Indication' from dataset 2, and appends it to the main portion of the 'VACCINE' from dataset 1 to find a match.
Is it possible instead to have a more explicit lookup table which definitively lists what should be matched?
For example, instead of table.csv:
"VACCINE","Task"
"Pneum, conj ","PCV7"
"DTaP-IPV-Hep B/Hib","DTaP-IPV-HepB/Hib"
is this feasible:
"VACCINE","Task","Indication"
"Pneum, conj 6w","PCV7","6w"
"DTaP-IPV-Hep B/Hib6w","DTaP-IPV-HepB/Hib","6w"
"Pneum, conj 3m","PCV7","3m"
"DTaP-IPV-Hep B/Hib3m","DTaP-IPV-HepB/Hib","3m"
"Pneum, conj 5m","PCV7","5m"
"DTaP-IPV-Hep B/Hib5m","DTaP-IPV-HepB/Hib","5m"
I know this would have an impact on the rest of the code, but could make it much more extensible to apply to other vaccinations and to keep up to date with statutory changes.
So I'm guessing that the section
# create dictionary from table
taskmap = {}
for t in table:
taskmap[ t['Task'] ] = t['VACCINE']
would become something like
# create dictionary from table
taskmap = {}
for t in table:
taskmap[ t['Task'] ] + t['Indication'] = t['VACCINE']
Man, I'd love to understand this stuff
Offline
Can you give an example of a task/indication combination that fails with the previous code?
Here's an updated version all the same:
#!/usr/bin/env python
# Author: Xyne
# Website: xyne.archlinux.ca
# Email: replace "xyne." with "xyne@" in the website URL
import csv
import re
from sys import argv, stdout
exit_on_unmatched = True
def main(argv):
# a dictionary to hold the file paths to set 1, set 2 and the task table
fpaths = {}
# the command-line options which we will accept, in the form "opt=arg"
options = set(['ds1', 'ds2', 't'])
# read in the command-line options
for arg in argv[1:]:
split_arg = arg.split('=', 1)
if len(split_arg) > 1:
opt, fpath = split_arg
if opt in options and len(fpath) > 0:
fpaths[opt] = fpath
# check that we have all 3 file paths and exit with a usage message if not
for opt in options:
if not fpaths.has_key(opt):
print "usage:\n %s ds1=/path/to/dataset1 ds2=/path/to/set2 t=/path/to/table\n" % argv[0]
print '''where:
"dataset1" is the csv file containing data set 1
"dataset2" is the csv file containing data set 2
"table" is a csv file which matches "VACCINE" entries in dataset1 to "Task"
and "Indication" entries in dataset2. This is best illustrated with an
example:
Given the following 2 "VACCINE" entries in dataset1
"Pneum, conj 6w"
"DTaP-IPV-Hep B/Hib5m"
and the corresponding "Task" and "Indication" entries in dataset2:
"PCV7","6w"
"DTaP-IPV-Hep B/Hib","5m"
the table file would be constructed as follows:
"VACCINE","Task","Indication"
"Pneum, conj 6w","PCV7","6w"
"DTaP-IPV-Hep B/Hib5m","DTaP-IPV-HepB/Hib","5m"
Note the final space in "Pneum, conj ". This is so that when the "VACCINE"
entry in the table ("Pneum, conj ") is combined with the "Indication" of
dataset2 ("6w"), we get the "VACCINE" entry of dataset1 ("Pneum, conj 6w")
'''
exit(1)
# read the data sets and table
dataset1 = csv.DictReader( open(fpaths['ds1']) )
dataset2 = csv.DictReader( open(fpaths['ds2']) )
table = csv.DictReader( open(fpaths['t']) )
# create dictionary from table
taskmap = {}
for t in table:
# the ':::' is to prevent possible overlapping concatenations
taskmap[ t['Task'] + ':::' + t['Indication'] ] = t['VACCINE']
# create NHI-VACCINE dictionary for dataset2 using
# the taskmap dictionary to correlate the data sets
ds2dict = {}
unmatched = set()
for r in dataset2:
nhi = r['NHI']
mapped_task = r['Task'] + ':::' + r['Indication']
if taskmap.has_key(mapped_task):
vaccine = taskmap[mapped_task]
elif exit_on_unmatched:
unmatched.add( (r['Task'], r['Indication']) )
continue
else:
vaccine = r['Task'] + r['Indication']
# use sets to keep track of tasks
if ds2dict.has_key(nhi):
ds2dict[nhi].add(vaccine)
else:
ds2dict[nhi] = set([vaccine])
# exit with a message if there are unmatched tasks and indications"
if len(unmatched) > 0:
print '''The table does not specify a matching "VACCINE" entry in data set 1 for the
following "Task" and "Indication" combinations in data set 2:
'''
writer = csv.DictWriter(stdout, ('Task','Indication'), quoting=csv.QUOTE_ALL)
for r in unmatched:
writer.writerow({'Task':r[0],'Indication':r[1]})
exit (1)
# create a writer for printing the rows to stdout
writer = csv.DictWriter(stdout, dataset1.fieldnames, quoting=csv.QUOTE_ALL)
# print out the unmatched rows in dataset1
for r in dataset1:
nhi = r['NHI']
if not ds2dict.has_key(nhi) or r['VACCINE'] not in ds2dict[nhi]:
writer.writerow(r)
if __name__ == "__main__":
main(argv)
This now requires a table which specifies all matches. The "exit_on_unmatched" variable can be set to false to default to "VACCINE" = "Task"+"Indication" when no matching row can be found in the table. Using the example data given above, the table file is:
"VACCINE","Task","Indication"
"DTaP-IPV-Hep B/Hib3m","DTaP-IPV-HepB/Hib","3m"
"DTaP-IPV-Hep B/Hib5m","DTaP-IPV-HepB/Hib","5m"
"DTaP-IPV-Hep B/Hib6w","DTaP-IPV-HepB/Hib","6w"
"Pneum, conj 3m","PCV7","3m"
"Pneum, conj 5m","PCV7","5m"
"Pneum, conj 6w","PCV7","6w"
The output is the same as in the previous post.
If the table is not complete, you will get a list of entries which need to be added:
The table does not specify a matching "VACCINE" entry in data set 1 for the
following "Task" and "Indication" combinations in data set 2:
"DTaP-IPV-HepB/Hib","6w"
"DTaP-IPV-HepB/Hib","3m"
"PCV7","6w"
"DTaP-IPV-HepB/Hib","5m"
"PCV7","3m"
"PCV7","5m"
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
Sorry to take so long to reply, and thanks for the new script! It sure does work on the data posted above. I've been testing it on PortablePython on Windows at work, where I was getting some incorrect matching. However, I've just spent time testing it on my linux box and it works perfectly - even with the new combinations added.
Here are the extra combinations as they would be represented in the 'table.csv' file:
"DTaP-IPV-1 6w","DTaP-IPV","6w"
"Hib-HepB-1 6w","Hib-HepB","6w"
"DTaP-IPV-2 3m","DTaP-IPV","3m"
"Hib-HepB-2 3m","Hib-HepB","3m"
"DTaP-IPV-3 5m","DTaP-IPV","5m"
"HepB, paed-3 5m","HepB","5m"
"MMR-1 15m","MMR","15m"
"Hib 15m","Hib","15m"
"MMR-2 4y","MMR","4y"
What was happening was that as I added the new combinations and tested the matching one at a time, at some point I would suddenly get ALL 'DTaP-IPV-Hep B/Hib' lines returned as well as the correct results. I can't seem to replicate this now on my linux box so will continue editing and testing.
Adding the new tasks/indications under a new NHI to both datasets, the updated dataset files would be-
dataset1:
"NHI","VACCINE","IMMDATE","OC","OUTCOME","BATCH","EXPIRY","DILUENTBATCH","DILUENTEXPIRY","PROVIDER","VACCINATOR","SITECODE","NIROPTDESCR"
"ABC1234","DTaP-IPV-Hep B/Hib6w","20-Oct-08","G","Given","A21CA438D","15-Sep-10",,,"ABM","LS","RT","Opt on"
"ABC1234","Pneum, conj 6w","20-Oct-08","G","Given",31256,"9-Jun-10",,,"ABM","LS","LT","Opt on"
"ABC1234","DTaP-IPV-Hep B/Hib3m","1-Dec-08","G","Given","A21FAA426CA","15-Sep-10",,,"ABM","MJ","RT","Opt on"
"ABC1234","Pneum, conj 3m","1-Dec-08","G","Given",32500,"15-Oct-10",,,"ABM","MJ","LT","Opt on"
"ABC1234","DTaP-IPV-Hep B/Hib5m","11-Feb-09","G","Given","A21CA469G","15-Sep-10",,,"ABM","MART","RVL","Opt on"
"ABC1234","Pneum, conj 5m","11-Feb-09","G","Given",33463,"15-Oct-10",,,"ABM","MART","LVL","Opt on"
"ZZZ9311","DTaP-IPV-Hep B/Hib6w","24-Oct-08","G","Given","A21CA438D","15-Sep-10",,,"JK","LS","LT","Opt on"
"ZZZ9311","Pneum, conj 6w","24-Oct-08","G","Given",31256,"9-Jun-10",,,"JK","LS","RT","Opt on"
"ZZZ9311","DTaP-IPV-Hep B/Hib3m","8-Dec-08","G","Given","A21FAA426CA","15-Sep-10",,,"ABM","MJ","RT","Opt on"
"ZZZ9311","Pneum, conj 3m","8-Dec-08","G","Given",32500,"15-Oct-10",,,"ABM","MJ","LT","Opt on"
"ZZZ9311","DTaP-IPV-Hep B/Hib5m","17-Feb-09","G","Given","A21CA469G","15-Sep-10",,,"ABM","LS","LT","Opt on"
"ZZZ9311","Pneum, conj 5m","17-Feb-09","G","Given",33463,"15-Oct-10",,,"ABM","LS","RT","Opt on"
"LNX9320","DTaP-IPV-Hep B/Hib6w","24-Oct-08","G","Given","A21CA438D","15-Sep-10",,,"JK","LS","LT","Opt on"
"LNX9320","Pneum, conj 6w","24-Oct-08","G","Given",31256,"9-Jun-10",,,"JK","LS","RT","Opt on"
"LNX9320","DTaP-IPV-Hep B/Hib3m","8-Dec-08","G","Given","A21FAA426CA","15-Sep-10",,,"ABM","MJ","RT","Opt on"
"LNX9320","Pneum, conj 3m","8-Dec-08","G","Given",32500,"15-Oct-10",,,"ABM","MJ","LT","Opt on"
"LNX9320","DTaP-IPV-Hep B/Hib5m","17-Feb-09","G","Given","A21CA469G","15-Sep-10",,,"ABM","MJ","LT","Opt on"
"LNX9320","Pneum, conj 5m","17-Feb-09","G","Given",33463,"15-Oct-10",,,"ABM","MJ","RT","Opt on"
"SVG7698","Pneum, conj 5m","02/17/09","G","Given",33463,"10/15/10",,,"ABM","MJ","RT","Opt on"
"XXX9999","DTaP-IPV-1 6w","26-Aug-05","G","Given","AC20B040BB","15-Feb-08",,,"NS","MART","RVL","Opt on"
"XXX9999","Hib-HepB-1 6w","26-Aug-05","G","Given","F2033","15-Feb-07",,,"NS","MART","RVL","Opt on"
"XXX9999","DTaP-IPV-2 3m","26-Sep-05","G","Given","AC20B040BB","15-Feb-08",,,"NS","MART","RVL","Opt on"
"XXX9999","Hib-HepB-2 3m","26-Sep-05","G","Given","F2033","15-Feb-07",,,"NS","MART","RVL","Opt on"
"XXX9999","DTaP-IPV-3 5m","26-Nov-05","G","Given","AC20B040BB","15-Feb-08",,,"NS","MART","RVL","Opt on"
"XXX9999","HepB, paed-3 5m","26-Nov-05","G","Given","F2033","15-Feb-07",,,"NS","MART","RVL","Opt on"
"XXX9999","Hib 15m","18-Nov-06","G","Given","A72DA154A","15-May-10",,,"NS","MART","RVL","Opt on"
"XXX9999","MMR-1 15m","18-Nov-06","G","Given","G3634","15-Nov-07","G2987","15-Feb-08","NS","MART","RA","Opt on"
dataset2:
"Clinic","Responsible Provider","Name","Nominated Provider","DOB","NHI","Task","Indication","Status","Code","Due","Completed","Vaccinator","Site","Batch No.","Expiry Date"
"D Medical Centre","Dr M",,"Unknown GP","9/5/2008","ABC1234","DTaP-IPV-HepB/Hib","6w","Completed",,"10/17/2008","10/20/2008","Nrs. L ","OTH","A21CA438D","Sep-10"
"D Medical Centre","Dr M",,"Unknown GP","9/5/2008","ABC1234","PCV7","6w","Completed",,"10/17/2008","10/20/2008","Nrs. L ","OTH",31256,"Jun-10"
"D Medical Centre","Dr M",,"Unknown GP","9/5/2008","ABC1234","DTaP-IPV-HepB/Hib","3m","Completed",,"12/5/2008","12/1/2008","Nrs. M ","OTH","A21FAA426CA","Sep-10"
"D Medical Centre","Dr M",,"Unknown GP","9/5/2008","ABC1234","PCV7","3m","Completed",,"12/17/2008","12/1/2008","Nrs. M ","OTH",32500,"Oct-10"
"D Medical Centre","Dr M",,"Unknown GP","9/5/2008","ABC1234","DTaP-IPV-HepB/Hib","5m","Completed",,"2/5/2009","2/11/2009","Nrs. M ","RVL","A21CA469G","Sep-10"
"D Medical Centre","Dr M",,"Unknown GP","9/5/2008","ABC1234","PCV7","5m","Completed",,"3/7/2009","2/11/2009","Nrs. M ","LVL",33463,"Oct-10"
"D Medical Centre","Dr K",,"Unknown GP","9/6/2008","ZZZ9311","DTaP-IPV-HepB/Hib","6w","Completed",,"10/18/2008","10/24/2008","Nrs. L ","OTH","A21CA438D","Sep-10"
"D Medical Centre","Dr K",,"Unknown GP","9/6/2008","ZZZ9311","PCV7","6w","Completed",,"10/18/2008","10/24/2008","Nrs. L ","OTH",31256,"Jun-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","ZZZ9311","DTaP-IPV-HepB/Hib","5m","Completed",,"2/6/2009","2/17/2009","Nrs. L ","OTH","A21CA469G","Sep-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","ZZZ9311","PCV7","5m","Completed",,"3/11/2009","2/17/2009","Nrs. L ","OTH",33463,"Oct-10"
"D Medical Centre","Dr K",,"Unknown GP","9/6/2008","LNX9320","DTaP-IPV-HepB/Hib","6w","Completed",,"10/18/2008","10/24/2008","Nrs. L ","OTH","A21CA438D","Sep-10"
"D Medical Centre","Dr K",,"Unknown GP","9/6/2008","LNX9320","PCV7","6w","Completed",,"10/18/2008","10/24/2008","Nrs. L ","OTH",31256,"Jun-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","LNX9320","DTaP-IPV-HepB/Hib","3m","Completed",,"12/6/2008","12/8/2008","Nrs. M ","OTH","A21FAA426CA","Sep-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","LNX9320","PCV7","3m","Completed",,"12/20/2008","12/8/2008","Nrs. M ","OTH",32500,"Oct-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","XXX9999","DTaP-IPV","6w","Completed",,"8/26/2005","12/8/2008","Nrs. M ","OTH","A21FAA426CA","Sep-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","XXX9999","Hib-HepB","6w","Completed",,"8/26/2005","12/8/2008","Nrs. M ","OTH",32500,"Oct-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","XXX9999","DTaP-IPV","3m","Completed",,"9/26/2005","12/8/2008","Nrs. M ","OTH","A21FAA426CA","Sep-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","XXX9999","Hib-HepB","3m","Completed",,"9/26/2005","12/8/2008","Nrs. M ","OTH",32500,"Oct-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","XXX9999","DTaP-IPV","5m","Completed",,"11/26/2005","12/8/2008","Nrs. M ","OTH","A21FAA426CA","Sep-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","XXX9999","HepB","5m","Completed",,"11/26/2005","12/8/2008","Nrs. M ","OTH",32500,"Oct-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","XXX9999","Hib","15m","Completed",,"10/18/2006","12/8/2008","Nrs. M ","OTH","A21FAA426CA","Sep-10"
"D Medical Centre","Dr M",,"Unknown GP","9/6/2008","XXX9999","MMR","15m","Completed",,"10/18/2006","12/8/2008","Nrs. M ","OTH",32500,"Oct-10"
As I say, testing on windows was giving me some false results but I can't replicate this now in linux. I'll continue testing all the permutations I can find.
Thank you so much for your hard work, Xyne.
Offline
I haven't tested it but the problem on Windows may be due to the distinction that it makes between text and binary files. I've modified the file opening line to hopefully handle this correctly:
#!/usr/bin/env python
# Author: Xyne
# Website: xyne.archlinux.ca
# Email: replace "xyne." with "xyne@" in the website URL
import csv
import re
from sys import argv, stdout
exit_on_unmatched = True
def main(argv):
# a dictionary to hold the file paths to set 1, set 2 and the task table
fpaths = {}
# the command-line options which we will accept, in the form "opt=arg"
options = set(['ds1', 'ds2', 't'])
# read in the command-line options
for arg in argv[1:]:
split_arg = arg.split('=', 1)
if len(split_arg) > 1:
opt, fpath = split_arg
if opt in options and len(fpath) > 0:
fpaths[opt] = fpath
# check that we have all 3 file paths and exit with a usage message if not
for opt in options:
if not fpaths.has_key(opt):
print "usage:\n %s ds1=/path/to/dataset1 ds2=/path/to/set2 t=/path/to/table\n" % argv[0]
print '''where:
"dataset1" is the csv file containing data set 1
"dataset2" is the csv file containing data set 2
"table" is a csv file which matches "VACCINE" entries in dataset1 to "Task"
and "Indication" entries in dataset2. This is best illustrated with an
example:
Given the following 2 "VACCINE" entries in dataset1
"Pneum, conj 6w"
"DTaP-IPV-Hep B/Hib5m"
and the corresponding "Task" and "Indication" entries in dataset2:
"PCV7","6w"
"DTaP-IPV-Hep B/Hib","5m"
the table file would be constructed as follows:
"VACCINE","Task","Indication"
"Pneum, conj 6w","PCV7","6w"
"DTaP-IPV-Hep B/Hib5m","DTaP-IPV-HepB/Hib","5m"
Note the final space in "Pneum, conj ". This is so that when the "VACCINE"
entry in the table ("Pneum, conj ") is combined with the "Indication" of
dataset2 ("6w"), we get the "VACCINE" entry of dataset1 ("Pneum, conj 6w")
'''
exit(1)
# read the data sets and table
dataset1 = csv.DictReader( open(fpaths['ds1'], "rb") )
dataset2 = csv.DictReader( open(fpaths['ds2'], "rb") )
table = csv.DictReader( open(fpaths['t'], "rb") )
# create dictionary from table
taskmap = {}
for t in table:
# the ':::' is to prevent possible overlapping concatenations
taskmap[ t['Task'] + ':::' + t['Indication'] ] = t['VACCINE']
# create NHI-VACCINE dictionary for dataset2 using
# the taskmap dictionary to correlate the data sets
ds2dict = {}
unmatched = set()
for r in dataset2:
nhi = r['NHI']
mapped_task = r['Task'] + ':::' + r['Indication']
if taskmap.has_key(mapped_task):
vaccine = taskmap[mapped_task]
elif exit_on_unmatched:
unmatched.add( (r['Task'], r['Indication']) )
continue
else:
vaccine = r['Task'] + r['Indication']
# use sets to keep track of tasks
if ds2dict.has_key(nhi):
ds2dict[nhi].add(vaccine)
else:
ds2dict[nhi] = set([vaccine])
# exit with a message if there are unmatched tasks and indications"
if len(unmatched) > 0:
print '''The table does not specify a matching "VACCINE" entry in data set 1 for the
following "Task" and "Indication" combinations in data set 2:
'''
writer = csv.DictWriter(stdout, ('Task','Indication'), quoting=csv.QUOTE_ALL)
for r in unmatched:
writer.writerow({'Task':r[0],'Indication':r[1]})
exit (1)
# create a writer for printing the rows to stdout
writer = csv.DictWriter(stdout, dataset1.fieldnames, quoting=csv.QUOTE_ALL)
# print out the unmatched rows in dataset1
for r in dataset1:
nhi = r['NHI']
if not ds2dict.has_key(nhi) or r['VACCINE'] not in ds2dict[nhi]:
writer.writerow(r)
if __name__ == "__main__":
main(argv)
If the issue persists then make sure that the file encoding is the same for all the files. I don't think that actually matters but it might.
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline