You are not logged in.
I have a daily log file with hundreds of thousands of entries in the following format.
field1,field2,field3,field4,field5,field6,field7,field8,field9,20110516192001.100
field1,field2,field3,field4,field5,field6,field7,field8,field9,20110516192002.200
field1,field2,field3,field4,field5,field6,field7,field8,field9,20110516192003.300
field1,field2,field3,field4,field5,field6,field7,field8,field9,20110516192004.400
field1,field2,field3,field4,field5,field6,field7,field8,field9,20110516192005.500
It's always in the same format and the 10th field is always the timestamp (YYYYMMDDHHMMSS.MS)
Since the file rotates daily, the 10th field will always be 20110516xxxxxx.xxx for today and will be 20110517xxxxxx.xxx tomorrow
What I want to do is only look at entries that have been written in the last 30 minutes.
At a high level, here's my plan
1) Get the date/time from 30 minutes ago... write it to a variable
2) Iterate through the file line by line comparing the 10th field to the variable, if it's larger write the line to a tmp file
3) Use tmp file for my analysis
This seems incredibly inefficient to me... what would be a more graceful way to do it? I have regular solaris tools at my disposal (plus python)
Thanks
Last edited by oliver (2011-05-17 12:41:43)
Offline
I'm guessing something with awk will do the job, but I'm not familiar enough with awk to be able to write something for you.
Here's an easy way to work out 30 minutes ago though:
TZ_START=$(date +%Y%m%d%H%M%S -d '30 minutes ago')
Are you familiar with our Forum Rules, and How To Ask Questions The Smart Way?
BlueHackers // fscanary // resticctl
Offline
The algorith you describe really is a viable approach. Since this is a log file, each line should have a time stamp later than all lines that preceed it in the file. A more efficient algoithm could do a binary search through the file for the time stamp you are interested in. This would be easy enough in to do in C or python, but your algoithm could be fast enough. If this is the case, you could try the following quick & dirty bash script.
#!/bin/bash
seconds() {
secs=$(($1 % 100))
mins=$(($1 / 100 % 100))
hrs=$(($1 / 10000 % 100))
days=$(($1 / 1000000 % 100))
month=$(($1 / 100000000 % 100))
year=$(($1 / 10000000000))
(LC_TIME=C date +%s -d $(printf "%d-%02d-%02d %2d:%02d:%02d" $year $month $days $hrs $mins $secs))
}
found=0
now=$(date +%s)
while read line
do
if [ "$found" -eq "0" ]
then
ts=${line##*,}
ts=$(seconds ${ts%.*})
diff=$(( ($now - $ts)/60 ))
[[ $diff -lt "30" ]] && found=1
fi
[[ $found -ne 0 ]] && echo "$line"
done < $1
It will write (to stdout) all lines following the first line that has been time stamped within the last 30 minutes (ignoring milliseconds). You could redirect the output of this script to a file of your choice for analysis as follows:
$ ./script logfile > tmp
Last edited by rockin turtle (2011-05-17 06:58:41)
Offline
Since it is a daily log file, you don't need to worry about the days and a simple numerical comparison will suffice:
TZ_START=$(date +%Y%m%d%H%M%S -d '30 minutes ago')
awk -F, '{if($10 >= '$TZ_START') print}' < old > new
Offline
That's what I was trying to get at quigybo, nice
Are you familiar with our Forum Rules, and How To Ask Questions The Smart Way?
BlueHackers // fscanary // resticctl
Offline
Since it is a daily log file, you don't need to worry about the days and a simple numerical comparison will suffice:
TZ_START=$(date +%Y%m%d%H%M%S -d '30 minutes ago') awk -F, '{if($10 >= '$TZ_START') print}' < old > new
Just for the record, though there's no real advantage in this case, if you're using gawk, the timestamp can be generated in the gawk script:
gawk -F, 'BEGIN{time = strftime("%Y%m%d%H%M%S", systime() - 1800)}
{if($10 >= time') print}' < old > new
"...one cannot be angry when one looks at a penguin." - John Ruskin
"Life in general is a bit shit, and so too is the internet. And that's all there is." - scepticisle
Offline
nice - thank you all for the advice and support. I'm marking this as solved :-)
Offline