You are not logged in.
Hi,
I'm trying to scan through multiple records containing millions of numbers. What I need to do is to print out the numbers that are sequential as such:
If the record contains a sequence like this:
46859995600
..
46859995699
And there is no breakage in that sequence, then print:
46859995600 100
If that sequence is broken somewhere, then print series of 10 that are sequential, f.e:
46859995600 10
46859995620 10
46859995630 10
...
I already have a bash script for this, but it's way too slow. If anyone can help me out I'd be very grateful, perhaps a "awk" solution.
Thanks!
Offline
It might help if you provide some sample input and the bash script to compare against (perhaps it can even be made to run faster).
And what languages would be acceptable? Just bash (/zsh?) and awk, or also something like Python?
Offline
Is "millions" a colloqiualism, or is it actually in the millions? Or is it billions, or thousands?
`wc -l <filename>` will give an exact answer to that. The magnitude of the file would influence what tool/approach I'd opt for. You also refer to records (as plural). Is this more than one file? If so, does this sequence need to be followed across files, or is each file a separate sequence? (or is this not files at all, but must be read from piped input - this makes a big difference)
Also, this sounds a bit like an X-Y problem. If you layout your end goal, you be provided with much better ways of approaching this.
Last edited by Trilby (2014-05-01 19:05:26)
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
Trilby, here is `wc -l file` from one of the files:
9012539 file
here is `head file`
461165002
461165003
461165008
461165021
461165028
461165032
461165036
461165047
461165050
461165054
and here is `tail file`
46119999990
46119999991
46119999992
46119999993
46119999994
46119999995
46119999996
46119999997
46119999998
46119999999
My end goal is to print out sequences of 100 or 10 from each file. The sequences does not need to be tracked across the files.
Offline
Please see the other questions in the thread - they will all be useful.
Additionally, I'm curious as to the desired result for when 150 sequential numbers are encountered. Should *each* of the last 50 output the "100" as the last 100 numbers have been in sequence? Or should a completed sequence of 100 reset the count?
If the bash script works as intended, but is just slow, then posting it would remove all ambiguity from your description.
Also, what you state as an end goal really isn't an end goal ... unless you are just really bored. What is this for? Is this homework?
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
It might help if you provide some sample input and the bash script to compare against (perhaps it can even be made to run faster).
And what languages would be acceptable? Just bash (/zsh?) and awk, or also something like Python?
Thanks Raynman, any solution that is fast enough would be ok. Here is my for sure flawed and slow bash script.
rm -rfv flawedslow.sh
Last edited by xr4y (2014-05-02 08:55:42)
Offline
I'm curious as to the desired result for when 150 sequential numbers are encountered. Should *each* of the last 50 output the "100" as the last 100 numbers have been in sequence? Or should a completed sequence of 100 reset the count?
If the bash script works as intended, but is just slow, then posting it would remove all ambiguity from your description.
Also, what you state as an end goal really isn't an end goal ... unless you are just really bored. What is this for? Is this homework?
Trilby, I'm only interested in sequences ending with 00-99 or 0-9 if broken. Bash script posted. This is work related and not a homework.
Offline
I haven't thought it through much, but something like this immediately came to mind:
get next array[index] that's a multiple of 10
if array[index] is multiple of 100:
then if array[index] + 99 == array[index + 99] > print 100 sequence
elif array[index] + 9 == array[index + 9] > print 10 sequence
goto 0
But whether the Constitution really be one thing, or another, this much is certain - that it has either authorized such a government as we have had, or has been powerless to prevent it. In either case, it is unfit to exist.
-Lysander Spooner
Offline
Whoa, the original bash approach would *definitely* be slow - and innaccurate. It will only find the 10s within chunks of 100 that have the *00 number. So, for example, the following series of 10 from 810-819 would be missed as there is no 800 to trigger the outermost for loop:
789 799 801 805 810 811 812 813 814 815 816 817 818 819
Also your bash is *very* slow as it has to grep through the entire file *many* times. Just read the input once, and count breaks in order:
--- bash script removed as it failed miserably at it's goal ---
Last edited by Trilby (2014-05-02 00:25:13)
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
Whoa, the original bash approach would *definitely* be slow - and innaccurate. It will only find the 10s within chunks of 100 that have the *00 number.
Yeah, just added a -m to grep as such:
if [[ $(grep -w -m 1 ${base}${h} ${file}) ]]; then
...
Which made it alot faster since there are no duplicate numbers.
Last edited by xr4y (2014-05-01 21:32:36)
Offline
Quick reply - I just editted my post above with the code.
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
Wow, Trilby, you wrote that one fast, thanks. But `Trilby_script.sh < file` does not give any output.
Last edited by xr4y (2014-05-01 21:47:29)
Offline
If you send a sample input file to the email in my profile, I'd be happy to tweak this a bit. I didn't expect it to work right the first time, but the basic logic of looking at each line only once should drastically improve run time - especially if this is on an HDD.
EDIT: D'Oh, one obvious problem is the missing "$"s on the variables in several conditionals. Also, the subtraction conditional was backwards. These were both just fixed and it's now working on some tiny sample files I created (the new code has replaced the old above).
EDIT: actually I see this will still miss any sequences of 10 that are within a broken sequence of 100 if they come *before* the break in the sequence of 100. A little more revision will be needed for this. (edit: this should now be done too).
Last edited by Trilby (2014-05-01 22:10:43)
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
I just got the test file, and ran it with the newly revised code above. It seems to work. It's still not really fast, recoding in C would make it a lot faster yet - but I suspect this is far quicker than the original.
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
I really had a brain freeze on that one, thanks for helping out everyone, specially Tribly. Your code is more than fast enough.
Offline
One thing you failed to mention, but I think is implied -- Is the list monotonic? If not, the code gets far more interesting
Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way
Offline
Indeed. Also, note that the code above makes several assumptions: 1) monotonicity ... but "sort" can take care of that; 2) there is no zero. This would be easy to fix, but in the first draft I didn't bother with it: 000-009 will not currently be counted; 3) the application is not mission critical; 4) there are no blank line; 5) each line contains *only* the number.
On #3, my code may be faster, but it hasn't been checked for accuracy. Be sure to test for accuracy before trusting its output. In fact, I just tested it ... it failed.
I tested it after I recoded it in C. The speed went from 2min 3sec to 0.39sec ... that's nice. But the results file was *much* smaller.
The results from the bash version includes several duplicates, and some erroneous inclusions. Given that so far the C version seems more accurate (but still needs testing) and *much* faster, I'd go with patching this up further:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
unsigned long lt = 0, lh = 0;
unsigned long n, prev = 0, tens[10];
int nten=0;
void check_tens() {
int i;
if (lh) printf("%lu 100\n",lh);
else for (i = 0; i < 10 && tens[i]; i++)
printf("%lu 10\n",tens[i]);
memset(tens,0,10*sizeof(unsigned long));
nten = 0;
}
int main() {
int i;
memset(tens,0,10*sizeof(unsigned long));
while (fscanf(stdin,"%lu\n",&n) == 1) {
if ((n % 10) == 0) lt = n;
else if ((n % 10) == 9) {
if (!lt || n - 9 == lt) tens[nten++] = lt;
lt = 0;
}
if ((n % 100) == 0) lh = n;
else if ((n % 100) == 99) check_tens();
if (n - prev != 1) {
lt = lh = 0;
for (i = prev; i < n; i++)
if ((i % 100) == 99) { check_tens(); break; }
if ((n % 10) == 0) lt = n;
if ((n % 100) == 0) lh = n;
}
prev=n;
}
lh = 0;
check_tens();
return 0;
}
EDIT: I can actually confirm the C version is not yet accurate. ARGH!
Last edited by Trilby (2014-05-02 00:11:09)
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
After a little evaluation of your bash script, It's fast but not 100% accurate as it prints out the 10's along with the 100 series.
If a sequence ranges from 57223300-57223399, then just print 57223300. if that sequence was broken at say 57223322, then print 57223300, 57223310, 57223330 etc.
And I just saw that you also posted a C code. Thanks alot for your time Trilby.
Offline
Trilby, your assumptions are correct on 1,2,4,5. Your C code is blazing fast, where does it fail in accuracy?
Offline
Oh, it failed in a ridiculous number of ways. Most notable was that it only counted a sequence of ten of it went from a zero to the next zero in the ones place. So 30 31 32 33 34 35 36 37 38 39 72 73 would not score the 30-39 sequence (it would have required 30-40). The new C version seems to work very well.
As it's brand new of course, so I'd still advise great caution and prescribe more testing. But I have fed it some test case input and it gave the expected output for every permutation of odd input I could think of [1]. I also gave it your file, then grep'ed the output for any "10s only" matches (e.g. where a 100 was broken, but the 10s were counted); I "by hand" verified the first several of these as well as several more from the middle of the grep-result list - each one tested was an accurate detection, and no other 10s in that 100s block were missed.
So I'm relatively confident the new code has it right.
note [1] - my inputs all still fit the assumptions of monotonicity and format of one and only one number on every single line of the input.
Last edited by Trilby (2014-05-02 00:26:41)
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
edit: updated code (v4)
I thought this was an interesting problem so I wrote my own C script. Can you test it and let me know how it works (accuracy, time)? I don't have the input file so I've only done some limited testing using files generated with variations of seq and some manual editing.
#include <stdlib.h>
#include <stdio.h>
inline
void
print_tens(unsigned long current_range, int tens_tracker)
{
unsigned long i;
for (i = current_range * 100; tens_tracker > 0; i += 10, tens_tracker >>= 1)
{
if (tens_tracker & 1)
{
printf("%lu 10\n", i);
}
}
}
int
main(int argc, char * * argv)
{
int tens_tracker = 0, sequential = 1, loop = 1;
unsigned long current_value, current_range,
previous_value = -1, previous_tens = 0, previous_range = 0;
while (loop)
{
if (fscanf(stdin,"%lu\n",¤t_value) != 1)
{
loop = 0;
current_value = -1;
}
current_range = current_value / 100;
previous_tens = previous_value % 100;
if (sequential >= 100 && previous_tens == 99)
{
printf("%lu 100\n", previous_range * 100);
sequential = 1;
tens_tracker = 0;
}
else if (sequential >= 10 && (previous_value % 10) == 9)
{
tens_tracker += 1 << (previous_tens / 10);
}
if (tens_tracker && current_range != previous_range)
{
print_tens(previous_range, tens_tracker);
tens_tracker = 0;
sequential = 1;
}
if ((current_value - previous_value) == 1)
{
sequential ++;
}
else
{
sequential = 1;
}
previous_value = current_value;
previous_range = current_range;
}
return EXIT_SUCCESS;
}
This script assumes monotonicity and one value per line as well.
Last edited by Xyne (2014-05-04 21:48:06)
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
I wanna play. Is there any chance that one of the larger data sets can either be posted somewhere or if someone would mail it to me? I have an amateur Haskell version that I'll post after testing with a real world data set. It'll be after I sleep tonight though.
**EDIT**
Remove the recursive babbling...
Last edited by skottish (2014-05-02 02:06:16)
Offline
Thank Trilby, your new C code is flawless as far as I can tell, will do more tests.
Xyne, here is a test.
time tribly < file
time xyne < file
Trilby: real 0m1.566s
Xyne: real 0m1.575s
So basicaly the same run speed. Accuracy though is a bit off. Here is the tail output from both programs:
Trilby:
46119999000 100
46119999100 100
46119999200 100
46119999300 100
46119999400 100
46119999500 100
46119999600 100
46119999700 100
46119999800 100
46119999900 100
Xyne:
46119999900 100
46119999900 10
46119999910 10
46119999920 10
46119999930 10
46119999940 10
46119999950 10
46119999960 10
46119999970 10
46119999980 10
I want to thank everyone for their time and help, really appreciated. I've learnt a little more C as well in this never ending learning process.
Offline
You know what they say -- there are two hard problems in computer science: cache invalidation, naming things, and off-by-one errors.
Offline
Incidentally, there is a name for this: RLE (Run Length Encoding)
Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way
Offline