You are not logged in.

#1 2014-05-01 18:04:33

xr4y
Member
Registered: 2011-05-06
Posts: 33

Find gaps in sequence of numbers

Hi,

I'm trying to scan through multiple records containing millions of numbers. What I need to do is to print out the numbers that are sequential as such:

If the record contains a sequence like this:

46859995600
..
46859995699

And there is no breakage in that sequence, then print:

46859995600 100

If that sequence is broken somewhere, then print series of 10 that are sequential, f.e:

46859995600 10
46859995620 10
46859995630 10
...

I already have a bash script for this, but it's way too slow. If anyone can help me out I'd be very grateful, perhaps a "awk" solution.

Thanks!

Offline

#2 2014-05-01 18:27:51

Raynman
Member
Registered: 2011-10-22
Posts: 1,539

Re: Find gaps in sequence of numbers

It might help if you provide some sample input and the bash script to compare against (perhaps it can even be made to run faster).

And what languages would be acceptable? Just bash (/zsh?) and awk, or also something like Python?

Offline

#3 2014-05-01 18:34:15

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,548
Website

Re: Find gaps in sequence of numbers

Is "millions" a colloqiualism, or is it actually in the millions?  Or is it billions, or thousands?

`wc -l <filename>` will give an exact answer to that.  The magnitude of the file would influence what tool/approach I'd opt for.  You also refer to records (as plural).  Is this more than one file?  If so, does this sequence need to be followed across files, or is each file a separate sequence? (or is this not files at all, but must be read from piped input - this makes a big difference)

Also, this sounds a bit like an X-Y problem.  If you layout your end goal, you be provided with much better ways of approaching this.

Last edited by Trilby (2014-05-01 19:05:26)


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#4 2014-05-01 20:19:42

xr4y
Member
Registered: 2011-05-06
Posts: 33

Re: Find gaps in sequence of numbers

Trilby, here is `wc -l file` from one of the files:

9012539 file

here is `head file`

461165002
461165003
461165008
461165021
461165028
461165032
461165036
461165047
461165050
461165054

and here is `tail file`

46119999990
46119999991
46119999992
46119999993
46119999994
46119999995
46119999996
46119999997
46119999998
46119999999

My end goal is to print out sequences of 100 or 10 from each file. The sequences does not need to be tracked across the files.

Offline

#5 2014-05-01 20:31:07

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,548
Website

Re: Find gaps in sequence of numbers

Please see the other questions in the thread - they will all be useful.

Additionally, I'm curious as to the desired result for when 150 sequential numbers are encountered.  Should *each* of the last 50 output the "100" as the last 100 numbers have been in sequence?  Or should a completed sequence of 100 reset the count?

If the bash script works as intended, but is just slow, then posting it would remove all ambiguity from your description.

Also, what you state as an end goal really isn't an end goal ... unless you are just really bored.  What is this for?  Is this homework?


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#6 2014-05-01 20:43:41

xr4y
Member
Registered: 2011-05-06
Posts: 33

Re: Find gaps in sequence of numbers

Raynman wrote:

It might help if you provide some sample input and the bash script to compare against (perhaps it can even be made to run faster).

And what languages would be acceptable? Just bash (/zsh?) and awk, or also something like Python?

Thanks Raynman, any solution that is fast enough would be ok. Here is my for sure flawed and slow bash script.

rm -rfv flawedslow.sh

Last edited by xr4y (2014-05-02 08:55:42)

Offline

#7 2014-05-01 20:50:52

xr4y
Member
Registered: 2011-05-06
Posts: 33

Re: Find gaps in sequence of numbers

Trilby wrote:

I'm curious as to the desired result for when 150 sequential numbers are encountered.  Should *each* of the last 50 output the "100" as the last 100 numbers have been in sequence?  Or should a completed sequence of 100 reset the count?

If the bash script works as intended, but is just slow, then posting it would remove all ambiguity from your description.

Also, what you state as an end goal really isn't an end goal ... unless you are just really bored.  What is this for?  Is this homework?

Trilby, I'm only interested in sequences ending with 00-99 or 0-9 if broken. Bash script posted. This is work related and not a homework.

Offline

#8 2014-05-01 21:10:07

alphaniner
Member
From: Ancapistan
Registered: 2010-07-12
Posts: 2,810

Re: Find gaps in sequence of numbers

I haven't thought it through much, but something like this immediately came to mind:

get next array[index] that's a multiple of 10

if array[index] is multiple of 100:
    then if array[index] + 99 == array[index + 99] > print 100 sequence

elif array[index] + 9 == array[index + 9] > print 10 sequence
goto 0

But whether the Constitution really be one thing, or another, this much is certain - that it has either authorized such a government as we have had, or has been powerless to prevent it. In either case, it is unfit to exist.
-Lysander Spooner

Offline

#9 2014-05-01 21:23:12

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,548
Website

Re: Find gaps in sequence of numbers

Whoa, the original bash approach would *definitely* be slow - and innaccurate.  It will only find the 10s within chunks of 100 that have the *00 number.  So, for example, the following series of 10 from 810-819 would be missed as there is no 800 to trigger the outermost for loop:

789 799 801 805 810 811 812 813 814 815 816 817 818 819

Also your bash is *very* slow as it has to grep through the entire file *many* times.  Just read the input once, and count breaks in order:

--- bash script removed as it failed miserably at it's goal  ---

Last edited by Trilby (2014-05-02 00:25:13)


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#10 2014-05-01 21:30:30

xr4y
Member
Registered: 2011-05-06
Posts: 33

Re: Find gaps in sequence of numbers

Trilby wrote:

Whoa, the original bash approach would *definitely* be slow - and innaccurate.  It will only find the 10s within chunks of 100 that have the *00 number.

Yeah, just added a -m to grep as such:

if [[ $(grep -w -m 1 ${base}${h} ${file}) ]]; then
...

Which made it alot faster since there are no duplicate numbers.

Last edited by xr4y (2014-05-01 21:32:36)

Offline

#11 2014-05-01 21:34:50

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,548
Website

Re: Find gaps in sequence of numbers

Quick reply - I just editted my post above with the code.


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#12 2014-05-01 21:45:22

xr4y
Member
Registered: 2011-05-06
Posts: 33

Re: Find gaps in sequence of numbers

Wow, Trilby, you wrote that one fast, thanks. But `Trilby_script.sh < file` does not give any output.

Last edited by xr4y (2014-05-01 21:47:29)

Offline

#13 2014-05-01 21:47:32

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,548
Website

Re: Find gaps in sequence of numbers

If you send a sample input file to the email in my profile, I'd be happy to tweak this a bit.  I didn't expect it to work right the first time, but the basic logic of looking at each line only once should drastically improve run time - especially if this is on an HDD.

EDIT: D'Oh, one obvious problem is the missing "$"s on the variables in several conditionals.  Also, the subtraction conditional was backwards.  These were both just fixed and it's now working on some tiny sample files I created (the new code has replaced the old above).

EDIT: actually I see this will still miss any sequences of 10 that are within a broken sequence of 100 if they come *before* the break in the sequence of 100.  A little more revision will be needed for this. (edit: this should now be done too).

Last edited by Trilby (2014-05-01 22:10:43)


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#14 2014-05-01 22:13:01

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,548
Website

Re: Find gaps in sequence of numbers

I just got the test file, and ran it with the newly revised code above.  It seems to work.  It's still not really fast, recoding in C would make it a lot faster yet - but I suspect this is far quicker than the original.


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#15 2014-05-01 22:15:32

xr4y
Member
Registered: 2011-05-06
Posts: 33

Re: Find gaps in sequence of numbers

I really had a brain freeze on that one, thanks for helping out everyone, specially Tribly. Your code is more than fast enough. smile

Offline

#16 2014-05-01 22:27:48

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 19,803

Re: Find gaps in sequence of numbers

One thing you failed to mention, but I think is implied -- Is the list monotonic?  If not, the code gets far more interesting tongue


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way

Offline

#17 2014-05-01 22:31:32

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,548
Website

Re: Find gaps in sequence of numbers

Indeed.  Also, note that the code above makes several assumptions: 1) monotonicity ... but "sort" can take care of that; 2) there is no zero.  This would be easy to fix, but in the first draft I didn't bother with it: 000-009 will not currently be counted; 3) the application is not mission critical; 4) there are no blank line; 5) each line contains *only* the number.

On #3, my code may be faster, but it hasn't been checked for accuracy.  Be sure to test for accuracy before trusting its output.  In fact, I just tested it ... it failed.

I tested it after I recoded it in C.  The speed went from 2min 3sec to 0.39sec ... that's nice.  But the results file was *much* smaller.

The results from the bash version includes several duplicates, and some erroneous inclusions.  Given that so far the C version seems more accurate (but still needs testing) and *much* faster, I'd go with patching this up further:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

unsigned long lt = 0, lh = 0;
unsigned long n, prev = 0, tens[10];
int nten=0;

void check_tens() {
	int i;
	if (lh) printf("%lu 100\n",lh);
	else for (i = 0; i < 10 && tens[i]; i++)
		printf("%lu 10\n",tens[i]);
	memset(tens,0,10*sizeof(unsigned long));
	nten = 0;
}

int main() {
	int i;
	memset(tens,0,10*sizeof(unsigned long));
	while (fscanf(stdin,"%lu\n",&n) == 1) {
		if ((n % 10) == 0) lt = n;
		else if ((n % 10) == 9) {
			if (!lt || n - 9 == lt) tens[nten++] = lt;
			lt = 0;
		}
		if ((n % 100) == 0) lh = n;
		else if ((n % 100) == 99) check_tens();
		if (n - prev != 1) {
			lt = lh = 0;
			for (i = prev; i < n; i++)
				if ((i % 100) == 99) { check_tens(); break; }
			if ((n % 10) == 0) lt = n;
			if ((n % 100) == 0) lh = n;
		}
		prev=n;
	}
	lh = 0;
	check_tens();
	return 0;
}

EDIT: I can actually confirm the C version is not yet accurate.  ARGH!

Last edited by Trilby (2014-05-02 00:11:09)


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#18 2014-05-01 22:39:41

xr4y
Member
Registered: 2011-05-06
Posts: 33

Re: Find gaps in sequence of numbers

After a little evaluation of your bash script, It's fast but not 100% accurate as it prints out the 10's along with the 100 series.

If a sequence ranges from 57223300-57223399, then just print 57223300. if that sequence was broken at say 57223322, then print 57223300, 57223310, 57223330 etc.

And I just saw that you also posted a C code. Thanks alot for your time Trilby.

Offline

#19 2014-05-01 23:33:41

xr4y
Member
Registered: 2011-05-06
Posts: 33

Re: Find gaps in sequence of numbers

Trilby, your assumptions are correct on 1,2,4,5. Your C code is blazing fast, where does it fail in accuracy?

Offline

#20 2014-05-02 00:16:49

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,548
Website

Re: Find gaps in sequence of numbers

Oh, it failed in a ridiculous number of ways.  Most notable was that it only counted a sequence of ten of it went from a zero to the next zero in the ones place.  So 30 31 32 33 34 35 36 37 38 39 72 73 would not score the 30-39 sequence (it would have required 30-40).  The new C version seems to work very well.

As it's brand new of course, so I'd still advise great caution and prescribe more testing.  But I have fed it some test case input and it gave the expected output for every permutation of odd input I could think of [1].  I also gave it your file, then grep'ed the output for any "10s only" matches (e.g. where a 100 was broken, but the 10s were counted); I "by hand" verified the first several of these as well as several more from the middle of the grep-result list - each one tested was an accurate detection, and no other 10s in that 100s block were missed.

So I'm relatively confident the new code has it right.


note [1] - my inputs all still fit the assumptions of monotonicity and format of one and only one number on every single line of the input.

Last edited by Trilby (2014-05-02 00:26:41)


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#21 2014-05-02 01:14:46

Xyne
Administrator/PM
Registered: 2008-08-03
Posts: 6,963
Website

Re: Find gaps in sequence of numbers

edit: updated code (v4)

I thought this was an interesting problem so I wrote my own C script. Can you test it and let me know how it works (accuracy, time)? I don't have the input file so I've only done some limited testing using files generated with variations of seq and some manual editing.

#include <stdlib.h>
#include <stdio.h>

inline
void
print_tens(unsigned long current_range, int tens_tracker)
{
  unsigned long i;
  for (i = current_range * 100; tens_tracker > 0; i += 10, tens_tracker >>= 1)
  {
    if (tens_tracker & 1)
    {
      printf("%lu 10\n", i);
    }
  }
}



int
main(int argc, char * * argv)
{
  int tens_tracker = 0, sequential = 1, loop = 1;
  unsigned long current_value, current_range,
                previous_value = -1, previous_tens = 0, previous_range = 0;

  while (loop)
  {
    if (fscanf(stdin,"%lu\n",&current_value) != 1)
    {
      loop = 0;
      current_value = -1;
    }
    current_range = current_value / 100;

    previous_tens = previous_value % 100;
    if (sequential >= 100 && previous_tens == 99)
    {
      printf("%lu 100\n", previous_range * 100);
      sequential = 1;
      tens_tracker = 0;
    }
    else if (sequential >= 10 && (previous_value % 10) == 9)
    {
      tens_tracker += 1 << (previous_tens / 10);
    }

    if (tens_tracker && current_range != previous_range)
    {
      print_tens(previous_range, tens_tracker);
      tens_tracker = 0;
      sequential = 1;
    }

    if ((current_value - previous_value) == 1)
    {
      sequential ++;
    }
    else
    {
      sequential = 1;
    }
    previous_value = current_value;
    previous_range = current_range;
  }

  return EXIT_SUCCESS;
}

This script assumes monotonicity and one value per line as well.

Last edited by Xyne (2014-05-04 21:48:06)


My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

#22 2014-05-02 01:59:37

skottish
Forum Fellow
From: Here
Registered: 2006-06-16
Posts: 7,942

Re: Find gaps in sequence of numbers

I wanna play. Is there any chance that one of the larger data sets can either be posted somewhere or if someone would mail it to me? I have an amateur Haskell version that I'll post after testing with a real world data set. It'll be after I sleep tonight though.

**EDIT**

Remove the recursive babbling...

Last edited by skottish (2014-05-02 02:06:16)

Offline

#23 2014-05-02 08:53:31

xr4y
Member
Registered: 2011-05-06
Posts: 33

Re: Find gaps in sequence of numbers

Thank Trilby, your new C code is flawless as far as I can tell, will do more tests.

Xyne, here is a test.

time tribly < file
time xyne < file

Trilby: real    0m1.566s
Xyne: real    0m1.575s

So basicaly the same run speed. Accuracy though is a bit off. Here is the tail output from both programs:

Trilby:

46119999000 100
46119999100 100
46119999200 100
46119999300 100
46119999400 100
46119999500 100
46119999600 100
46119999700 100
46119999800 100
46119999900 100

Xyne:

46119999900 100
46119999900 10
46119999910 10
46119999920 10
46119999930 10
46119999940 10
46119999950 10
46119999960 10
46119999970 10
46119999980 10

I want to thank everyone for their time and help, really appreciated. I've learnt a little more C as well in this never ending learning process. smile

Offline

#24 2014-05-02 11:29:52

Trent
Member
From: Baltimore, MD (US)
Registered: 2009-04-16
Posts: 990

Re: Find gaps in sequence of numbers

You know what they say -- there are two hard problems in computer science: cache invalidation, naming things, and off-by-one errors.

Offline

#25 2014-05-02 13:11:12

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 19,803

Re: Find gaps in sequence of numbers

Incidentally, there is a name for this: RLE  (Run Length Encoding) smile


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way

Offline

Board footer

Powered by FluxBB