Fast C++ alg to finding quartiles/median on large fixed range dataset?

ngoonee · 2009-08-24 08:25:05

Hi all,

Easiest way to find median/quartiles would be to sort values for the dataset and pick the nth value, n being 25%, 50%, or 75% of the dataset's size. Surely sorting (normally O(n log n)) wouldn't be the most efficient method? My dataset has a fixed numerical range (which may be large, unfortunately, but fully known at the start of computation) as well as a fixed size (from 300K to 1.5M values, depending on dataset)

A few suggestions I've seen from Mr. Google are single-pass building of a histogram, which would be fast for a small numerical range but memory-intensive for larger ranges, or a recursive partitioning based on a 'trial' pivot value (perhaps based on mean). Also some suggestions to use ordered containers in C, like lists and sets, but my data is already in array format...

Suggestions welcome .

majiq · 2009-08-24 09:08:41

Just have to point out that for your other suggestions: The histogram is essentially a bucket sorting algorithm, so you're still sorting :-P. The recursive partitioning (I don't fully understand that, by the by) seems like an attempt at just the (n) part of merge sort - and the merging is really where the sorting happens so....?

Short of sorting, the only method I could suggest is an ordered container in C, but you have to consider the tradeoff of build time versus execution time, then. I don't know what your hardware limitations are, but I'd just go with sort and pick.

Also, maybe you can find something other than the quartiles to work off of? Again, I don't know what you're doing, but as long as it's not a box-and-whisker plot, maybe there's something else you can use?

ngoonee · 2009-08-24 09:32:58

I'm basically comparing two distributions and finding the mapping which would best make distribution A fit distribution B. The easiest way I've found is to generate a hash map with the quarttiles of the distribution, linearly connected. So, quartile 1 for distribution A is assumed to be a perfect match for quartile 1 for distribution B, so on and so forth.

And yes, I understand that everything is 'sort-of-like-a' sorting method, but sorting is pretty fixed in its memory/computational requirements, and I'd like to minimize those using the factors I know about beforehand (basically numerical range and distribution size). It doesn't matter what sorting algorithm you use, sorting on a few hundred thousand values is still relatively slow, and totally unparallelizable.

Allan · 2009-08-24 10:35:28

ngoonee wrote:

It doesn't matter what sorting algorithm you use, sorting on a few hundred thousand values is still relatively slow, and totally unparallelizable.

Not really... Knuth has a book all about sorting and searching algorithms. It is quite interesting if you are into that sort of thing.

ngoonee · 2009-08-24 11:09:47

Allan wrote:

ngoonee wrote:
It doesn't matter what sorting algorithm you use, sorting on a few hundred thousand values is still relatively slow, and totally unparallelizable.
Not really... Knuth has a book all about sorting and searching algorithms. It is quite interesting if you are into that sort of thing.

Hi Allan. Yes, I know (in general) about various algorithms, and the performance improvements one to another is terrific. What I was saying, unclearly, in the previous is that sorting is still relatively slow (I believe best algorithms are at O(n log n)), whereas for example median should be determinable by a single pass through the distribution (theoretically O(n)).

I will, however, search for that book for reading tomorrow. Maths is obviously not a strength for me.

SiC · 2009-08-24 11:40:15

If you are calculating the median, you have no option other than to sort the values. It is the middle value, rather than an average, so you need the data in order.

Quartiles suffer from the same problem. You need ordered data.

keenerd · 2009-08-24 11:50:33

ngoonee wrote:

It doesn't matter what sorting algorithm you use, sorting on a few hundred thousand values is still relatively slow, and totally unparallelizable.

Only being a little silly here, but bubble sort is embarrassingly parallel. With n-1 processors, it will sort a list in O(n) time. Merge sort is a better candidate for parallelization, with n/2 processors on the first pass, n/4 on the second, etc, and O(n) time. Quicksort needs 1, then 2, then 4 processors. However each stage can start working before the previous one finishes, but it also works out to O(n) time. I think. Most any sort becomes O(n) with enough processors.

SiC wrote:

If you are calculating the median, you have no option other than to sort the values.

This is incorrect. Oh, the irony of your user name.

Regards for your algo, O(n) is typical for finding the median. Think of sorting a list, but stopping early once the left and right sides have an equal number of elements.
http://en.wikipedia.org/wiki/Median_search
http://en.wikipedia.org/wiki/Selection_algorithm

For quartiles, just rerun median on the left and right halves. So quartiles and median should take O(2n) for a single core, O(1.5n) for two cores, and O(n) for three cores. You'd have a lot of interesting thread contention issues for the three core case, though.

Last edited by keenerd (2009-08-24 11:57:10)

SiC · 2009-08-24 12:33:46

The median search method uses a slight variation on Quicksort if I am not mistaken... Hence it is sorting the list, but it is a very efficient way of doing it.

Last edited by SiC (2009-08-24 12:34:02)

cerbie · 2009-08-24 13:17:22

ngoonee wrote:

It doesn't matter what sorting algorithm you use, sorting on a few hundred thousand values is still relatively slow, and totally unparallelizable.

While I've never had the need to try it, it should be possible to make a parallel quicksort or merge sort, by spawning processes to work on partitions of the list. Let the OS manage the dirty work. Have it make new processes as part of the sort, up to n depth, operating in-process once that depth is reached (FI, depth=3 -> 8 'leaf' processes for an i7), returning the list back to the parents to each do the same. With millions of values, I'm not sure if it would be best/easiest to pipe the data in and out, share memory, or just serialize it to text once each process is done with its part (I've not had to connect processes with anything but ordered bytes and file handles before, so I'm not sure how easily all that can be made to work).

Last edited by cerbie (2009-08-24 13:18:41)

majiq · 2009-08-24 13:53:33

Could you - and I'm just shooting off my (mouth/fingers) here - take a look at the distribution of the two sets in terms of mean, standard deviation, skewness, and kurtosis and apply the mapping from there? Or is there no set/determined way to make those four play together yet? I'm sure that there must be somewhere...I don't know if it'd be faster or slower than figuring out how to parallelize the sorting algorithms, but it'd save the the hassle of sorting.

SiC · 2009-08-24 14:02:40

ngoonee wrote:

I'm basically comparing two distributions and finding the mapping which would best make distribution A fit distribution B. The easiest way I've found is to generate a hash map with the quarttiles of the distribution, linearly connected. So, quartile 1 for distribution A is assumed to be a perfect match for quartile 1 for distribution B, so on and so forth.
And yes, I understand that everything is 'sort-of-like-a' sorting method, but sorting is pretty fixed in its memory/computational requirements, and I'd like to minimize those using the factors I know about beforehand (basically numerical range and distribution size). It doesn't matter what sorting algorithm you use, sorting on a few hundred thousand values is still relatively slow, and totally unparallelizable.

If you are assuming that the inter - quartile ranges for A and B can be mapped to each other, then you are also assuming that the median / mean should map to each other.

Given that, Normality could be assumed (for most distributions of data, this can be ok, but if either/both distributions are definitely non-normal, then you are going to bugger up).

Given this assumption of normality, you can map using the normal distribution. This is a very trivial idea, but it would cover what majiq says and would be approximately O(3n).

Not entirely sure it is what you are looking for though.

kakTuZ · 2009-08-24 22:00:01

Finding the median (or the nth value) can be done in O(n) as you can see from keenerds links. The STL provides an implementations for this. See the nth_element functions. It does exactly what you request in your first post. Dividing a set in two parts, where the first part contains n elements and all these elements are smaller then the elements form the other part. And it does so in O(n).

Last edited by kakTuZ (2009-08-24 22:11:09)

ngoonee · 2009-08-25 05:45:50

keenerd wrote:

ngoonee wrote:
It doesn't matter what sorting algorithm you use, sorting on a few hundred thousand values is still relatively slow, and totally unparallelizable.
Only being a little silly here, but bubble sort is embarrassingly parallel. With n-1 processors, it will sort a list in O(n) time. Merge sort is a better candidate for parallelization, with n/2 processors on the first pass, n/4 on the second, etc, and O(n) time. Quicksort needs 1, then 2, then 4 processors. However each stage can start working before the previous one finishes, but it also works out to O(n) time. I think. Most any sort becomes O(n) with enough processors.

My bad . I'm from a graphics processing background, and for me parallelizable means separate threads from the start rather than spawning threads. Reading your explanation made me feel really stupid, as of course the thread spawning's end-results is superbly parallel.

keenerd wrote:

SiC wrote:
If you are calculating the median, you have no option other than to sort the values.
This is incorrect. Oh, the irony of your user name.
Regards for your algo, O(n) is typical for finding the median. Think of sorting a list, but stopping early once the left and right sides have an equal number of elements.
http://en.wikipedia.org/wiki/Median_search
http://en.wikipedia.org/wiki/Selection_algorithm
For quartiles, just rerun median on the left and right halves. So quartiles and median should take O(2n) for a single core, O(1.5n) for two cores, and O(n) for three cores. You'd have a lot of interesting thread contention issues for the three core case, though.

Thanks for this, is actually similar to the following two suggestions:-

cerbie wrote:

ngoonee wrote:
It doesn't matter what sorting algorithm you use, sorting on a few hundred thousand values is still relatively slow, and totally unparallelizable.
While I've never had the need to try it, it should be possible to make a parallel quicksort or merge sort, by spawning processes to work on partitions of the list. ... I'm not sure if it would be best/easiest to pipe the data in and out, share memory, or just serialize it to text once each process is done with its part ...

Data management is a bit complicated but probably doable. Obviously preferable to avoid that complexity.

kakTuZ wrote:

Finding the median (or the nth value) can be done in O(n) as you can see from keenerds links. The STL provides an implementations for this. See the nth_element functions. It does exactly what you request in your first post. Dividing a set in two parts, where the first part contains n elements and all these elements are smaller then the elements form the other part. And it does so in O(n).

This is a champion suggestion, looks like almost exactly what I'm looking for. Will be trying this out.

@majiq and SiC (concerning median/mean mapping), yes those should map fairly well, but normality doesn't really hold for my data (I think). The previous nth_element suggestion seems the fastest way so far, will see how that works out.

Thanks all for the contributions. You guys are champs.

ngoonee · 2009-08-25 07:55:48

So, besides the quartiles I'm also calculating the min and max value. What I have looks like this, with 'ptr' being the start of my array and 'size' being the number of elements. Results are saved in a 'quartiles' array. Ignoring definitions and initialization for the sake of brevity.

std::nth_element(ptr,ptr+size/2,ptr+size);
std::nth_element(ptr,ptr+size/4,ptr+size/2);
std::nth_element(ptr/2,ptr+size*3/4,ptr+size);
quartiles[0] = *(std::min_element(ptr,ptr+size/4));
quartiles[1] = ptr[size/4];
quartiles[2] = ptr[size/2];
quartiles[3] = ptr[size*3/4];
quartiles[4] = *(std::max_element(ptr+size/4,ptr+size));

Is this just about as efficient as it gets?

Arch Linux

#1 2009-08-24 08:25:05

Fast C++ alg to finding quartiles/median on large fixed range dataset?

#2 2009-08-24 09:08:41

Re: Fast C++ alg to finding quartiles/median on large fixed range dataset?

#3 2009-08-24 09:32:58

Re: Fast C++ alg to finding quartiles/median on large fixed range dataset?

#4 2009-08-24 10:35:28

Re: Fast C++ alg to finding quartiles/median on large fixed range dataset?

#5 2009-08-24 11:09:47

Re: Fast C++ alg to finding quartiles/median on large fixed range dataset?

#6 2009-08-24 11:40:15

Re: Fast C++ alg to finding quartiles/median on large fixed range dataset?

#7 2009-08-24 11:50:33

Re: Fast C++ alg to finding quartiles/median on large fixed range dataset?

#8 2009-08-24 12:33:46

Re: Fast C++ alg to finding quartiles/median on large fixed range dataset?

#9 2009-08-24 13:17:22

Re: Fast C++ alg to finding quartiles/median on large fixed range dataset?

#10 2009-08-24 13:53:33

Re: Fast C++ alg to finding quartiles/median on large fixed range dataset?

#11 2009-08-24 14:02:40

Re: Fast C++ alg to finding quartiles/median on large fixed range dataset?

#12 2009-08-24 22:00:01

Re: Fast C++ alg to finding quartiles/median on large fixed range dataset?

#13 2009-08-25 05:45:50

Re: Fast C++ alg to finding quartiles/median on large fixed range dataset?

#14 2009-08-25 07:55:48

Re: Fast C++ alg to finding quartiles/median on large fixed range dataset?

Board footer