Find gaps in sequence of numbers

Trilby · 2014-05-02 14:10:08

That seems related, but not quite the same. RLE (as I understand it) should unambiguously encode the original data. There would be no way to regenerate the original input data with the output of the code presented in this thread. All complete blocks of 10 or 100 would be regenerated, but for any broken blocks of 10, there would be no way to know which or how many of the 10 elements were absent from the original data.

Also, there would be no way to know from an output of "... 1030 1410 ..." whether there were simply no input between 1030 and 1410, or whether almost every number between those two was present except for one missing number in each block of 10. This could result in an up to 90% (in the very worst case) data loss from the original input. And that is 90% of the input lines - if you quantify information as reduction in ambiguity, then it'd be far greater than 90%, as even if we knew how many numbers were in that gap, we still would not know *which* numbers were in that gap.

EDIT: also a pure RLE algorithm would be much easier to code than this - and it would produce a much smaller output (at least with the sample file I was sent, though this would depend on the 'porousness' of the input data ... there's probably a more accurate term for that I'm not thinking of).

Last edited by Trilby (2014-05-02 14:15:29)

ewaller · 2014-05-02 14:48:50

I suppose you are right. It does remind me of a an algorithm I worked on in the '80's where we were collecting data from accelerometers, LVDTs and and load cells where we would wait for long periods for something interesting to happen. We used a RLE based system that required a certain delta from a previous sample to start a new run.

Xyne · 2014-05-02 17:00:31

I have updated my code to catch that error.

xr4y · 2014-05-02 23:33:14

Xyne, your new C code is flawless.

Both C codes are equally fast, really. When I time both programs then one time Xynes code wins and the next time Trilbys, although we are talking about milliseconds.

The amount of geek power in the Arch community is really amazing. Thanks again everyone, I hope I can repay one day.

skottish · 2014-05-03 13:46:05

This is my amateur Haskell version. The data set that I used is 9,988,579 entries long. I thought that when I generated it that I may get some sequences of length 100 or 10, but that didn't happen. I manually added some runs just to get this working. It's still slower and less generic than I'd like it to be, but it's pretty good. I didn't bother with the exact output format, but it would add almost nothing to the run time:

import qualified Data.List                  as DL
import qualified Data.Function              as DF
import qualified Data.ByteString.Lazy.Char8 as DB
import Control.Applicative

parseIntRanges :: Integral b => b -> [b] -> [[b]]
parseIntRanges x xs = DL.groupBy ((==) `DF.on` (`div` x)) xs

tensAndHundreds :: Integral t => [[t]] -> [[t]]
tensAndHundreds [] = []
tensAndHundreds (x:xs) 
    | length x < 10   = tensAndHundreds xs
    | length x == 100 = [head x,100] : tensAndHundreds xs
    | otherwise       = tens (parseIntRanges 10 x) ++ tensAndHundreds xs

tens :: Num t => [[t]] -> [[t]]
tens [] = []
tens (x:xs)
    | length x == 10 = [head x,10] : tens xs
    | otherwise      = tens xs

unJust :: Maybe (t, t1) -> t
unJust (Just (a, _)) = a

main = do
    contents <- map unJust <$> map DB.readInt <$> DB.lines <$> DB.getContents
    print $ tensAndHundreds $ parseIntRanges 100 contents

My version wrote:

time cat data.txt | ./arch-q
[[0,10],[19000,100],[632570,10],[3147650,10],[4519000,100]]
real 0m3.168s
user 0m3.113s
sys 0m0.307s

Xyne's version wrote:

time cat data.txt | ./xyne
0 10
19000 100
632570 10
3147650 10
4519000 100
real 0m1.734s
user 0m1.617s
sys 0m0.433s

Last edited by skottish (2014-05-03 17:20:08)

Xyne · 2014-05-04 17:48:53

edit: updated code

v3: div -> quot, mod -> rem
v4: included changes suggested by skottish below

A version using a different approach is available below.
/edit

This thread may just end up in the "Try This" section.

Here's my own Haskell version. I have only tested it on a very small data set so I have no idea if it will be dead-slow or not. Please give it a try and let me know how it performs (or send me a real data set).

{-# OPTIONS_GHC -fwarn-unused-imports #-}

import qualified Data.ByteString.Lazy.Char8 as BS
import Data.ByteString.Builder as BB
import Data.Foldable as Fold
import Data.Monoid
import Data.Sequence as DS
import System.IO


readInts :: BS.ByteString -> [Int]
readInts b =
  case (BS.readInt b) of
    (Just (x, b')) -> x : (readInts b')
    _ -> if BS.null b then [] else readInts $ BS.tail b



sequenceLength :: Int -> [Int] -> [(Int, Int)]
sequenceLength _ [] = []
sequenceLength n (x:[]) = [(x,n)]
sequenceLength n (x:xs@(x':_)) =
  let
    n' =
      if (x' - x) == 1
      then n + 1
      else 1
  in
    (x,n) : (sequenceLength n' xs)



detectSequences' :: Int -> DS.Seq (Int, Bool) -> [(Int, Int)] -> [(Int, Bool)]
detectSequences' _ tens [] = Fold.foldr (:) [] tens
detectSequences' range tens ((x,l):ys) =
  let
    newRange = quot x 100
    tens' =
      if rem x 10 == 9 && l >= 10
      then tens |> (x, False)
      else tens
    rest ts = detectSequences' newRange ts ys
  in
    if rem x 100 == 99 && l >= 100
    then (x, True) : (rest DS.empty)
    else
      if newRange /= range && not (DS.null tens')
      then Fold.foldr (:) (rest DS.empty) tens'
      else rest tens'



detectSequences :: BS.ByteString -> [(Int, Bool)]
detectSequences bs =
  let
    xs = readInts bs
    ls = sequenceLength 1 xs
  in
    detectSequences' 0 DS.empty ls



getBuilder :: [(Int, Bool)] -> BB.Builder
getBuilder [] = mempty
getBuilder ((x,b):ys) =
  let
    (x',l) = if b then (x - 99, 100) else (x - 9, 10)
  in
    BB.intDec x' <>
    BB.char7 ' ' <>
    BB.intDec l <>
    BB.char7 '\n' <>
    getBuilder ys



main :: IO ()
main =
  BS.getContents >>=
  hPutBuilder stdout . getBuilder . detectSequences

Haskell is fun. ❤ ❤

Last edited by Xyne (2014-07-20 03:44:24)

Trilby · 2014-05-04 20:05:20

Who's going to write the befunge version?

skottish · 2014-05-04 20:24:11

Xyne wrote:

This thread may just end up in the "Try This" section.
Here's my own Haskell version. I have only tested it on a very small data set so I have no idea if it will be dead-slow or not. Please give it a try and let me know how it performs (or send me a real data set).
...snip...

This is the output of your C version, my updated Haskell version, and your new Haskell version. Both of my data sets produce similar results:

Xyne C wrote:

cat data.txt | ./xyne
0 10
19000 100
632570 10
3147650 10
3147660 10
4519000 100
5454519000 100
real 0m1.733s
user 0m1.600s
sys 0m0.440s

My Haskell wrote:

cat data.txt | ./arch-q
0 10
19000 100
632570 10
3147650 10
3147660 10
4519000 100
5454519000 100
real 0m3.039s
user 0m2.957s
sys 0m0.333s

Xyne's Haskell wrote:

time cat data.txt | ./xhs
19000 10
19010 10
19020 10
19030 10
19040 10
19050 10
19060 10
19070 10
19080 10
631900 10
3147650 10
4519000 10
4519010 10
4519020 10
4519030 10
4519040 10
4519050 10
4519060 10
4519070 10
4519080 10
5454519000 100
real 0m3.263s
user 0m3.193s
sys 0m0.327s

I'm generating data sets with a combination of python and bash. The python script is simply:

#!/usr/bin/python2

import random

for i in range(100000000):
    print random.randint(100000,78910120)

I'm cleaning it up with sort and uniq. That run is going to produce a gigantic file. You'll get tons of matches of 10 in it and maybe you'll get lucky enough to have some of 100. My other data set, the one that posted the above results, is much smaller and was hand edited for results.

Last edited by skottish (2014-05-04 20:40:17)

Xyne · 2014-05-04 21:47:30

I have updated both of my scripts. The C version initialized the previous value to 0 so incorrectly detected 1-9 at the start of a file as a full sequence of 10. The Haskell version had an off-by-one error in the sequence length function.

I wrote my own script to generate data (gen.py):

#!/usr/bin/env python3

import random
import sys

with open('data.txt', 'w') as f:
  for i in range(int(sys.argv[1])):
    if random.random() > 0.1:
      f.write('{:d}\n'.format(i))

The comparison script:

#!/bin/bash

if [[ ! -z $1 ]]
then
  echo "generating data ($1)..."
  ./gen.py "$1"
  echo
fi

echo "compiling..."
gcc -Wall -O3 test.c
gcc -Wall -O3 -o trilby trilby.c
ghc -O3 test.hs
ghc -O3 skottish.hs
echo

echo "testing my C version..."
time ./a.out < data.txt > a.txt
echo

echo "testing my Haskell version..."
time ./test < data.txt > b.txt
diff -q a.txt b.txt
echo

echo "testing Trilby's C version..."
time ./trilby < data.txt > c.txt
diff -q a.txt b.txt
echo

echo "testing skottish's Haskell version..."
time ./skottish < data.txt > d.txt
diff -q a.txt d.txt
echo

Output

compiling...

testing my C version...

real	0m1.267s
user	0m1.237s
sys	0m0.030s

testing my Haskell version...

real	0m1.483s
user	0m1.473s
sys	0m0.010s

testing Trilby's C version...

real	0m1.261s
user	0m1.247s
sys	0m0.013s

testing skottish's Haskell version...

real	0m2.162s
user	0m2.137s
sys	0m0.027s
Files a.txt and d.txt differ

The files obviously differ for the last one because I don't have skottish's updated version. Trilby's version seems to be a few tens of ms faster than my C version on average.

skottish · 2014-05-04 22:06:07

Excellent! Nice work. Both of your updated versions are doing well here.

My updated code only deals with the output format. The fact is that the bigger the data set, the worse the performance is revealed. It's a very naive way to do things. I'm spending about 70% of my time in parseIntRanges as I'm first walking the whole list and performing integer division on all of it, parsing it into the "0..99" ranges, then walking every sub list for elements of length > 10 and < 100 for the "0..9" ranges. On my set of 56,653,076 elements, your Haskell code is running ~21 seconds and mine ~29. Your C code is at ~10. I'm still spending ~22% mostly in IO. The rest of my code is working fairly well.

import qualified Data.List                  as DL
import qualified Data.Function              as DF
import qualified Data.ByteString.Lazy.Char8 as DB
import Control.Applicative

parseIntRanges :: Integral b => b -> [b] -> [[b]]
parseIntRanges x = DL.groupBy ((==) `DF.on` (`div` x)) 

tensAndHundreds :: Integral t => [[t]] -> [[t]]
tensAndHundreds [] = []
tensAndHundreds (x:xs) 
    | length x < 10   = tensAndHundreds xs
    | length x == 100 = [head x,100] : tensAndHundreds xs
    | otherwise       = tens (parseIntRanges 10 x) ++ tensAndHundreds xs

tens :: Num t => [[t]] -> [[t]]
tens [] = []
tens (x:xs)
    | length x == 10 = [head x,10] : tens xs
    | otherwise      = tens xs

unJust :: Maybe (t, t1) -> t
unJust (Just (a, _)) = a

display :: [[Int]] -> IO ()
display = mapM_ (\[a,b] -> putStrLn (show a ++ " " ++ show b))

main = do
    contents <- map unJust <$> map DB.readInt <$> DB.lines <$> DB.getContents
    display $ tensAndHundreds $ parseIntRanges 100 contents

Last edited by skottish (2014-05-06 23:49:27)

Xyne · 2014-05-04 22:23:07

Noob question: How are you profiling your code?

skottish · 2014-05-04 22:38:22

Xyne wrote:

Noob question: How are you profiling your code?

That's a great question:

http://www.randomhacks.net/articles/200 … ce-haskell

and in depth:

http://book.realworldhaskell.org/read/p … ation.html

Last edited by skottish (2014-05-08 21:29:05)

Stalafin · 2014-05-05 12:35:09

skottish wrote:

-snip-
I'm still spending ~22% mostly in IO. The rest of my code is working fairly well.
-snip-end-

Just a random idea: have a look at at System.IO.Unsafe, in particular unsafePerformIO; maybe this way you can speed up the IO part.

Xyne · 2014-05-05 16:20:23

skottish wrote:

http://www.randomhacks.net/articles/200 … ce-haskell

Thanks. I think I've even used that before myself but I had completely forgotten about it if so.

skottish · 2014-05-08 21:32:24

Hey Xyne,

Swap out your div's with quot's and your mod's with rem's in your Haskell code and recompile with -O2. You're running within a second of your C code on my machine. I'm still a little behind, but I'm finding optimizations without changing my primary algorithm. I'm over 45% faster now though.

**EDIT**

You're within a second on my large dataset. You're actually faster than your C code on my smaller set.

**EDIT 2**

And if you compile with llvm, you're significantly faster than your own C code. I read this on the Internet somewhere:

ghc -O2 -fllvm <your_file>

Last edited by skottish (2014-05-12 00:06:47)

Xyne · 2014-05-11 04:39:55

@skottish
Interesting. I've made the recommended changes to my code above.

So is -O2 faster than -O3 in this case?
Did you compile the C code with llvm for the last comparison?

When I have some time I plan to post a version in another language just to see how it compares (new language for me, so this could be a fun way to get my feet wet).

skottish · 2014-05-11 13:53:54

Xyne wrote:

So is -O2 faster than -O3 in this case?
Did you compile the C code with llvm for the last comparison?

I didn't spend a whole lot of time on this, but -O3 seems to slow things down for both C and Haskell some. I don't know much about compiler flags in general, but I knew that -O2 is generally regarded as sane, so that's what I was using. My only attempt with clang failed to compile your C. I'm sure I just missed a switch or something, but it may be some time before I can get around to it -- if ever.

FYI, your C with -O2:

real 0m9.875s
user 0m7.877s
sys 0m2.663s

Your Haskell with -O2 --make

real 0m10.010s
user 0m8.040s
sys 0m2.107s

Your Haskell with -O2 -fllvm

real 0m8.270s
user 0m6.287s
sys 0m2.073s

My current Haskell version with -O2 -fllvm

real 0m15.208s
user 0m13.157s
sys 0m2.073s

Last edited by skottish (2014-05-11 14:06:59)

skottish · 2014-07-15 22:57:15

I didn't want to let my first public appearance of Haskell code be left with just pretty good code. I revisited this problem and was able to get my run time down to 10.1 seconds. To put that in perspective and with my current testing setup, Xyne's C is at 9.4 seconds and Xyne's Haskell is at 8.1(!). Trilby's won't work because I intentionally put stuff at the beginning of the list starting with 0 in order to be complete. As with before, I'm avoiding using counters for my solution as I've never enjoyed managing state with them. And, I felt all along that I can get the performance that I'm looking for out of Haskell's standard libraries. I was getting fast.

Well, 10.1 seconds just wasn't fast enough for me and I decided to post at Stack Overflow to see if anyone had any advice. SO user dfeuer did. Some reworking later and I'm now at 9.2 seconds. I still feel like I can squeeze some more out of this. But just in case, here it is:

*** EDIT ***

There is a way and now I'm at 7.8 seconds 7.0 seconds. I found a post at SO about fast output. This is a modification of applicative's output solution.

import qualified Data.ByteString.Lazy.Builder as BB  
import qualified Data.ByteString.Lazy.Char8   as DB
import qualified Data.Function                as DF
import qualified Data.List                    as DL

import Control.Applicative
import Data.Monoid
import System.IO

groupByEqOn :: Eq b => (a -> b) -> [a] -> [[a]]
groupByEqOn fn = DL.groupBy ((==) `DF.on` fn)

filterLengthBy :: (Int -> Bool) -> [[a]] -> [[a]]
filterLengthBy fn = filter (flb fn)
    where flb fn xs = fn $ length xs   

tensAndHundreds :: [[[Int]]] -> [(Int, Int)]
tensAndHundreds [] = []
tensAndHundreds (x:xs) 
    | length x /= 10 = tens x ++ tensAndHundreds xs
    | otherwise      = (head $ head x, 100) : tensAndHundreds xs

tens :: [[Int]] -> [(Int, Int)]
tens = map (\x -> (head x, 10))

dbIntVals :: DB.ByteString -> [Int]
dbIntVals xs = 
    case (DB.readInt xs) of
         Just (x', xs') -> x' : dbIntVals (DB.tail xs')
         Nothing        -> if xs /= DB.empty
                           then dbIntVals (DB.tail xs)
                           else []

generateResults :: [Int] -> [(Int, Int)]
generateResults xs = tensAndHundreds 
                   $ groupByEqOn ((`quot` 100) . head) 
                   $ filterLengthBy (== 10) 
                   $ groupByEqOn (`quot` 10) xs

displayResults xs = BB.hPutBuilder stdout 
                  $ mconcat $ map (\(a,b) -> (BB.intDec a 
                                          <> BB.char7 ' ' 
                                          <> BB.intDec b 
                                          <> BB.char7 '\n')) xs

main :: IO ()
main = dbIntVals <$> DB.getContents >>= 
       displayResults . generateResults

Xyne, check out dbIntVals. It started life from me learning from your readInts, but it contains that last change that brought 0.9 seconds of performance. This will work on your code. Essentially, I took what was your readInts function and combined it with BS.lines. You don't need the BS.lines function at all if you utilize the ByteStream part of the Just value. Let me know if you update so that I can test and obsess about the numbers.

And once again Xyne, the output changes will help you as well.

Last edited by skottish (2014-07-18 21:43:07)

Xyne · 2014-07-19 15:16:08

@skottish
Nice

I have included the changes in my post above. On a list of 89,997,960 elements (generated with the gen.py script above, with 10,000,000 as input), I get the following times (both compiled with ghc -O2 -fllvm):

$ time ./skottish < data.txt > skottish.output

real	0m11.862s
user	0m11.610s
sys	0m0.263s

$time ./xyne < data.txt > xyne.output

real	0m9.276s
user	0m9.007s
sys	0m0.277s

The output is identical.

While updating my code to use the builder I found no difference between using one builder per line (i.e. multiple calls to hPutBuilder) and one builder for the full output.

edit: same thing with 56,653,076 elements

time ./skottish < subdata.txt > skottish.output

real	0m7.442s
user	0m7.257s
sys	0m0.190s

time ./xyne < subdata.txt > xyne.output

real	0m5.909s
user	0m5.763s
sys	0m0.150s

Last edited by Xyne (2014-07-19 15:22:01)

skottish · 2014-07-19 16:42:42

Xyne,

You're at 5.7 seconds on my machine now! Sweet! I'm going to continue to try to speed up my code without using counters for a bit longer. If I don't succeed, I'm quite happy with how things are right now.

Xyne wrote:

While updating my code to use the builder I found no difference between using one builder per line (i.e. multiple calls to hPutBuilder) and one builder for the full output.

*** EDIT ***

I found a faster version of groupBy from here and tried again with 'one builder per line', and now I'm at 5.8 seconds:

import qualified Data.ByteString.Lazy.Builder as BB  
import qualified Data.ByteString.Lazy.Char8   as DB
import qualified Data.Function                as DF
import qualified Data.List                    as DL

import Control.Applicative
import Data.Monoid
import System.IO

groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy rel []     = []
groupBy rel (x:xs) = (x:ys) : groupBy rel zs
    where (ys,zs) = groupByAux x xs
          groupByAux x0 (x:xs) 
              | rel x0 x = (x:ys, zs)
              where (ys,zs) = groupByAux x xs
          groupByAux y xs = ([], xs)

filterLengthBy :: (Int -> Bool) -> [[a]] -> [[a]]
filterLengthBy fn = filter (flb fn)
    where flb fn xs = fn $ length xs  

outBB x y =  BB.intDec x <> BB.char7 ' ' <> BB.intDec y <> BB.char7 '\n'

tensAndHundreds :: [[[Int]]] -> [BB.Builder]
tensAndHundreds [] = []
tensAndHundreds (x:xs) 
    | length x /= 10 = tens x ++ tensAndHundreds xs
    | otherwise      = outBB (head $ head x) 100 : tensAndHundreds xs

tens :: [[Int]] -> [BB.Builder]
tens = map (\x -> outBB (head x) 10)

dbIntVals :: DB.ByteString -> [Int]
dbIntVals xs = 
    case (DB.readInt xs) of
         Just (x', xs') -> x' : dbIntVals (DB.tail xs')
         Nothing        -> if xs /= DB.empty
                           then dbIntVals (DB.tail xs)
                           else []

generateResults :: [Int] -> [BB.Builder]
generateResults xs = tensAndHundreds 
                   $ groupBy ((==) `DF.on` (`quot` 100) . head) 
                   $ filterLengthBy (== 10) 
                   $ groupBy ((==) `DF.on` (`quot` 10)) xs

displayResults :: [BB.Builder] -> IO ()
displayResults xs = BB.hPutBuilder stdout 
                  $ mconcat xs

main :: IO ()
main = dbIntVals <$> DB.getContents >>= 
       displayResults . generateResults

Last edited by skottish (2014-07-20 02:50:42)

Xyne · 2014-07-20 03:40:23

edit

v2: assume presence of newline after each number and skip them with BS.tail as suggested by skottish below

/edit

Here's a new version using a different approach based on the ByteString Builder:

{-# OPTIONS_GHC -fwarn-unused-imports #-}

import qualified Data.ByteString.Lazy.Char8 as BS
import Data.ByteString.Builder as BB
import Data.Monoid
import System.IO


readInts :: BS.ByteString -> [Int]
readInts b =
  case (BS.readInt b) of
    (Just (x, b')) -> x : (readInts $ BS.tail b')
    _ -> []



getBuilder' :: BB.Builder -> BB.Builder -> BB.Builder -> Bool -> Int -> [Int] -> BB.Builder
getBuilder' tens _ _ _ _ [] = tens
getBuilder' tens ten hundred isHundred n (x:xs) =
  let
    t = rem x 10
    h = rem x 100
    isSeq = (x - n) == 1
    isHundred' = isHundred && isSeq
    ten' =
      BB.intDec x <>
      BB.char7 ' ' <>
      BB.intDec 10 <>
      BB.char7 '\n'
    tens' =  tens <> ten
    hundred' =
      BB.intDec x <>
      BB.char7 ' ' <>
      BB.intDec 100 <>
      BB.char7 '\n'
  in
    if t == 9 && isSeq
    then
      if h == 99
      then
        if isHundred
        then
          hundred <> getBuilder' mempty mempty mempty False x xs
        else
          tens' <> getBuilder' mempty mempty mempty False x xs
      else
        getBuilder' tens' mempty hundred isHundred x xs
    else
      if t == 0
      then
        if h == 0
        then
          tens <> getBuilder' mempty ten' hundred' True x xs
        else
          getBuilder' tens ten' hundred isHundred' x xs
      else
        if isSeq
        then
          getBuilder' tens ten hundred isHundred x xs
        else
          tens <> getBuilder' mempty mempty mempty False x xs



getBuilder :: [Int] -> BB.Builder
getBuilder = getBuilder' mempty mempty mempty False (-1)



main :: IO ()
main =
  BS.getContents >>=
  hPutBuilder stdout . getBuilder . readInts

Results for 50M entries (your new script is faster than my old one on my system):

time ./skottish < subset.txt > skottish.output

real	0m5.004s
user	0m4.883s
sys	0m0.127s

time ./xyne < subset.txt > xyne.output

real	0m5.194s
user	0m5.017s
sys	0m0.180s

time ./xyne-new < subset.txt > xyne-new.output

real	0m3.723s
user	0m3.603s
sys	0m0.120s

Last edited by Xyne (2014-07-20 18:02:05)

skottish · 2014-07-20 15:34:14

Xyne,

When I saw this version, I had to do a double take just to figure out where all of your code went! While I don't have anything to offer with my code yet, if your replace this in readInts, you'll be significantly faster:

(Just (x, b')) -> x : (readInts b')

with:

(Just (x, b')) -> x : readInts (BS.tail b')

Every valid int has a newline at the beginning of the ByteString part of the Just. tail keeps you from iterating over them. In this case, we're talking about millions of newlines.

Last edited by skottish (2014-07-21 00:39:24)

Xyne · 2014-07-20 17:49:44

Thanks. That made the code about 7.5% faster

Note: the getBuilder' function can be modified to use a case statement but this is slower on my system:

getBuilder' :: BB.Builder -> BB.Builder -> BB.Builder -> Bool -> Int -> [Int] -> BB.Builder
getBuilder' tens _ _ _ _ [] = tens
getBuilder' tens ten hundred isHundred n (x:xs) =
  let
    t = rem x 10
    h = rem x 100
    isSeq = (x - n) == 1
    isHundred' = isHundred && isSeq
    ten' =
      BB.intDec x <>
      BB.char7 ' ' <>
      BB.intDec 10 <>
      BB.char7 '\n'
    tens' =  tens <> ten
    hundred' =
      BB.intDec x <>
      BB.char7 ' ' <>
      BB.intDec 100 <>
      BB.char7 '\n'
  in
    case (t, h, isSeq, isHundred) of
      (_, 99, True, True) -> hundred <> getBuilder' mempty mempty mempty False x xs
      (_, 99, True, False) -> tens' <> getBuilder' mempty mempty mempty False x xs
      (9, _, True, _) -> getBuilder' tens' mempty hundred isHundred x xs
      (_, 0, _, _) -> tens <> getBuilder' mempty ten' hundred' True x xs
      (0, _, _, _) -> getBuilder' tens ten' hundred isHundred' x xs
      (_, _, True, _) -> getBuilder' tens ten hundred isHundred x xs
      _ -> tens <> getBuilder' mempty mempty mempty False x xs

Last edited by Xyne (2014-07-20 18:02:42)

skottish · 2014-07-27 21:00:35

The horse is still breathing, so I'll continue...

It was suggested in the Stack Overflow post that I linked to that laziness may be hurting my performance. I tried some experiments before and it didn't seem to help, but I revisited the idea and it helps alot. My current version is running at 5.1 seconds with my primary test set.

import qualified Data.ByteString.Builder as BB  
import qualified Data.ByteString.Char8   as DB
import qualified Data.Function           as DF

import Control.Applicative
import Data.Monoid
import System.IO

groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy rel []     = []
groupBy rel (x:xs) = (x:ys) : groupBy rel zs
    where (ys,zs) = groupByAux x xs
          groupByAux x0 (x:xs) 
              | rel x0 x = (x:ys, zs)
              where (ys,zs) = groupByAux x xs
          groupByAux y xs = ([], xs)

filterLengthBy :: (Int -> Bool) -> [[a]] -> [[a]]
filterLengthBy fn = filter (flb fn)
    where flb fn xs = fn $ length xs  

buildOutput x y =  BB.intDec x <> BB.char7 ' ' <> BB.intDec y <> BB.char7 '\n'

tensAndHundreds :: [[[Int]]] -> [BB.Builder]
tensAndHundreds [] = []
tensAndHundreds (x:xs) 
    | length x /= 10 = tens x ++ tensAndHundreds xs
    | otherwise      = buildOutput (head $ head x) 100 : tensAndHundreds xs

tens :: [[Int]] -> [BB.Builder]
tens = foldr (\x acc -> buildOutput (head x) 10 : acc) []

dbIntVals :: DB.ByteString -> [Int]
dbIntVals xs = 
    case (DB.readInt xs) of
         Just (x', xs') -> x' : dbIntVals (DB.tail xs')
         Nothing        -> if xs == DB.empty
                           then []
                           else dbIntVals (DB.tail xs)

generateResults :: [Int] -> [BB.Builder]
generateResults xs = tensAndHundreds 
                   $! groupBy ((==) `DF.on` (`quot` 100) . head) 
                   $ filterLengthBy (== 10)
                   $ groupBy ((==) `DF.on` (`quot` 10)) xs

displayResults :: [BB.Builder] -> IO ()
displayResults xs = BB.hPutBuilder stdout $ mconcat xs

main :: IO ()
main = DB.getContents >>= 
       displayResults . generateResults . dbIntVals

Xyne,

Once again part of what I did will make your code significantly faster. The main thing is to not use the lazy version of Data.ByteString.Char8. I'm seeing your code run at 4.2 seconds here. I'm using the version from this post.

Just for fun I wrote a preliminary version in Clojure based on my last algorithm. Not surprisingly considering how inefficient the idea is, it's "slow" at 28 seconds:

(ns arch-q.core
  (:gen-class)
  (:require [clojure.java.io :as io :only (reader)]))

(defn tens
  [coll]
  (map #(vector (first %) 10) coll))

(defn tens-and-hundreds
  [coll]
  (lazy-seq
    (when-let [[fst & rst] coll]
      (if (not= (count fst) 10)
        (concat (tens fst) (tens-and-hundreds rst))
        (cons [(ffirst fst) 100] (tens-and-hundreds rst))))))

(defn generate-results
  [coll]
  (tens-and-hundreds
    (partition-by #(quot (first %) 100)
      (filter #(= 10 (count %))
        (partition-by #(quot % 10)
          (map #(Long/parseLong %) coll))))))

(defn -main
  []
  (with-open [rdr (io/reader *in*)] 
    (doseq [out (generate-results (line-seq rdr))]
      (println (first out) (second out)))))

Now I'm motivated to write a more efficient version just to see if I can get Clojure running closer to the times that the other versions are doing. Clojure is a dynamic language running on the JVM -- which is designed for static languages -- and I'd like to get the run time down. It's really a great, performant language, and I feel that I can do better with it.

Last edited by skottish (2014-07-27 23:47:46)

Xyne · 2014-07-28 00:59:47

Here is the comparison of the lazy version from above (xyne-new) and the equivalent non-lazy version, where the only change is to replace

import qualified Data.ByteString.Lazy.Char8 as BS

with

import qualified Data.ByteString.Char8 as BS

, using a sample file with 50M entries:

time ./xyne-new < subset.txt > xyne-new.output

real	0m3.841s
user	0m3.690s
sys	0m0.153s

time ./xyne-new2 < subset.txt > xyne-new2.output

real	0m3.398s
user	0m3.220s
sys	0m0.180s

It is faster but the non-lazy version of getContents loads the entire file into memory, which is 420 MiB.

When I have time I hope to contribute a version in yet another language as a learning exercise.

Arch Linux

#26 2014-05-02 14:10:08

Re: Find gaps in sequence of numbers

#27 2014-05-02 14:48:50

Re: Find gaps in sequence of numbers

#28 2014-05-02 17:00:31

Re: Find gaps in sequence of numbers

#29 2014-05-02 23:33:14

Re: Find gaps in sequence of numbers

#30 2014-05-03 13:46:05

Re: Find gaps in sequence of numbers

#31 2014-05-04 17:48:53

Re: Find gaps in sequence of numbers

#32 2014-05-04 20:05:20

Re: Find gaps in sequence of numbers

#33 2014-05-04 20:24:11

Re: Find gaps in sequence of numbers

#34 2014-05-04 21:47:30

Re: Find gaps in sequence of numbers

#35 2014-05-04 22:06:07

Re: Find gaps in sequence of numbers

#36 2014-05-04 22:23:07

Re: Find gaps in sequence of numbers

#37 2014-05-04 22:38:22

Re: Find gaps in sequence of numbers

#38 2014-05-05 12:35:09

Re: Find gaps in sequence of numbers

#39 2014-05-05 16:20:23

Re: Find gaps in sequence of numbers

#40 2014-05-08 21:32:24

Re: Find gaps in sequence of numbers

#41 2014-05-11 04:39:55

Re: Find gaps in sequence of numbers

#42 2014-05-11 13:53:54

Re: Find gaps in sequence of numbers

#43 2014-07-15 22:57:15

Re: Find gaps in sequence of numbers

#44 2014-07-19 15:16:08

Re: Find gaps in sequence of numbers

#45 2014-07-19 16:42:42

Re: Find gaps in sequence of numbers

#46 2014-07-20 03:40:23

Re: Find gaps in sequence of numbers

#47 2014-07-20 15:34:14

Re: Find gaps in sequence of numbers

#48 2014-07-20 17:49:44

Re: Find gaps in sequence of numbers

#49 2014-07-27 21:00:35

Re: Find gaps in sequence of numbers

#50 2014-07-28 00:59:47

Re: Find gaps in sequence of numbers

Board footer