Continuos ranges and missing data

In an effort to document techniques to deal with incomplete sequences of integers, this notebook shows how to obtain a list of continuous ranges of integers from a single sorted list, e.g. from this:

[1, 2, 3, 7, 8, 9, 10, 23, 24, 25, 27, 28, 35]

to this:

[1, 3], [7, 10], [23, 25], [27, 28], [35]

The resulting list is much more human readable and can be used to describe the corresponding values associated with each range. Isolated values will remain as such. This technique is integrated in https://github.com/steko/missing with other similar tools to list duplicates, missing values and such.

Data import

First of all let's import some data to work with:

In [2]:
nctn = open("nctn.txt").readlines()
nums = [int(n.strip()) for n in nctn]
nums = sorted(nums)

Take a look at our data:

In [4]:
print(len(nums))
1527
In [6]:
print(nums[:20])
[2238, 3319, 3324, 3328, 3329, 3330, 3363, 3372, 3376, 3378, 3380, 3382, 3386, 3391, 3392, 3410, 3412, 3413, 3416, 3423]

As you can see there are a few isolated numbers and then some ranges of two numbers. More long-spanning ranges are further down the list.

Find ranges

Define a function that will find the ranges for us:

In [8]:
def ranges(nums):
    '''Return a list of continuous ranges (min, max values) from a list of integers.'''
    
    rgs = [[nums[0]]]
    
    # these two make the code more readable and, incidentally, faster
    FIRST, LAST = 0, -1
    
    for n in nums:
        if rgs[LAST][LAST] == n - 1:
            if rgs[LAST][FIRST] == n -1:
                rgs[LAST].append(n)
            if rgs[LAST][FIRST] < n - 1:
                rgs[LAST][LAST] = n
        elif rgs[LAST][LAST] < n -1:
            rgs.append([n])
    return rgs

And now let's see what ranges we have in our data:

In [10]:
rgs = ranges(nums)
for r in rgs:
    try:
        print('{} - {}: {} numbers'.format(r[0], r[1], r[1] - r[0] + 1))
    except IndexError:
        print('{}: 1 number'.format(r[0]))
2238: 1 number
3319: 1 number
3324: 1 number
3328 - 3330: 3 numbers
3363: 1 number
3372: 1 number
3376: 1 number
3378: 1 number
3380: 1 number
3382: 1 number
3386: 1 number
3391 - 3392: 2 numbers
3410: 1 number
3412 - 3413: 2 numbers
3416: 1 number
3423: 1 number
3429: 1 number
3462: 1 number
3483: 1 number
3599 - 3602: 4 numbers
3675: 1 number
3689: 1 number
3768: 1 number
3776: 1 number
3786: 1 number
11865: 1 number
11867 - 11964: 98 numbers
13028 - 13087: 60 numbers
23032 - 23074: 43 numbers
47435 - 47443: 9 numbers
55644 - 55651: 8 numbers
55687: 1 number
55737: 1 number
55750 - 55775: 26 numbers
55777 - 55805: 29 numbers
55807 - 55878: 72 numbers
55880 - 55922: 43 numbers
55924 - 55945: 22 numbers
88038: 1 number
88050 - 88053: 4 numbers
88429 - 88826: 398 numbers
96146: 1 number
96149 - 96271: 123 numbers
96273 - 96275: 3 numbers
97763 - 97870: 108 numbers
98033: 1 number
104878 - 104899: 22 numbers
105396 - 105489: 94 numbers
105500: 1 number
105540 - 105569: 30 numbers
105571 - 105605: 35 numbers
105785 - 105821: 37 numbers
105823 - 105854: 32 numbers
105857 - 105926: 70 numbers
105928 - 105938: 11 numbers
105941 - 105945: 5 numbers
106704 - 106753: 50 numbers
107545 - 107561: 17 numbers
1004632 - 1004656: 25 numbers

Timing

The dataset we have been working with is rather small, but if you need to crunch big data a fast tool is really needed. Let's see how the ranges function performs:

In [11]:
%timeit rgs = ranges(nums)
1000 loops, best of 3: 806 ┬Ás per loop

Not bad for a naive algorithm like that.