I would be happy to hear your comments and suggestions.
Please feel free to drop me a note via twitter, email, or google+.

# Day 6 - One Python Benchmark per Day¶

## Determining if a string is a number¶

For this benchmark, we will be testing different approaches to determine is a string is a number.

Note that methods like .isdigit() and .isnumeric() are only evaluating if a string in an integer,
e.g., both '1.2'.isdigit() and '1.2'.isnumeric() return False.

## Functions¶

In [1]:
def is_number_tryexcept(s):
""" Returns True is string is a number. """
try:
float(s)
return True
except ValueError:
return False

In [2]:
from re import match as re_match

def is_number_regex(s):
""" Returns True is string is a number. """
if re_match("^\d+?\.\d+?$", s) is None: return s.isdigit() return True  In [3]: from re import compile as re_compile comp = re_compile("^\d+?\.\d+?$")

def compiled_regex(s):
""" Returns True is string is a number. """
if comp.match(s) is None:
return s.isdigit()
return True

In [4]:
def is_number_repl_isdigit(s):
""" Returns True is string is a number. """
return s.replace('.','',1).isdigit()


Quick note on why I am importing re.search as re_search:
To decrease the overhead for the lookup.

In [5]:
import re

%timeit re_match("^\d+?\.\d+?$", '1.2345') %timeit re.match("^\d+?\.\d+?$", '1.2345')

1000000 loops, best of 3: 1.3 µs per loop
1000000 loops, best of 3: 1.36 µs per loop


## Testing the functions for correct behavior¶

In [6]:
a_float = '1.1234'
inv_float = '1.12.34'
no_number = 'abc123'
an_int = '12345'

funcs = [
is_number_tryexcept,
is_number_regex,
compiled_regex,
is_number_repl_isdigit
]

for f in funcs:
assert (f(a_float) == True), 'Error in %s(%s)' %(f.__name__, a_float)
assert (f(inv_float) == False), 'Error in %s(%s)' %(f.__name__, inv_float)
assert (f(no_number) == False), 'Error in %s(%s)' %(f.__name__, no_number)
assert (f(an_int) == True), 'Error in %s(%s)' %(f.__name__, an_int)
print('ok')

ok


## Special cases where those functions do not work¶

In [7]:
a_float = '.1234'

print('Float notation ".1234" is not supported by:')
for f in funcs:
if not f(a_float):
print('\t -', f.__name__)

Float notation ".1234" is not supported by:
- is_number_regex
- compiled_regex


### Scientific notations¶

In [8]:
scientific1 = '1.000000e+50'
scientific2 = '1e50'

print('Scientific notation "1.000000e+50" is not supported by:')
for f in funcs:
if not f(scientific1):
print('\t -', f.__name__)

print('Scientific notation "1e50" is not supported by:')
for f in funcs:
if not f(scientific2):
print('\t -', f.__name__)

Scientific notation "1.000000e+50" is not supported by:
- is_number_regex
- compiled_regex
- is_number_repl_isdigit
Scientific notation "1e50" is not supported by:
- is_number_regex
- compiled_regex
- is_number_repl_isdigit


## Timing the functions¶

In [9]:
import timeit

test_cases = ['1.12345', '1.12.345', 'abc12345', '12345']
times_n = {f.__name__:[] for f in funcs}

for t in test_cases:
for f in funcs:
f = f.__name__
times_n[f].append(min(timeit.Timer('%s(t)' %f,
'from __main__ import %s, t' %f)
.repeat(repeat=3, number=1000000)))


### Preparing the plots¶

In [10]:
%matplotlib inline

In [11]:
import platform
import multiprocessing

def print_sysinfo():

print('\nPython version  :', platform.python_version())
print('compiler        :', platform.python_compiler())

print('\nsystem     :', platform.system())
print('release    :', platform.release())
print('machine    :', platform.machine())
print('processor  :', platform.processor())
print('CPU count  :', multiprocessing.cpu_count())
print('interpreter:', platform.architecture()[0])
print('\n\n')

In [31]:
from numpy import arange
import matplotlib.pyplot as plt

def plot():
labels = [('is_number_tryexcept','Try-except method'),
('is_number_regex', 'Regular expression'),
('compiled_regex', 'Compiled regular expression'),
('is_number_repl_isdigit', 'replace-isdigit method')
]

x_labels = ['float: "1.12345"',
'invalid float: "1.12.345"',
'no number: "abc12345"',
'integer: "12345"']

plt.rcParams.update({'font.size': 12})

ind = arange(len(test_cases))  # the x locations for the groups
width = 0.2

fig = plt.figure(figsize=(12,8))
colors = [(0,'c'), (1,'b'), (2,'g'), (3,'r')]

for l,c in zip(labels,colors):
ax.bar(ind + c[0]*width,
times_n[l[0]],
width,
alpha=0.5,
color=c[1],
label=l[1])
plt.grid()
ax.set_ylabel('time in microseconds')
ax.set_title('Methods for determening if a string is a number')
ax.set_xticks(ind + width)
ax.set_xticklabels(x_labels)
plt.legend(loc='upper right', fontsize=13)
plt.xlim([-0.2, 3.8])
plt.ylim([-0, 2])
plt.show()


# Results¶

In [32]:
print_sysinfo()
plot()

Python version  : 3.4.0
compiler        : GCC 4.2.1 (Apple Inc. build 5577)

system     : Darwin
release    : 13.2.0
machine    : x86_64
processor  : i386
CPU count  : 4
interpreter: 64bit



## Conclusion¶

The try-except approach appears to be slower than the replace-isdigit method for cases where the string is not a number - executing the except-block seems to be very costly. However, we have to consider that it also works for special cases like:

• '1.000000e+50'
• '1e50'
• '.12345'

So it really depends on the data set which method to choose - the more thorough "try-except" or the faster "replace-isdigit"