Julia and Statistical Computing

Why Use Julia?

Additional Reasons to Use Julia

  • Clean language design:
    • Conservative, highly selective choice of features
    • Appropriate for both prototype and production code
    • Familiar C-like syntax
    • Strong metaprogramming facilities derived from Lisp
  • Smooth interoperability with C/Fortran
  • Growing, dedicated community

Why You Shouldn't Use Julia

  • Julia is very young
  • The Julia package ecosystem is even younger
  • Breaking changes are still coming in core and will be quite frequent outside of core Julia
  • Language features are still being added: your favorite may not exist yet
  • Code quality for packages varies from reasonably well tested to never tested to broken

So... Should You Use Julia?

That depends on your use case:

  • If you tend to build lots of tools from scratch, Julia is usable, but a little rough
  • If you tend to build upon lots of other packages, Julia isn't ready for you yet

The Basics of Julia

Julia has all of the basic types and functions you need to do math

In [1]:
# Integers
1
Out[1]:
1
In [2]:
# Floating point numbers
1.0
Out[2]:
1.0
In [3]:
# Strings
"a"
Out[3]:
"a"
In [4]:
# Complex numbers
1 + 2im
Out[4]:
1 + 2im
In [5]:
# Vectors
[1, 2, 3]
Out[5]:
3-element Array{Int64,1}:
 1
 2
 3
In [6]:
# Matrices
[1.0 2.0; 3.1 4.2]
Out[6]:
2x2 Array{Float64,2}:
 1.0  2.0
 3.1  4.2
In [7]:
# Elementary functions
sin(1.0)
Out[7]:
0.8414709848078965

Julia's Type System

In [8]:
subtypes(Any)[1:4]
Out[8]:
4-element Array{Any,1}:
 AbstractArray{T,N}
 AbstractCmd       
 AbstractRNG       
 Algorithm         
In [9]:
subtypes(Number)
Out[9]:
5-element Array{Any,1}:
 Complex{Float16}
 Complex{Float32}
 Complex{Float64}
 Complex{T<:Real}
 Real            
In [10]:
subtypes(Real)
Out[10]:
4-element Array{Any,1}:
 FloatingPoint       
 Integer             
 MathConst{sym}      
 Rational{T<:Integer}
In [11]:
subtypes(Integer)
Out[11]:
5-element Array{Any,1}:
 BigInt  
 Bool    
 Char    
 Signed  
 Unsigned
In [12]:
subtypes(None)
no method subtypes(Type{None})
while loading In[12], in expression starting on line 1

Why Are Types Important?

  • Abstraction: Assert that a computation works for many related types of objects
  • Parametric Polymorphism: Function behavior can be defined over many types at once
  • Ad Hoc Polymorphism: Function behavior can vary systematically across types
  • Type Safety: How should a function work if the wrong type of argument is used?

Abstraction / Parametric Polymorphism

In [1]:
sqrt(2)
Out[1]:
1.4142135623730951
In [2]:
sqrt(2.0)
Out[2]:
1.4142135623730951

Ad Hoc Polymorphism

In [3]:
1 + 1
Out[3]:
2
In [4]:
1.0 + 1.0
Out[4]:
2.0

Type Safety

In [5]:
"wat" + 1
no method +(ASCIIString, Int64)
while loading In[5], in expression starting on line 1
In [6]:
1 + "wat"
no method +(Int64, ASCIIString)
while loading In[6], in expression starting on line 1
In [7]:
"wat" - 1
no method -(ASCIIString, Int64)
while loading In[7], in expression starting on line 1
In [8]:
1 - "wat"
no method -(Int64, ASCIIString)
while loading In[8], in expression starting on line 1

Multiple Dispatch

The meaning of a function can depend upon the types of its inputs

In [9]:
function foo(a, b)
    return a + b
end
Out[9]:
foo (generic function with 1 method)
In [10]:
foo(1, 2)
Out[10]:
3
In [11]:
foo(1.0, 2)
Out[11]:
3.0
In [12]:
bar(a, b) = a * b
Out[12]:
bar (generic function with 1 method)
In [13]:
bar(2, 3)
Out[13]:
6
In [14]:
bar(1, "wat")
no method *(Int64, ASCIIString)
while loading In[14], in expression starting on line 1
 in bar at In[12]:1
In [15]:
bar("NaN", "NaN")
Out[15]:
"NaNNaN"
In [16]:
folly(a::Integer, b::Integer) = 1
Out[16]:
folly (generic function with 1 method)
In [17]:
folly(a::Float64, b::Float64) = 2
Out[17]:
folly (generic function with 2 methods)
In [18]:
folly(1, 2)
Out[18]:
1
In [19]:
folly(1.0, 2.0)
Out[19]:
2

User-Defined Types

In [20]:
type NewType
    a::Int
    b::UTF8String
end
In [21]:
x = NewType(1, "this is a string")
Out[21]:
NewType(1,"this is a string")
In [22]:
x.a, x.b
Out[22]:
(1,"this is a string")

Modules / Namespaces

In [23]:
module NewNamespace
    c = 42
    export c
end
In [24]:
c
c not defined
while loading In[24], in expression starting on line 1
In [25]:
using NewNamespace
In [26]:
c
Out[26]:
42
In [27]:
c = 41
Out[27]:
41
Warning: imported binding for c overwritten in module Main
In [28]:
NewNamespace.c
Out[28]:
42

What's the Catch?

If Julia looks like most dynamic languages, why is it so fast?

Designing for Efficiency

  • Julia's strict language definition makes it possible for the Julia compiler to do a lot of work:
    • Infer very specific types for variables
    • Perform static analysis of code
    • Provide custom binary implementations of high-level constructs based on specific types
    • Represent data in a maximally efficient way
  • Lots of valid code in Python and Ruby is illegal in Julia
  • Every Julia type has an obvious deconstruction into machine types

Static Analysis

Julia generates custom function implementations for every input type signatures

In [29]:
foo(a, b) = a + b
Out[29]:
foo (generic function with 1 method)
In [30]:
code_llvm(foo, (Int, Int))
define i64 @julia_foo1292(i64, i64) {
top:
  %2 = add i64 %1, %0, !dbg !3947
  ret i64 %2, !dbg !3947
}
In [31]:
code_llvm(foo, (Float64, Float64))
define double @julia_foo1305(double, double) {
top:
  %2 = fadd double %0, %1, !dbg !3986
  ret double %2, !dbg !3986
}
In [32]:
code_native(foo, (Int, Int))
	.section	__TEXT,__text,regular,pure_instructions
Filename: In[29]
Source line: 1
	push	RBP
	mov	RBP, RSP
Source line: 1
	add	RDI, RSI
	mov	RAX, RDI
	pop	RBP
	ret
In [33]:
code_native(foo, (Float64, Float64))
	.section	__TEXT,__text,regular,pure_instructions
Filename: In[29]
Source line: 1
	push	RBP
	mov	RBP, RSP
Source line: 1
	vaddsd	XMM0, XMM0, XMM1
	pop	RBP
	ret

For simple functions, Julia's specialized implementation is similar to a natural C implementation

As Julia gets smarter, its static analyses tools improve all code you write:

In [34]:
function foo()
	r = 0
	for i in 1:12
		r += 2
	end
	return r
end

code_llvm(foo, ())
define i64 @julia_foo1307() {
pass2:
  ret i64 24, !dbg !3993
}

What You Lose

To make static analysis effective, Julia rules out certain types of code:

  • Reification of scope
  • Evaluating code in arbitrary scopes
  • Unexpected changes of type
  • Mature community and package system

R's reification of scope is ruled out by Julia's design:

f1 <- function(scope.number) {
    ls(env = sys.frame(scope.number))
}

f2 <- function() {
    a <- 1
    my.scope <- sys.nframe()
    f1(my.scope)
}

f2()

R's ability to mutate ad hoc scopes is ruled out by Julia's design:

g1 <- function(var.name, scope.number)
{
    assign("a", -1, envir = sys.frame(scope.number))
}

g2 <- function()
{
    a <- 1
    scope.number <- sys.nframe()
    print(a)
    g1("a", scope.number)
    print(a)
}

g2()

R's type mutation by assignment is ruled out by Julia's design:

v <- c(1, 2, 3)
v[1] <- "a"
print(v)

To make defaults efficient, Julia assumes you want to work close to the metal:

  • Machine precision integers, not infinite precision integers
In [35]:
function fib{T <: Integer}(n::T)
    if n == zero(T)
        return zero(T)
    elseif n == one(T)
        return one(T)
    else
        a, b = zero(T), one(T)
        i = 1
        while i < n
            a, b = b, a + b
            i += 1
        end
        return b
    end
end
Out[35]:
fib (generic function with 1 method)
In [36]:
fib(95)
Out[36]:
-4953053512429003327
In [37]:
fib(BigInt(95))
Out[37]:
31940434634990099905

Statistics in Julia

  • StatsBase.jl (Mature, but incomplete)
  • DataArrays.jl (Usable, but immature)
  • DataFrames.jl (Usable, but immature)
  • Distributions.jl (Mature, complete)
  • Gadfly.jl (Mature, but incomplete)
  • GLM.jl (Usable, but immature)
  • Optim.jl, NLopt.jl, DualNumbers.jl, Calculus.jl, JuMP.jl (Mature, fairly complete)

StatsBase

In [38]:
using StatsBase
In [39]:
modes([1, 1, 2, 2, 3])
Out[39]:
2-element Array{Int64,1}:
 2
 1
In [40]:
corspearman(rand(100), rand(100))
Out[40]:
-0.08795679567956798

DataArrays

In [41]:
using DataArrays
In [42]:
da = @data([1, 2, NA, 4])
Out[42]:
4-element DataArray{Int64,1}:
 1  
 2  
  NA
 4  

DataFrames

In [43]:
using DataFrames
In [44]:
df = DataFrame(
    A = [1, 2, 3],
    B = ["a", "b", "c"],
    C = [1//2, 3//4, 5//6]
)
Out[44]:
ABC
11a1//2
22b3//4
33c5//6

Distributions

In [45]:
using Distributions
In [46]:
srand(1)
In [47]:
x = rand(Normal(10, 1), 10)
Out[47]:
10-element Array{Float64,1}:
 10.6701 
 10.5509 
  9.93663
 11.3369 
  9.92685
  9.25454
  8.77994
  9.94682
  9.83486
  7.88463
In [48]:
pdf(Normal(10, 1), 1.0)
Out[48]:
1.0279773571668915e-18
In [49]:
loglikelihood(Normal(10, 1), x)
Out[49]:
-13.738557913500014

RDatasets

In [50]:
using RDatasets
In [51]:
iris = dataset("datasets", "iris")
Out[51]:
SepalLengthSepalWidthPetalLengthPetalWidthSpecies
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
74.63.41.40.3setosa
85.03.41.50.2setosa
94.42.91.40.2setosa
104.93.11.50.1setosa
115.43.71.50.2setosa
124.83.41.60.2setosa
134.83.01.40.1setosa
144.33.01.10.1setosa
155.84.01.20.2setosa
165.74.41.50.4setosa
175.43.91.30.4setosa
185.13.51.40.3setosa
195.73.81.70.3setosa
205.13.81.50.3setosa
215.43.41.70.2setosa
225.13.71.50.4setosa
234.63.61.00.2setosa
245.13.31.70.5setosa
254.83.41.90.2setosa
265.03.01.60.2setosa
275.03.41.60.4setosa
285.23.51.50.2setosa
295.23.41.40.2setosa
304.73.21.60.2setosa

Optim

In [52]:
using Optim
In [53]:
x = rand(Normal(31, 11), 1_000)
Out[53]:
1000-element Array{Float64,1}:
 22.3331
 41.9854
 25.5507
 35.0022
 38.4288
 37.6098
 14.4765
 26.8999
 16.957 
 38.0962
 51.9647
 13.5306
 34.2712
  ⋮     
 15.7977
 21.9261
 23.4996
 40.0516
 27.46  
 36.5742
 33.2723
 21.4413
 31.422 
 39.2109
 24.5928
 39.204 
In [54]:
function nll(theta)
    -loglikelihood(Normal(theta[1], exp(theta[2])), x)
end
Out[54]:
nll (generic function with 1 method)
In [55]:
nll([1.0, 1.0])
Out[55]:
70556.9217598127
In [56]:
fit = optimize(nll, [0.0, 0.0])
Out[56]:
Results of Optimization Algorithm
 * Algorithm: Nelder-Mead
 * Starting Point: [0.0,0.0]
 * Minimum: [30.87859403957906,2.4004058957100334]
 * Value of Function at Minimum: 3819.342967
 * Iterations: 54
 * Convergence: true
   * |x - x'| < NaN: false
   * |f(x) - f(x')| / |f(x)| < 1.0e-08: true
   * |g(x)| < NaN: false
   * Exceeded Maximum Number of Iterations: false
 * Objective Function Calls: 106
 * Gradient Call: 0
In [57]:
theta = fit.minimum
Out[57]:
2-element Array{Float64,1}:
 30.8786 
  2.40041
In [58]:
Normal(theta[1], exp(theta[2]))
Out[58]:
Normal( μ=30.87859403957906 σ=11.027651548809786 )