This lecture explores some of the key packages for working with data and doing statistics in Julia
In particular, we will examine the DataFrame
object in detail (i.e., construction, manipulation, querying, visualization, and nuances like missing data)
While Julia is not an ideal language for pure cookie-cutter statistical analysis, it has many useful packages to provide those tools as part of a more general solution
This list is not exhaustive, and others can be found in organizations such as JuliaStats, JuliaData, and QueryVerse
using InstantiateFromURL
github_project("QuantEcon/quantecon-notebooks-julia", version = "0.2.0")
Activated /home/qebuild/repos/lecture-source-jl/_build/website/jupyter/Project.toml Info quantecon-notebooks-julia 0.1.0 activated, 0.2.0 requested
using LinearAlgebra, Statistics
using DataFrames, RDatasets, DataFramesMeta, CategoricalArrays, Query, VegaLite
using DataVoyager, GLM
┌ Info: Recompiling stale cache file /home/qebuild/.julia/compiled/v1.2/RDatasets/JyIbx.ji for RDatasets [ce6b1742-4840-55fa-b093-852dadbb1d8b] └ @ Base loading.jl:1240
A useful package for working with data is DataFrames.jl
The most important data type provided is a DataFrame
, a two dimensional array for storing heterogeneous data
Although data can be heterogeneous within a DataFrame
, the contents of the columns must be homogeneous
(of the same type)
This is analogous to a data.frame
in R, a DataFrame
in Pandas (Python) or, more loosely, a spreadsheet in Excel
There are a few different ways to create a DataFrame
The first is to set up columns and construct a dataframe by assigning names
using DataFrames, RDatasets # RDatasets provides good standard data examples from R
# note use of missing
commodities = ["crude", "gas", "gold", "silver"]
last_price = [4.2, 11.3, 12.1, missing]
df = DataFrame(commod = commodities, price = last_price)
commod | price | |
---|---|---|
String | Float64⍰ | |
1 | crude | 4.2 |
2 | gas | 11.3 |
3 | gold | 12.1 |
4 | silver | missing |
Columns of the DataFrame
can be accessed by name using a symbol df[!, :col]
or a struct-style df.col
, as below
df[!, :price]
4-element Array{Union{Missing, Float64},1}: 4.2 11.3 12.1 missing
df.price
4-element Array{Union{Missing, Float64},1}: 4.2 11.3 12.1 missing
Note that the type of this array has values Union{Missing, Float64}
since it was created with a missing
value
df.commod
4-element Array{String,1}: "crude" "gas" "gold" "silver"
The DataFrames.jl
package provides a number of methods for acting on DataFrame
’s, such as describe
DataFrames.describe(df)
variable | mean | min | median | max | nunique | nmissing | eltype | |
---|---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Union… | Union… | DataType | |
1 | commod | crude | silver | 4 | String | |||
2 | price | 9.2 | 4.2 | 11.3 | 12.1 | 1 | Float64 |
While often data will be generated all at once, or read from a file, you can add to a DataFrame
by providing the key parameters
nt = (commod = "nickel", price= 5.1)
push!(df, nt)
commod | price | |
---|---|---|
String | Float64⍰ | |
1 | crude | 4.2 |
2 | gas | 11.3 |
3 | gold | 12.1 |
4 | silver | missing |
5 | nickel | 5.1 |
Named tuples can also be used to construct a DataFrame
, and have it properly deduce all types
nt = (t = 1, col1 = 3.0)
df2 = DataFrame([nt])
push!(df2, (t=2, col1 = 4.0))
t | col1 | |
---|---|---|
Int64 | Float64 | |
1 | 1 | 3.0 |
2 | 2 | 4.0 |
As we discussed in fundamental types, the semantics of missing
are that mathematical operations will not silently ignore it
In order to allow missing
in a column, you can create/load the DataFrame
from a source with missing
’s, or call allowmissing!
on a column
allowmissing!(df2, :col1) # necessary to add in a for col1
push!(df2, (t=3, col1 = missing))
push!(df2, (t=4, col1 = 5.1))
t | col1 | |
---|---|---|
Int64 | Float64⍰ | |
1 | 1 | 3.0 |
2 | 2 | 4.0 |
3 | 3 | missing |
4 | 4 | 5.1 |
We can see the propagation of missing
to caller functions, as well as a way to efficiently calculate with non-missing data
@show mean(df2.col1)
@show mean(skipmissing(df2.col1))
mean(df2.col1) = missing mean(skipmissing(df2.col1)) = 4.033333333333333
4.033333333333333
And to replace the missing
df2.col1 .= coalesce.(df2.col1, 0.0) # replace all missing with 0.0
4-element Array{Union{Missing, Float64},1}: 3.0 4.0 0.0 5.1
One way to do an additional calculation with a DataFrame
is to tuse the @transform
macro from DataFramesMeta.jl
using DataFramesMeta
f(x) = x^2
df2 = @transform(df2, col2 = f.(:col1))
t | col1 | col2 | |
---|---|---|---|
Int64 | Float64⍰ | Float64 | |
1 | 1 | 3.0 | 9.0 |
2 | 2 | 4.0 | 16.0 |
3 | 3 | 0.0 | 0.0 |
4 | 4 | 5.1 | 26.01 |
For data that is categorical
using CategoricalArrays
id = [1, 2, 3, 4]
y = ["old", "young", "young", "old"]
y = CategoricalArray(y)
df = DataFrame(id=id, y=y)
id | y | |
---|---|---|
Int64 | Categorical… | |
1 | 1 | old |
2 | 2 | young |
3 | 3 | young |
4 | 4 | old |
levels(df.y)
2-element Array{String,1}: "old" "young"
The DataFrame
(and similar types that fulfill a standard generic interface) can fit into a variety of packages
One set of them is the QueryVerse
Note: The QueryVerse, in the same spirit as R’s tidyverse, makes heavy use of the pipeline syntax |>
x = 3.0
f(x) = x^2
g(x) = log(x)
@show g(f(x))
@show x |> f |> g; # pipes nest function calls
g(f(x)) = 2.1972245773362196 (x |> f) |> g = 2.1972245773362196
To give an example directly from the source of the LINQ inspired Query.jl
using Query
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
x = @from i in df begin
@where i.age>50
@select {i.name, i.children}
@collect DataFrame
end
name | children | |
---|---|---|
String | Int64 | |
1 | Kirk | 2 |
While it is possible to just use the Plots.jl
library, there may be better options for displaying tabular data – such as VegaLite.jl
using RDatasets, VegaLite
iris = dataset("datasets", "iris")
iris |> @vlplot(
:point,
x=:PetalLength,
y=:PetalWidth,
color=:Species
)
Another useful tool for exploring tabular data is DataVoyager.jl
using DataVoyager
iris |> Voyager()
The Voyager()
function creates a separate window for analysis
While Julia is not intended as a replacement for R, Stata, and similar specialty languages, it has a growing number of packages aimed at statistics and econometrics
Many of the packages live in the JuliaStats organization
A few to point out
using GLM
x = randn(100)
y = 0.9 .* x + 0.5 * rand(100)
df = DataFrame(x=x, y=y)
ols = lm(@formula(y ~ x), df) # R-style notation
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}} y ~ 1 + x Coefficients: ────────────────────────────────────────────────────────────────────────── Estimate Std. Error t value Pr(>|t|) Lower 95% Upper 95% ────────────────────────────────────────────────────────────────────────── (Intercept) 0.268432 0.0135617 19.7934 <1e-35 0.24152 0.295345 x 0.904374 0.0130551 69.2735 <1e-84 0.878467 0.930281 ──────────────────────────────────────────────────────────────────────────
To explore data use the interactive DataVoyager and VegaLite
# cigar |> Voyager()
cigar |> @vlplot(
:point,
x=:Price,
y=:Sales,
color=:Year,
size=:NDI
)
UndefVarError: cigar not defined Stacktrace: [1] top-level scope at In[20]:1