Julia code is organized in packages, and package management is built into the Julia language.
The assumption is that packages are developed with git
and Julia will clone the whole repository when installing a package.
Users can have their packages registered on a special GitHub repository: METADATA.jl. Dependencies are tracked in the REQUIRE
file.
# update the local copy of METADATA
Pkg.update()
# install a registered package
Pkg.add("DataFrames")
# install any other package
#Pkg.clone("https://github.com/leethargo/PipeLayout.jl")
# checkout a branch of a package (default: master)
#Pkg.checkout("PipeLayout")
# list installed packages with versions
Pkg.status()
INFO: Updating METADATA... INFO: Updating PipeLayout master... INFO: Updating SCIP master... INFO: Computing changes...
18 required packages: - BenchmarkTools 0.0.8 - Cbc 0.3.2 - Clp 0.3.1 - Convex 0.5.0 - DataFrames 0.10.0 - GLPKMathProgInterface 0.3.4 - IJulia 1.5.1 - IndexedTables 0.1.7 - JuMP 0.17.1 - LightGraphs 0.9.0 - Plots 0.12.0 - ProfileView 0.2.1 - PyPlot 2.3.2 - Query 0.6.0 - SCIP 0.3.0+ master - SCS 0.3.1 - StatPlots 0.4.0 - StaticArrays 0.5.1 80 additional packages: - AxisAlgorithms 0.1.6 - BaseTestNext 0.2.2 - BinDeps 0.6.0 - Blosc 0.2.1 - Cairo 0.3.0 - Calculus 0.2.2 - ColorTypes 0.5.1 - Colors 0.7.3 - Combinatorics 0.4.1 - Compat 0.26.0 - Conda 0.5.3 - DataArrays 0.5.3 - DataStructures 0.5.3 - DataValues 0.1.1 - DiffBase 0.2.0 - Distances 0.4.1 - Distributions 0.13.0 - DocStringExtensions 0.3.3 - Documenter 0.11.1 - DualNumbers 0.3.0 - FileIO 0.4.1 - FixedPointNumbers 0.3.8 - FixedSizeArrays 0.2.5 - ForwardDiff 0.4.2 - GLPK 0.4.2 - GZip 0.3.0 - Graphics 0.2.0 - Gtk 0.13.0 - GtkReactive 0.2.1 - HDF5 0.8.1 - Interpolations 0.6.2 - IntervalSets 0.0.5 - IterTools 0.1.0 - IterableTables 0.3.0 - Iterators 0.3.1 - JLD 0.6.11 - JSON 0.12.0 - KernelDensity 0.3.2 - LaTeXStrings 0.2.1 - Lazy 0.11.7 - LegacyStrings 0.2.2 - LineSearches 0.1.5 - Loess 0.2.0 - MacroTools 0.3.7 - MathProgBase 0.6.4 - MbedTLS 0.4.5 - Measures 0.1.0 - NaNMath 0.2.5 - NamedTuples 4.0.0 - Optim 0.7.8 - PDMats 0.7.0 - PipeLayout 0.0.0- master (unregistered) - PlotThemes 0.1.4 - PlotUtils 0.4.2 - Polynomials 0.1.5 - PooledArrays 0.1.1 - PositiveFactorizations 0.0.4 - PyCall 1.13.0 - QuadGK 0.1.2 - Ratios 0.1.0 - Reactive 0.5.2 - RecipesBase 0.2.0 - Reexport 0.0.3 - Requires 0.4.3 - ReverseDiffSparse 0.7.3 - Rmath 0.1.7 - RoundingIntegers 0.0.2 - SHA 0.3.3 - SIUnits 0.1.0 - ShowItLikeYouBuildIt 0.0.1 - Showoff 0.1.1 - SimpleTraits 0.5.0 - SortingAlgorithms 0.1.1 - SpecialFunctions 0.1.1 - StatsBase 0.16.0 - StatsFuns 0.5.0 - TexExtensions 0.0.3 - URIParser 0.1.8 - WoodburyMatrices 0.2.2 - ZMQ 0.4.3
INFO: No packages to install, update or remove INFO: Package DataFrames is already installed
The goal of this project is the definition of an index fund, following the Dow Jones. That is, we want to select few stocks of the index, together with weights, that show a similar behavior to the overall index.
We start with price data of all the Dow Jones stocks from 2016. From the averages prices, we define weights of the stocks to be used
The data is provided in a file using comma-separated values and three columns:
;head dowjones2016.csv
date,symbol,price 2016-01-04,AAPL,105.349997999999999 2016-01-04,AXP,67.589995999999999 2016-01-04,BA,140.500000000000000 2016-01-04,CAT,67.989998000000000 2016-01-04,CSCO,26.410000000000000 2016-01-04,CVX,88.849997999999999 2016-01-04,DD,63.070000000000000 2016-01-04,DIS,102.980002999999996 2016-01-04,GE,30.709999000000000
Julia provides a function to read csv files into arrays:
?readcsv
search: readcsv readchomp @threadcall
readcsv(source, [T::Type]; options...)
Equivalent to readdlm
with delim
set to comma, and type optionally defined by T
.
data = readcsv("dowjones2016.csv")
data[1:5,:]
5×3 Array{Any,2}: "date" "symbol" "price" "2016-01-04" "AAPL" 105.35 "2016-01-04" "AXP" 67.59 "2016-01-04" "BA" 140.5 "2016-01-04" "CAT" 67.99
But we will use the DataFrames package for easier processing.
using DataFrames
df = readtable("dowjones2016.csv")
df[1:4, :]
date | symbol | price | |
---|---|---|---|
1 | 2016-01-04 | AAPL | 105.349998 |
2 | 2016-01-04 | AXP | 67.589996 |
3 | 2016-01-04 | BA | 140.5 |
4 | 2016-01-04 | CAT | 67.989998 |
We can now access the columns by name:
df[:price]
7560-element DataArrays.DataArray{Float64,1}: 105.35 67.59 140.5 67.99 26.41 88.85 63.07 102.98 30.71 177.14 131.07 135.95 33.99 ⋮ 58.87 62.14 50.83 32.48 84.08 122.42 160.04 109.62 78.02 53.38 69.12 90.26
Let's compute mean prices for the stocks, using a groupby-and-aggregate approach.
avg = by(df, :symbol, d -> DataFrame(avgprice = mean(d[:price])))
avg[1:4, :]
symbol | avgprice | |
---|---|---|
1 | AAPL | 104.6040078690476 |
2 | AXP | 63.79333337698412 |
3 | BA | 133.11150809920633 |
4 | CAT | 78.69801573015873 |
We can now use these averages to compute weights.
weights = DataFrame(symbol = avg[:symbol],
weight = avg[:avgprice] / sum(avg[:avgprice]))
weights[1:4, :]
symbol | weight | |
---|---|---|
1 | AAPL | 0.03995967436333611 |
2 | AXP | 0.02436962866171713 |
3 | BA | 0.05084979654236352 |
4 | CAT | 0.030063351736529197 |
We can also pivot the table into a two-way format.
# original dataframe
df[1:4, :]
date | symbol | price | |
---|---|---|---|
1 | 2016-01-04 | AAPL | 105.349998 |
2 | 2016-01-04 | AXP | 67.589996 |
3 | 2016-01-04 | BA | 140.5 |
4 | 2016-01-04 | CAT | 67.989998 |
# two-way table with symbols as columns
# rows columns data
prices = unstack(df, :date, :symbol, :price)
prices[1:4, 1:4]
date | AAPL | AXP | BA | |
---|---|---|---|---|
1 | 2016-01-04 | 105.349998 | 67.589996 | 140.5 |
2 | 2016-01-05 | 102.709999 | 66.550003 | 141.070007 |
3 | 2016-01-06 | 100.699997 | 64.419998 | 138.830002 |
4 | 2016-01-07 | 96.449997 | 63.84 | 133.009995 |
joined = join(df, weights, on=:symbol)
joined[1:4, :]
date | symbol | price | weight | |
---|---|---|---|---|
1 | 2016-01-04 | AAPL | 105.349998 | 0.03995967436333611 |
2 | 2016-01-05 | AAPL | 102.709999 | 0.03995967436333611 |
3 | 2016-01-06 | AAPL | 100.699997 | 0.03995967436333611 |
4 | 2016-01-07 | AAPL | 96.449997 | 0.03995967436333611 |
joined[:contribution] = joined[:weight] .* joined[:price]
joined[1:4, :]
date | symbol | price | weight | contribution | |
---|---|---|---|---|---|
1 | 2016-01-04 | AAPL | 105.349998 | 0.03995967436333611 | 4.209751614258111 |
2 | 2016-01-05 | AAPL | 102.709999 | 0.03995967436333611 | 4.104258113898577 |
3 | 2016-01-06 | AAPL | 100.699997 | 0.03995967436333611 | 4.023939088508923 |
4 | 2016-01-07 | AAPL | 96.449997 | 0.03995967436333611 | 3.8541104724647446 |
index = by(joined, :date, d -> DataFrame(value = sum(d[:contribution])))
index[1:4, :]
date | value | |
---|---|---|
1 | 2016-01-04 | 100.57292879489896 |
2 | 2016-01-05 | 100.51142239490156 |
3 | 2016-01-06 | 99.01420719993507 |
4 | 2016-01-07 | 96.60603263325876 |
using Plots # general plotting
pyplot() # backend, based on Python's matplotlib
Plots.PyPlotBackend()
x = cumsum(randn(10, 3))
10×3 Array{Float64,2}: 0.0967222 1.13685 -0.221413 2.63506 -0.836293 0.568922 1.97871 0.380401 0.673628 0.862146 1.3827 -1.16359 0.318709 1.72083 0.381898 2.27183 1.32943 -0.366597 1.48305 0.657708 1.07946 1.82443 0.120829 -0.108935 1.97516 -0.857797 -0.854739 1.2446 -0.0925 -1.0925
Plots will interprete the columns of the data as series to be plotted independently:
plot(x)
plot(x')
You can also add to existing plots, using the call plot!
.
plot(x, color=[:red :green])
plot!(x + 3, color=:black, alpha=0.5)
using StatPlots # for DataFrames integration
We can set common attributes for several plots using the with
wrapper:
with(grid=false, legend=false, xticks=false, ylim=(0,300)) do
plot(df, :date, :price, group=:symbol, color=:grey, alpha=0.4)
plot!(index, :date, :value, linewidth=2)
end
bar(weights, :symbol, :weight, xrotation=50, color=:weight, grid=false)
We know come to the decision problem, where we want to pick a small subset of the stocks together with some weights, such that this portfolio has a similar behavior to our overall Dow Jones index.
The model is based on a linear regression over the time series, but we minimize the loss using the L1-norm (absolute value), and allow only a fixed number of weights to take nonzero variable.
A high-level mathematical model might look like this ($w$: weights, $P$: prices, $I$: value of index):
\begin{align*} \text{minimize} \quad & \lVert w^T P - I \rVert_1 \\ \text{subject to} \quad & \lVert w \rVert_0 \le K \end{align*}For the curious: this can be expressed as a Mixed-Integer Linear Program in the following form:
\begin{align*} \text{minimize} \quad & \sum_d \Delta^+_d + \Delta^-_d & \\ \text{subject to} \quad & \sum_s P_{d,s} w_s = I_d + \Delta^+_d + \Delta^-_d & (\forall d) \\ & w_s \le p_s & (\forall s) \\ & \sum_s p_s \le K & \\ & w_s \ge 0, \quad p_s \in \{0,1\} & (\forall s) \\ & \Delta^+_d \ge 0, \quad \Delta^-_d \ge 0 & (\forall d) \end{align*}Several Julia packages are devoted to this kind of optimization, such as JuMP and Convex for modeling, solver backends like Cbc or SCIP and MathProgBase as glue. See JuliaOpt for an overview.
using JuMP # modeling
using Cbc # solver backend
# preparing data for indexing
syms = [Symbol(s) for s in weights[:symbol]]
days = 1:length(prices[:date])
@show size(syms) size(days);
size(syms) = (30,) size(days) = (252,)
We will formulate a model that should look quite close to the mathematical notation above.
Note the heavy use of Julia macros to define variables and constraints. The expressions are used as parsed by the Julia language and directly translated to the solver's internal form.
function find_fund(maxstocks; timelimit=10.0, gaplimit=0.01, lastday=200)
days = 1:lastday
fund = Model(solver=CbcSolver(seconds=timelimit, ratioGap=gaplimit))
# decisions
@variable(fund, pick[syms], Bin) # is stock included?
@variable(fund, weight[syms] ≥ 0) # what part of the portfolio
# auxiliary variables
@variable(fund, Δ⁺[days] ≥ 0) # positive slack
@variable(fund, Δ⁻[days] ≥ 0) # negative slack
# fit to Dow Jones index
for d in days
@constraint(fund, sum(prices[d,s] * weight[s] for s in syms) == index[d, :value] + Δ⁺[d] - Δ⁻[d])
end
# can only use stock if picked
for s in syms
@constraint(fund, weight[s] ≤ pick[s])
end
# few stocks allowed
@constraint(fund, sum(pick[s] for s in syms) ≤ maxstocks)
# minimize the absolute violation (L1 norm)
@objective(fund, :Min, sum(Δ⁺[d] + Δ⁻[d] for d in days))
status = solve(fund)
@show status
getvalue(weight)
end
find_fund (generic function with 1 method)
trainingdays = 100
sol = find_fund(3, timelimit=6, lastday=trainingdays)
WARNING: Not solved to optimality, status: UserLimit
status = :UserLimit
weight: 1 dimensions: [AAPL] = 0.0 [ AXP] = 0.47229506948301675 [ BA] = 0.0 [ CAT] = 0.0 [CSCO] = 0.0 [ CVX] = 0.0 [ DD] = 0.0 [ DIS] = 0.0 [ GE] = 0.0 [ GS] = 0.0 [ HD] = 0.0 [ IBM] = 0.0 [INTC] = 0.0 [ JNJ] = 0.0 [ JPM] = 0.0 [ KO] = 0.0 [ MCD] = 0.0 [ MMM] = 0.31629226616166206 [ MRK] = 0.0 [MSFT] = 0.4061951417784263 [ NKE] = 0.0 [ PFE] = 0.0 [ PG] = 0.0 [ TRV] = 0.0 [ UNH] = 0.0 [ UTX] = 0.0 [ V] = 0.0 [ VZ] = 0.0 [ WMT] = 0.0 [ XOM] = 0.0
solfund = sum(sol[s] * prices[:, s] for s in syms);
with(xticks=[0, trainingdays, length(days)], yticks=[]) do
plot(index, :date, :value, label="Dow Jones")
plot!(solfund, label="Index Fund")
end
errors = abs.(index[:value] - solfund)
with(bins=20) do
histogram(errors[trainingdays:252], label="later", color=:red)
histogram!(errors[1:trainingdays], alpha=0.8, label="training", color=:green)
end