Introduction to Packages

Julia code is organized in packages, and package management is built into the Julia language.

The assumption is that packages are developed with git and Julia will clone the whole repository when installing a package.

Users can have their packages registered on a special GitHub repository: METADATA.jl. Dependencies are tracked in the REQUIRE file.

In [1]:
# update the local copy of METADATA
Pkg.update()

# install a registered package
Pkg.add("DataFrames")

# install any other package
#Pkg.clone("https://github.com/leethargo/PipeLayout.jl")

# checkout a branch of a package (default: master)
#Pkg.checkout("PipeLayout")

# list installed packages with versions
Pkg.status()
INFO: Updating METADATA...
INFO: Updating PipeLayout master...
INFO: Updating SCIP master...
INFO: Computing changes...
18 required packages:
 - BenchmarkTools                0.0.8
 - Cbc                           0.3.2
 - Clp                           0.3.1
 - Convex                        0.5.0
 - DataFrames                    0.10.0
 - GLPKMathProgInterface         0.3.4
 - IJulia                        1.5.1
 - IndexedTables                 0.1.7
 - JuMP                          0.17.1
 - LightGraphs                   0.9.0
 - Plots                         0.12.0
 - ProfileView                   0.2.1
 - PyPlot                        2.3.2
 - Query                         0.6.0
 - SCIP                          0.3.0+             master
 - SCS                           0.3.1
 - StatPlots                     0.4.0
 - StaticArrays                  0.5.1
80 additional packages:
 - AxisAlgorithms                0.1.6
 - BaseTestNext                  0.2.2
 - BinDeps                       0.6.0
 - Blosc                         0.2.1
 - Cairo                         0.3.0
 - Calculus                      0.2.2
 - ColorTypes                    0.5.1
 - Colors                        0.7.3
 - Combinatorics                 0.4.1
 - Compat                        0.26.0
 - Conda                         0.5.3
 - DataArrays                    0.5.3
 - DataStructures                0.5.3
 - DataValues                    0.1.1
 - DiffBase                      0.2.0
 - Distances                     0.4.1
 - Distributions                 0.13.0
 - DocStringExtensions           0.3.3
 - Documenter                    0.11.1
 - DualNumbers                   0.3.0
 - FileIO                        0.4.1
 - FixedPointNumbers             0.3.8
 - FixedSizeArrays               0.2.5
 - ForwardDiff                   0.4.2
 - GLPK                          0.4.2
 - GZip                          0.3.0
 - Graphics                      0.2.0
 - Gtk                           0.13.0
 - GtkReactive                   0.2.1
 - HDF5                          0.8.1
 - Interpolations                0.6.2
 - IntervalSets                  0.0.5
 - IterTools                     0.1.0
 - IterableTables                0.3.0
 - Iterators                     0.3.1
 - JLD                           0.6.11
 - JSON                          0.12.0
 - KernelDensity                 0.3.2
 - LaTeXStrings                  0.2.1
 - Lazy                          0.11.7
 - LegacyStrings                 0.2.2
 - LineSearches                  0.1.5
 - Loess                         0.2.0
 - MacroTools                    0.3.7
 - MathProgBase                  0.6.4
 - MbedTLS                       0.4.5
 - Measures                      0.1.0
 - NaNMath                       0.2.5
 - NamedTuples                   4.0.0
 - Optim                         0.7.8
 - PDMats                        0.7.0
 - PipeLayout                    0.0.0-             master (unregistered)
 - PlotThemes                    0.1.4
 - PlotUtils                     0.4.2
 - Polynomials                   0.1.5
 - PooledArrays                  0.1.1
 - PositiveFactorizations        0.0.4
 - PyCall                        1.13.0
 - QuadGK                        0.1.2
 - Ratios                        0.1.0
 - Reactive                      0.5.2
 - RecipesBase                   0.2.0
 - Reexport                      0.0.3
 - Requires                      0.4.3
 - ReverseDiffSparse             0.7.3
 - Rmath                         0.1.7
 - RoundingIntegers              0.0.2
 - SHA                           0.3.3
 - SIUnits                       0.1.0
 - ShowItLikeYouBuildIt          0.0.1
 - Showoff                       0.1.1
 - SimpleTraits                  0.5.0
 - SortingAlgorithms             0.1.1
 - SpecialFunctions              0.1.1
 - StatsBase                     0.16.0
 - StatsFuns                     0.5.0
 - TexExtensions                 0.0.3
 - URIParser                     0.1.8
 - WoodburyMatrices              0.2.2
 - ZMQ                           0.4.3
INFO: No packages to install, update or remove
INFO: Package DataFrames is already installed

Creating an index fund

The goal of this project is the definition of an index fund, following the Dow Jones. That is, we want to select few stocks of the index, together with weights, that show a similar behavior to the overall index.

We start with price data of all the Dow Jones stocks from 2016. From the averages prices, we define weights of the stocks to be used

Loading the price data

The data is provided in a file using comma-separated values and three columns:

In [2]:
;head dowjones2016.csv
date,symbol,price
2016-01-04,AAPL,105.349997999999999
2016-01-04,AXP,67.589995999999999
2016-01-04,BA,140.500000000000000
2016-01-04,CAT,67.989998000000000
2016-01-04,CSCO,26.410000000000000
2016-01-04,CVX,88.849997999999999
2016-01-04,DD,63.070000000000000
2016-01-04,DIS,102.980002999999996
2016-01-04,GE,30.709999000000000

Julia provides a function to read csv files into arrays:

In [3]:
?readcsv
search: readcsv readchomp @threadcall

Out[3]:
readcsv(source, [T::Type]; options...)

Equivalent to readdlm with delim set to comma, and type optionally defined by T.

In [4]:
data = readcsv("dowjones2016.csv")
data[1:5,:]
Out[4]:
5×3 Array{Any,2}:
 "date"        "symbol"     "price"
 "2016-01-04"  "AAPL"    105.35    
 "2016-01-04"  "AXP"      67.59    
 "2016-01-04"  "BA"      140.5     
 "2016-01-04"  "CAT"      67.99    

But we will use the DataFrames package for easier processing.

In [5]:
using DataFrames
In [6]:
df = readtable("dowjones2016.csv")
df[1:4, :]
Out[6]:
datesymbolprice
12016-01-04AAPL105.349998
22016-01-04AXP67.589996
32016-01-04BA140.5
42016-01-04CAT67.989998

We can now access the columns by name:

In [7]:
df[:price]
Out[7]:
7560-element DataArrays.DataArray{Float64,1}:
 105.35
  67.59
 140.5 
  67.99
  26.41
  88.85
  63.07
 102.98
  30.71
 177.14
 131.07
 135.95
  33.99
   ⋮   
  58.87
  62.14
  50.83
  32.48
  84.08
 122.42
 160.04
 109.62
  78.02
  53.38
  69.12
  90.26

Let's compute mean prices for the stocks, using a groupby-and-aggregate approach.

In [8]:
avg = by(df, :symbol, d -> DataFrame(avgprice = mean(d[:price])))
avg[1:4, :]
Out[8]:
symbolavgprice
1AAPL104.6040078690476
2AXP63.79333337698412
3BA133.11150809920633
4CAT78.69801573015873

We can now use these averages to compute weights.

In [9]:
weights = DataFrame(symbol = avg[:symbol],
                    weight = avg[:avgprice] / sum(avg[:avgprice]))
weights[1:4, :]
Out[9]:
symbolweight
1AAPL0.03995967436333611
2AXP0.02436962866171713
3BA0.05084979654236352
4CAT0.030063351736529197

We can also pivot the table into a two-way format.

In [10]:
# original dataframe
df[1:4, :]
Out[10]:
datesymbolprice
12016-01-04AAPL105.349998
22016-01-04AXP67.589996
32016-01-04BA140.5
42016-01-04CAT67.989998
In [11]:
# two-way table with symbols as columns
#                    rows   columns  data
prices = unstack(df, :date, :symbol, :price)
prices[1:4, 1:4]
Out[11]:
dateAAPLAXPBA
12016-01-04105.34999867.589996140.5
22016-01-05102.70999966.550003141.070007
32016-01-06100.69999764.419998138.830002
42016-01-0796.44999763.84133.009995
In [12]:
joined = join(df, weights, on=:symbol)
joined[1:4, :]
Out[12]:
datesymbolpriceweight
12016-01-04AAPL105.3499980.03995967436333611
22016-01-05AAPL102.7099990.03995967436333611
32016-01-06AAPL100.6999970.03995967436333611
42016-01-07AAPL96.4499970.03995967436333611
In [13]:
joined[:contribution] = joined[:weight] .* joined[:price]
joined[1:4, :]
Out[13]:
datesymbolpriceweightcontribution
12016-01-04AAPL105.3499980.039959674363336114.209751614258111
22016-01-05AAPL102.7099990.039959674363336114.104258113898577
32016-01-06AAPL100.6999970.039959674363336114.023939088508923
42016-01-07AAPL96.4499970.039959674363336113.8541104724647446
In [14]:
index = by(joined, :date, d -> DataFrame(value = sum(d[:contribution])))
index[1:4, :]
Out[14]:
datevalue
12016-01-04100.57292879489896
22016-01-05100.51142239490156
32016-01-0699.01420719993507
42016-01-0796.60603263325876

Visualization the time series

In [15]:
using Plots      # general plotting
pyplot()         # backend, based on Python's matplotlib
Out[15]:
Plots.PyPlotBackend()
In [16]:
x = cumsum(randn(10, 3))
Out[16]:
10×3 Array{Float64,2}:
 0.0967222   1.13685   -0.221413
 2.63506    -0.836293   0.568922
 1.97871     0.380401   0.673628
 0.862146    1.3827    -1.16359 
 0.318709    1.72083    0.381898
 2.27183     1.32943   -0.366597
 1.48305     0.657708   1.07946 
 1.82443     0.120829  -0.108935
 1.97516    -0.857797  -0.854739
 1.2446     -0.0925    -1.0925  

Plots will interprete the columns of the data as series to be plotted independently:

In [17]:
plot(x)
Out[17]:
In [18]:
plot(x')
Out[18]:

You can also add to existing plots, using the call plot!.

In [19]:
plot(x, color=[:red :green])
plot!(x + 3, color=:black, alpha=0.5)
Out[19]:
In [20]:
using StatPlots  # for DataFrames integration

We can set common attributes for several plots using the with wrapper:

In [21]:
with(grid=false, legend=false, xticks=false, ylim=(0,300)) do
    plot(df, :date, :price, group=:symbol, color=:grey, alpha=0.4)
    plot!(index, :date, :value, linewidth=2)
end
Out[21]:
In [22]:
bar(weights, :symbol, :weight, xrotation=50, color=:weight, grid=false)
Out[22]:

Picking stocks

We know come to the decision problem, where we want to pick a small subset of the stocks together with some weights, such that this portfolio has a similar behavior to our overall Dow Jones index.

The model is based on a linear regression over the time series, but we minimize the loss using the L1-norm (absolute value), and allow only a fixed number of weights to take nonzero variable.

A high-level mathematical model might look like this ($w$: weights, $P$: prices, $I$: value of index):

\begin{align*} \text{minimize} \quad & \lVert w^T P - I \rVert_1 \\ \text{subject to} \quad & \lVert w \rVert_0 \le K \end{align*}

For the curious: this can be expressed as a Mixed-Integer Linear Program in the following form:

\begin{align*} \text{minimize} \quad & \sum_d \Delta^+_d + \Delta^-_d & \\ \text{subject to} \quad & \sum_s P_{d,s} w_s = I_d + \Delta^+_d + \Delta^-_d & (\forall d) \\ & w_s \le p_s & (\forall s) \\ & \sum_s p_s \le K & \\ & w_s \ge 0, \quad p_s \in \{0,1\} & (\forall s) \\ & \Delta^+_d \ge 0, \quad \Delta^-_d \ge 0 & (\forall d) \end{align*}

Several Julia packages are devoted to this kind of optimization, such as JuMP and Convex for modeling, solver backends like Cbc or SCIP and MathProgBase as glue. See JuliaOpt for an overview.

In [23]:
using JuMP # modeling
using Cbc  # solver backend
In [24]:
# preparing data for indexing
syms = [Symbol(s) for s in weights[:symbol]]
days = 1:length(prices[:date])

@show size(syms) size(days);
size(syms) = (30,)
size(days) = (252,)

We will formulate a model that should look quite close to the mathematical notation above.

Note the heavy use of Julia macros to define variables and constraints. The expressions are used as parsed by the Julia language and directly translated to the solver's internal form.

In [25]:
function find_fund(maxstocks; timelimit=10.0, gaplimit=0.01, lastday=200)
    days = 1:lastday

    fund = Model(solver=CbcSolver(seconds=timelimit, ratioGap=gaplimit))

    # decisions
    @variable(fund, pick[syms], Bin)    # is stock included?
    @variable(fund, weight[syms]  0)   # what part of the portfolio

    # auxiliary variables
    @variable(fund, Δ⁺[days]  0) # positive slack
    @variable(fund, Δ⁻[days]  0) # negative slack

    # fit to Dow Jones index
    for d in days
        @constraint(fund, sum(prices[d,s] * weight[s] for s in syms) == index[d, :value] + Δ⁺[d] - Δ⁻[d])
    end

    # can only use stock if picked
    for s in syms
        @constraint(fund, weight[s]  pick[s])
    end
                
    # few stocks allowed
    @constraint(fund, sum(pick[s] for s in syms)  maxstocks)
                          
    # minimize the absolute violation (L1 norm)
    @objective(fund, :Min, sum(Δ⁺[d] + Δ⁻[d] for d in days))
                            
                            
    status = solve(fund)
    @show status
    
    getvalue(weight)
end
Out[25]:
find_fund (generic function with 1 method)
In [26]:
trainingdays = 100
sol = find_fund(3, timelimit=6, lastday=trainingdays)
WARNING: Not solved to optimality, status: UserLimit
status = :UserLimit
Out[26]:
weight: 1 dimensions:
[AAPL] = 0.0
[ AXP] = 0.47229506948301675
[  BA] = 0.0
[ CAT] = 0.0
[CSCO] = 0.0
[ CVX] = 0.0
[  DD] = 0.0
[ DIS] = 0.0
[  GE] = 0.0
[  GS] = 0.0
[  HD] = 0.0
[ IBM] = 0.0
[INTC] = 0.0
[ JNJ] = 0.0
[ JPM] = 0.0
[  KO] = 0.0
[ MCD] = 0.0
[ MMM] = 0.31629226616166206
[ MRK] = 0.0
[MSFT] = 0.4061951417784263
[ NKE] = 0.0
[ PFE] = 0.0
[  PG] = 0.0
[ TRV] = 0.0
[ UNH] = 0.0
[ UTX] = 0.0
[   V] = 0.0
[  VZ] = 0.0
[ WMT] = 0.0
[ XOM] = 0.0
In [27]:
solfund = sum(sol[s] * prices[:, s] for s in syms);
In [28]:
with(xticks=[0, trainingdays, length(days)], yticks=[]) do
    plot(index, :date, :value, label="Dow Jones")
    plot!(solfund, label="Index Fund")
end
Out[28]:
In [29]:
errors = abs.(index[:value] - solfund)

with(bins=20) do
    histogram(errors[trainingdays:252], label="later", color=:red)
    histogram!(errors[1:trainingdays], alpha=0.8, label="training", color=:green)
end
Out[29]: