Introduction to Packages¶

Julia code is organized in packages, and package management is built into the Julia language.

The assumption is that packages are developed with git and Julia will clone the whole repository when installing a package.

Users can have their packages registered on a special GitHub repository: METADATA.jl. Dependencies are tracked in the REQUIRE file.

In [1]:

# update the local copy of METADATA
Pkg.update()

# install a registered package
Pkg.add("DataFrames")

# install any other package
#Pkg.clone("https://github.com/leethargo/PipeLayout.jl")

# checkout a branch of a package (default: master)
#Pkg.checkout("PipeLayout")

# list installed packages with versions
Pkg.status()

INFO: Updating METADATA...
INFO: Updating PipeLayout master...
INFO: Updating SCIP master...
INFO: Computing changes...

18 required packages:
 - BenchmarkTools                0.0.8
 - Cbc                           0.3.2
 - Clp                           0.3.1
 - Convex                        0.5.0
 - DataFrames                    0.10.0
 - GLPKMathProgInterface         0.3.4
 - IJulia                        1.5.1
 - IndexedTables                 0.1.7
 - JuMP                          0.17.1
 - LightGraphs                   0.9.0
 - Plots                         0.12.0
 - ProfileView                   0.2.1
 - PyPlot                        2.3.2
 - Query                         0.6.0
 - SCIP                          0.3.0+             master
 - SCS                           0.3.1
 - StatPlots                     0.4.0
 - StaticArrays                  0.5.1
80 additional packages:
 - AxisAlgorithms                0.1.6
 - BaseTestNext                  0.2.2
 - BinDeps                       0.6.0
 - Blosc                         0.2.1
 - Cairo                         0.3.0
 - Calculus                      0.2.2
 - ColorTypes                    0.5.1
 - Colors                        0.7.3
 - Combinatorics                 0.4.1
 - Compat                        0.26.0
 - Conda                         0.5.3
 - DataArrays                    0.5.3
 - DataStructures                0.5.3
 - DataValues                    0.1.1
 - DiffBase                      0.2.0
 - Distances                     0.4.1
 - Distributions                 0.13.0
 - DocStringExtensions           0.3.3
 - Documenter                    0.11.1
 - DualNumbers                   0.3.0
 - FileIO                        0.4.1
 - FixedPointNumbers             0.3.8
 - FixedSizeArrays               0.2.5
 - ForwardDiff                   0.4.2
 - GLPK                          0.4.2
 - GZip                          0.3.0
 - Graphics                      0.2.0
 - Gtk                           0.13.0
 - GtkReactive                   0.2.1
 - HDF5                          0.8.1
 - Interpolations                0.6.2
 - IntervalSets                  0.0.5
 - IterTools                     0.1.0
 - IterableTables                0.3.0
 - Iterators                     0.3.1
 - JLD                           0.6.11
 - JSON                          0.12.0
 - KernelDensity                 0.3.2
 - LaTeXStrings                  0.2.1
 - Lazy                          0.11.7
 - LegacyStrings                 0.2.2
 - LineSearches                  0.1.5
 - Loess                         0.2.0
 - MacroTools                    0.3.7
 - MathProgBase                  0.6.4
 - MbedTLS                       0.4.5
 - Measures                      0.1.0
 - NaNMath                       0.2.5
 - NamedTuples                   4.0.0
 - Optim                         0.7.8
 - PDMats                        0.7.0
 - PipeLayout                    0.0.0-             master (unregistered)
 - PlotThemes                    0.1.4
 - PlotUtils                     0.4.2
 - Polynomials                   0.1.5
 - PooledArrays                  0.1.1
 - PositiveFactorizations        0.0.4
 - PyCall                        1.13.0
 - QuadGK                        0.1.2
 - Ratios                        0.1.0
 - Reactive                      0.5.2
 - RecipesBase                   0.2.0
 - Reexport                      0.0.3
 - Requires                      0.4.3
 - ReverseDiffSparse             0.7.3
 - Rmath                         0.1.7
 - RoundingIntegers              0.0.2
 - SHA                           0.3.3
 - SIUnits                       0.1.0
 - ShowItLikeYouBuildIt          0.0.1
 - Showoff                       0.1.1
 - SimpleTraits                  0.5.0
 - SortingAlgorithms             0.1.1
 - SpecialFunctions              0.1.1
 - StatsBase                     0.16.0
 - StatsFuns                     0.5.0
 - TexExtensions                 0.0.3
 - URIParser                     0.1.8
 - WoodburyMatrices              0.2.2
 - ZMQ                           0.4.3

INFO: No packages to install, update or remove
INFO: Package DataFrames is already installed

Creating an index fund¶

The goal of this project is the definition of an index fund, following the Dow Jones. That is, we want to select few stocks of the index, together with weights, that show a similar behavior to the overall index.

We start with price data of all the Dow Jones stocks from 2016. From the averages prices, we define weights of the stocks to be used

Loading the price data¶

The data is provided in a file using comma-separated values and three columns:

In [2]:

;head dowjones2016.csv

date,symbol,price
2016-01-04,AAPL,105.349997999999999
2016-01-04,AXP,67.589995999999999
2016-01-04,BA,140.500000000000000
2016-01-04,CAT,67.989998000000000
2016-01-04,CSCO,26.410000000000000
2016-01-04,CVX,88.849997999999999
2016-01-04,DD,63.070000000000000
2016-01-04,DIS,102.980002999999996
2016-01-04,GE,30.709999000000000

Julia provides a function to read csv files into arrays:

In [3]:

?readcsv

search: readcsv readchomp @threadcall

Out[3]:

readcsv(source, [T::Type]; options...)

Equivalent to readdlm with delim set to comma, and type optionally defined by T.

In [4]:

data = readcsv("dowjones2016.csv")
data[1:5,:]

Out[4]:

5×3 Array{Any,2}:
 "date"        "symbol"     "price"
 "2016-01-04"  "AAPL"    105.35    
 "2016-01-04"  "AXP"      67.59    
 "2016-01-04"  "BA"      140.5     
 "2016-01-04"  "CAT"      67.99

But we will use the DataFrames package for easier processing.

In [5]:

using DataFrames

In [6]:

df = readtable("dowjones2016.csv")
df[1:4, :]

Out[6]:

	date	symbol	price
1	2016-01-04	AAPL	105.349998
2	2016-01-04	AXP	67.589996
3	2016-01-04	BA	140.5
4	2016-01-04	CAT	67.989998

We can now access the columns by name:

In [7]:

df[:price]

Out[7]:

7560-element DataArrays.DataArray{Float64,1}:
 105.35
  67.59
 140.5 
  67.99
  26.41
  88.85
  63.07
 102.98
  30.71
 177.14
 131.07
 135.95
  33.99
   ⋮   
  58.87
  62.14
  50.83
  32.48
  84.08
 122.42
 160.04
 109.62
  78.02
  53.38
  69.12
  90.26

Let's compute mean prices for the stocks, using a groupby-and-aggregate approach.

In [8]:

avg = by(df, :symbol, d -> DataFrame(avgprice = mean(d[:price])))
avg[1:4, :]

Out[8]:

	symbol	avgprice
1	AAPL	104.6040078690476
2	AXP	63.79333337698412
3	BA	133.11150809920633
4	CAT	78.69801573015873

We can now use these averages to compute weights.

In [9]:

weights = DataFrame(symbol = avg[:symbol],
                    weight = avg[:avgprice] / sum(avg[:avgprice]))
weights[1:4, :]

Out[9]:

	symbol	weight
1	AAPL	0.03995967436333611
2	AXP	0.02436962866171713
3	BA	0.05084979654236352
4	CAT	0.030063351736529197

We can also pivot the table into a two-way format.

In [10]:

# original dataframe
df[1:4, :]

Out[10]:

	date	symbol	price
1	2016-01-04	AAPL	105.349998
2	2016-01-04	AXP	67.589996
3	2016-01-04	BA	140.5
4	2016-01-04	CAT	67.989998

In [11]:

# two-way table with symbols as columns
#                    rows   columns  data
prices = unstack(df, :date, :symbol, :price)
prices[1:4, 1:4]

Out[11]:

	date	AAPL	AXP	BA
1	2016-01-04	105.349998	67.589996	140.5
2	2016-01-05	102.709999	66.550003	141.070007
3	2016-01-06	100.699997	64.419998	138.830002
4	2016-01-07	96.449997	63.84	133.009995

In [12]:

joined = join(df, weights, on=:symbol)
joined[1:4, :]

Out[12]:

	date	symbol	price	weight
1	2016-01-04	AAPL	105.349998	0.03995967436333611
2	2016-01-05	AAPL	102.709999	0.03995967436333611
3	2016-01-06	AAPL	100.699997	0.03995967436333611
4	2016-01-07	AAPL	96.449997	0.03995967436333611

In [13]:

joined[:contribution] = joined[:weight] .* joined[:price]
joined[1:4, :]

Out[13]:

	date	symbol	price	weight	contribution
1	2016-01-04	AAPL	105.349998	0.03995967436333611	4.209751614258111
2	2016-01-05	AAPL	102.709999	0.03995967436333611	4.104258113898577
3	2016-01-06	AAPL	100.699997	0.03995967436333611	4.023939088508923
4	2016-01-07	AAPL	96.449997	0.03995967436333611	3.8541104724647446

In [14]:

index = by(joined, :date, d -> DataFrame(value = sum(d[:contribution])))
index[1:4, :]

Out[14]:

	date	value
1	2016-01-04	100.57292879489896
2	2016-01-05	100.51142239490156
3	2016-01-06	99.01420719993507
4	2016-01-07	96.60603263325876

Visualization the time series¶

In [15]:

using Plots      # general plotting
pyplot()         # backend, based on Python's matplotlib

Out[15]:

Plots.PyPlotBackend()

In [16]:

x = cumsum(randn(10, 3))

Out[16]:

10×3 Array{Float64,2}:
 0.0967222   1.13685   -0.221413
 2.63506    -0.836293   0.568922
 1.97871     0.380401   0.673628
 0.862146    1.3827    -1.16359 
 0.318709    1.72083    0.381898
 2.27183     1.32943   -0.366597
 1.48305     0.657708   1.07946 
 1.82443     0.120829  -0.108935
 1.97516    -0.857797  -0.854739
 1.2446     -0.0925    -1.0925

Plots will interprete the columns of the data as series to be plotted independently:

In [17]:

plot(x)

Out[17]:

In [18]:

plot(x')

Out[18]:

You can also add to existing plots, using the call plot!.

In [19]:

plot(x, color=[:red :green])
plot!(x + 3, color=:black, alpha=0.5)

Out[19]:

In [20]:

using StatPlots  # for DataFrames integration

We can set common attributes for several plots using the with wrapper:

In [21]:

with(grid=false, legend=false, xticks=false, ylim=(0,300)) do
    plot(df, :date, :price, group=:symbol, color=:grey, alpha=0.4)
    plot!(index, :date, :value, linewidth=2)
end

Out[21]:

In [22]:

bar(weights, :symbol, :weight, xrotation=50, color=:weight, grid=false)

Out[22]:

Picking stocks¶

We know come to the decision problem, where we want to pick a small subset of the stocks together with some weights, such that this portfolio has a similar behavior to our overall Dow Jones index.

The model is based on a linear regression over the time series, but we minimize the loss using the L1-norm (absolute value), and allow only a fixed number of weights to take nonzero variable.

A high-level mathematical model might look like this ($w$: weights, $P$: prices, $I$: value of index):

\begin{align*} \text{minimize} \quad & \lVert w^T P - I \rVert_1 \\ \text{subject to} \quad & \lVert w \rVert_0 \le K \end{align*}

For the curious: this can be expressed as a Mixed-Integer Linear Program in the following form:

\begin{align*} \text{minimize} \quad & \sum_d \Delta^+_d + \Delta^-_d & \\ \text{subject to} \quad & \sum_s P_{d,s} w_s = I_d + \Delta^+_d + \Delta^-_d & (\forall d) \\ & w_s \le p_s & (\forall s) \\ & \sum_s p_s \le K & \\ & w_s \ge 0, \quad p_s \in \{0,1\} & (\forall s) \\ & \Delta^+_d \ge 0, \quad \Delta^-_d \ge 0 & (\forall d) \end{align*}

Several Julia packages are devoted to this kind of optimization, such as JuMP and Convex for modeling, solver backends like Cbc or SCIP and MathProgBase as glue. See JuliaOpt for an overview.

In [23]:

using JuMP # modeling
using Cbc  # solver backend

In [24]:

# preparing data for indexing
syms = [Symbol(s) for s in weights[:symbol]]
days = 1:length(prices[:date])

@show size(syms) size(days);

size(syms) = (30,)
size(days) = (252,)

We will formulate a model that should look quite close to the mathematical notation above.

Note the heavy use of Julia macros to define variables and constraints. The expressions are used as parsed by the Julia language and directly translated to the solver's internal form.

In [25]:

function find_fund(maxstocks; timelimit=10.0, gaplimit=0.01, lastday=200)
    days = 1:lastday

    fund = Model(solver=CbcSolver(seconds=timelimit, ratioGap=gaplimit))

    # decisions
    @variable(fund, pick[syms], Bin)    # is stock included?
    @variable(fund, weight[syms] ≥ 0)   # what part of the portfolio

    # auxiliary variables
    @variable(fund, Δ⁺[days] ≥ 0) # positive slack
    @variable(fund, Δ⁻[days] ≥ 0) # negative slack

    # fit to Dow Jones index
    for d in days
        @constraint(fund, sum(prices[d,s] * weight[s] for s in syms) == index[d, :value] + Δ⁺[d] - Δ⁻[d])
    end

    # can only use stock if picked
    for s in syms
        @constraint(fund, weight[s] ≤ pick[s])
    end
                
    # few stocks allowed
    @constraint(fund, sum(pick[s] for s in syms) ≤ maxstocks)
                          
    # minimize the absolute violation (L1 norm)
    @objective(fund, :Min, sum(Δ⁺[d] + Δ⁻[d] for d in days))
                            
                            
    status = solve(fund)
    @show status
    
    getvalue(weight)
end

Out[25]:

find_fund (generic function with 1 method)

In [26]:

trainingdays = 100
sol = find_fund(3, timelimit=6, lastday=trainingdays)

WARNING: Not solved to optimality, status: UserLimit

status = :UserLimit

Out[26]:

weight: 1 dimensions:
[AAPL] = 0.0
[ AXP] = 0.47229506948301675
[  BA] = 0.0
[ CAT] = 0.0
[CSCO] = 0.0
[ CVX] = 0.0
[  DD] = 0.0
[ DIS] = 0.0
[  GE] = 0.0
[  GS] = 0.0
[  HD] = 0.0
[ IBM] = 0.0
[INTC] = 0.0
[ JNJ] = 0.0
[ JPM] = 0.0
[  KO] = 0.0
[ MCD] = 0.0
[ MMM] = 0.31629226616166206
[ MRK] = 0.0
[MSFT] = 0.4061951417784263
[ NKE] = 0.0
[ PFE] = 0.0
[  PG] = 0.0
[ TRV] = 0.0
[ UNH] = 0.0
[ UTX] = 0.0
[   V] = 0.0
[  VZ] = 0.0
[ WMT] = 0.0
[ XOM] = 0.0

In [27]:

solfund = sum(sol[s] * prices[:, s] for s in syms);

In [28]:

with(xticks=[0, trainingdays, length(days)], yticks=[]) do
    plot(index, :date, :value, label="Dow Jones")
    plot!(solfund, label="Index Fund")
end

Out[28]:

In [29]:

errors = abs.(index[:value] - solfund)

with(bins=20) do
    histogram(errors[trainingdays:252], label="later", color=:red)
    histogram!(errors[1:trainingdays], alpha=0.8, label="training", color=:green)
end

Out[29]:

In [ ]: