In [1]:
#Pkg.add("DataFrames")
using DataFrames

DataFrames gives us readtable which loads a csv file into a DatFrame object

In [16]:
df = readtable("input/weather.csv")
Out[16]:
StationDateTmaxTminTavgDepartDewPointWetBulbHeatCoolSunriseSunsetCodeSumDepthWater1SnowFallPrecipTotalStnPressureSeaLevelResultSpeedResultDirAvgSpeed
112007-05-018350671451560 204481849 0M0.00.0029.1029.821.7279.2
222007-05-01845268M51570 3-- MMM0.0029.1829.822.7259.6
312007-05-02594251-3424714 004471850BR0M0.00.0029.3830.0913.0413.4
422007-05-02604352M424713 0--BR HZMMM0.0029.4430.0813.3213.4
512007-05-03664656 240489 004461851 0M0.00.0029.3930.1211.7711.9
622007-05-03674858M40507 0--HZMMM0.0029.4630.1212.9613.2
712007-05-04664958 441507 004441852RA0M0.0 T29.3130.0510.4810.8
822007-05-047851MM4250MM-- MMM0.0029.3630.0410.1710.4
912007-05-05665360 538495 004431853 0M0.0 T29.4030.1011.7712.0
1022007-05-05665460M39505 0-- MMM T29.4630.0911.2711.5
1112007-05-06684959 430466 004421855 0M0.00.0029.5730.2914.41115.0
1222007-05-06685260M30465 0-- MMM0.0029.6230.2813.81014.5
1312007-05-078347651041540 004411856RA0M0.0 T29.3830.128.61810.5
1422007-05-07845067M39530 2-- MMM0.0029.4430.128.5179.9
1512007-05-088254681258620 304391857BR0M0.00.0029.2930.032.7115.8
1622007-05-08806070M57630 5--HZMMM T29.3630.022.585.4
1712007-05-097761691359630 404381858BR HZ0M0.00.1329.2129.943.996.2
1822007-05-09766370M60630 5--BR HZMMM0.0229.2829.933.975.9
1912007-05-108456701452600 504371859BR0M0.00.0029.2029.920.7174.1
2022007-05-10835971M52610 6--BR HZMMM0.0029.2629.912.093.9
2112007-05-11705161 442514 004361860 0M0.00.0029.3330.0411.3312.9
2222007-05-11734961M44514 0-- MMM0.0029.3930.0311.73612.8
2312007-05-12644655-2364610 004351901 0M0.00.0029.4930.2012.4312.9
2422007-05-12654756M37469 0-- MMM0.0029.5430.1912.7113.0
2512007-05-13694356-233469 004341902 0M0.00.0029.4930.246.6148.1
2622007-05-13694457M32468 0-- MMM0.0029.5530.246.4117.6
2712007-05-149056731547590 804331903 0M0.00.0029.2329.9716.92117.3
2822007-05-14905472M45580 7-- MMM0.0029.3129.9814.12114.6
2912007-05-158057691156610 404321904RA BR0M0.00.3829.1329.848.12712.3
3022007-05-15825669M56610 4--TSRA RA BRMMM0.6029.1929.838.12510.8

DataFrames can be subset by rows or columns

In [4]:
df[1:1]
df[2:10]
df[1:2,2:10]
Out[4]:
DateTmaxTminTavgDepartDewPointWetBulbHeatCool
12007-05-018350671451560 2
22007-05-01845268M51570 3
In [9]:
names(df)
Out[9]:
22-element Array{Symbol,1}:
 :Station    
 :Date       
 :Tmax       
 :Tmin       
 :Tavg       
 :Depart     
 :DewPoint   
 :WetBulb    
 :Heat       
 :Cool       
 :Sunrise    
 :Sunset     
 :CodeSum    
 :Depth      
 :Water1     
 :SnowFall   
 :PrecipTotal
 :StnPressure
 :SeaLevel   
 :ResultSpeed
 :ResultDir  
 :AvgSpeed   
In [10]:
rename!(df,:Date,:Day)
df[:Day]
Out[10]:
2944-element DataArray{UTF8String,1}:
 "2007-05-01"
 "2007-05-01"
 "2007-05-02"
 "2007-05-02"
 "2007-05-03"
 "2007-05-03"
 "2007-05-04"
 "2007-05-04"
 "2007-05-05"
 "2007-05-05"
 "2007-05-06"
 "2007-05-06"
 "2007-05-07"
 ⋮           
 "2014-10-26"
 "2014-10-26"
 "2014-10-27"
 "2014-10-27"
 "2014-10-28"
 "2014-10-28"
 "2014-10-29"
 "2014-10-29"
 "2014-10-30"
 "2014-10-30"
 "2014-10-31"
 "2014-10-31"

Can also use symbols as column names

In [27]:
df[:,[:Tmax,:Depart]]
Out[27]:
TmaxDepart
18314
284M
359-3
460M
566 2
667M
766 4
878M
966 5
1066M
1168 4
1268M
138310
1484M
158212
1680M
177713
1876M
198414
2083M
2170 4
2273M
2364-2
2465M
2569-2
2669M
279015
2890M
298011
3082M

Rows can be selected using selection criteria

In [11]:
df[df[:Tavg].==68,:]
Out[11]:
StationDayTmaxTminTavgDepartDewPointWetBulbHeatCoolSunriseSunsetCodeSumDepthWater1SnowFallPrecipTotalStnPressureSeaLevelResultSpeedResultDirAvgSpeed
In [30]:
df[df[:Cool].=="M",:Cool]
Out[30]:
11-element DataArray{UTF8String,1}:
 "M"
 "M"
 "M"
 "M"
 "M"
 "M"
 "M"
 "M"
 "M"
 "M"
 "M"

you can assing a value to each element of a DataArray

In [31]:
df[df[:Cool].=="M",:Cool] = NA
type: non-boolean (NAtype) used in boolean context

Here we change all "M" values to NA

In [6]:
df = readtable("input/weather.csv")
for name in names(df)
  df[df[name].=="M",name] = NA
end

The RDatasets package provides plenty of Data to play with

In [21]:
using RDatasets
In [22]:
RDatasets.packages()
Out[22]:
PackageTitle
1COUNTFunctions, data and code for count data.
2EcdatData sets for econometrics
3HSAURA Handbook of Statistical Analyses Using R (1st Edition)
4HistDataData sets from the history of statistics and data visualization
5ISLRData for An Introduction to Statistical Learning with Applications in R
6KMsurvData sets from Klein and Moeschberger (1997), Survival Analysis
7MASSSupport Functions and Datasets for Venables and Ripley's MASS
8SASmixedData sets from "SAS System for Mixed Models"
9ZeligEveryone's Statistical Software
10adehabitatLTAnalysis of Animal Movements
11bootBootstrap Functions (Originally by Angelo Canty for S)
12carCompanion to Applied Regression
13clusterCluster Analysis Extended Rousseeuw et al.
14datasetsThe R Datasets Package
15gapGenetic analysis package
16ggplot2An Implementation of the Grammar of Graphics
17latticeLattice Graphics
18lme4Linear mixed-effects models using Eigen and S4
19mgcvMixed GAM Computation Vehicle with GCV/AIC/REML smoothness estimation
20mlmRevExamples from Multilevel Modelling Software Review
21nlregHigher Order Inference for Nonlinear Heteroscedastic Models
22plmLinear Models for Panel Data
23plyrTools for splitting, applying and combining data
24psclPolitical Science Computational Laboratory, Stanford University
25psychProcedures for Psychological, Psychometric, and Personality Research
26quantregQuantile Regression
27reshape2Flexibly Reshape Data: A Reboot of the Reshape Package.
28robustbaseBasic Robust Statistics
29rpartRecursive Partitioning and Regression Trees
30sandwichRobust Covariance Matrix Estimators
In [23]:
RDatasets.datasets("COUNT")
Out[23]:
PackageDatasetTitleRowsColumns
1COUNTaffairsaffairs60118
2COUNTazdrg112azdrg11217984
3COUNTazproazpro35896
4COUNTbadhealthbadhealth11273
5COUNTfasttrakgfasttrakg159
6COUNTlbwlbw18910
7COUNTlbwgrplbwgrp67
8COUNTloomisloomis41011
9COUNTmdvismdvis222713
10COUNTmedparmedpar149510
11COUNTrwmrwm273264
12COUNTrwm5yrrwm5yr1960917
13COUNTshipsships407
14COUNTtitanictitanic13164
15COUNTtitanicgrptitanicgrp125
In [24]:
lbw = dataset("COUNT", "lbw")
Out[24]:
LowSmokeRaceAgeLWtPTLHtUIFTVBWt
10021918200102523
20033315500032551
30112010500012557
40112110800122594
50111810700102600
60032112400002622
70012211800012637
80031710300012637
90112912300012663
100112611300002665
11003199500002722
120031915000012733
13003229501002750
140033010710122750
150111810000002769
160111810000002769
17002159800002778
180112511800032782
190032012000102807
200112812000012821
210033212100022835
220013110000132835
230013620200012836
240032812000002863
250032512000122877
260012816700002877
270111712200002906
280012915000022920
290122616800002920
300021711300012920