In [5]:
from IPython.core.display import HTML
def print_resource(txt,url,img): return HTML('<p class="resource-container"><a href="%s"><code>%s</code><span style="background: url(%s);"></span></a></p>' % (url,txt,img,))
print_resource("Prologue to Data Science",'#','datasciencestarterkit2.jpg')

It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.

~ Arthur Conan Doyle, Sherlock Holmes

But once you do have the data, what is one supposed to do then? In this prologue to Data Science the most difficult part of starting something has been taken care of : knowing where to start. The prologue is an annotated list of some of the best materials covering the pre-requisite knowledge of stats, code and data you will need before doing data science. This guide was prepared for the benefit of Symbol & Key, Hong Kong's Data Science community, but it is also functions as the pre-work for General Assembly's Data Science course. The Symbol & Key talk series is aimed at (aspirational) data practioners and the topics can therefore sometimes be moderately technical. By following this guide however, you should feel equiped with enough stats, code, and data knowledge to participate in the community and set out on your journey towards data science mastery.

Data Science Checklist

The modern Data Scientist doesn't always need to know the mathematics that go on behind the scenes, but they do need to be intimately familiar with the characteristics of the various machine learning algorithms - e.g. which types of data they are suitable for, how to measure their accuracy, and how to interpret their output. That kind of knowledge is often only gained with experience, but there are ways to speed up the process. To optimise your learning, you need to have a balanced checklist. A Data Science Checklist that makes sure that you are sensitive to the issues and challenges from the following three domains.

In [2]:
%matplotlib inline

import IPython
import pandas as pd
import seaborn as sns
from matplotlib_venn import venn3, venn3_circles
from matplotlib import pyplot as plt

# Utility Functions
def table(df, print_index=True, print_header=True):
    return IPython.display.display(HTML(df.to_html(index=print_index, header=print_header).replace('<table border="1" class="dataframe">','<table class="table table-striped table-hover">')))

def resource_table(ref):
    return table(df.ix[df.Ref == ref, 2:],False,True)

# Generate a Venn Diagram
clist = ['red','blue','black']
subsets = (1, 1, 1, 1, 1, 1, 1)
data = {
    '100':'Stats',
    '010':'Code',
    '001':'Data',
    '111':'Data Science \n Checklist'
}

with sns.plotting_context("poster"):
    v = venn3(subsets, set_colors=clist, alpha=0.4)
    c = venn3_circles(subsets, alpha=0)

    for k in v.id2idx.keys():
        v.get_label_by_id(k).set_text('')
    
    for k, p in data.iteritems():
        v.get_label_by_id(k).set_text(p)
        v.get_patch_by_id('001').set_alpha(.6)
        
    for label in data.iterkeys():
        v.get_label_by_id(label).set_alpha(1)
        v.get_label_by_id(label).set_color('white')
        v.get_label_by_id(label).set_fontweight(700)
        v.get_label_by_id(label).set_fontsize(20)
        v.get_label_by_id(label).set_fontname('Roboto')

To design and assess the validity of your data models, you'll at least need a stats vocabulary equivalent to one offered by college-level course. The materials referenced in this guide provide a succinct refresher. But as you'll also want to implement and iterate over your data models once you've designed them, you also need some code skills. Data Scientists often work with a scripting language with strong machine learning libraries. R is a common contender, but for the purposes of this introduction Python is used to teach you the basics of computational thinking. Finally, the raw input of your models never quite comes in the format or as clean as you want it so you will need the data skills to wrangle the data into submission.

Checklist

In [3]:
cols = ['Category','Ref','Media Type','Resource Depth','Level Up To','Style','HK$']
understanding_rank = ['bare essentials', 'conceptual', 'OK foundation', 'practical foundation','solid foundation']
df = pd.read_csv('resources.csv', names=cols)
df['Level Up To'] = pd.Categorical(df['Level Up To'], categories=understanding_rank, ordered=True)

If only there were a one-size-fits-all checklist that we could all step through and be on our way again. Unfortunately we all come to the checklist with different levels of experience and different learning needs. This checklist therefore provides a differentiated list of resources. For example, if you're already a programmer, there's a quick guide to the Python syntax. But if you're new to programming entirely, it's wise to spend some more time with some more expository materials. Whatever your skill level may be, pick at least one item from each section and add them to your checklist.

Resource Guide

Each resource is indicated by a banner image. Click on it to be taken to the actual resource, or receive further instructions on how to get it. There's a summary table under each banner, giving you a snapshot of the resource to help you evaluate the right one for you. It contains the following details:

Media Type

How is the information being delivered? Books, videos or through interactives coding execises?

Resource Depth

Expressed in pages or time, this should give you an idea of how deeply the resource covers the materials, or rather an estimate of how much time you'll have to invest to cover it fully.

Level Up To

What can you expect to know once you've checked off that resource?

  • Bare Essentials : The depth you'd get from reading a summary on the topic at high-school level. You need to get way beyond this level before you can start thinking about data science, but it gives you pointers on some of the basic priciples.
  • OK Foundation : Get to this level, and whenever you'll move on to a more advanced resource / project / class the topic will be familiar to you.
  • Practical Foundation : Obtain a working level understanding of the concepts, without too much worry about any considerations beyond getting the job done.
  • Solid Foundation : Beyond just giving you a practical foundation, this resource also presents the context in the wider context of the field and explains why certain issues matter.
Style

We've all got different learning styles, so this is my best attempt to characterise the materials. Either by pointing out how the content is presented, e.g. conversational, or who it's intended audience is, i.e. tech-literate.

HK$

Most resources are free, but when they aren't I've expressed their retail price in HK Dollars. Whichever book you decide to buy, they are all worth their respective sticker prices.

Checklist Requirement

Many of these resources go beyond the introductory level. Hence I've indicated which sections / units you are advised to complete to get the foundations down and satisfy the checklist requirement.

In [4]:
print '\n'

Code

Programming is making the computer do what you can’t be bothered or do not have a long enough life span to do yourself. Most analysis is crunched by programs and now most beautiful data visuals are drawn by them. While R is the de facto standard for performing statistical analysis, it has quite a high learning curve and there are other areas of data science for which it is not well suited. To avoid learning a new language for a specific problem domain, we will use Python and its numerous statistical libraries. If you are coming from R, you will find that much of its functionality can be replicated with NumPy, SciPy, matplotlib, and pandas.

In [5]:
print_resource("Google's Python Course",'https://developers.google.com/edu/python/','gdb.JPG')
In [6]:
resource_table('google-python')
Media Type Resource Depth Level Up To Style HK$
videos / notes / excercises 1 day solid foundation tech literate Free

Checklist Requirement

Up to Python Dict and File & Exercise: wordcount.py

Summary

This is a free class for people with a little bit of programming experience who want to learn Python. The class includes written materials, lecture videos, and lots of code exercises to practice Python coding. These materials are used within Google to introduce Python to people who have just a little programming experience.

In [7]:
print_resource("Python Language Essentials",
               'https://copy.com/6D3JLxGXWaPQGU2H','Python_natalensis_Smith_1840.jpg')
In [8]:
resource_table('pydata-python')
Media Type Resource Depth Level Up To Style HK$
book 50 p. practical foundation reference Free

Checklist Requirement

The full appendix to Python for Data Analysis

Summary

This 50 page appendix to 'Python for Data Analysis' is not intended to be an exhaustive introduction to the Python language but rather a biased, no-frills overview of features which are used repeatedly in Data Science projects. For new Python programmers, it is recommended that you supplement this with the official Python tutorial and potentially one of the many excellent (and much longer) books on general purpose Python programming.

In [9]:
print_resource('CodeAcademy','http://www.codecademy.com/en/tracks/python','codecademy.png')
Out[9]:
In [10]:
resource_table('codecademy-python')
Media Type Resource Depth Level Up To Style HK$
interactive lessons ~ 9 hours OK foundation hand-holding Free

Checklist Requirement

Up until Codecademy's exam statistics unit

Summary

CodeAcademy is a free website with tutorials to teach users rudimentary programming. Its Python course is aimed at non-programmers and will slowly take you through various programming concepts. The course is split up in teaching and practice units so you'll also learn why certain techniques are useful. Remember that Codecademy also provides an excellent glossary of concepts and techniques you'll likely employ in your adventures.

In [11]:
print_resource('Python Practice Book','http://anandology.com/python-practice-book/index.html','PythonPracticeBook.jpg')
In [12]:
resource_table('practicebook-python')
Media Type Resource Depth Level Up To Style HK$
online reference / exercises ~ 7 hours OK foundation reference Free

Checklist Requirement

Chapters 1. Getting Started and 2. Working with Data, also read A Plan for Spam by Paul Graham which describes a method of detecting spam using probability of occurrence of a word in spam.

Summary

This book is prepared from the training notes of Anand Chitipothu. Good presentation, stays on topic with lot's of little excercises to check comprehension.

In [13]:
print_resource('General Assembly','https://generalassemb.ly/education?topic=8','gads.png')
Out[13]:
In [14]:
resource_table('ga-python')
Media Type Resource Depth Level Up To Style HK$
workshop 150 min bare essentials interactive 400

Checklist Requirement

The introductory course is run once every month in Hong Kong. Sign up through General Assembly.

Although if you are new to programming, you need to supplement this item with another Python entry on your checklist before you're ready to start doing Data Science.

Summary

This workshop explores Python's place in the scientific ecosystem, and how the language, with several readily-available open-source libraries, can serve as a powerful tool for data analysis. Designed as a stand-alone introduction to the data science aspects of Python, this class is also a recommended refresher for students planning to enroll in General Assembly's upcoming Data Science course.

In [15]:
print '\n'

Stats

Statistics is perhaps the start point of data science. For most questions in the world we have neither measured every phenomenon nor asked every person what they think, instead we have a small recorded subset of conversations and measurements. Statistics helps us understand what we can, and as importantly cannot reasonably learn from that smaller group.

In [16]:
print_resource('Think Stats: Statistics for Programmers','http://www.greenteapress.com/thinkstats/index.html','think_stats.png')
In [17]:
resource_table('think-stats')
Media Type Resource Depth Level Up To Style HK$
book 3 hours practical foundation programmer oriented Free

Checklist Requirement

  1. Preface (8 mins)
  2. Statistical Thinking for Programmers (14 mins)
  3. Descriptive Statistics (19 mins)
  4. Cumulative Distribution Functions (16 mins)
  5. Continuous Distributions (23 mins)
  6. Probability (23 mins)

Summary

Think Stats is an introduction to Probability and Statistics for people who have some exposure to python. It emphasizes simple techniques you can use to explore real data sets and answer interesting questions. Readers are encouraged to work on a project with real datasets.

Because it uses a programming language, it covers data analysis from beginning to end: viewing data, calculating descriptive statistics, identifying outliers, describing data using the distributions (and explaining what the distributions really mean!). Going through this small book, the goal is understanding and using statistics, not just learning statistics.

In [18]:
title = 'Statistics in a Nutshell, 2nd Edition'
url = 'http://shop.oreilly.com/product/0636920023074.do'
img = 'nutshell.jpg'
print_resource(title,url,img)
In [19]:
resource_table('nutshell-stats')
Media Type Resource Depth Level Up To Style HK$
book 190 p. solid foundation text book 240

Checklist Requirement

  1. Basic Concepts of Measurement
  2. Probability [Optional]
  3. Inferential Statistics
  4. Descriptive Statistics and Graphic Displays
  5. Categorical Data
  6. The t-Test
  7. The Pearson Correlation Coefficient

Summary

Statistics in a Nutshell is a clear and concise introduction and reference for anyone new to the subject, and especially those who want to apply this powerful tool to real problems, this is a most useful book. Perhapps not so much an introduction to statistics but rather as a tool for those who know there is a procedure that will really help them solve a present problem, but can't remember what it was nor exactly how to use it.

In [20]:
title = 'Naked Statistics'
url = 'http://www.amazon.com/Naked-Statistics-Stripping-Dread-Data-ebook/dp/B007Q6XLF2'
img = 'naked.jpg'
print_resource(title,url,img)
Out[20]:
In [21]:
resource_table('naked-stats')
Media Type Resource Depth Level Up To Style HK$
book 200 p. solid foundation anecdotal 80

Checklist Requirement

  1. What's the Point
  2. Descriptive Stats
  3. Deceptive Description
  4. Correlation
  5. Basic Probability & The Monty Hall Problem
  6. Problems with Probability
  7. The Importance of Data
  8. The Central Limit Theorem
  9. Inference

Summary

Charles Wheelan's Naked Statistics is an insightful book laced with college-style humor as indicated by the soft porn book cover. Read this book if you want to understand the concepts behind statistics without having to mine a text book. The book is a quick read at only 250 pages, much of it skimmable. It is especially valuable for digital analytics professionals and marketing executives who want to understand more about data science predictions which are essentially statistically-based "guesstimates".

In [22]:
title = 'Metacademy'
url = 'https://www.metacademy.org/graphs/concepts/expectation_and_variance#focus=expectation_and_variance&mode=learn'
img = 'metacademy.png'
print_resource(title,url,img)
Out[22]:
In [23]:
resource_table('metacademy-stats')
Media Type Resource Depth Level Up To Style HK$
meta-reference 7 Hours solid foundation systematic Free/$

Checklist Requirement

Sample items from Metacademy's Learning Plan. Probability and Random Variables are prerequisite topics prior to picking up Expectation and Variance. Review as is necessary for your level of understanding.

Summary

Metacademy is built around an interconnected web of concepts, each one annotated with a short description, a set of learning goals, a (very rough) time estimate, and pointers to learning resources. The concepts are arranged in a prerequisite graph, which is used to generate a learning plan for a concept. It's pretty fantastic, and a much more elaborate implementation of the idea this checklist was built on.

In [24]:
title = 'School of Data'
url = 'http://schoolofdata.org/handbook/courses/the-math-you-need-to-start/'
img = 'SCODAbadges.png'
print_resource(title,url,img)
Out[24]:
In [25]:
resource_table('school-of-data')
Media Type Resource Depth Level Up To Style HK$
blog article 90 mins bare essentials hand-holding Free

Checklist Requirement

Read the full article. Since it only convers the bare essentials, it is recommended that you follow up this item with another item from the stats section.

Summary

Math seems to be a scary thing for many people. If you tend to get scared by thinking about numbers and what to do with them, this item is for you. It claims to "tame the beast and show you how much you can do – with counting, adding, and dividing numbers".

In [26]:
print '\n'

Data

Pandas is a Python library for doing data analysis. It's really fast and lets you do exploratory work incredibly quickly. You can imagine pandas being the tool which holds your data. The better you know how to merge in new data and ask for a particualr subset of data, the simpler it will be for you to bring in more evidence to your dataset and answer more specific question about your data.

In [27]:
title = 'Pandas Cookbook'
url = 'https://github.com/jvns/pandas-cookbook'
img = 'cookbook.png'
print_resource(title,url,img)
Out[27]:
In [28]:
resource_table('cookbook-pandas')
Media Type Resource Depth Level Up To Style HK$
notebooks / exercises 2 hours practical foundation code demo Free

Checklist Requirement

  1. A quick tour of the IPython Notebook - Shows off IPython's awesome tab completion and magic functions.
  2. Reading from a CSV - Reading your data into pandas is pretty much the easiest thing. Even when the encoding is wrong!
  3. Selecting data - It's not totally obvious how to select data from a pandas dataframe. Here I explain the basics (how to take slices and get columns)
  4. More selecting data - Here we get into serious slicing and dicing and learn how to filter dataframes in complicated ways, really fast.
  5. groupby and aggregate - The groupby/aggregate is seriously my favorite thing about pandas and I use it all the time. You should probably read this.

Summary

The goal of this cookbook is to give you some concrete examples for getting started with pandas. The docs are really comprehensive. However, I've often had people tell me that they have some trouble getting started, so these are examples with real-world data, and all the bugs and weirdness that that entails.

In [29]:
title = 'Learn Pandes'
url = 'https://bitbucket.org/hrojas/learn-pandas'
img = 'learn.png'
print_resource(title,url,img)
Out[29]:
In [30]:
resource_table('learn-pandas')
Media Type Resource Depth Level Up To Style HK$
notebooks 4 hour solid foundation code demo Free

Checklist Requirement

  1. Babynames
  2. Babynames Redux
  3. Customer Data
  4. Back to Basics
  5. Stack & Unstack
  6. Groupby

Summary

A series of simple notebooks showcasing how to do data analysis with pandas based on a motivating examples of finding baby name popularity and customer counts.

In [31]:
print_resource('Python for Data Analysis','http://shop.oreilly.com/product/0636920023784.do','pydata.png')
In [32]:
resource_table('pydata-pandas')
Media Type Resource Depth Level Up To Style HK$
book 90 p. solid foundation reference 150

Checklist Requirement

  • Chapter 1 : Preliminaries
  • Chapter 2 : Introductory Example
  • Chapter 5 : Getting Started with Pandas

Summary

Python for Data Analysis is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python. It is also a practical, modern introduction to scientific computing in Python, tailored for data-intensive applications. This is a book about the parts of the Python language and libraries you’ll need to effectively solve a broad set of data analysis problems.

In [33]:
print_resource('10 Min to Pandas','http://pandas.pydata.org/pandas-docs/stable/10min.html#min','10mintopandas.png')
Out[33]:
In [34]:
resource_table('10min-pandas')
Media Type Resource Depth Level Up To Style HK$
overview 1 hour conceptual highlight features Free

Checklist Requirement

Just gives you an idea of what's possible with pandas but it won't teach you how to do it. So follow this item up with another one from the data category.

Summary

Shows off what pandas can do.

Resources for Further Study

Upon strengthening your knowledge base in both Python and Statistics, you'll be ready to embark on your journey to become a Data Scientist. General Assembly's Data Science course picks up where this prologue ends. It first offers a chance to clarify anything that wasn't clear from the prologue. The instructor then sets off on a 11-week tour to develop the student's programming ability and knowledge of statistical methods. The course provides an in-depth overview of the most popular machine learning algorithms, and culminates in an indivudual data science project.

The course curriculum was developed in-house by General Assembly, but any further study of Data Science benefits from having these two books handy for reference.

Building Machine Learning system with Python shows you exactly how to find patterns through raw data. The book starts by brushing up on your Python ML knowledge and introducing libraries, and then moves on to more serious projects on datasets, Modelling, Recommendations, improving recommendations through examples and sailing through sound and image processing in detail.

In [35]:
print_resource('Building Machine Learning Systems','http://shop.oreilly.com/product/9781782161400.do', 'building_machine_learning_python.png')

Python for Data Analysis is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python. It is also a practical, modern introduction to scientific computing in Python, tailored for data-intensive applications. It was authored by the lead developer of the pandas package that's been discussed here, so acts as an insider's guide to everything pandas.

In [36]:
print_resource('Python for Data Analysis','http://shop.oreilly.com/product/0636920023784.do','pydata.png')

With sophisticated analytics, cool new technologies, lean learning principles and agile delivery methods, data science is an exciting, emerging field to join.

Please reach out if you have questions about anything or need help!

All the best,

Mart van de Ven

In [37]:
HTML('''<script>

code_show=true;

function code_toggle() {
    if (code_show){ 
        $('div.input').hide();
        $('.output_scroll').removeClass('output_scroll');
        $('.prompt').hide();
    } else {
        $('div.input').show();
        $('.output_scroll').removeClass('output_scroll');
        $('.prompt').show();
    }
    code_show = !code_show
}
</script>
 
<a class='btn btn-warning btn-lg' style="margin:0 auto; display:block; max-width:320px" href="javascript:code_toggle()">TOGGLE CODE</a>''')
Out[37]:
TOGGLE CODE
In [38]:
HTML('''<link href='http://fonts.googleapis.com/css?family=Roboto|Open+Sans' rel='stylesheet' type='text/css'>
<link rel="stylesheet" type="text/css" href="./theme/custom.css">

<script>
$(function(){
    code_toggle()
})
</script>
''')
Out[38]: