from IPython.core.display import HTML
def print_resource(txt,url,img): return HTML('<p class="resource-container"><a href="%s"><code>%s</code><span style="background: url(%s);"></span></a></p>' % (url,txt,img,))
print_resource("Prologue to Data Science",'#','datasciencestarterkit2.jpg')
It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.
But once you do have the data, what is one supposed to do then? In this prologue to Data Science the most difficult part of starting something has been taken care of : knowing where to start. The prologue is an annotated list of some of the best materials covering the pre-requisite knowledge of stats, code and data you will need before doing data science. This guide was prepared for the benefit of Symbol & Key, Hong Kong's Data Science community, but it is also functions as the pre-work for General Assembly's Data Science course. The Symbol & Key talk series is aimed at (aspirational) data practioners and the topics can therefore sometimes be moderately technical. By following this guide however, you should feel equiped with enough stats, code, and data knowledge to participate in the community and set out on your journey towards data science mastery.
The modern Data Scientist doesn't always need to know the mathematics that go on behind the scenes, but they do need to be intimately familiar with the characteristics of the various machine learning algorithms - e.g. which types of data they are suitable for, how to measure their accuracy, and how to interpret their output. That kind of knowledge is often only gained with experience, but there are ways to speed up the process. To optimise your learning, you need to have a balanced checklist. A Data Science Checklist that makes sure that you are sensitive to the issues and challenges from the following three domains.
%matplotlib inline
import IPython
import pandas as pd
import seaborn as sns
from matplotlib_venn import venn3, venn3_circles
from matplotlib import pyplot as plt
# Utility Functions
def table(df, print_index=True, print_header=True):
return IPython.display.display(HTML(df.to_html(index=print_index, header=print_header).replace('<table border="1" class="dataframe">','<table class="table table-striped table-hover">')))
def resource_table(ref):
return table(df.ix[df.Ref == ref, 2:],False,True)
# Generate a Venn Diagram
clist = ['red','blue','black']
subsets = (1, 1, 1, 1, 1, 1, 1)
data = {
'100':'Stats',
'010':'Code',
'001':'Data',
'111':'Data Science \n Checklist'
}
with sns.plotting_context("poster"):
v = venn3(subsets, set_colors=clist, alpha=0.4)
c = venn3_circles(subsets, alpha=0)
for k in v.id2idx.keys():
v.get_label_by_id(k).set_text('')
for k, p in data.iteritems():
v.get_label_by_id(k).set_text(p)
v.get_patch_by_id('001').set_alpha(.6)
for label in data.iterkeys():
v.get_label_by_id(label).set_alpha(1)
v.get_label_by_id(label).set_color('white')
v.get_label_by_id(label).set_fontweight(700)
v.get_label_by_id(label).set_fontsize(20)
v.get_label_by_id(label).set_fontname('Roboto')
To design and assess the validity of your data models, you'll at least need a stats vocabulary equivalent to one offered by college-level course. The materials referenced in this guide provide a succinct refresher. But as you'll also want to implement and iterate over your data models once you've designed them, you also need some code skills. Data Scientists often work with a scripting language with strong machine learning libraries. R is a common contender, but for the purposes of this introduction Python is used to teach you the basics of computational thinking. Finally, the raw input of your models never quite comes in the format or as clean as you want it so you will need the data skills to wrangle the data into submission.
cols = ['Category','Ref','Media Type','Resource Depth','Level Up To','Style','HK$']
understanding_rank = ['bare essentials', 'conceptual', 'OK foundation', 'practical foundation','solid foundation']
df = pd.read_csv('resources.csv', names=cols)
df['Level Up To'] = pd.Categorical(df['Level Up To'], categories=understanding_rank, ordered=True)
If only there were a one-size-fits-all checklist that we could all step through and be on our way again. Unfortunately we all come to the checklist with different levels of experience and different learning needs. This checklist therefore provides a differentiated list of resources. For example, if you're already a programmer, there's a quick guide to the Python syntax. But if you're new to programming entirely, it's wise to spend some more time with some more expository materials. Whatever your skill level may be, pick at least one item from each section and add them to your checklist.
Each resource is indicated by a banner image. Click on it to be taken to the actual resource, or receive further instructions on how to get it. There's a summary table under each banner, giving you a snapshot of the resource to help you evaluate the right one for you. It contains the following details:
How is the information being delivered? Books, videos or through interactives coding execises?
Expressed in pages or time, this should give you an idea of how deeply the resource covers the materials, or rather an estimate of how much time you'll have to invest to cover it fully.
What can you expect to know once you've checked off that resource?
We've all got different learning styles, so this is my best attempt to characterise the materials. Either by pointing out how the content is presented, e.g. conversational, or who it's intended audience is, i.e. tech-literate.
Most resources are free, but when they aren't I've expressed their retail price in HK Dollars. Whichever book you decide to buy, they are all worth their respective sticker prices.
Many of these resources go beyond the introductory level. Hence I've indicated which sections / units you are advised to complete to get the foundations down and satisfy the checklist requirement.
print '\n'
Programming is making the computer do what you can’t be bothered or do not have a long enough life span to do yourself. Most analysis is crunched by programs and now most beautiful data visuals are drawn by them. While R
is the de facto standard for performing statistical analysis, it has quite a high learning curve and there are other areas of data science for which it is not well suited. To avoid learning a new language for a specific problem domain, we will use Python
and its numerous statistical libraries. If you are coming from R
, you will find that much of its functionality can be replicated with NumPy
, SciPy
, matplotlib
, and pandas
.
print_resource("Google's Python Course",'https://developers.google.com/edu/python/','gdb.JPG')
resource_table('google-python')
Media Type | Resource Depth | Level Up To | Style | HK$ |
---|---|---|---|---|
videos / notes / excercises | 1 day | solid foundation | tech literate | Free |
Up to Python Dict and File & Exercise: wordcount.py
This is a free class for people with a little bit of programming experience who want to learn Python. The class includes written materials, lecture videos, and lots of code exercises to practice Python coding. These materials are used within Google to introduce Python to people who have just a little programming experience.
print_resource("Python Language Essentials",
'https://copy.com/6D3JLxGXWaPQGU2H','Python_natalensis_Smith_1840.jpg')
resource_table('pydata-python')
Media Type | Resource Depth | Level Up To | Style | HK$ |
---|---|---|---|---|
book | 50 p. | practical foundation | reference | Free |
The full appendix to Python for Data Analysis
This 50 page appendix to 'Python for Data Analysis' is not intended to be an exhaustive introduction to the Python language but rather a biased, no-frills overview of features which are used repeatedly in Data Science projects. For new Python programmers, it is recommended that you supplement this with the official Python tutorial and potentially one of the many excellent (and much longer) books on general purpose Python programming.
print_resource('CodeAcademy','http://www.codecademy.com/en/tracks/python','codecademy.png')
resource_table('codecademy-python')
Media Type | Resource Depth | Level Up To | Style | HK$ |
---|---|---|---|---|
interactive lessons | ~ 9 hours | OK foundation | hand-holding | Free |
Up until Codecademy's exam statistics unit
CodeAcademy is a free website with tutorials to teach users rudimentary programming. Its Python course is aimed at non-programmers and will slowly take you through various programming concepts. The course is split up in teaching and practice units so you'll also learn why certain techniques are useful. Remember that Codecademy also provides an excellent glossary of concepts and techniques you'll likely employ in your adventures.
print_resource('Python Practice Book','http://anandology.com/python-practice-book/index.html','PythonPracticeBook.jpg')
resource_table('practicebook-python')
Media Type | Resource Depth | Level Up To | Style | HK$ |
---|---|---|---|---|
online reference / exercises | ~ 7 hours | OK foundation | reference | Free |
Chapters 1. Getting Started and 2. Working with Data, also read A Plan for Spam by Paul Graham which describes a method of detecting spam using probability of occurrence of a word in spam.
This book is prepared from the training notes of Anand Chitipothu. Good presentation, stays on topic with lot's of little excercises to check comprehension.
print_resource('General Assembly','https://generalassemb.ly/education?topic=8','gads.png')
resource_table('ga-python')
Media Type | Resource Depth | Level Up To | Style | HK$ |
---|---|---|---|---|
workshop | 150 min | bare essentials | interactive | 400 |
The introductory course is run once every month in Hong Kong. Sign up through General Assembly.
Although if you are new to programming, you need to supplement this item with another Python entry on your checklist before you're ready to start doing Data Science.
This workshop explores Python's place in the scientific ecosystem, and how the language, with several readily-available open-source libraries, can serve as a powerful tool for data analysis. Designed as a stand-alone introduction to the data science aspects of Python, this class is also a recommended refresher for students planning to enroll in General Assembly's upcoming Data Science course.
print '\n'
Statistics is perhaps the start point of data science. For most questions in the world we have neither measured every phenomenon nor asked every person what they think, instead we have a small recorded subset of conversations and measurements. Statistics helps us understand what we can, and as importantly cannot reasonably learn from that smaller group.
print_resource('Think Stats: Statistics for Programmers','http://www.greenteapress.com/thinkstats/index.html','think_stats.png')
resource_table('think-stats')
Media Type | Resource Depth | Level Up To | Style | HK$ |
---|---|---|---|---|
book | 3 hours | practical foundation | programmer oriented | Free |
Think Stats is an introduction to Probability and Statistics for people who have some exposure to python. It emphasizes simple techniques you can use to explore real data sets and answer interesting questions. Readers are encouraged to work on a project with real datasets.
Because it uses a programming language, it covers data analysis from beginning to end: viewing data, calculating descriptive statistics, identifying outliers, describing data using the distributions (and explaining what the distributions really mean!). Going through this small book, the goal is understanding and using statistics, not just learning statistics.
title = 'Statistics in a Nutshell, 2nd Edition'
url = 'http://shop.oreilly.com/product/0636920023074.do'
img = 'nutshell.jpg'
print_resource(title,url,img)
resource_table('nutshell-stats')
Media Type | Resource Depth | Level Up To | Style | HK$ |
---|---|---|---|---|
book | 190 p. | solid foundation | text book | 240 |
Statistics in a Nutshell is a clear and concise introduction and reference for anyone new to the subject, and especially those who want to apply this powerful tool to real problems, this is a most useful book. Perhapps not so much an introduction to statistics but rather as a tool for those who know there is a procedure that will really help them solve a present problem, but can't remember what it was nor exactly how to use it.
title = 'Naked Statistics'
url = 'http://www.amazon.com/Naked-Statistics-Stripping-Dread-Data-ebook/dp/B007Q6XLF2'
img = 'naked.jpg'
print_resource(title,url,img)
resource_table('naked-stats')
Media Type | Resource Depth | Level Up To | Style | HK$ |
---|---|---|---|---|
book | 200 p. | solid foundation | anecdotal | 80 |
Charles Wheelan's Naked Statistics is an insightful book laced with college-style humor as indicated by the soft porn book cover. Read this book if you want to understand the concepts behind statistics without having to mine a text book. The book is a quick read at only 250 pages, much of it skimmable. It is especially valuable for digital analytics professionals and marketing executives who want to understand more about data science predictions which are essentially statistically-based "guesstimates".
title = 'Metacademy'
url = 'https://www.metacademy.org/graphs/concepts/expectation_and_variance#focus=expectation_and_variance&mode=learn'
img = 'metacademy.png'
print_resource(title,url,img)
resource_table('metacademy-stats')
Media Type | Resource Depth | Level Up To | Style | HK$ |
---|---|---|---|---|
meta-reference | 7 Hours | solid foundation | systematic | Free/$ |
Sample items from Metacademy's Learning Plan. Probability
and Random Variables
are prerequisite topics prior to picking up Expectation and Variance
. Review as is necessary for your level of understanding.
Metacademy is built around an interconnected web of concepts, each one annotated with a short description, a set of learning goals, a (very rough) time estimate, and pointers to learning resources. The concepts are arranged in a prerequisite graph, which is used to generate a learning plan for a concept. It's pretty fantastic, and a much more elaborate implementation of the idea this checklist was built on.
title = 'School of Data'
url = 'http://schoolofdata.org/handbook/courses/the-math-you-need-to-start/'
img = 'SCODAbadges.png'
print_resource(title,url,img)
resource_table('school-of-data')
Media Type | Resource Depth | Level Up To | Style | HK$ |
---|---|---|---|---|
blog article | 90 mins | bare essentials | hand-holding | Free |
Read the full article. Since it only convers the bare essentials, it is recommended that you follow up this item with another item from the stats section.
Math seems to be a scary thing for many people. If you tend to get scared by thinking about numbers and what to do with them, this item is for you. It claims to "tame the beast and show you how much you can do – with counting, adding, and dividing numbers".
print '\n'
Pandas is a Python library for doing data analysis. It's really fast and lets you do exploratory work incredibly quickly. You can imagine pandas being the tool which holds your data. The better you know how to merge in new data and ask for a particualr subset of data, the simpler it will be for you to bring in more evidence to your dataset and answer more specific question about your data.
title = 'Pandas Cookbook'
url = 'https://github.com/jvns/pandas-cookbook'
img = 'cookbook.png'
print_resource(title,url,img)
resource_table('cookbook-pandas')
Media Type | Resource Depth | Level Up To | Style | HK$ |
---|---|---|---|---|
notebooks / exercises | 2 hours | practical foundation | code demo | Free |
The goal of this cookbook is to give you some concrete examples for getting started with pandas. The docs are really comprehensive. However, I've often had people tell me that they have some trouble getting started, so these are examples with real-world data, and all the bugs and weirdness that that entails.
title = 'Learn Pandes'
url = 'https://bitbucket.org/hrojas/learn-pandas'
img = 'learn.png'
print_resource(title,url,img)
resource_table('learn-pandas')
Media Type | Resource Depth | Level Up To | Style | HK$ |
---|---|---|---|---|
notebooks | 4 hour | solid foundation | code demo | Free |
print_resource('Python for Data Analysis','http://shop.oreilly.com/product/0636920023784.do','pydata.png')
resource_table('pydata-pandas')
Media Type | Resource Depth | Level Up To | Style | HK$ |
---|---|---|---|---|
book | 90 p. | solid foundation | reference | 150 |
Python for Data Analysis is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python. It is also a practical, modern introduction to scientific computing in Python, tailored for data-intensive applications. This is a book about the parts of the Python language and libraries you’ll need to effectively solve a broad set of data analysis problems.
print_resource('10 Min to Pandas','http://pandas.pydata.org/pandas-docs/stable/10min.html#min','10mintopandas.png')
resource_table('10min-pandas')
Media Type | Resource Depth | Level Up To | Style | HK$ |
---|---|---|---|---|
overview | 1 hour | conceptual | highlight features | Free |
Upon strengthening your knowledge base in both Python and Statistics, you'll be ready to embark on your journey to become a Data Scientist. General Assembly's Data Science course picks up where this prologue ends. It first offers a chance to clarify anything that wasn't clear from the prologue. The instructor then sets off on a 11-week tour to develop the student's programming ability and knowledge of statistical methods. The course provides an in-depth overview of the most popular machine learning algorithms, and culminates in an indivudual data science project.
The course curriculum was developed in-house by General Assembly, but any further study of Data Science benefits from having these two books handy for reference.
Building Machine Learning system with Python shows you exactly how to find patterns through raw data. The book starts by brushing up on your Python ML knowledge and introducing libraries, and then moves on to more serious projects on datasets, Modelling, Recommendations, improving recommendations through examples and sailing through sound and image processing in detail.
print_resource('Building Machine Learning Systems','http://shop.oreilly.com/product/9781782161400.do', 'building_machine_learning_python.png')
Python for Data Analysis is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python. It is also a practical, modern introduction to scientific computing in Python, tailored for data-intensive applications. It was authored by the lead developer of the pandas package that's been discussed here, so acts as an insider's guide to everything pandas.
print_resource('Python for Data Analysis','http://shop.oreilly.com/product/0636920023784.do','pydata.png')
With sophisticated analytics, cool new technologies, lean learning principles and agile delivery methods, data science is an exciting, emerging field to join.
Please reach out if you have questions about anything or need help!
All the best,
Mart van de Ven
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
$('.output_scroll').removeClass('output_scroll');
$('.prompt').hide();
} else {
$('div.input').show();
$('.output_scroll').removeClass('output_scroll');
$('.prompt').show();
}
code_show = !code_show
}
</script>
<a class='btn btn-warning btn-lg' style="margin:0 auto; display:block; max-width:320px" href="javascript:code_toggle()">TOGGLE CODE</a>''')
HTML('''<link href='http://fonts.googleapis.com/css?family=Roboto|Open+Sans' rel='stylesheet' type='text/css'>
<link rel="stylesheet" type="text/css" href="./theme/custom.css">
<script>
$(function(){
code_toggle()
})
</script>
''')