Lecture 0: Introduction to the Course Logistics



Welcome to the Cornell Data Science Training program!

Through these lectures and exercises, we aim to teach you basic data science concepts and how to apply them using a data science language. The topics that we will cover will include, but will not be limited to:

  • Tools for Data Science
  • Data Cleaning and Manipulation
  • Data visualization
  • Supervised Learning Algorithms
  • Unsupervised Learning
  • Meta-Learning

We'll start with the basics of the language, to advanced applications in more advanced concepts. By the end of the course, you will have the foundation and basic skills to contribute to any subteam on Cornell Data Science, and to your career in this explosive field.


Recap: Why Data Science? And why Python?

Before we begin to explore Numpy and all of its applications, we need to remember why data science is so important today and why data scientists choose Python. Data science can be thought of as the basis for empirical research, since data is used to inform our hypotheses and provide observations. In many cases, this data is used either by businesses or by scientists to inform their understanding of a phenomenon. We use a combination of exploratory data analysis and modeling to draw conclusions from larges troves of data. Data science allows us to:

  • analyze the past, and
  • predict the future through large scale data processing.

Our more recent ability to collect data in real time from many places in cluding websites, smart phones, and environmental sensors makes data science indredibly relevent in almost every industry, scientific discipline, or engineering endeavor today.

So why is Python a good programming language for data science? Well, it is:

  • an easy-to-learn and readable language
  • an open language with a vibrant community

Thanks to the efforts of this community, it offers an ever-growing set of data management, analytical processing, and visualization libraries, like Numpy, Pandas, and Sci-Kit Learn! Such libraries make Python applicable to every aspect of data science. Lastly, but very importantly, Jupyter Notebooks make Python-based analysis more producible and repeatable. They also provide built-in training and communication support to help with team collaboration.


Getting Started with Jupyter Notebooks

Jupyter Notebooks have led to it rapidly gaining broad acceptance within the data science community. Here are some of the key features of Jupyter Notebooks:

  • Documented Data Science
    • Allows us to document data science process by combining notes, code, and graphics
    • Allows others to read the notebooks and understand the motives behind each step and why decisions were made (good collaboration)

  • Reproducible science
    • Allows you to show others exactly how you conducted the research
    • Allows for replication and inspection of other’s methods

  • Presentation of Results
    • Can easily share notebooks with colleagues and present results as well

  • Support for Julia, Python, and R
    • 3 of most popular programming languages for conducting data science

So how do we set up the Jupyter Notebook on our computer? Here are the instructions you need:


Instruction to Install Jupyter Notebook


If you are an experienced Python user, you can simply go to the Command Prompt (Windows) or Terminal (Mac) and use:

pip3 install jupyter

where pip3 is a Python Package Manager for version 3.

However, if you are new to Python, we highly recommend downloading Anaconda.

Anaconda is a Python distribution, which provide everything you need for python data science, including Python language itself, different libraries, and a package manager. If you don't really understand what we are talking about here, no worries! The main point here is that if you are new to Python, then downloading Anaconda distribution will make your life easier. Ahaha

Check the Anaconda setup instruction and install the Jupyter notebook. If you have any questions, please come to Office Hour and get help from the staffs. You need to make sure that it runs on your computer before the project comes!

Once you finish setting it up on your computer, type the following command in your terminal or cmd:

jupyter notebook

and you will be able to launch the notebook on your computer's default web browser. Check this by downloading this note on your computer and open it with Jupyter notebook viewer!