Logistics

  • Python
  • IPython
  • AWS, Github and Hackpad (for final project)

Python Installation and Setup

Install Enthought Canopy https://www.enthought.com/products/canopy/academic/

  • Easy package install with Package Manager

  • Ideal environment: IPython (an enhanced Python shell for interactive and exploratory computing) with Editor.

  • If you don't already have it, go to the website and download Enthought Canopy for Academic Use with Python 2.7.4. (Academic Licence)

Get acquainted with (a bit) command line

  • If you don't already know what your command line is, it's time to get started. If you are on a Mac, you also have a Unix command line. Windows users don't, but you go "Start->Run" and type "cmd" to get a Windows command prompt.

  • In both operating systems, the most important command is "cd" to change directories until you get to the one your program is in. If you can figure out "cd to my program directory, then python [file]" you can do enough to get through the first assignment. In all operating systems that I know of, the following all work:

cd [full directory path] # go to full directory path
cd / # go to top level directory
cd directory # go to directory contained in the current directory
  • If you want to see what's in your current directory, you can use "ls" in Unix and "dir" in Windows.

Editing and Running Code

  • Python comes with a programming environment called IDLE. It hasn't been updated, slow, and has gross incompatibilities even with some code running only Python standard libraries.

  • If you have a favorite plaintext editor (e.g., Sublime Text), use that. If you don't have a favorite or don't know what that is, go download Canopy. It provides an easy-to-use Package Manager, IPython, and an Editor.

  • To complie a Python program, save it in a text file (with extension of .py). Then run it by clicking green arrow buttom or type %run test.py in IPython.

Alternatively: you can use IPython Notebook for advanced use. Once ipython and ipython-notebook are installed, one just executes the command ipython notebook --pylab=inline in the directory of interest to start up a webserver for working with IPython Notebooks. (NB.: ONLY Firefox and Chrome are supported?).

Installing Packages

  • Installing packages with Canopy is rather easy, but oftentimes you'll need to install packages from sources.

  • The first thing I would recommend installing is pip, because pip makes it a lot easier to install other Python things. (pip is a huge upgrade over easy_install and makes sure you have all of the dependencies before it tries to install a package.)

    • On a Mac/Unix:

      • (sudo) easy_install pip
      • (sudo) pip install ipython
    • So when installing a module later, try these steps:

      • Look for a binary installer
      • $ pip install [module name]
      • If it doesn't work, then download the archive, extract it wherever you want it installed, then $ python setup.py install in whatever directory has setup.py in it.
    • For any remotely popular library, one of those things should work.

Natural Language Toolkit

  • The Python world is inhabited by many packages or libraries that provide useful things like array operations, plotting functions, and much more. We can import libraries of functions to expand the capabilities of Python in our programs. We'll start by importing nltk, pandas and other related packages to help us out.

Installing NLTK

(Easiest) Use Canopy Package Manager

In [3]:
# Test if it works by typing
# import nltk

# and install all the corpora required
# nltk.download()
    

IPython Tutorial

An Interactive Computing and Development Environment

Install IPython (for those who do not want to install Canopy)

Reference

Bad news: difficult to install

  • Officially, IPython requires Python 2.6, 2.7, 3.1, or 3.2.
  • (OS X or Linux) easy_install ipython[zmq,qtconsole,notebook,test]
  • (Windows) on Windows, IPython requires distribute, and it also requires the PyReadline library to properly support coloring and keyboard management (features that the default windows console doesn’t have). So on Windows, the installation procedure is:

    • Install distribute. https://pypi.python.org/pypi/distribute
    • Install pyreadline. You can use the command easy_install pyreadline from a terminal, or the binary installer appropriate for your platform from the PyPI page.
    • Install IPython itself, which you can download from PyPI or from our site. Note that on Windows 7, you must right-click and ‘Run as administrator’ for the Start menu shortcuts to be created.

IPython by default runs in a terminal window, but the normal terminal application supplied by Microsoft Windows is very primitive. You may want to download the excellent and free Console application instead, which is a far superior tool. You can even configure Console http://sourceforge.net/projects/console to give you by default an IPython tab, which is very convenient to create new IPython sessions directly from the working terminal.

Given a properly built Python, the basic interactive IPython shell will work with no external dependencies. However, some Python distributions (particularly on Windows and OS X), don’t come with a working readline module. The IPython shell will work without readline, but will lack many features that users depend on, such as tab completion and command line editing. If you install IPython with distribute, (e.g. with easy_install), then the appropriate readline for your platform will be installed. See below for details of how to make sure you have a working readline.

IPython Notebook:(Powerful, but not for beginner)

a web based interactive computing environment for Python, R, shell scripts and other languages.

Reference

ipython notebook --pylab=inline

  • Introduction; demo

  • The IPython Notebook (or 'ipynb' for short) is one of the most exciting technologies for teaching and research in recent years. It is a completely open source, well architected, and fairly stable system for scientific computing and data exploration.

  • IPython Notebook can tackle the challenges from emerging Data Science to provide a foundation for data science that is interactive, repeatable, documented and sharable.

Wonderful tool for scientific modeling and presentation!

In [2]:
img = plt.imread('austro.png')
imshow(img)
Out[2]:
<matplotlib.image.AxesImage at 0x10c580c50>

Getting started with IPython

IPython essentials (Rossant, 2013)

[Running the IPython console]

If canopy has been installed correctly, you should be able to see it (click View-python). You can use the shell prompt like a regular Python interpreter.

In [3]:
print ("Hi, welcome to NTU")
Hi, welcome to NTU

[Using IPython as a system shell]

You can use the IPython command-line interface as an extended system shell (as in Unix-like systems).

In [4]:
pwd
Out[4]:
u'/Users/shukai 1/python_lab'

[Use the up and down arrow keys to browe the history]

[Use Tab key to complete your typing]

[Executing a script with the %run command]

Or, save the script as .py, and click the Run button in Canopy.

[Quick benchmarking]

You can use the %timeit magic command to do quick benchmarks in an interactive session.

In [5]:
%timeit [x*x for x in range(100000)]
10 loops, best of 3: 16.1 ms per loop

[Interactive computing with Pylab]

Use the %pylab magic command to enable the scientific computing and exploratory data analysis.

In [6]:
%pylab
Using matplotlib backend: MacOSX
Populating the interactive namespace from numpy and matplotlib
In [1]:
x = linspace(-10., 10., 1000)
plot(x, sin(x))
Out[1]:
[<matplotlib.lines.Line2D at 0x10c550610>]

[Reproducible research with IPython Notebook]

You can call the ipython notebook command in a shell, which will launch a local server on the default port 8888. Check the browser and create the Notebook there.

It is also easy to plot graphs.

In [23]:
y = randn(600)
plot(y)
Out[23]:
[<matplotlib.lines.Line2D at 0x1106bd290>]

More advanced network package (Note: sudo pip install networkx)

In [22]:
import networkx as nx

G=nx.star_graph(20)
pos=nx.spring_layout(G)
colors=range(20)
nx.draw(G,pos,node_color='#A0CBE2',edge_color=colors,width=4,edge_cmap=plt.cm.Blues,with_labels=False)
plt.savefig("edge_colormap.png") # save as png
plt.show() # display

[Working with R]

In [9]:
%matplotlib inline
%load_ext rmagic
In [13]:
import numpy as np
import matplotlib.pyplot as plt
X = np.array([0,1,2,3,4])
Y = np.array([3,5,4,6,7])
plt.scatter(X, Y)
Out[13]:
<matplotlib.collections.PathCollection at 0x1100c52d0>
In [15]:
%Rpush X Y
%R lm(Y~X)$coef
Out[15]:
array([ 3.2,  0.9])
In [16]:
%R resid(lm(Y~X)); coef(lm(X~Y))
Out[16]:
array([-2.5,  0.9])
In [17]:
%%R -i X,Y -o XYcoef
XYlm = lm(Y~X)
XYcoef = coef(XYlm)
print(summary(XYlm))
par(mfrow=c(2,2))
plot(XYlm)
Call:
lm(formula = Y ~ X)

Residuals:
   1    2    3    4    5 
-0.2  0.9 -1.0  0.1  0.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   3.2000     0.6164   5.191   0.0139 *
X             0.9000     0.2517   3.576   0.0374 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.7958 on 3 degrees of freedom
Multiple R-squared:  0.81,	Adjusted R-squared: 0.7467 
F-statistic: 12.79 on 1 and 3 DF,  p-value: 0.03739 

[Running C code from IPython]

Use Cython package

In [25]:
%load_ext cythonmagic
In [28]:
%%cython
def square(x):
    return x*x
In [29]:
square(10)
Out[29]:
100

Learning Yourself

There are a lot of resources online to learn more about using Python for Data analysis and NLP. E.g, http://www.codecademy.com/ as we mentioned.

Here we use IPython's feature for embedding videos to point you to a short video on YouTube on using IPython.

In [2]:
from IPython.display import YouTubeVideo
# a short video about using IPython
YouTubeVideo('26wgEsg9Mcc')
Out[2]:

Github Instructions (for Term Project)

Git is the source control software we’ll be using in tandem with Github, which is an online service that provides free git repositories.

Cloning the repository:

Windows Users

If you don't have Git installed, installation instructions can be found here: http://git-scm.com/download/win**

You can install a GUI on top of GIt from here: http://windows.github.com/

Once the GUI is installed, you can navigate to the repository page, https://github.com/uwescience/datasci_course_materials, and click on the 'Clone in Windows' button.

Linux Users

If you don't have Git installed, installation instructions can be found here: http://git-scm.com/download/linux

If you have Git installed, you need to clone the repository. From a terminal, navigate to where you want the repository to be located and perform a clone.

git clone https://github.com/uwescience/datasci_course_materials.git

Mac Users

If you don't have Git installed, installation instructions can be found here: http://git-scm.com/download/mac

If you have Git installed, you need to clone the repository. From a terminal, navigate to where you want the repository to be located and perform a clone. git clone https://github.com/uwescience/datasci_course_materials.git

Updating the repository:

Linux and Mac Users

Navigate to the repository and perform a git pull cd datasci_course_materials git pull

Windows Users

Perform a pull on the datasci_course_materials through the GUI interface. If you set the repository to stay in sync, the gui will give you notifications where there are updates to be pulled.


AWS Setup (for ambituous term project)

In [2]:
from IPython.display import YouTubeVideo
# a short video about using IPython
YouTubeVideo('JMedTCa5lec')
Out[2]:

Homework

In [ ]: