#!/usr/bin/env python # coding: utf-8 # # [NTDS'19] tutorial 1: introduction # [ntds'19]: https://github.com/mdeff/ntds_2019 # # [Michaƫl Defferrard](https://deff.ch), [EPFL LTS2](https://lts2.epfl.ch) # ## Content # # 1. [Conda and Anaconda](#conda) # 1. [Python](#python) # 1. [Jupyter notebooks](#jupyter) # 1. [Version control with git](#git) # 1. [Scientific Python](#scipy) # 1. [Ressources to improve your Python skills](#improve) # # ## 1 Conda and Anaconda # # ![conda](figures/conda.jpg) # # [Conda](https://conda.io) is a package and environment manager. It allows you to create environments, ideally one per project, and install packages into them. It is available for Windows, macOS and Linux. # # [Anaconda](https://anaconda.org/download) is a commercial distribution that comes with many of the packages used by data scientists. [Miniconda](https://conda.io/miniconda.html) is a lighter open distribution. Both install `conda`, from which you'll be able to install many packages. # # [conda-forge](https://conda-forge.org) is a community-driven collection of recipes to build conda packages. It contains many more packages than the official defaults channel. # Get basic information from your conda installation: # In[1]: get_ipython().system('conda info') # List your environments: # In[2]: get_ipython().system('conda env list') # List the packages in an environment: # In[3]: get_ipython().system('conda list -n ntds_2019') # Install packages in an environment. The package will be installed in the activated environment if an environment name is not given. # In[4]: get_ipython().system('conda install -n ntds_2019 git') # **Want to know more?** Look at the [conda user guide](https://conda.io/docs/user-guide/overview.html). # # ## 2 Python # # [Python](https://python.org) is one of the main programming languages used by data scientists, along [R](https://www.r-project.org) and [Julia](https://julialang.org). As an open and general purpose language, it is replacing [MATLAB](https://mathworks.com/products/matlab.html) in many scientific and engineering fields. Python is the most popular language used for machine learning. # # Below are very basic examples of Python code. **Want to learn more?** Look at the [Python Tutorial](https://docs.python.org/3/tutorial/index.html). # ### Control flow # In[5]: if 1 == 1: print('hello') # In[6]: for i in range(5): print(i) # In[7]: a = 4 while a > 2: print(a) a -= 1 # ### Data structures # Lists are mutable, i.e., we can change the objects they store. # In[8]: a = [1, 2, 'hello', 3.2] print(a) a[2] = 'world' print(a) # Tuples are not mutable. # In[9]: (1, 2, 'hello') # Sets contain unique values. # In[10]: a = {1, 2, 3, 3, 4} print(a) print(a.intersection({2, 4, 6})) # Dictionaries map keys to values. # In[11]: a = {'one': 1, 'two': 2, 'three': 3} a['two'] # ### Functions # In[12]: def add(a, b): return a + b add(1, 4) # ### Classes # In[13]: class A: d = 10 def add(self, c): return self.d + c a = A() a.add(20) # In[14]: class B(A): def sub(self, c): return self.d - c b = B() print(b.add(20)) print(b.sub(20)) # ### Dynamic typing # In[15]: x = 1 x = 'abc' x # In[16]: add('hel', 'lo') # In[17]: add([1, 2], [3, 4, 5]) # In[18]: print(int('120') + 10) print(str(120) + ' items') # # ## 3 Jupyter notebooks # # [Jupyter](https://jupyter.org) notebooks allow to mix text, math, code, and results (numerical or figures) in a **single document**. It is intended for interactive computing and is very useful to explore data, teach concepts, create reports. Code can be written in many programming languages, including Python, Julia, R, MATLAB, C++. # ### Markdown text (and Latex math) # # A list: # # * item # * item # # Text in a paragraph. Text can be *italic*, **bold**, `verbatim`. We can define [hyperlinks](https://github.com/mdeff/ntds_2019). # # A numbered list: # # 1. item # 1. item # # Some inline math: $x = \frac12$ # # Some display math: # $$ f(x) = \frac{e^{-x}}{4} $$ # ### Code and results # In[19]: 20 / 100 * 30 # ### Inline figures # In[20]: get_ipython().run_line_magic('matplotlib', 'inline') import numpy as np import matplotlib.pyplot as plt y = np.random.uniform(size=100) plt.plot(y); # **Want to learn more?** Look at the [documentation](https://jupyter.org/documentation). # # ## 4 Version control with git # # ![git](figures/git.jpg) # # [git](https://git-scm.com) is an open-source distributed version control system. It allows users to collaborate on projects (not only software!), synchronize and track their changes. It is most often used with an hosting service such as [GitHub](https://github.com) or [GitLab](https://about.gitlab.com). Those services add many tools to facilitate issue tracking, code review, continuous integration, etc. # # * Decentralized: draw on black board. Make it clear that all repos are the same. # * Commit are local. We push / pull to sync with other repos. # * Git is often used in a centralized fashion, with github / gitlab being the syncing point for everybody. It does not have to be, but github is easier to access than my laptop. # * **Want to learn more?** Try this [interactive guide](https://try.github.io) or look at the more involved [user manual](https://git-scm.com/docs). # # ### Basic usage # # 1. Install with `conda install git`. # 1. Everybody make a clean clone (to be erased afterwards). Use HTTPS if not logged on GitHub. # 1. I add a fake file. # 1. I commit. It is not on github. # 1. I push. It is on github. # 1. They pull. They see it on their machines. # # Two kinds of users: # * Those who don't want to use git, just do `git pull` before every lab. **Do not modify the content of the folder.** That is like your inbox, you only copy files from there and modify them outside. # * The power users make a branch for each of their solutions! # # ### Power users # # * Make a branch: `git branch assignment1_solution` # * Work on that branch: `git checkout assignment1_solution` # * Do and commit your modifications. You get a history of your changes! # * Come back to master with `git checkout master` and get new stuff from the TAs with `git pull`. Again, you should never modify master (you could do it locally, but only the TAs have write access to the github repo). # # ### Super-power users # # Those who want to backup or share their work on github. # # 1. Create a github account. # 1. Create a repository (you could have forked mdeff/ntds_2019). # 1. Add a remote repo: `git remote add my_repo git@github.com:username/ntds_2019.git` # 1. Push your own branches to your repo: `git push -u my_repo milestone1_solution`. # 1. Go on your github and see your changes. # # ### Contributors # # Same as before, except that you can now make a pull request for your changes to be integrated into master and be available to all of us. # # ### Collaborate with git and github # # All the code for your projects will have to be handled as a repository on GitHub. # While you don't have to collaborate with git (i.e., you can create a single commit at the end with all of your code), we highly recommend you to use it. # It is a very good way to manage your project, as it allows you to come back to previous states, synchronize your changes without being lost with versions, track who did what, discuss issues and code, etc. # As such, we recommend you to use git from the start to get the basics. Once you feel ready, create a repository for your project and start working on an assignment there. # # ## 5 Scientific Python # # Below are the basic packages used for scientific computing and data science. # * [NumPy](https://www.numpy.org): N-dimensional arrays # * [SciPy](https://www.scipy.org/scipylib/index.html): scientific computing # * [matplotlib](https://matplotlib.org): powerful visualization # * [pandas](https://pandas.pydata.org): data analysis # # **Want to learn more?** Look at the [Scipy Lecture Notes](https://www.scipy-lectures.org/). # # Finally, the below packages will be useful to work with networks and graphs. # * [NetworkX](https://networkx.github.io): network science # * [graph-tool](https://graph-tool.skewed.de): network science # * [scikit-learn](https://scikit-learn.org): graph embedding (dimensionality reduction) # * [PyGSP](https://github.com/epfl-lts2/pygsp): graph signal processing # * [Deep Graph Library](https://www.dgl.ai): deep learning on graphs with [PyTorch](https://pytorch.org) # # ## 6 Ressources to improve your Python skills (for experienced Python users) # # We provide a non exhaustive list of tools and concepts that can help you improve your Python coding skills. # They are by no means things that you need to master immediately. # # * Numpy and pytorch indexing and broadcasting rules. # They take some time to understand, but are really essential. # They will help you avoid writing loops, which are considerably slower and sometimes memory inefficient. # * # * # * # * # # # * Some common Python built-in functions. # * [`enumerate`](https://docs.python.org/3/library/functions.html#enumerate) # * [`zip`](https://docs.python.org/3/library/functions.html#zip) # * [`itertools.product`](https://docs.python.org/3/library/itertools.html#itertools.product) # # # * Scipy functions. # * `pdist` and `cdist` are considerably faster than loops to compute pairwise distances between objects (e.g., to build a nearest neighbors graph). # * # * # # # * Object-oriented programming. # * Classes. # * # * Read the source code of libraries you commonly use, and try to understand how they organize it. In particular, we advise you to write your methods as in the scikit-learn API. # * # * Inheritance. # When you implement different models that have the same role (such as different machine learning classifiers), base methods allow you to avoid writing the same code several times. # * # * Abstract methods. # They allow you to tell which methods subclassses (of the base class) should implement. # * # * # # # * Google python style guide. # Not essential, but following these rules make things easier when you work in a group. # * # # # * Unit tests. # * # *