Welcome to the Material Science Machine Learning Jupyter Notebooks!

Here you will get an introduction to Material Science and Machine Learning using Python.

These examples will most likely only work with python 3.6+

Machine Learning while there are many algorithms and techniques the workflow is always the same.

  1. gather the data (web scraping, experiments, simulations)
  2. explore the data
    • get a feel for what data you have
    • are there any interesting features to explore?
  3. sanitize the data
    • how do you handle missing data?
    • how do you handle categorical data? (for example 'metal, 'non-metal', 'spacegroup')
  4. apply machine learning algorithms
    • often the easiest part
  5. validate predictions

How is Machine Learning different from Statistics?

While a huge generalization.

  • statistician: care about understanding how the data is generated, and understanding the model and its parameters
  • machine learning: mostly care about ability from prediction

I feel that scientists fall mostly in the statisticians camp.

Occam's razor: one should select the simplest model that describes the data.

As an example https://www.youtube.com/watch?v=1A1yaWS8gSg

Sofisticated model that predicts planet positions with circles can be replaced by a far simpler one that uses elispses. This comes from our understanding of the physics.

Which Machine Learning Algorithms should I use?

There are hundreds of algorithms to choose from. Always start with the simplest so that you can just how more complex models perform.

General Fields of Machine Learning. You will notice that some algorithms appear in multiple areas.


SVM, nearest neighbors, random forests, gradient boost, nearual networks.

Great starting example dataset



SRV, ridge regression, Lasso, Bayession Methods, neural networks

linear regression

I would like to highlight how awesome bayession methods are. pymc3 is the python package to use. If you can create a model that describes your data you can use bayessian methods. It not gaussian processes are amazing (they are "parameter free" fitting methods.

Gaussian process. Notice how you get the variance of your prediction with your data.

Gaussian Process

Bayessian Methods predicting the effect of regulation on coal miner deaths.

coal miner deaths

I am not very knowledgable on neural networks but pytorch is the most userfriendly way to get started.

Play with neural networks in your browser to get a feel for them. link


k-Means, spectral clustering, mean-shift

effect of cluster size etc

Bias Variance Tradeoff

  • bias error: error of model with training data
  • variance error: error of model with a different set of training data
  • irreducible error: error that cannot be reduced regardless of algorithm (sometimes noise)

bias variance

How do we tell where we are on the bias variance curve?

Cross Validation): split your data into a traning and test set. Use the training set to fit your model. Use the test set to evaluate the performance of your model.

Often times you split your data 90% training, 10% testing.

sklearn provides many methods for automating this.

Additional Resources

  • great introduction
  • Kaggle competitions that teach you how to use machine learning (best way to learn is to apply)
  • fast.ai the place to learn about neural networks
  • coursera, edx, udacity too many to name
In [ ]: