This notebook demonstrates some of the basic features of matminer.
This notebook was last updated 11/15/18 for version 0.4.5 of matminer.
Note that in order to get the in-line plotting to work, you might need to start Jupyter notebook with a higher data rate, e.g., jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
. We recommend you do this before starting.
In this notebook, we will:
Matminer comes pre-loaded with several example data sets you can use. Below, we'll load a data set of computed elastic properties of materials which is sourced from the paper: "Charting the complete elastic properties of inorganic crystalline compounds", M. de Jong et al., Sci. Data. 2 (2015) 150009.
from matminer.datasets.convenience_loaders import load_elastic_tensor
df = load_elastic_tensor() # loads dataset in a pandas DataFrame object
The data set comes as a pandas DataFrame, which is a kind of "spreadsheet" object in Python. DataFrames have several useful methods you can use to explore and clean the data, some of which we'll explore below.
df.head()
The data set above has many columns - we won't need all this data for our modeling. We'll mainly be trying to predict K_VRH
and G_VRH
(the Voight-Reuss-Hill average of the bulk and shear modulus, respectively) and the elastic_anisotropy
. We can drop most of the other output data. We also don't need various metadata such as the cif
string of the structure since the crystal structure is already embedded in the structure
column.
df.columns
unwanted_columns = ["volume", "nsites", "compliance_tensor", "elastic_tensor",
"elastic_tensor_original", "K_Voigt", "G_Voigt", "K_Reuss", "G_Reuss"]
df = df.drop(unwanted_columns, axis=1)
df.head()
A pandas DataFrame includes a function called describe()
that helps determine statistics for the various numerical / categorical columns in the data.
df.describe()
Sometimes, the describe()
function will reveal outliers that indicate mistakes in the data. For example, negative hence unphysical minimum bulk/shear moduli or maximum bulk/shear moduli that are too high. In this case, the data looks ok at first glance; meaing that there are no clear problems with the ranges of the various properties. Therefore, and we won't filter out any data.
Note that the describe()
function only describes numerical columns by default.
We are seeking to find relationships between the inputs (composition and crystal structure of a material) and outputs (elastic properties such as K_VRH
, G_VRH
, and elastic_anisotropy
). To find such relations, we need to "featurize" the input data such that they are numbers that meaningfully represent the underlying physical quantity. For example, one "feature" or "descriptor" of the composition of a material such as Nb4CoSi
would be the standard deviation of the Pauling electronegativity of the elements in the compound (weighted by stoichiometry). Compositions with high values of this quantity would be more ionic and quantities with lower values would tend towards covalent or ionic. A descriptor of the crystal structure might be the average coordination number of sites; higher coordination numbers indicate more bonds and therefore might indicate stiffer materials. With matminer, we can start generating hundreds of possible descriptors using the descriptor library that is available. Data mining techniques can help narrow down which descriptors are the most relevant to the target problem using the available output data as a guide.
A major class of featurizers available in matminer uses the chemical composition to featurize the input data. Let's add some composition based features to our DataFrame.
The first step is to have a column representing the chemical composition as a pymatgen Composition object. One way to do this is to use the conversions Featurizers in matminer to turn a String composition (our formula
column from before) into a pymatgen Composition.
from matminer.featurizers.conversions import StrToComposition
df = StrToComposition().featurize_dataframe(df, "formula")
df.head()
As you can see, we now have a new Composition column above. The visualization in the DataFrame is not so clear, but the column embeds both the elements and the amounts of each element in the composition (not shown).
Next, we'll use one of the featurizers in matminer to add a suite of descriptors to the DataFrame.
from matminer.featurizers.composition import ElementProperty
ep_feat = ElementProperty.from_preset(preset_name="magpie")
df = ep_feat.featurize_dataframe(df, col_id="composition") # input the "composition" column to the featurizer
df.head()
As you can see, we now have many more columns in the DataFrame (a total of 141 columns - not all are shown). So we added many features to our data!
As an aside, note that each featurizer also has a citations()
function that tells you where to find out more about the Featurizer.
ep_feat.citations()
There are many more Composition based featurizers apart from ElementProperty
that are available in the matminer.featurizers.composition
. Let's try the ElectronegativityDiff
featurizer which requires knowing the oxidation state of the various elements in the Composition. This information is not there currently, but one can use the conversions
package to try to guess oxidation states and then apply the ElectronegativityDiff
featurizer to this column.
from matminer.featurizers.conversions import CompositionToOxidComposition
from matminer.featurizers.composition import OxidationStates
df = CompositionToOxidComposition().featurize_dataframe(df, "composition")
os_feat = OxidationStates()
df = os_feat.featurize_dataframe(df, "composition_oxid")
df.head()
As you can see, the end of our data frame has now has some oxidation state based features!
Not all featurizers operate on compositions. Matminer can also analyze crystal structures and featurize those as well. Let's start by adding some simple density features.
from matminer.featurizers.structure import DensityFeatures
df_feat = DensityFeatures()
df = df_feat.featurize_dataframe(df, "structure") # input the structure column to the featurizer
df.head()
df_feat.feature_labels()
Again, we see more features in the last few columns: density
, vpa
, and packing fraction
.
There are many more structure based featurizers that are much more complex and detailed analysis of the crystal structure that are outside of the scope of this example. Let's move on to doing some machine learning predictions.
We now have enough data to do some machine learning! We'll need to first determine what columns we consider input (independent variables) and what column we consider output (dependent variable).
For now, we'll use K_VRH
(bulk modulus) as the output. You could retry this example with G_VRH
, elastic_anisotropy
, or poisson_ratio
as the target output.
For the inputs, we'll use all the features we generated. That is, everything except the output data and non-numerical columns like composition
and structure
.
y = df['K_VRH'].values
excluded = ["G_VRH", "K_VRH", "elastic_anisotropy", "formula", "material_id",
"poisson_ratio", "structure", "composition", "composition_oxid"]
X = df.drop(excluded, axis=1)
print("There are {} possible descriptors:\n\n{}".format(X.shape[1], X.columns.values))
The scikit-learn library makes it easy to fit and cross-validate different types of regression models. Let's start with one of the simplest - a linear regression.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
lr = LinearRegression()
lr.fit(X, y)
# get fit statistics
print('training R2 = ' + str(round(lr.score(X, y), 3)))
print('training RMSE = %.3f' % np.sqrt(mean_squared_error(y_true=y, y_pred=lr.predict(X))))
This looks reasonable since linear regression is a simple (high bias) model. But to really validate that we are not over-fitting, we need to check the cross-validation score rather than the fitting score.
from sklearn.model_selection import KFold, cross_val_score
# Use 10-fold cross validation (90% training, 10% test)
crossvalidation = KFold(n_splits=10, shuffle=False, random_state=1)
scores = cross_val_score(lr, X, y, scoring='neg_mean_squared_error', cv=crossvalidation, n_jobs=1)
rmse_scores = [np.sqrt(abs(s)) for s in scores]
r2_scores = cross_val_score(lr, X, y, scoring='r2', cv=crossvalidation, n_jobs=1)
print('Cross-validation results:')
print('Folds: %i, mean R2: %.3f' % (len(scores), np.mean(np.abs(r2_scores))))
print('Folds: %i, mean RMSE: %.3f' % (len(scores), np.mean(np.abs(rmse_scores))))
A root-mean-squared error of ~22 GPa is not bad! Let's see what this looks like on a plot.
Note that in order to get the code below to work, you might need to start Jupyter notebook with a higher data rate, e.g., jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
.
from matminer.figrecipes.plot import PlotlyFig
from sklearn.model_selection import cross_val_predict
pf = PlotlyFig(x_title='DFT (MP) bulk modulus (GPa)',
y_title='Predicted bulk modulus (GPa)',
title='Linear regression',
mode='notebook',
filename="lr_regression.html")
pf.xy(xy_pairs=[(y, cross_val_predict(lr, X, y, cv=crossvalidation)), ([0, 400], [0, 400])],
labels=df['formula'],
modes=['markers', 'lines'],
lines=[{}, {'color': 'black', 'dash': 'dash'}],
showlegends=False
)