Matminer introduction - Predicting bulk modulus

This notebook demonstrates some of the basic features of matminer.

This notebook was last updated 11/15/18 for version 0.4.5 of matminer.

Note that in order to get the in-line plotting to work, you might need to start Jupyter notebook with a higher data rate, e.g., jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10. We recommend you do this before starting.

Overview

In this notebook, we will:

  1. Load and examine a dataset in a pandas dataframe
  2. Add descriptors to the dataframe using matminer
  3. Train, compare, and visualize several machine learning methods with scikit-learn and matminer FigRecipes.

1. Load and process data set

Matminer comes pre-loaded with several example data sets you can use. Below, we'll load a data set of computed elastic properties of materials which is sourced from the paper: "Charting the complete elastic properties of inorganic crystalline compounds", M. de Jong et al., Sci. Data. 2 (2015) 150009.

In [1]:
from matminer.datasets.convenience_loaders import load_elastic_tensor
df = load_elastic_tensor()  # loads dataset in a pandas DataFrame object

1.1 A first look at the data set

The data set comes as a pandas DataFrame, which is a kind of "spreadsheet" object in Python. DataFrames have several useful methods you can use to explore and clean the data, some of which we'll explore below.

In [2]:
df.head()
Out[2]:
material_id formula nsites space_group volume structure elastic_anisotropy G_Reuss G_VRH G_Voigt K_Reuss K_VRH K_Voigt poisson_ratio compliance_tensor elastic_tensor elastic_tensor_original
0 mp-10003 Nb4CoSi 12 124 194.419802 [[0.94814328 2.07280467 2.5112 ] Nb, [5.273... 0.030688 96.844535 97.141604 97.438674 194.267623 194.268884 194.270146 0.285701 [[0.004385293093993, -0.0016070693558990002, -... [[311.33514638650246, 144.45092552856926, 126.... [[311.33514638650246, 144.45092552856926, 126....
1 mp-10010 Al(CoSi)2 5 164 61.987320 [[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278... 0.266910 93.939650 96.252006 98.564362 173.647763 175.449907 177.252050 0.268105 [[0.0037715428949660003, -0.000844229828709, -... [[306.93357350984974, 88.02634955100905, 105.6... [[306.93357350984974, 88.02634955100905, 105.6...
2 mp-10015 SiOs 2 221 25.952539 [[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os] 0.756489 120.962289 130.112955 139.263621 295.077545 295.077545 295.077545 0.307780 [[0.0019959391925840004, -0.000433146670736000... [[569.5291276937579, 157.8517489654999, 157.85... [[569.5291276937579, 157.8517489654999, 157.85...
3 mp-10021 Ga 4 63 76.721433 [[0. 1.09045794 0.84078375] Ga, [0. ... 2.376805 12.205989 15.101901 17.997812 49.025963 49.130670 49.235377 0.360593 [[0.021647143908635, -0.005207263618160001, -0... [[69.28798774976904, 34.7875015216915, 37.3877... [[70.13259066665267, 40.60474945058445, 37.387...
4 mp-10025 SiRu2 12 62 160.300999 [[1.0094265 4.24771709 2.9955487 ] Si, [3.028... 0.196930 100.110773 101.947798 103.784823 255.055257 256.768081 258.480904 0.324682 [[0.00410214297725, -0.001272204332729, -0.001... [[349.3767766177825, 186.67131003104407, 176.4... [[407.4791016459293, 176.4759188081947, 213.83...

1.2 Removing unneeded columns from the data set

The data set above has many columns - we won't need all this data for our modeling. We'll mainly be trying to predict K_VRH and G_VRH (the Voight-Reuss-Hill average of the bulk and shear modulus, respectively) and the elastic_anisotropy. We can drop most of the other output data. We also don't need various metadata such as the cif string of the structure since the crystal structure is already embedded in the structure column.

In [3]:
df.columns
Out[3]:
Index(['material_id', 'formula', 'nsites', 'space_group', 'volume',
       'structure', 'elastic_anisotropy', 'G_Reuss', 'G_VRH', 'G_Voigt',
       'K_Reuss', 'K_VRH', 'K_Voigt', 'poisson_ratio', 'compliance_tensor',
       'elastic_tensor', 'elastic_tensor_original'],
      dtype='object')
In [4]:
unwanted_columns = ["volume", "nsites", "compliance_tensor", "elastic_tensor", 
                    "elastic_tensor_original", "K_Voigt", "G_Voigt", "K_Reuss", "G_Reuss"]
df = df.drop(unwanted_columns, axis=1)
In [5]:
df.head()
Out[5]:
material_id formula space_group structure elastic_anisotropy G_VRH K_VRH poisson_ratio
0 mp-10003 Nb4CoSi 124 [[0.94814328 2.07280467 2.5112 ] Nb, [5.273... 0.030688 97.141604 194.268884 0.285701
1 mp-10010 Al(CoSi)2 164 [[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278... 0.266910 96.252006 175.449907 0.268105
2 mp-10015 SiOs 221 [[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os] 0.756489 130.112955 295.077545 0.307780
3 mp-10021 Ga 63 [[0. 1.09045794 0.84078375] Ga, [0. ... 2.376805 15.101901 49.130670 0.360593
4 mp-10025 SiRu2 62 [[1.0094265 4.24771709 2.9955487 ] Si, [3.028... 0.196930 101.947798 256.768081 0.324682

1.3 Getting a feel for the data using descriptive statistics

A pandas DataFrame includes a function called describe() that helps determine statistics for the various numerical / categorical columns in the data.

In [6]:
df.describe()
Out[6]:
space_group elastic_anisotropy G_VRH K_VRH poisson_ratio
count 1181.000000 1181.000000 1181.000000 1181.000000 1181.000000
mean 163.403895 2.145013 67.543145 136.259661 0.287401
std 65.040733 19.140097 44.579408 72.886978 0.062177
min 4.000000 0.000005 2.722175 6.476135 0.042582
25% 124.000000 0.145030 34.117959 76.435350 0.249159
50% 193.000000 0.355287 59.735163 130.382766 0.290198
75% 221.000000 0.923117 91.332142 189.574194 0.328808
max 229.000000 397.297866 522.921225 435.661487 0.467523

Sometimes, the describe() function will reveal outliers that indicate mistakes in the data. For example, negative hence unphysical minimum bulk/shear moduli or maximum bulk/shear moduli that are too high. In this case, the data looks ok at first glance; meaing that there are no clear problems with the ranges of the various properties. Therefore, and we won't filter out any data.

Note that the describe() function only describes numerical columns by default.

2. Add descriptors to the data ("featurization")

We are seeking to find relationships between the inputs (composition and crystal structure of a material) and outputs (elastic properties such as K_VRH, G_VRH, and elastic_anisotropy). To find such relations, we need to "featurize" the input data such that they are numbers that meaningfully represent the underlying physical quantity. For example, one "feature" or "descriptor" of the composition of a material such as Nb4CoSi would be the standard deviation of the Pauling electronegativity of the elements in the compound (weighted by stoichiometry). Compositions with high values of this quantity would be more ionic and quantities with lower values would tend towards covalent or ionic. A descriptor of the crystal structure might be the average coordination number of sites; higher coordination numbers indicate more bonds and therefore might indicate stiffer materials. With matminer, we can start generating hundreds of possible descriptors using the descriptor library that is available. Data mining techniques can help narrow down which descriptors are the most relevant to the target problem using the available output data as a guide.

2.1 Add composition-based features

A major class of featurizers available in matminer uses the chemical composition to featurize the input data. Let's add some composition based features to our DataFrame.

The first step is to have a column representing the chemical composition as a pymatgen Composition object. One way to do this is to use the conversions Featurizers in matminer to turn a String composition (our formula column from before) into a pymatgen Composition.

In [7]:
from matminer.featurizers.conversions import StrToComposition
df = StrToComposition().featurize_dataframe(df, "formula")
df.head()

Out[7]:
material_id formula space_group structure elastic_anisotropy G_VRH K_VRH poisson_ratio composition
0 mp-10003 Nb4CoSi 124 [[0.94814328 2.07280467 2.5112 ] Nb, [5.273... 0.030688 97.141604 194.268884 0.285701 (Nb, Co, Si)
1 mp-10010 Al(CoSi)2 164 [[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278... 0.266910 96.252006 175.449907 0.268105 (Al, Co, Si)
2 mp-10015 SiOs 221 [[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os] 0.756489 130.112955 295.077545 0.307780 (Si, Os)
3 mp-10021 Ga 63 [[0. 1.09045794 0.84078375] Ga, [0. ... 2.376805 15.101901 49.130670 0.360593 (Ga)
4 mp-10025 SiRu2 62 [[1.0094265 4.24771709 2.9955487 ] Si, [3.028... 0.196930 101.947798 256.768081 0.324682 (Si, Ru)

As you can see, we now have a new Composition column above. The visualization in the DataFrame is not so clear, but the column embeds both the elements and the amounts of each element in the composition (not shown).

Next, we'll use one of the featurizers in matminer to add a suite of descriptors to the DataFrame.

In [8]:
from matminer.featurizers.composition import ElementProperty

ep_feat = ElementProperty.from_preset(preset_name="magpie")
df = ep_feat.featurize_dataframe(df, col_id="composition")  # input the "composition" column to the featurizer
df.head()

Out[8]:
material_id formula space_group structure elastic_anisotropy G_VRH K_VRH poisson_ratio composition minimum Number ... range GSmagmom mean GSmagmom avg_dev GSmagmom mode GSmagmom minimum SpaceGroupNumber maximum SpaceGroupNumber range SpaceGroupNumber mean SpaceGroupNumber avg_dev SpaceGroupNumber mode SpaceGroupNumber
0 mp-10003 Nb4CoSi 124 [[0.94814328 2.07280467 2.5112 ] Nb, [5.273... 0.030688 97.141604 194.268884 0.285701 (Nb, Co, Si) 14.0 ... 1.548471 0.258079 0.430131 0.0 194.0 229.0 35.0 222.833333 9.611111 229.0
1 mp-10010 Al(CoSi)2 164 [[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278... 0.266910 96.252006 175.449907 0.268105 (Al, Co, Si) 13.0 ... 1.548471 0.619388 0.743266 0.0 194.0 227.0 33.0 213.400000 15.520000 194.0
2 mp-10015 SiOs 221 [[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os] 0.756489 130.112955 295.077545 0.307780 (Si, Os) 14.0 ... 0.000000 0.000000 0.000000 0.0 194.0 227.0 33.0 210.500000 16.500000 194.0
3 mp-10021 Ga 63 [[0. 1.09045794 0.84078375] Ga, [0. ... 2.376805 15.101901 49.130670 0.360593 (Ga) 31.0 ... 0.000000 0.000000 0.000000 0.0 64.0 64.0 0.0 64.000000 0.000000 64.0
4 mp-10025 SiRu2 62 [[1.0094265 4.24771709 2.9955487 ] Si, [3.028... 0.196930 101.947798 256.768081 0.324682 (Si, Ru) 14.0 ... 0.000000 0.000000 0.000000 0.0 194.0 227.0 33.0 205.000000 14.666667 194.0

5 rows × 141 columns

As you can see, we now have many more columns in the DataFrame (a total of 141 columns - not all are shown). So we added many features to our data!

As an aside, note that each featurizer also has a citations() function that tells you where to find out more about the Featurizer.

In [9]:
ep_feat.citations()
Out[9]:
['@article{ward_agrawal_choudary_wolverton_2016, title={A general-purpose machine learning framework for predicting properties of inorganic materials}, volume={2}, DOI={10.1038/npjcompumats.2017.28}, number={1}, journal={npj Computational Materials}, author={Ward, Logan and Agrawal, Ankit and Choudhary, Alok and Wolverton, Christopher}, year={2016}}']

2.2 Add more composition-based features

There are many more Composition based featurizers apart from ElementProperty that are available in the matminer.featurizers.composition. Let's try the ElectronegativityDiff featurizer which requires knowing the oxidation state of the various elements in the Composition. This information is not there currently, but one can use the conversions package to try to guess oxidation states and then apply the ElectronegativityDiff featurizer to this column.

In [10]:
from matminer.featurizers.conversions import CompositionToOxidComposition
from matminer.featurizers.composition import OxidationStates

df = CompositionToOxidComposition().featurize_dataframe(df, "composition")

os_feat = OxidationStates()
df = os_feat.featurize_dataframe(df, "composition_oxid")
df.head()


Out[10]:
material_id formula space_group structure elastic_anisotropy G_VRH K_VRH poisson_ratio composition minimum Number ... maximum SpaceGroupNumber range SpaceGroupNumber mean SpaceGroupNumber avg_dev SpaceGroupNumber mode SpaceGroupNumber composition_oxid minimum oxidation state maximum oxidation state range oxidation state std_dev oxidation state
0 mp-10003 Nb4CoSi 124 [[0.94814328 2.07280467 2.5112 ] Nb, [5.273... 0.030688 97.141604 194.268884 0.285701 (Nb, Co, Si) 14.0 ... 229.0 35.0 222.833333 9.611111 229.0 (Nb0+, Co0+, Si0+) 0 0 0 0.000000
1 mp-10010 Al(CoSi)2 164 [[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278... 0.266910 96.252006 175.449907 0.268105 (Al, Co, Si) 13.0 ... 227.0 33.0 213.400000 15.520000 194.0 (Al3+, Co2+, Co3+, Si4-) -4 3 7 3.872983
2 mp-10015 SiOs 221 [[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os] 0.756489 130.112955 295.077545 0.307780 (Si, Os) 14.0 ... 227.0 33.0 210.500000 16.500000 194.0 (Si4-, Os4+) -4 4 8 5.656854
3 mp-10021 Ga 63 [[0. 1.09045794 0.84078375] Ga, [0. ... 2.376805 15.101901 49.130670 0.360593 (Ga) 31.0 ... 64.0 0.0 64.000000 0.000000 64.0 (Ga0+) 0 0 0 0.000000
4 mp-10025 SiRu2 62 [[1.0094265 4.24771709 2.9955487 ] Si, [3.028... 0.196930 101.947798 256.768081 0.324682 (Si, Ru) 14.0 ... 227.0 33.0 205.000000 14.666667 194.0 (Si4-, Ru2+) -4 2 6 4.242641

5 rows × 146 columns

As you can see, the end of our data frame has now has some oxidation state based features!

2.3 Add some structure based features

Not all featurizers operate on compositions. Matminer can also analyze crystal structures and featurize those as well. Let's start by adding some simple density features.

In [11]:
from matminer.featurizers.structure import DensityFeatures

df_feat = DensityFeatures()
df = df_feat.featurize_dataframe(df, "structure")  # input the structure column to the featurizer
df.head()

Out[11]:
material_id formula space_group structure elastic_anisotropy G_VRH K_VRH poisson_ratio composition minimum Number ... avg_dev SpaceGroupNumber mode SpaceGroupNumber composition_oxid minimum oxidation state maximum oxidation state range oxidation state std_dev oxidation state density vpa packing fraction
0 mp-10003 Nb4CoSi 124 [[0.94814328 2.07280467 2.5112 ] Nb, [5.273... 0.030688 97.141604 194.268884 0.285701 (Nb, Co, Si) 14.0 ... 9.611111 229.0 (Nb0+, Co0+, Si0+) 0 0 0 0.000000 7.834556 16.201654 0.688834
1 mp-10010 Al(CoSi)2 164 [[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278... 0.266910 96.252006 175.449907 0.268105 (Al, Co, Si) 13.0 ... 15.520000 194.0 (Al3+, Co2+, Co3+, Si4-) -4 3 7 3.872983 5.384968 12.397466 0.644386
2 mp-10015 SiOs 221 [[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os] 0.756489 130.112955 295.077545 0.307780 (Si, Os) 14.0 ... 16.500000 194.0 (Si4-, Os4+) -4 4 8 5.656854 13.968635 12.976265 0.569426
3 mp-10021 Ga 63 [[0. 1.09045794 0.84078375] Ga, [0. ... 2.376805 15.101901 49.130670 0.360593 (Ga) 31.0 ... 0.000000 64.0 (Ga0+) 0 0 0 0.000000 6.036267 19.180359 0.479802
4 mp-10025 SiRu2 62 [[1.0094265 4.24771709 2.9955487 ] Si, [3.028... 0.196930 101.947798 256.768081 0.324682 (Si, Ru) 14.0 ... 14.666667 194.0 (Si4-, Ru2+) -4 2 6 4.242641 9.539514 13.358418 0.598395

5 rows × 149 columns

In [12]:
df_feat.feature_labels()
Out[12]:
['density', 'vpa', 'packing fraction']

Again, we see more features in the last few columns: density, vpa, and packing fraction.

There are many more structure based featurizers that are much more complex and detailed analysis of the crystal structure that are outside of the scope of this example. Let's move on to doing some machine learning predictions.

3. Try some different machine learning models to relate input features to the bulk modulus

We now have enough data to do some machine learning! We'll need to first determine what columns we consider input (independent variables) and what column we consider output (dependent variable).

3.1 Define input data and output data

For now, we'll use K_VRH (bulk modulus) as the output. You could retry this example with G_VRH, elastic_anisotropy, or poisson_ratio as the target output.

For the inputs, we'll use all the features we generated. That is, everything except the output data and non-numerical columns like composition and structure.

In [13]:
y = df['K_VRH'].values
excluded = ["G_VRH", "K_VRH", "elastic_anisotropy", "formula", "material_id", 
            "poisson_ratio", "structure", "composition", "composition_oxid"]
X = df.drop(excluded, axis=1)
print("There are {} possible descriptors:\n\n{}".format(X.shape[1], X.columns.values))
There are 140 possible descriptors:

['space_group' 'minimum Number' 'maximum Number' 'range Number'
 'mean Number' 'avg_dev Number' 'mode Number' 'minimum MendeleevNumber'
 'maximum MendeleevNumber' 'range MendeleevNumber' 'mean MendeleevNumber'
 'avg_dev MendeleevNumber' 'mode MendeleevNumber' 'minimum AtomicWeight'
 'maximum AtomicWeight' 'range AtomicWeight' 'mean AtomicWeight'
 'avg_dev AtomicWeight' 'mode AtomicWeight' 'minimum MeltingT'
 'maximum MeltingT' 'range MeltingT' 'mean MeltingT' 'avg_dev MeltingT'
 'mode MeltingT' 'minimum Column' 'maximum Column' 'range Column'
 'mean Column' 'avg_dev Column' 'mode Column' 'minimum Row' 'maximum Row'
 'range Row' 'mean Row' 'avg_dev Row' 'mode Row' 'minimum CovalentRadius'
 'maximum CovalentRadius' 'range CovalentRadius' 'mean CovalentRadius'
 'avg_dev CovalentRadius' 'mode CovalentRadius'
 'minimum Electronegativity' 'maximum Electronegativity'
 'range Electronegativity' 'mean Electronegativity'
 'avg_dev Electronegativity' 'mode Electronegativity' 'minimum NsValence'
 'maximum NsValence' 'range NsValence' 'mean NsValence'
 'avg_dev NsValence' 'mode NsValence' 'minimum NpValence'
 'maximum NpValence' 'range NpValence' 'mean NpValence'
 'avg_dev NpValence' 'mode NpValence' 'minimum NdValence'
 'maximum NdValence' 'range NdValence' 'mean NdValence'
 'avg_dev NdValence' 'mode NdValence' 'minimum NfValence'
 'maximum NfValence' 'range NfValence' 'mean NfValence'
 'avg_dev NfValence' 'mode NfValence' 'minimum NValence'
 'maximum NValence' 'range NValence' 'mean NValence' 'avg_dev NValence'
 'mode NValence' 'minimum NsUnfilled' 'maximum NsUnfilled'
 'range NsUnfilled' 'mean NsUnfilled' 'avg_dev NsUnfilled'
 'mode NsUnfilled' 'minimum NpUnfilled' 'maximum NpUnfilled'
 'range NpUnfilled' 'mean NpUnfilled' 'avg_dev NpUnfilled'
 'mode NpUnfilled' 'minimum NdUnfilled' 'maximum NdUnfilled'
 'range NdUnfilled' 'mean NdUnfilled' 'avg_dev NdUnfilled'
 'mode NdUnfilled' 'minimum NfUnfilled' 'maximum NfUnfilled'
 'range NfUnfilled' 'mean NfUnfilled' 'avg_dev NfUnfilled'
 'mode NfUnfilled' 'minimum NUnfilled' 'maximum NUnfilled'
 'range NUnfilled' 'mean NUnfilled' 'avg_dev NUnfilled' 'mode NUnfilled'
 'minimum GSvolume_pa' 'maximum GSvolume_pa' 'range GSvolume_pa'
 'mean GSvolume_pa' 'avg_dev GSvolume_pa' 'mode GSvolume_pa'
 'minimum GSbandgap' 'maximum GSbandgap' 'range GSbandgap'
 'mean GSbandgap' 'avg_dev GSbandgap' 'mode GSbandgap' 'minimum GSmagmom'
 'maximum GSmagmom' 'range GSmagmom' 'mean GSmagmom' 'avg_dev GSmagmom'
 'mode GSmagmom' 'minimum SpaceGroupNumber' 'maximum SpaceGroupNumber'
 'range SpaceGroupNumber' 'mean SpaceGroupNumber'
 'avg_dev SpaceGroupNumber' 'mode SpaceGroupNumber'
 'minimum oxidation state' 'maximum oxidation state'
 'range oxidation state' 'std_dev oxidation state' 'density' 'vpa'
 'packing fraction']

3.1 Try a linear regression model using scikit-learn

The scikit-learn library makes it easy to fit and cross-validate different types of regression models. Let's start with one of the simplest - a linear regression.

In [14]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

lr = LinearRegression()

lr.fit(X, y)

# get fit statistics
print('training R2 = ' + str(round(lr.score(X, y), 3)))
print('training RMSE = %.3f' % np.sqrt(mean_squared_error(y_true=y, y_pred=lr.predict(X))))
training R2 = 0.927
training RMSE = 19.685
/Users/daniel/.conda/envs/matminer/lib/python3.6/site-packages/sklearn/linear_model/base.py:485: RuntimeWarning:

internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.

This looks reasonable since linear regression is a simple (high bias) model. But to really validate that we are not over-fitting, we need to check the cross-validation score rather than the fitting score.

In [15]:
from sklearn.model_selection import KFold, cross_val_score

# Use 10-fold cross validation (90% training, 10% test)
crossvalidation = KFold(n_splits=10, shuffle=False, random_state=1)
scores = cross_val_score(lr, X, y, scoring='neg_mean_squared_error', cv=crossvalidation, n_jobs=1)
rmse_scores = [np.sqrt(abs(s)) for s in scores]
r2_scores = cross_val_score(lr, X, y, scoring='r2', cv=crossvalidation, n_jobs=1)

print('Cross-validation results:')
print('Folds: %i, mean R2: %.3f' % (len(scores), np.mean(np.abs(r2_scores))))
print('Folds: %i, mean RMSE: %.3f' % (len(scores), np.mean(np.abs(rmse_scores))))
Cross-validation results:
Folds: 10, mean R2: 0.901
Folds: 10, mean RMSE: 22.490

A root-mean-squared error of ~22 GPa is not bad! Let's see what this looks like on a plot.

Note that in order to get the code below to work, you might need to start Jupyter notebook with a higher data rate, e.g., jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10.

In [16]:
from matminer.figrecipes.plot import PlotlyFig
from sklearn.model_selection import cross_val_predict

pf = PlotlyFig(x_title='DFT (MP) bulk modulus (GPa)',
               y_title='Predicted bulk modulus (GPa)',
               title='Linear regression',
               mode='notebook',
               filename="lr_regression.html")

pf.xy(xy_pairs=[(y, cross_val_predict(lr, X, y, cv=crossvalidation)), ([0, 400], [0, 400])], 
      labels=df['formula'], 
      modes=['markers', 'lines'],
      lines=[{}, {'color': 'black', 'dash': 'dash'}], 
      showlegends=False
     )