authors: Evgeny Blokhin, Joseph Montoya
note: This notebook requires use of the MPDS data retrieval tool, which requires an account to the MPDS database. Please inquire about access to this database at the MPDS website.
Note that in order to get the in-line plotting to work, you might need to start Jupyter notebook with a higher data rate, e.g., jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
. We recommend you do this before starting.
This notebook was last updated 11/15/18 for version 0.4.5 of matminer.
We use the MPDSDataRetrieval tool to download the crystalline structures from the MPDS database. Let's say we want to study the chemical bond of uranium and oxygen. What is the length of this bond the most frequently reported in the world's scientific literature? The MPDS contains many crystalline structures with uranium and oxygen, so let's perform a quick data investigation to answer this question:
# Enter your MPDS API key below!
API_KEY = None
import pandas as pd
import numpy as np
import tqdm
from matminer.data_retrieval.retrieve_MPDS import MPDSDataRetrieval
from matminer.figrecipes.plot import PlotlyFig
Here we use a naive brute-force approach to calculate bond lengths between particular atom types in a crystalline environment. Obviously, we are interested in neighboring atoms only, so we do not consider interatomic distances more than, let's say, 4 Angstroms. We then represent a crystalline structure with the ase's Atoms
class and calculate distances using its get_distance
method. Note rounding of distances marked by comment NB
in the code:
def calculate_lengths(ase_obj, elA, elB, limit=4):
"""
Short helper function to get bond lengths between element A
and element B.
"""
assert elA != elB
lengths = []
all_lengths = ase_obj.get_all_distances()
for n, atom in enumerate(ase_obj):
if atom.symbol == elA:
for m, neighbor in enumerate(ase_obj):
if neighbor.symbol == elB:
dist = round(all_lengths[n][m], 2) # NB occurrence <-> rounding
if dist < limit:
lengths.append(dist)
return lengths
Note that the crystalline structures are not retrieved from the MPDS by default, so we need to specify additional four fields:
cell_abc
sg_n
basis_noneq
els_noneq
On top of that, we also obtain crystalline phase_id
s, MPDS entry numbers, and chemical formulae. Note that get_data
API client method returns a usual Python list, whereas get_dataframe
API client method returns a Pandas dataframe. We use the former below:
client = MPDSDataRetrieval(api_key=API_KEY)
answer = client.get_data(criteria={"elements": "U-O",
"props": "atomic structure", "classes": "binary"},
fields={'S':['phase_id', 'entry', 'chemical_formula', 'cell_abc',
'sg_n', 'basis_noneq', 'els_noneq']})
Got 172 hits
MPDSDataRetrieval.compile_crystal
API client method helps us to handle the crystalline structure in the ase's Atoms
flavor. We then call calculate_lengths
function defined earlier.
lengths = []
for item in tqdm.tqdm(answer):
crystal = MPDSDataRetrieval.compile_crystal(item, 'ase')
if not crystal: continue
lengths.extend(calculate_lengths(crystal, 'U', 'O'))
100%|██████████| 172/172 [00:24<00:00, 6.96it/s]
That runs a little bit slow, since ase's Atoms
are expectedly not performing very well on hundreds of bond length calculations. We may want to use the ase's neighbor_list method, or employ a C-extension here, but this is outside the scope of this exercise. A popular Pymatgen library can be also used instead of ase. (Mind however that Pymatgen and ase are generally incompatible.) Anyway now we have a flat list lengths
. Let's convert it into a Pandas Dataframe and find which U-O
distances occur more often than the others.
dfrm = pd.DataFrame(sorted(lengths), columns=['length'])
dfrm['occurrence'] = dfrm.groupby('length')['length'].transform('count')
dfrm.drop_duplicates('length', inplace=True)
What did we do here? We calculated the numbers of occurrences (counts) of each particular U-O
length and then updated our dataframe dfrm
with this info, creating a new column occurrence
. Here is a resulting distribution of bond lengths rendered using matminer.figrecipes.plot.PlotlyFig
.
We can see below that the most frequent bond lengths between the neighboring uranium and oxygen atoms are 1.78 and 2.35 Angstroms. This agrees with the well-known study of Burns et al. [1], done in 1997. However, Burns considered only 105 structures, and we did more than 170, confirming even more thoroughly his findings on the uranyl ion geometry.
pf = PlotlyFig(dfrm, mode='notebook', x_title="Bond lengths (A)")
pf.histogram(cols=['length'], n_bins=50)