#!/usr/bin/env python
# coding: utf-8

# # Uranium-oxygen bond length analysis
# 
# *authors: Evgeny Blokhin, Joseph Montoya*
# 
# **note: This notebook requires use of the MPDS data retrieval tool, which requires an account to the MPDS database.  Please inquire about access to this database at the [MPDS website](https://mpds.io/). **
# 
# **Note that in order to get the in-line plotting to work, you might need to start Jupyter notebook with a higher data rate, e.g., ``jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10``. We recommend you do this before starting.**
# 
# 
# This notebook was last updated 11/15/18 for version 0.4.5 of matminer.
# 
# We use the MPDSDataRetrieval tool to download the crystalline structures from the MPDS database. Let's say we want to study the chemical bond of uranium and oxygen. What is the length of this bond the most frequently reported in the world's scientific literature? The MPDS contains many crystalline structures with uranium and oxygen, so let's perform a quick data investigation to answer this question:
#     

# In[1]:


# Enter your MPDS API key below!
API_KEY = None


# In[2]:


import pandas as pd
import numpy as np
import tqdm
from matminer.data_retrieval.retrieve_MPDS import MPDSDataRetrieval
from matminer.figrecipes.plot import PlotlyFig


# Here we use a naive brute-force approach to calculate bond lengths between particular atom types in a crystalline environment. Obviously, we are interested in neighboring atoms only, so we do not consider interatomic distances more than, let's say, 4 Angstroms. We then represent a crystalline structure with the **ase**'s `Atoms` class and calculate distances using its `get_distance` method. Note rounding of distances marked by comment `NB` in the code:
#     

# In[3]:


def calculate_lengths(ase_obj, elA, elB, limit=4):
    """
    Short helper function to get bond lengths between element A
    and element B.
    """
    assert elA != elB
    lengths = []
    all_lengths = ase_obj.get_all_distances()
    
    for n, atom in enumerate(ase_obj):
        if atom.symbol == elA:
            for m, neighbor in enumerate(ase_obj):
                if neighbor.symbol == elB:
                    dist = round(all_lengths[n][m], 2) # NB occurrence <-> rounding
                    if dist < limit:
                        lengths.append(dist)
    return lengths


# Note that the crystalline structures are not retrieved from the MPDS by default, so we need to specify additional four fields:
# * `cell_abc`
# * `sg_n`
# * `basis_noneq`
# * `els_noneq`
# 
# On top of that, we also obtain crystalline `phase_id`s, MPDS entry numbers, and chemical formulae. Note that `get_data` API client method returns a usual Python list, whereas `get_dataframe` API client method returns a Pandas dataframe. We use the former below:
#     

# In[4]:


client = MPDSDataRetrieval(api_key=API_KEY)

answer = client.get_data(criteria={"elements": "U-O",
                                 "props": "atomic structure", "classes": "binary"},
                         fields={'S':['phase_id', 'entry', 'chemical_formula', 'cell_abc', 
                                      'sg_n', 'basis_noneq', 'els_noneq']})


# `MPDSDataRetrieval.compile_crystal` API client method helps us to handle the crystalline structure in the **ase**'s `Atoms` flavor. We then call `calculate_lengths` function defined earlier.
#     

# In[5]:


lengths = []

for item in tqdm.tqdm(answer):
    crystal = MPDSDataRetrieval.compile_crystal(item, 'ase')
    if not crystal: continue
    lengths.extend(calculate_lengths(crystal, 'U', 'O'))


# That runs a little bit slow, since **ase**'s `Atoms` are expectedly not performing very well on hundreds of bond length calculations. We may want to use the **ase**'s **neighbor_list** method, or employ a C-extension here, but this is outside the scope of this exercise. A popular [Pymatgen](https://github.com/materialsproject/pymatgen) library can be also used instead of **ase**. (Mind however that **Pymatgen** and **ase** are generally incompatible.) Anyway now we have a flat list `lengths`. Let's convert it into a Pandas Dataframe and find which `U-O` distances occur more often than the others.
#     

# In[6]:


dfrm = pd.DataFrame(sorted(lengths), columns=['length'])
dfrm['occurrence'] = dfrm.groupby('length')['length'].transform('count')
dfrm.drop_duplicates('length', inplace=True)


# What did we do here? We calculated the numbers of occurrences (counts) of each particular `U-O` length and then updated our dataframe `dfrm` with this info, creating a new column `occurrence`. Here is a resulting distribution of bond lengths rendered using `matminer.figrecipes.plot.PlotlyFig`.
#     

# We can see below that the most frequent bond lengths between the neighboring uranium and oxygen atoms are 1.78 and 2.35 Angstroms. This agrees with the well-known study of Burns _et al._ [[1]](http://canmin.geoscienceworld.org/content/35/6/1551), done in 1997.  However, Burns considered only 105 structures, and we did more than 170, confirming even more thoroughly his findings on the uranyl ion geometry.

# In[7]:


pf = PlotlyFig(dfrm, mode='notebook', x_title="Bond lengths (A)")
pf.histogram(cols=['length'], n_bins=50)