This largely parallels the notebook Working with PDBsum in Jupyter & Demonstration of PDBsum protein interface data to dataframe script except there, the data was protein-protein interaction list text.
Here is is ligand and protein chain interaction list text that will be converted to a dataframe.
Say example from here links to the following as 'List of interactions' in the bottom right of the page:
http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/GetLigInt.pl?pdb=1wsv&ligtype=02&ligno=01
Then based on suggestion at top here that would be used in a curl command where the items after the ?
in the original URL get placed into quotes and provided following the --data
flag argument option in the call to curl
, like so:
curl -L -o data.txt --data "pdb=1wsv&ligtype=02&ligno=01" http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/GetLigInt.pl
Specifically, the --data "pdb=1wsv&ligtype=02&ligno=01"
is the part coming from the end of the original URL.
Putting that into action in Jupyter to fetch for the example the interactions list in a text:
!curl -L -o data.txt --data "pdb=1wsv&ligtype=02&ligno=01" http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/GetLigInt.pl
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 7082 0 7054 100 28 13697 54 --:--:-- --:--:-- --:--:-- 13751
To prove that the data file has been retieved, we'll show the first 16 lines of it by running the next cell:
!head -16 data.txt
<PRE> List of protein-ligand interactions ----------------------------------- <P> PDB code: 1wsv Ligand THH --------------------------- <P> Hydrogen bonds -------------- <----- A T O M 1 -----> <----- A T O M 2 -----> Atom Atom Res Res Atom Atom Res Res no. name name no. Chain no. name name no. Chain Distance 1. 670 N LEU 88 A --> 5701 OE2 THH 3001 A 3.08
Later in this series of notebooks, I'll demonstrate how to make this step even easier with just the PDB entry id and the chains you are interested in and the later how to loop on this process to get multiple data files for interactions from different structures.
To convert the data to a dataframe, we'll use a script.
If you haven't encountered Pandas dataframes before I suggest you see the first two notebooks that come up with you launch a session from my blast-binder site. Those first two notebooks cover using the dataframe containing BLAST results some.
To get that script, you can run the next cell. (It is not included in the repository where this launches from to insure you always get the most current version, which is assumed to be the best available at the time.)
!curl -OL https://raw.githubusercontent.com/fomightez/structurework/master/pdbsum-utilities/pdbsum_ligand_interactions_list_to_df.py
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 24999 100 24999 0 0 173k 0 --:--:-- --:--:-- --:--:-- 173k
We have the script now. And we already have a data file for it to process. To process the data file, run the next command where we use Python to run the script and direct it at the results file, data.txt
, we made just a few cells ago.
%run pdbsum_ligand_interactions_list_to_df.py data.txt
Provided interactions data read and converted to a dataframe... A dataframe of the data has been saved as a file in a manner where other Python programs can access it (pickled form). RESULTING DATAFRAME is stored as ==> 'ligand_int_pickled_df.pkl'
As of writing this, the script we are using outputs a file that is a binary, compact form of the dataframe. (That means it is tiny and not human readable. It is called 'pickled'. Saving in that form may seem odd, but as illustrated here below this is is a very malleable form. And even more pertinent for dealing with data in Jupyter notebooks, there is actually an easier way to interact with this script when in Jupyter notebook that skips saving this intermediate file. So hang on through the long, more trandtional way of doing this before the easier way is introduced. And I saved it in the compact form and not the mroe typical tab-delimited form because we mostly won't go this route and might as well make tiny files while working along to a better route. It is easy to convert back and forth using the pickled form assuming you can match the Pandas/Python versions.)
We can take that file where the dataframe is pickled, and bring it into active memory in this notebook with another command form the Pandas library. First, we have to import the Pandas library. Run the next command to bring the dataframe into active memory. Note the name comes from the name noted when we ran the script in the cell above.
import pandas as pd
df = pd.read_pickle("ligand_int_pickled_df.pkl")
When that last cell ran, you won't notice any output, but something happened. We can look at that dataframe by calling it in a cell.
df
Atom1 no. | Atom1 name | Atom1 Res name | Atom1 Res no. | Atom1 Chain | Atom2 no. | Atom2 name | Atom2 Res name | Atom2 Res no. | Atom2 Chain | Distance | type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 670 | N | LEU | 88 | A | 5701 | OE2 | THH | 3001 | A | 3.08 | Hydrogen bonds |
1 | 876 | O | VAL | 115 | A | 5721 | N8 | THH | 3001 | A | 2.82 | Hydrogen bonds |
2 | 1541 | OE1 | GLU | 204 | A | 5731 | NA2 | THH | 3001 | A | 2.45 | Hydrogen bonds |
3 | 1744 | NH2 | ARG | 233 | A | 5728 | O4 | THH | 3001 | A | 3.02 | Hydrogen bonds |
4 | 2801 | OH | TYR | 371 | A | 5707 | O2 | THH | 3001 | A | 2.78 | Hydrogen bonds |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
83 | 2800 | CZ | TYR | 371 | A | 5707 | O2 | THH | 3001 | A | 3.37 | Non-bonded contacts |
84 | 2801 | OH | TYR | 371 | A | 5704 | CA | THH | 3001 | A | 3.87 | Non-bonded contacts |
85 | 2801 | OH | TYR | 371 | A | 5705 | CT | THH | 3001 | A | 3.60 | Non-bonded contacts |
86 | 2801 | OH | TYR | 371 | A | 5707 | O2 | THH | 3001 | A | 2.78 | Non-bonded contacts |
87 | 2801 | OH | TYR | 371 | A | 5710 | O | THH | 3001 | A | 3.62 | Non-bonded contacts |
88 rows × 12 columns
You'll notice that if the list of data is large, that the Jupyter environment represents just the head and tail to make it more reasonable. There are ways you can have Jupyter display it all which we won't go into here.
Instead we'll start to show some methods of dataframes that make them convenient. For example, you can use the head
method to see the start like we used on the command line above.
df.head()
Atom1 no. | Atom1 name | Atom1 Res name | Atom1 Res no. | Atom1 Chain | Atom2 no. | Atom2 name | Atom2 Res name | Atom2 Res no. | Atom2 Chain | Distance | type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 670 | N | LEU | 88 | A | 5701 | OE2 | THH | 3001 | A | 3.08 | Hydrogen bonds |
1 | 876 | O | VAL | 115 | A | 5721 | N8 | THH | 3001 | A | 2.82 | Hydrogen bonds |
2 | 1541 | OE1 | GLU | 204 | A | 5731 | NA2 | THH | 3001 | A | 2.45 | Hydrogen bonds |
3 | 1744 | NH2 | ARG | 233 | A | 5728 | O4 | THH | 3001 | A | 3.02 | Hydrogen bonds |
4 | 2801 | OH | TYR | 371 | A | 5707 | O2 | THH | 3001 | A | 2.78 | Hydrogen bonds |
Now what types of interactions are observed for this ligand?
To help answer that, we can group the results by the type column.
grouped = df.groupby('type')
for type, grouped_df in grouped:
print(type)
display(grouped_df)
Hydrogen bonds
Atom1 no. | Atom1 name | Atom1 Res name | Atom1 Res no. | Atom1 Chain | Atom2 no. | Atom2 name | Atom2 Res name | Atom2 Res no. | Atom2 Chain | Distance | type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 670 | N | LEU | 88 | A | 5701 | OE2 | THH | 3001 | A | 3.08 | Hydrogen bonds |
1 | 876 | O | VAL | 115 | A | 5721 | N8 | THH | 3001 | A | 2.82 | Hydrogen bonds |
2 | 1541 | OE1 | GLU | 204 | A | 5731 | NA2 | THH | 3001 | A | 2.45 | Hydrogen bonds |
3 | 1744 | NH2 | ARG | 233 | A | 5728 | O4 | THH | 3001 | A | 3.02 | Hydrogen bonds |
4 | 2801 | OH | TYR | 371 | A | 5707 | O2 | THH | 3001 | A | 2.78 | Hydrogen bonds |
Non-bonded contacts
Atom1 no. | Atom1 name | Atom1 Res name | Atom1 Res no. | Atom1 Chain | Atom2 no. | Atom2 name | Atom2 Res name | Atom2 Res no. | Atom2 Chain | Distance | type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 433 | SD | MET | 56 | A | 5724 | C4A | THH | 3001 | A | 3.83 | Non-bonded contacts |
6 | 434 | CE | MET | 56 | A | 5727 | C4 | THH | 3001 | A | 3.64 | Non-bonded contacts |
7 | 434 | CE | MET | 56 | A | 5729 | N3 | THH | 3001 | A | 3.44 | Non-bonded contacts |
8 | 434 | CE | MET | 56 | A | 5730 | C2 | THH | 3001 | A | 3.64 | Non-bonded contacts |
9 | 664 | CA | THR | 87 | A | 5701 | OE2 | THH | 3001 | A | 3.79 | Non-bonded contacts |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
83 | 2800 | CZ | TYR | 371 | A | 5707 | O2 | THH | 3001 | A | 3.37 | Non-bonded contacts |
84 | 2801 | OH | TYR | 371 | A | 5704 | CA | THH | 3001 | A | 3.87 | Non-bonded contacts |
85 | 2801 | OH | TYR | 371 | A | 5705 | CT | THH | 3001 | A | 3.60 | Non-bonded contacts |
86 | 2801 | OH | TYR | 371 | A | 5707 | O2 | THH | 3001 | A | 2.78 | Non-bonded contacts |
87 | 2801 | OH | TYR | 371 | A | 5710 | O | THH | 3001 | A | 3.62 | Non-bonded contacts |
83 rows × 12 columns
Same data as earlier but we can cleary see we have Hydrogen bonds and Non-bonded contacts, and we immediately get a sense of what types of interactions are more abundant.
What if we wanted to know the residues in chain A that interact with the ligand?
We can easily make a list of the residues in the 'Atom1 Res no.' column. Lots will be repeated because that list is coming from all the atoms from each residue. To limit it to just showing a residue number once, no matter if it as a single or dozens of interactions wiht the ligand, we can use Python's set conversion of a list to limit it to the unique residues. The code in the next cell does that:
the_list = df["Atom1 Res no."].tolist()
residues_interacting_with_ligand = set(the_list)
residues_interacting_with_ligand
{56, 87, 88, 101, 102, 103, 115, 117, 176, 177, 196, 197, 204, 233, 242, 262, 371}
From that list, we can see what residues of chain A interact with the ligand.
What if we wanted to know what residues are involved in hydrogen bonds? Then we can subset on the rows where type
matches 'Hydrogen bonds' and look at those residues.
the_hbond_list = df[df["type"]=="Hydrogen bonds"]["Atom1 Res no."].tolist()
residues_interacting_with_ligand_via_hbond = set(the_hbond_list)
residues_interacting_with_ligand_via_hbond
{88, 115, 204, 233, 371}
You may want to get a sense of what else you can do by examining the first two notebooks that come up with you launch a session from my blast-binder site. Those first two notebooks cover using the dataframe containing BLAST results some.
Shortly, we'll cover how to bring the dataframe we just made into the notebook without dealing with a file intermediate; however, next I'll demonstrate how to save it as text for use elsewhere, such as in Excel.
I've tried to sell you on the power of the Python/Pandas dataframe, but it isn't for all uses or everyone. However, most everyone is accustomed to dealing with text based tables or even Excel. In fact, a text-based based table perhaps tab or comma-delimited would be the better way to archive the data we are generating here. Python/Pandas makes it easy to go from the dataframe form to these tabular forms. You can even go back later from the table to the dataframe, which may be inportant if you are going to different versions of Python/Pandas as I briefly mentioned parenthetically above.
First, generating a text-based table.
#Save / write a TSV-formatted (tab-separated values/ tab-delimited) file
df.to_csv('pdbsum_data.tsv', sep='\t',index = False) #add `,header=False` to leave off header, too
Because df.to_csv()
defaults to dealing with csv, you can simply use df.to_csv('example.csv',index = False)
for comma-delimited (comma-separated) files.
You can see that worked by looking at the first few lines with the next command. (Feel free to make the number higher or delete the number all together. I restricted it just to first line to make output smaller.)
!head -5 pdbsum_data.tsv
Atom1 no. Atom1 name Atom1 Res name Atom1 Res no. Atom1 Chain Atom2 no. Atom2 name Atom2 Res name Atom2 Res no. Atom2 Chain Distance type 670 N LEU 88 A 5701 OE2 THH 3001 A 3.08 Hydrogen bonds 876 O VAL 115 A 5721 N8 THH 3001 A 2.82 Hydrogen bonds 1541 OE1 GLU 204 A 5731 NA2 THH 3001 A 2.45 Hydrogen bonds 1744 NH2 ARG 233 A 5728 O4 THH 3001 A 3.02 Hydrogen bonds
If you had need to go back from a tab-separated table to a dataframe, you can run something like in the following cell.
reverted_df = pd.read_csv('pdbsum_data.tsv', sep='\t')
reverted_df.to_pickle('reverted_df.pkl') # OPTIONAL: pickle that data too
For a comma-delimited (CSV) file you'd use df = pd.read_csv('example.csv')
because pd.read_csv()
method defaults to comma as the separator (sep
parameter).
You can verify that read from the text-based table by viewing it with the next line.
reverted_df.head()
Atom1 no. | Atom1 name | Atom1 Res name | Atom1 Res no. | Atom1 Chain | Atom2 no. | Atom2 name | Atom2 Res name | Atom2 Res no. | Atom2 Chain | Distance | type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 670 | N | LEU | 88 | A | 5701 | OE2 | THH | 3001 | A | 3.08 | Hydrogen bonds |
1 | 876 | O | VAL | 115 | A | 5721 | N8 | THH | 3001 | A | 2.82 | Hydrogen bonds |
2 | 1541 | OE1 | GLU | 204 | A | 5731 | NA2 | THH | 3001 | A | 2.45 | Hydrogen bonds |
3 | 1744 | NH2 | ARG | 233 | A | 5728 | O4 | THH | 3001 | A | 3.02 | Hydrogen bonds |
4 | 2801 | OH | TYR | 371 | A | 5707 | O2 | THH | 3001 | A | 2.78 | Hydrogen bonds |
Generating an Excel spreadsheet from a dataframe.
Because this is a specialized need, there is a special module needed that I didn't bother installing by default and so it needs to be installed before generating the Excel file. Running the next cell will do both.
%pip install openpyxl
# save to excel (KEEPS multiINDEX, and makes sparse to look good in Excel straight out of Python)
df.to_excel('pdbsum_data.xlsx') # after openpyxl installed
Requirement already satisfied: openpyxl in /srv/conda/envs/notebook/lib/python3.7/site-packages (3.0.9) Requirement already satisfied: et-xmlfile in /srv/conda/envs/notebook/lib/python3.7/site-packages (from openpyxl) (1.1.0) Note: you may need to restart the kernel to use updated packages.
You'll need to download the file first to your computer and then view it locally as there is no viewer in the Jupyter environment.
Adiitionally, it is possible to add styles to dataframes and the styles such as shading of cells and coloring of text will be translated to the Excel document made as well.
Excel files can be read in to Pandas dataframes directly without needing to go to a text based intermediate first.
# read Excel
df_from_excel = pd.read_excel('pdbsum_data.xlsx',engine='openpyxl') # see https://stackoverflow.com/a/65266270/8508004 where notes xlrd no longer supports xlsx
That can be viewed to convince yourself it worked by running the next command.
df_from_excel.head()
Unnamed: 0 | Atom1 no. | Atom1 name | Atom1 Res name | Atom1 Res no. | Atom1 Chain | Atom2 no. | Atom2 name | Atom2 Res name | Atom2 Res no. | Atom2 Chain | Distance | type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 670 | N | LEU | 88 | A | 5701 | OE2 | THH | 3001 | A | 3.08 | Hydrogen bonds |
1 | 1 | 876 | O | VAL | 115 | A | 5721 | N8 | THH | 3001 | A | 2.82 | Hydrogen bonds |
2 | 2 | 1541 | OE1 | GLU | 204 | A | 5731 | NA2 | THH | 3001 | A | 2.45 | Hydrogen bonds |
3 | 3 | 1744 | NH2 | ARG | 233 | A | 5728 | O4 | THH | 3001 | A | 3.02 | Hydrogen bonds |
4 | 4 | 2801 | OH | TYR | 371 | A | 5707 | O2 | THH | 3001 | A | 2.78 | Hydrogen bonds |
Next, we'll cover how to bring the dataframe we just made into the notebook without dealing with a file intermediate.
First we'll check for the script we'll use and get it if we don't already have it.
(The thinking is once you know what you are doing you may have skipped all the steps above and not have the script you'll need yet. It cannot hurt to check and if it isn't present, bring it here.)
# Get a file if not yet retrieved / check if file exists
import os
file_needed = "pdbsum_ligand_interactions_list_to_df.py"
if not os.path.isfile(file_needed):
!curl -OL https://raw.githubusercontent.com/fomightez/structurework/master/pdbsum-utilities/{file_needed}
This is going to rely on approaches very similar to those illustrated here and here.
We obtained the pdbsum_ligand_interactions_list_to_df.py
script in the preparation steps above. However, instead of using it as an external script as we did earlier in this notebook, we want to use the core function of that script within this notebook for the options that involve no pickled-object file intermediate. Similar to the way we imported a lot of other useful modules in the first notebook and a cell above, you can run the next cell to bring in to memory of this notebook's computational environment, the main function associated with the pdbsum_prot_interactions_list_to_df.py
script, aptly named pdbsum_ligand_interactions_list_to_df
. (As written below the command to do that looks a bit redundant;however, the first from part of the command below actually is referencing the pdbsum_ligand_interactions_list_to_df.py
script, but it doesn't need the .py
extension because the import only deals with such files.)
from pdbsum_ligand_interactions_list_to_df import pdbsum_ligand_interactions_list_to_df
We can demonstrate that worked by calling the function.
pdbsum_ligand_interactions_list_to_df()
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) /tmp/ipykernel_87/1310711332.py in <module> ----> 1 pdbsum_ligand_interactions_list_to_df() TypeError: pdbsum_ligand_interactions_list_to_df() missing 1 required positional argument: 'data_file'
If the module was not imported, you'd see ModuleNotFoundError: No module named 'pdbsum_ligand_interactions_list_to_df'
, but instead you should see it saying it is missing data_file
to act on because you passed it nothing.
After importing the main function of that script into this running notebook, you are ready to demonstrate the approach that doesn't require a file intermediates. The imported pdbsum_ligand_interactions_list_to_df
function is used within the computational environment of the notebook and the dataframe produced assigned to a variable in the running the notebook. In the end, the results are in an active dataframe in the notebook without needing to read the pickled dataframe. Although bear in mind the pickled dataframe still gets made, and it is good to download and keep that pickled dataframe since you'll find it convenient for reading and getting back into an analysis without need for rerunning earlier steps again.
direct_df = pdbsum_ligand_interactions_list_to_df("data.txt")
direct_df.head()
Provided interactions data read and converted to a dataframe... A dataframe of the data has been saved as a file in a manner where other Python programs can access it (pickled form). RESULTING DATAFRAME is stored as ==> 'ligand_int_pickled_df.pkl' Returning a dataframe with the information as well.
Atom1 no. | Atom1 name | Atom1 Res name | Atom1 Res no. | Atom1 Chain | Atom2 no. | Atom2 name | Atom2 Res name | Atom2 Res no. | Atom2 Chain | Distance | type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 670 | N | LEU | 88 | A | 5701 | OE2 | THH | 3001 | A | 3.08 | Hydrogen bonds |
1 | 876 | O | VAL | 115 | A | 5721 | N8 | THH | 3001 | A | 2.82 | Hydrogen bonds |
2 | 1541 | OE1 | GLU | 204 | A | 5731 | NA2 | THH | 3001 | A | 2.45 | Hydrogen bonds |
3 | 1744 | NH2 | ARG | 233 | A | 5728 | O4 | THH | 3001 | A | 3.02 | Hydrogen bonds |
4 | 2801 | OH | TYR | 371 | A | 5707 | O2 | THH | 3001 | A | 2.78 | Hydrogen bonds |
This may be how you prefer to use the script. Either option exists.
Continue on with other notebooks in the series if you wish.