Notebook

Demo of PDBrenum in your broswer via MyBinder.org¶

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.

Some tips:

Code cells have boxes around them.
To run a code cell, click on the cell and either click the button on the toolbar above, or then hit Shift+Enter. The Shift+Enter combo will also move you to the next cell, so it's a quick way to work through the notebook. Selecting from the menu above the toolbar, Cell > Run All is a shortcut to trigger attempting to run all the cells in the notebook.
While a cell is running a * appears in the square brackets next to the cell. Once the cell has finished running the asterisk will be replaced with a number.
In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.
To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.

Step through running the cells below. Then substitute in your PDB entry identifiers of interest.

In [1]:

%run PDBrenum.py -rfla 1d5t 1bxw 2vl3 5e6h -PDB

Downloading PDB files: 100%|██████████| 4/4 [00:01<00:00,  2.07it/s]
Downloading SIFTS files: 100%|██████████| 4/4 [00:00<00:00, 12.80it/s]
Renumbering PDB files: 100%|██████████| 4/4 [00:03<00:00,  1.22it/s]

That's it. Really.
Below this demonstration notebook will demonstrate that it worked and fill in some information about running the script here, where to find the output, options for running it elsewhere, etc.. But mostly that is it as you'll see.

There's some other options that are handy. If instead you wanted the converted results i the mmCIF format you'd use the following command here:

%run PDBrenum.py -rfla 1d5t 1bxw 2vl3 5e6h -mmCIF

Or simply leave off any reference to format because it defaults to mmCIF format if no type is indicated when calling the script. mmCIF_assembly and -PDB_assembly are also valid types

Note the %run part is magic for properly running a script in a Jupyter environnemt. If you were running the first demonstration command in a terminal you'd use the following:

python PDBrenum.py -rfla 1d5t 1bxw 2vl3 5e6h -PDB

Depending on your system and how you installed Python, you may need to replace python with python3.

The -rfla flag in the call the the script above stands for --renumber_from_list_of_arguments to indicate we are providing the PDB entry identifiers as part of the command. Use of a text file to provide the PDB ids will be demonstrated below.

If you ever need the full ist of options and flags just call the script, with the help flag like below to print out the full usage details:

%run PDBrenum.py --help

Running the below cell will do that.

In [2]:

%run PDBrenum.py --help

PDB.py
optional arguments:
-h, --help            show this help message and exit

-rftf text_file_with_PDB.txt, --renumber_from_text_file text_file_with_PDB.txt
This option will download and renumber specified files
usage $ python3 PDB.py -rftf text_file_with_PDB_in_it.txt -mmCIF
usage $ python3 PDB.py -rftf text_file_with_PDB_in_it.txt -PDB
usage $ python3 PDB.py -rftf text_file_with_PDB_in_it.txt -mmCIF_assembly
usage $ python3 PDB.py -rftf text_file_with_PDB_in_it.txt -PDB_assembly
usage $ python3 PDB.py -rftf text_file_with_PDB_in_it.txt -all

-rfla [6dbp 3v03 2jit ...], --renumber_from_list_of_arguments [6dbp 3v03 2jit ...]
This option will download and renumber specified files
usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -mmCIF
usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -PDB
usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -mmCIF_assembly
usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -PDB_assembly
usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -all

-dftf text_file_with_PDB.txt, --download_from_text_file text_file_with_PDB.txt
This option will read given input file parse by space
or tab or comma or new line and download it example 
usage $ python3 PDB.py -dftf text_file_with_PDB_in_it.txt -mmCIF
usage $ python3 PDB.py -dftf text_file_with_PDB_in_it.txt -PDB
usage $ python3 PDB.py -dftf text_file_with_PDB_in_it.txt -mmCIF_assembly
usage $ python3 PDB.py -dftf text_file_with_PDB_in_it.txt -PDB_assembly
usage $ python3 PDB.py -dftf text_file_with_PDB_in_it.txt -all

-dfla [6dbp 3v03 2jit ...], --download_from_list_of_arguments 6dbp 3v03 2jit ...]
This option will read given list of arguments separated by space. 
Format of the list should be without any commas or quotation marks
usage $ python3 PDB.py -dfla 6dbp 3v03 2jit -mmCIF
usage $ python3 PDB.py -dfla 6dbp 3v03 2jit -PDB
usage $ python3 PDB.py -dfla 6dbp 3v03 2jit -mmCIF_assembly
usage $ python3 PDB.py -dfla 6dbp 3v03 2jit -PDB_assembly
usage $ python3 PDB.py -dfla 6dbp 3v03 2jit -all

-redb, --renumber_entire_database
This option will download and renumber entire PDB database in PDB or/and mmCIF format
usage $ python3 PDB.py -redb -mmCIF
usage $ python3 PDB.py -redb -PDB
usage $ python3 PDB.py -redb -mmCIF_assembly
usage $ python3 PDB.py -redb -PDB_assembly
usage $ python3 PDB.py -redb -all 

-dall, --download_entire_database
This option will download entire mmCIF database
usage $ python3 PDB.py -dall -mmCIF
usage $ python3 PDB.py -dall -PDB
usage $ python3 PDB.py -dall -mmCIF_assembly
usage $ python3 PDB.py -dall -PDB_assembly
usage $ python3 PDB.py -dall -all

-refr, --refresh_entire_database
This option will delete outdated files and download
fresh ones. This option makes sense and only works if
you work with entire database
usage $ python3 PDB.py -refr -mmCIF
usage $ python3 PDB.py -refr -PDB
usage $ python3 PDB.py -refr -mmCIF_assembly
usage $ python3 PDB.py -refr -PDB_assembly
usage $ python3 PDB.py -refr -all    

-PDB, --PDB_format_only
This option will specify working format to pdb format
-mmCIF, --mmCIF_format_only
This option will specify working format to mmCIF format (default)
-PDB_assembly, --PDB_assembly_format_only
This option will specify working format to pdb format
-mmCIF_assembly, --mmCIF_assembly_format_only
This option will specify working format to mmCIF format
-all, --all_formats   This option will work with both formats

argpar.add_argument("-sipm", "--set_default_input_path_to_mmCIF", type=str, help=argparse.SUPPRESS)
argpar.add_argument("-sipma", "--set_default_input_path_to_mmCIF_assembly", type=str, help=argparse.SUPPRESS)
argpar.add_argument("-sipp", "--set_default_input_path_to_PDB", type=str, help=argparse.SUPPRESS)
argpar.add_argument("-sippa", "--set_default_input_path_to_PDB_assembly", type=str, help=argparse.SUPPRESS)
argpar.add_argument("-sips", "--set_default_input_path_to_SIFTS", type=str, help=argparse.SUPPRESS)
argpar.add_argument("-sopm", "--set_default_output_path_to_mmCIF", type=str, help=argparse.SUPPRESS)
argpar.add_argument("-sopma", "--set_default_output_path_to_mmCIF_assembly", type=str, help=argparse.SUPPRESS)
argpar.add_argument("-sopp", "--set_default_output_path_to_PDB", type=str, help=argparse.SUPPRESS)
argpar.add_argument("-soppa", "--set_default_output_path_to_PDB_assembly", type=str, help=argparse.SUPPRESS)

-sipm, --set_default_input_path_to_mmCIF
This option will set default input path to mmCIF files (default: <./mmCIF>)
usage $ python3 PDB.py -sipm /Users/bulatfaezov/PycharmProjects/renum/venv/mmCIF
-sipp, --set_default_input_path_to_PDB
This option will set default input path to PDB files (default: <./PDB>)
usage $ python3 PDB.py -sipp /Users/bulatfaezov/PycharmProjects/renum/venv/PDB
-sipma, --set_default_input_path_to_mmCIF_assembly
This option will set default input path to mmCIF_assembly files (default: <./mmCIF_assembly>)
usage $ python3 PDB.py -sipm /Users/bulatfaezov/PycharmProjects/renum/venv/mmCIF_assembly
-sippa, --set_default_input_path_to_PDB_assembly
This option will set default input path to PDB_assembly files (default: <./PDB_assembly>)
usage $ python3 PDB.py -sipp /Users/bulatfaezov/PycharmProjects/renum/venv/PDB_assembly
-sips, --set_default_input_path_to_SIFTS
This option will set default input path to SIFTS files (default: <./SIFTS>)
usage $ python3 PDB.py -sips /Users/bulatfaezov/PycharmProjects/renum/venv/SIFTS
-sopm, --set_default_output_path_to_mmCIF
This option will set default output path to mmCIF files (default: <./output_mmCIF>)
usage $ python3 PDB.py -sopm /Users/bulatfaezov/PycharmProjects/renum/venv/output_mmCIF
-sopp, --set_default_output_path_to_PDB
This option will set default output path to PDB files (default: <./output_PDB>)
usage $ python3 PDB.py -sopp /Users/bulatfaezov/PycharmProjects/renum/venv/output_PDB
-sopma, --set_default_output_path_to_mmCIF_assembly
This option will set default output path to mmCIF_assembly files (default: <./output_mmCIF_assembly>)
usage $ python3 PDB.py -sopm /Users/bulatfaezov/PycharmProjects/renum/venv/output_mmCIF_assembly
-soppa, --set_default_output_path_to_PDB_assembly
This option will set default output path to PDB_assembly files (default: <./output_PDB_assembly>)
usage $ python3 PDB.py -sopp /Users/bulatfaezov/PycharmProjects/renum/venv/output_PDB_assembly

-sdmn, --set_default_mmCIF_num
This option will set default mmCIF number which will be added to 1 to end numbering in cases 
when there are no UniProt numbering (default: 50000)
usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -mmCIF -sdmn 50000

-sdpn, --set_default_PDB_num
This option will set default PDB number which will be added to 1 to end numbering in cases 
when there are no UniProt numbering (default: 5000)
usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -mmCIF -sdpn 5000

"-offz", "--set_to_off_mode_gzip"
By default program will compress files with gzip this option will turn that off
(default: gzip is on)
usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -mmCIF -offz

"-nproc", "--set_number_of_processes"
By default program will use all available CPUs. User can reduce number of CPUs for PDBrenum.
In this example: only 4 CPUs will be used by the PDBrenum even if more CPUs available
(default: nproc = None)
usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -mmCIF -nproc 4


Roland Dunbrack's Lab
Fox Chase Cancer Center
Philadelphia, PA
2020

Locating results and showing it worked¶

Let's demonstrate that the %run PDBrenum.py -rfla 1d5t 1bxw 2vl3 5e6h -PDB command first run worked.

When the script runs, it creates a directory for the data it obtains from the PDB. Because the demo command indicated we wanted the legacy PDB format, the script created a directory called PDB as it ran and saved the PDB files there.

We can see that in some steps. First by running the following to list the contents of that working directory:

In [3]:

ls

binder/     LICENSE             output_PDB/      PDBrenum.py*  src/
demo.ipynb  log_corrected.txt   PDB/             README.md*
input.txt*  log_translator.txt  PDBrenum.ipynb*  SIFTS/

(Note that listing the files and directory show log_corrected.txt present in the directory along with this demo.ipynb notebook. That file harboring useful information will be discussed below in the section 'There's some good information that PDBrenum exposes as part of its process¶ '.)

We see the PDB directory and we can check the contents of that with the following command:

In [4]:

ls PDB

pdb1bxw.ent.gz  pdb1d5t.ent.gz  pdb2vl3.ent.gz  pdb5e6h.ent.gz

Those files are compressed in the gzip format; however, using the unix zcat command to uncompress in combination with the unix command head to grab the start of a text file and display, we can view the start of one of them by running the following command:

In [5]:

!zcat PDB/pdb1bxw.ent.gz|head

HEADER    MEMBRANE PROTEIN                        03-OCT-98   1BXW              
TITLE     OUTER MEMBRANE PROTEIN A (OMPA) TRANSMEMBRANE DOMAIN                  
COMPND    MOL_ID: 1;                                                            
COMPND   2 MOLECULE: PROTEIN (OUTER MEMBRANE PROTEIN A);                        
COMPND   3 CHAIN: A;                                                            
COMPND   4 FRAGMENT: TRANSMEMBRANE DOMAIN;                                      
COMPND   5 ENGINEERED: YES;                                                     
COMPND   6 MUTATION: YES                                                        
SOURCE    MOL_ID: 1;                                                            
SOURCE   2 ORGANISM_SCIENTIFIC: ESCHERICHIA COLI BL21(DE3);                     

gzip: stdout: Broken pipe

The exclamation point at the beginning is to tell Jupyter this is a Unix command and to run it in the shell. ls we used above is so commonly used that Jupyter has been told to recognize it without needing the exclamation point.

Don't mind the gzip: stdout: Broken pipe at the end; zcat is meant to handle an entire file and so it causes 'Broken pipe' notice when it doesn't get to write all the file to the destination. (Also, if you ever see the gzip line somewhere other than the very end of the output, just run the cell again and it will probably move to the end where it should show.) The point is you can read the PDB file.

So that is the initial input? What did the script do?

The outout from the PDBrenum.py script gets saved over in output_PDB/ because the PDB format was specified when calling the script. Using commands similar to when viewing the initial PDB files, the output can be viewed like so:

In [6]:

ls output_PDB

1bxw_renum.pdb.gz  1d5t_renum.pdb.gz  2vl3_renum.pdb.gz  5e6h_renum.pdb.gz

Alright, we can see the renumbered verison of the file we looked at earlier is 1bxw_renum.pdb.gz.

But how do we see the difference?

We can add in the Unix tail command to the 'pipe' the outout of our earlier zcat & head combination to show part of the middle of the original and renumbered files.

First let's display a section of the original by running the command below:

In [7]:

!zcat PDB/pdb1bxw.ent.gz|head -n 510|tail

gzip: stdout: Broken pipe
ATOM     11  C   ALA A   1      46.036  12.651  40.029  1.00 51.14           C  
ATOM     12  O   ALA A   1      47.195  12.259  40.003  1.00 53.33           O  
ATOM     13  CB  ALA A   1      44.229  11.697  41.473  1.00 53.91           C  
ATOM     14  N   PRO A   2      45.736  13.936  40.024  1.00 49.63           N  
ATOM     15  CA  PRO A   2      46.822  14.919  40.021  1.00 51.58           C  
ATOM     16  C   PRO A   2      47.754  14.618  41.197  1.00 56.34           C  
ATOM     17  O   PRO A   2      47.328  14.035  42.194  1.00 54.38           O  
ATOM     18  CB  PRO A   2      46.081  16.238  40.142  1.00 45.83           C  
ATOM     19  CG  PRO A   2      44.708  15.943  39.588  1.00 44.84           C  
ATOM     20  CD  PRO A   2      44.381  14.536  40.054  1.00 41.42           C

Now to display the renumbered version by running the command below:
(the renumbered version gets an extra 14 lines in the header and so that is why 510 used in command above and 524 in command below)

In [8]:

!zcat output_PDB/1bxw_renum.pdb.gz|head -n 524|tail

ATOM     11  C   ALA A  22      46.036  12.651  40.029  1.00 51.14           C  
ATOM     12  O   ALA A  22      47.195  12.259  40.003  1.00 53.33           O  
ATOM     13  CB  ALA A  22      44.229  11.697  41.473  1.00 53.91           C  
ATOM     14  N   PRO A  23      45.736  13.936  40.024  1.00 49.63           N  
ATOM     15  CA  PRO A  23      46.822  14.919  40.021  1.00 51.58           C  
ATOM     16  C   PRO A  23      47.754  14.618  41.197  1.00 56.34           C  
ATOM     17  O   PRO A  23      47.328  14.035  42.194  1.00 54.38           O  
ATOM     18  CB  PRO A  23      46.081  16.238  40.142  1.00 45.83           C  
ATOM     19  CG  PRO A  23      44.708  15.943  39.588  1.00 44.84           C  
ATOM     20  CD  PRO A  23      44.381  14.536  40.054  1.00 41.42           C  

gzip: stdout: Broken pipe

Comparing the results of the two commands shows that what the original PDB has as residues #1 and #2 correspond to residues #22 and #23 in the UniProt numbering.

By viewing the corresponding UniProt entry (shown below for convenience), we can convince ourselves of the validity of this renumbering:

This sample above shows that the numbering has been corrected in 1bxw_renum.pdb.gz and the similarly processed PDB entries.

Locating the output for download¶

Above we showed how we can see the results listed from within this notebook and even display contents; however, if anything useful is created, you'll want to get those files out of the output directories and download them to your local computer. Jupyter has a file navigator accessible from the dashboard that allows you to download files from this session to your local machine. Click on the Jupyter icon in the upper left side above this notebook, next to 'demo'. That will take you to the Juptyer Dashboard. You should see the directory output_PDB listed there. Click on the word output_PDB and you should go into it where you can click the checkbox next to a file name and get a 'Download' button up at the top. Click 'Download' to initiate downloading the file to your local machine.

Dealing with compression¶

The files that get used in running the PDBrenum.py get the gzip flavor of compression applied. At any point to convert them you can uncompress witht the gunzip Unix command. For example, to uncompress the above example output, use:

!gunzip output_PDB/1bxw_renum.pdb.gz

After that you can view the file directly as text by either navigating to it in the file navigator and clicking on it to open it in the Jupyter Dashboard, or running the command below to view the first few lines of it directly:

!head output_PDB/1bxw_renum.pdb

Substitute cat in place of head to display the entire file in this notebook.

There's some good information that PDBrenum exposes as part of its process¶

It's important to point out that in the process PDBrenum exposes some information that can be useful in other contexts. Luckily, it makes that information available in a an easy to access form.
The file generated during the process log_corrected.txt contains some useful information, such as mapping chain IDs for each PDB file to UniProt accession identifiers. The location of it in the directory where the output_PDB/ and PDB/ directories get generated was illustrated above where ls was run in the section 'Locating results and showing it worked' after PDBrenum was first run.

In [9]:

cat log_corrected.txt

SP PDB_id chain_PDB   chain_auth  UniProt             SwissProt              uni_len chain_len     renum 5k_or_50k
+  5e6h   A           A           P29375              KDM5A_HUMAN                294       294         0         0
+  1bxw   A           A           P0A910              OMPA_ECOLI                 171       172       171         1
+  2vl3   A           A           P30044              PRDX5_HUMAN                161       162       161         1
+  2vl3   B           B           P30044              PRDX5_HUMAN                161       161       161         0
+  2vl3   C           C           P30044              PRDX5_HUMAN                161       161       161         0
+  1d5t   A           A           P21856              GDIA_BOVIN                 431       433         0         2

Demonstrations of various ways of taking advantage of this information to map chains in PDB files to UniProt ids is found in a companion notebook, Demo of using PDBrenum to perform mapping of chain IDs in PDB files to UniProt IDs. That was originally suggested as an option to address this Biostars question: Mapping PDB ID + chain ID to UniProt ID. PDBrenum provides the necessary information, via the SIFTS database, parsed out as side product of its efforts and the information is in an easy to mine fixed width text-based data table.

Using a list of PDB entry identifiers¶

You may have a lot of PDB entries that you want to process. The script allows for listing them in a separate text file with each id separated by a space and then indicating that file when calling the script. Such a file is included along with the script as input.txt. Let's examine the contents of that:

In [9]:

!head input.txt

2aa3 4zah 2aa2 2af2 2aac 2aaa 2asd

We can point the script at it when calling it, like so using the -rftf flag this time:

In [10]:

%run PDBrenum.py -rftf input.txt -PDB

Downloading PDB files: 100%|██████████| 7/7 [00:01<00:00,  3.98it/s]
Downloading SIFTS files: 100%|██████████| 7/7 [00:32<00:00,  4.63s/it]
Renumbering PDB files: 100%|██████████| 7/7 [04:23<00:00, 37.65s/it]

When that is finished, we can run the following cell to see that output_PDB/ contains additional files that corresponds to the contents of input.txt.

In [11]:

ls output_PDB

1bxw_renum.pdb.gz  2aa3_renum.pdb.gz  2af2_renum.pdb.gz  4zah_renum.pdb.gz
1d5t_renum.pdb.gz  2aaa_renum.pdb.gz  2asd_renum.pdb.gz  5e6h_renum.pdb.gz
2aa2_renum.pdb.gz  2aac_renum.pdb.gz  2vl3_renum.pdb.gz

Using the means we used to analyze the contents of 1bxw_renum.pdb.gz above, you could convince yourself those have been processed.