If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.
Some tips:
Step through running the cells below. Then substitute in your PDB entry identifiers of interest.
%run PDBrenum.py -rfla 1d5t 1bxw 2vl3 5e6h -PDB
Downloading PDB files: 100%|██████████| 4/4 [00:01<00:00, 2.07it/s] Downloading SIFTS files: 100%|██████████| 4/4 [00:00<00:00, 12.80it/s] Renumbering PDB files: 100%|██████████| 4/4 [00:03<00:00, 1.22it/s]
That's it. Really.
Below this demonstration notebook will demonstrate that it worked and fill in some information about running the script here, where to find the output, options for running it elsewhere, etc.. But mostly that is it as you'll see.
There's some other options that are handy. If instead you wanted the converted results i the mmCIF
format you'd use the following command here:
%run PDBrenum.py -rfla 1d5t 1bxw 2vl3 5e6h -mmCIF
Or simply leave off any reference to format because it defaults to mmCIF
format if no type is indicated when calling the script. mmCIF_assembly
and -PDB_assembly
are also valid types
Note the %run
part is magic for properly running a script in a Jupyter environnemt. If you were running the first demonstration command in a terminal you'd use the following:
python PDBrenum.py -rfla 1d5t 1bxw 2vl3 5e6h -PDB
Depending on your system and how you installed Python, you may need to replace python
with python3
.
The -rfla
flag in the call the the script above stands for --renumber_from_list_of_arguments
to indicate we are providing the PDB entry identifiers as part of the command. Use of a text file to provide the PDB ids will be demonstrated below.
If you ever need the full ist of options and flags just call the script, with the help
flag like below to print out the full usage details:
%run PDBrenum.py --help
Running the below cell will do that.
%run PDBrenum.py --help
PDB.py optional arguments: -h, --help show this help message and exit -rftf text_file_with_PDB.txt, --renumber_from_text_file text_file_with_PDB.txt This option will download and renumber specified files usage $ python3 PDB.py -rftf text_file_with_PDB_in_it.txt -mmCIF usage $ python3 PDB.py -rftf text_file_with_PDB_in_it.txt -PDB usage $ python3 PDB.py -rftf text_file_with_PDB_in_it.txt -mmCIF_assembly usage $ python3 PDB.py -rftf text_file_with_PDB_in_it.txt -PDB_assembly usage $ python3 PDB.py -rftf text_file_with_PDB_in_it.txt -all -rfla [6dbp 3v03 2jit ...], --renumber_from_list_of_arguments [6dbp 3v03 2jit ...] This option will download and renumber specified files usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -mmCIF usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -PDB usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -mmCIF_assembly usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -PDB_assembly usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -all -dftf text_file_with_PDB.txt, --download_from_text_file text_file_with_PDB.txt This option will read given input file parse by space or tab or comma or new line and download it example usage $ python3 PDB.py -dftf text_file_with_PDB_in_it.txt -mmCIF usage $ python3 PDB.py -dftf text_file_with_PDB_in_it.txt -PDB usage $ python3 PDB.py -dftf text_file_with_PDB_in_it.txt -mmCIF_assembly usage $ python3 PDB.py -dftf text_file_with_PDB_in_it.txt -PDB_assembly usage $ python3 PDB.py -dftf text_file_with_PDB_in_it.txt -all -dfla [6dbp 3v03 2jit ...], --download_from_list_of_arguments 6dbp 3v03 2jit ...] This option will read given list of arguments separated by space. Format of the list should be without any commas or quotation marks usage $ python3 PDB.py -dfla 6dbp 3v03 2jit -mmCIF usage $ python3 PDB.py -dfla 6dbp 3v03 2jit -PDB usage $ python3 PDB.py -dfla 6dbp 3v03 2jit -mmCIF_assembly usage $ python3 PDB.py -dfla 6dbp 3v03 2jit -PDB_assembly usage $ python3 PDB.py -dfla 6dbp 3v03 2jit -all -redb, --renumber_entire_database This option will download and renumber entire PDB database in PDB or/and mmCIF format usage $ python3 PDB.py -redb -mmCIF usage $ python3 PDB.py -redb -PDB usage $ python3 PDB.py -redb -mmCIF_assembly usage $ python3 PDB.py -redb -PDB_assembly usage $ python3 PDB.py -redb -all -dall, --download_entire_database This option will download entire mmCIF database usage $ python3 PDB.py -dall -mmCIF usage $ python3 PDB.py -dall -PDB usage $ python3 PDB.py -dall -mmCIF_assembly usage $ python3 PDB.py -dall -PDB_assembly usage $ python3 PDB.py -dall -all -refr, --refresh_entire_database This option will delete outdated files and download fresh ones. This option makes sense and only works if you work with entire database usage $ python3 PDB.py -refr -mmCIF usage $ python3 PDB.py -refr -PDB usage $ python3 PDB.py -refr -mmCIF_assembly usage $ python3 PDB.py -refr -PDB_assembly usage $ python3 PDB.py -refr -all -PDB, --PDB_format_only This option will specify working format to pdb format -mmCIF, --mmCIF_format_only This option will specify working format to mmCIF format (default) -PDB_assembly, --PDB_assembly_format_only This option will specify working format to pdb format -mmCIF_assembly, --mmCIF_assembly_format_only This option will specify working format to mmCIF format -all, --all_formats This option will work with both formats argpar.add_argument("-sipm", "--set_default_input_path_to_mmCIF", type=str, help=argparse.SUPPRESS) argpar.add_argument("-sipma", "--set_default_input_path_to_mmCIF_assembly", type=str, help=argparse.SUPPRESS) argpar.add_argument("-sipp", "--set_default_input_path_to_PDB", type=str, help=argparse.SUPPRESS) argpar.add_argument("-sippa", "--set_default_input_path_to_PDB_assembly", type=str, help=argparse.SUPPRESS) argpar.add_argument("-sips", "--set_default_input_path_to_SIFTS", type=str, help=argparse.SUPPRESS) argpar.add_argument("-sopm", "--set_default_output_path_to_mmCIF", type=str, help=argparse.SUPPRESS) argpar.add_argument("-sopma", "--set_default_output_path_to_mmCIF_assembly", type=str, help=argparse.SUPPRESS) argpar.add_argument("-sopp", "--set_default_output_path_to_PDB", type=str, help=argparse.SUPPRESS) argpar.add_argument("-soppa", "--set_default_output_path_to_PDB_assembly", type=str, help=argparse.SUPPRESS) -sipm, --set_default_input_path_to_mmCIF This option will set default input path to mmCIF files (default: <./mmCIF>) usage $ python3 PDB.py -sipm /Users/bulatfaezov/PycharmProjects/renum/venv/mmCIF -sipp, --set_default_input_path_to_PDB This option will set default input path to PDB files (default: <./PDB>) usage $ python3 PDB.py -sipp /Users/bulatfaezov/PycharmProjects/renum/venv/PDB -sipma, --set_default_input_path_to_mmCIF_assembly This option will set default input path to mmCIF_assembly files (default: <./mmCIF_assembly>) usage $ python3 PDB.py -sipm /Users/bulatfaezov/PycharmProjects/renum/venv/mmCIF_assembly -sippa, --set_default_input_path_to_PDB_assembly This option will set default input path to PDB_assembly files (default: <./PDB_assembly>) usage $ python3 PDB.py -sipp /Users/bulatfaezov/PycharmProjects/renum/venv/PDB_assembly -sips, --set_default_input_path_to_SIFTS This option will set default input path to SIFTS files (default: <./SIFTS>) usage $ python3 PDB.py -sips /Users/bulatfaezov/PycharmProjects/renum/venv/SIFTS -sopm, --set_default_output_path_to_mmCIF This option will set default output path to mmCIF files (default: <./output_mmCIF>) usage $ python3 PDB.py -sopm /Users/bulatfaezov/PycharmProjects/renum/venv/output_mmCIF -sopp, --set_default_output_path_to_PDB This option will set default output path to PDB files (default: <./output_PDB>) usage $ python3 PDB.py -sopp /Users/bulatfaezov/PycharmProjects/renum/venv/output_PDB -sopma, --set_default_output_path_to_mmCIF_assembly This option will set default output path to mmCIF_assembly files (default: <./output_mmCIF_assembly>) usage $ python3 PDB.py -sopm /Users/bulatfaezov/PycharmProjects/renum/venv/output_mmCIF_assembly -soppa, --set_default_output_path_to_PDB_assembly This option will set default output path to PDB_assembly files (default: <./output_PDB_assembly>) usage $ python3 PDB.py -sopp /Users/bulatfaezov/PycharmProjects/renum/venv/output_PDB_assembly -sdmn, --set_default_mmCIF_num This option will set default mmCIF number which will be added to 1 to end numbering in cases when there are no UniProt numbering (default: 50000) usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -mmCIF -sdmn 50000 -sdpn, --set_default_PDB_num This option will set default PDB number which will be added to 1 to end numbering in cases when there are no UniProt numbering (default: 5000) usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -mmCIF -sdpn 5000 "-offz", "--set_to_off_mode_gzip" By default program will compress files with gzip this option will turn that off (default: gzip is on) usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -mmCIF -offz "-nproc", "--set_number_of_processes" By default program will use all available CPUs. User can reduce number of CPUs for PDBrenum. In this example: only 4 CPUs will be used by the PDBrenum even if more CPUs available (default: nproc = None) usage $ python3 PDB.py -rfla 6dbp 3v03 2jit -mmCIF -nproc 4 Roland Dunbrack's Lab Fox Chase Cancer Center Philadelphia, PA 2020
Let's demonstrate that the %run PDBrenum.py -rfla 1d5t 1bxw 2vl3 5e6h -PDB
command first run worked.
When the script runs, it creates a directory for the data it obtains from the PDB. Because the demo command indicated we wanted the legacy PDB format, the script created a directory called PDB
as it ran and saved the PDB files there.
We can see that in some steps. First by running the following to list the contents of that working directory:
ls
binder/ LICENSE output_PDB/ PDBrenum.py* src/ demo.ipynb log_corrected.txt PDB/ README.md* input.txt* log_translator.txt PDBrenum.ipynb* SIFTS/
(Note that listing the files and directory show log_corrected.txt
present in the directory along with this demo.ipynb
notebook. That file harboring useful information will be discussed below in the section 'There's some good information that PDBrenum exposes as part of its process¶
'.)
We see the PDB
directory and we can check the contents of that with the following command:
ls PDB
pdb1bxw.ent.gz pdb1d5t.ent.gz pdb2vl3.ent.gz pdb5e6h.ent.gz
Those files are compressed in the gzip format; however, using the unix zcat
command to uncompress in combination with the unix command head
to grab the start of a text file and display, we can view the start of one of them by running the following command:
!zcat PDB/pdb1bxw.ent.gz|head
HEADER MEMBRANE PROTEIN 03-OCT-98 1BXW TITLE OUTER MEMBRANE PROTEIN A (OMPA) TRANSMEMBRANE DOMAIN COMPND MOL_ID: 1; COMPND 2 MOLECULE: PROTEIN (OUTER MEMBRANE PROTEIN A); COMPND 3 CHAIN: A; COMPND 4 FRAGMENT: TRANSMEMBRANE DOMAIN; COMPND 5 ENGINEERED: YES; COMPND 6 MUTATION: YES SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: ESCHERICHIA COLI BL21(DE3); gzip: stdout: Broken pipe
The exclamation point at the beginning is to tell Jupyter this is a Unix command and to run it in the shell. ls
we used above is so commonly used that Jupyter has been told to recognize it without needing the exclamation point.
Don't mind the gzip: stdout: Broken pipe
at the end; zcat is meant to handle an entire file and so it causes 'Broken pipe' notice when it doesn't get to write all the file to the destination. (Also, if you ever see the gzip
line somewhere other than the very end of the output, just run the cell again and it will probably move to the end where it should show.) The point is you can read the PDB file.
So that is the initial input? What did the script do?
The outout from the PDBrenum.py
script gets saved over in output_PDB/
because the PDB format was specified when calling the script. Using commands similar to when viewing the initial PDB files, the output can be viewed like so:
ls output_PDB
1bxw_renum.pdb.gz 1d5t_renum.pdb.gz 2vl3_renum.pdb.gz 5e6h_renum.pdb.gz
Alright, we can see the renumbered verison of the file we looked at earlier is 1bxw_renum.pdb.gz
.
But how do we see the difference?
We can add in the Unix tail
command to the 'pipe' the outout of our earlier zcat
& head
combination to show part of the middle of the original and renumbered files.
First let's display a section of the original by running the command below:
!zcat PDB/pdb1bxw.ent.gz|head -n 510|tail
gzip: stdout: Broken pipe ATOM 11 C ALA A 1 46.036 12.651 40.029 1.00 51.14 C ATOM 12 O ALA A 1 47.195 12.259 40.003 1.00 53.33 O ATOM 13 CB ALA A 1 44.229 11.697 41.473 1.00 53.91 C ATOM 14 N PRO A 2 45.736 13.936 40.024 1.00 49.63 N ATOM 15 CA PRO A 2 46.822 14.919 40.021 1.00 51.58 C ATOM 16 C PRO A 2 47.754 14.618 41.197 1.00 56.34 C ATOM 17 O PRO A 2 47.328 14.035 42.194 1.00 54.38 O ATOM 18 CB PRO A 2 46.081 16.238 40.142 1.00 45.83 C ATOM 19 CG PRO A 2 44.708 15.943 39.588 1.00 44.84 C ATOM 20 CD PRO A 2 44.381 14.536 40.054 1.00 41.42 C
Now to display the renumbered version by running the command below:
(the renumbered version gets an extra 14 lines in the header and so that is why 510
used in command above and 524
in command below)
!zcat output_PDB/1bxw_renum.pdb.gz|head -n 524|tail
ATOM 11 C ALA A 22 46.036 12.651 40.029 1.00 51.14 C ATOM 12 O ALA A 22 47.195 12.259 40.003 1.00 53.33 O ATOM 13 CB ALA A 22 44.229 11.697 41.473 1.00 53.91 C ATOM 14 N PRO A 23 45.736 13.936 40.024 1.00 49.63 N ATOM 15 CA PRO A 23 46.822 14.919 40.021 1.00 51.58 C ATOM 16 C PRO A 23 47.754 14.618 41.197 1.00 56.34 C ATOM 17 O PRO A 23 47.328 14.035 42.194 1.00 54.38 O ATOM 18 CB PRO A 23 46.081 16.238 40.142 1.00 45.83 C ATOM 19 CG PRO A 23 44.708 15.943 39.588 1.00 44.84 C ATOM 20 CD PRO A 23 44.381 14.536 40.054 1.00 41.42 C gzip: stdout: Broken pipe
Comparing the results of the two commands shows that what the original PDB has as residues #1
and #2
correspond to residues #22
and #23
in the UniProt numbering.
By viewing the corresponding UniProt entry (shown below for convenience), we can convince ourselves of the validity of this renumbering:
This sample above shows that the numbering has been corrected in 1bxw_renum.pdb.gz
and the similarly processed PDB entries.
Above we showed how we can see the results listed from within this notebook and even display contents; however, if anything useful is created, you'll want to get those files out of the output
directories and download them to your local computer. Jupyter has a file navigator accessible from the dashboard that allows you to download files from this session to your local machine. Click on the Jupyter icon in the upper left side above this notebook, next to 'demo'. That will take you to the Juptyer Dashboard. You should see the directory output_PDB
listed there. Click on the word output_PDB
and you should go into it where you can click the checkbox next to a file name and get a 'Download' button up at the top. Click 'Download' to initiate downloading the file to your local machine.
The files that get used in running the PDBrenum.py
get the gzip flavor of compression applied. At any point to convert them you can uncompress witht the gunzip
Unix command. For example, to uncompress the above example output, use:
!gunzip output_PDB/1bxw_renum.pdb.gz
After that you can view the file directly as text by either navigating to it in the file navigator and clicking on it to open it in the Jupyter Dashboard, or running the command below to view the first few lines of it directly:
!head output_PDB/1bxw_renum.pdb
Substitute cat
in place of head
to display the entire file in this notebook.
It's important to point out that in the process PDBrenum exposes some information that can be useful in other contexts. Luckily, it makes that information available in a an easy to access form.
The file generated during the process log_corrected.txt
contains some useful information, such as mapping chain IDs for each PDB file to UniProt accession identifiers. The location of it in the directory where the output_PDB/
and PDB/
directories get generated was illustrated above where ls
was run in the section 'Locating results and showing it worked' after PDBrenum was first run.
cat log_corrected.txt
SP PDB_id chain_PDB chain_auth UniProt SwissProt uni_len chain_len renum 5k_or_50k + 5e6h A A P29375 KDM5A_HUMAN 294 294 0 0 + 1bxw A A P0A910 OMPA_ECOLI 171 172 171 1 + 2vl3 A A P30044 PRDX5_HUMAN 161 162 161 1 + 2vl3 B B P30044 PRDX5_HUMAN 161 161 161 0 + 2vl3 C C P30044 PRDX5_HUMAN 161 161 161 0 + 1d5t A A P21856 GDIA_BOVIN 431 433 0 2
Demonstrations of various ways of taking advantage of this information to map chains in PDB files to UniProt ids is found in a companion notebook, Demo of using PDBrenum to perform mapping of chain IDs in PDB files to UniProt IDs. That was originally suggested as an option to address this Biostars question: Mapping PDB ID + chain ID to UniProt ID. PDBrenum provides the necessary information, via the SIFTS database, parsed out as side product of its efforts and the information is in an easy to mine fixed width text-based data table.
You may have a lot of PDB entries that you want to process. The script allows for listing them in a separate text file with each id separated by a space and then indicating that file when calling the script. Such a file is included along with the script as input.txt
. Let's examine the contents of that:
!head input.txt
2aa3 4zah 2aa2 2af2 2aac 2aaa 2asd
We can point the script at it when calling it, like so using the -rftf
flag this time:
%run PDBrenum.py -rftf input.txt -PDB
Downloading PDB files: 100%|██████████| 7/7 [00:01<00:00, 3.98it/s] Downloading SIFTS files: 100%|██████████| 7/7 [00:32<00:00, 4.63s/it] Renumbering PDB files: 100%|██████████| 7/7 [04:23<00:00, 37.65s/it]
When that is finished, we can run the following cell to see that output_PDB/
contains additional files that corresponds to the contents of input.txt
.
ls output_PDB
1bxw_renum.pdb.gz 2aa3_renum.pdb.gz 2af2_renum.pdb.gz 4zah_renum.pdb.gz 1d5t_renum.pdb.gz 2aaa_renum.pdb.gz 2asd_renum.pdb.gz 5e6h_renum.pdb.gz 2aa2_renum.pdb.gz 2aac_renum.pdb.gz 2vl3_renum.pdb.gz
Using the means we used to analyze the contents of 1bxw_renum.pdb.gz
above, you could convince yourself those have been processed.