蛋白质是大型生物分子,它由一个或多个由α-氨基酸残基组成的长链条组成。α-氨基酸分子呈线性排列,相邻α-氨基酸残基的羧基和氨基通过肽键连接在一起。 蛋白质的分子结构可划分为四级,以描述其不同的方面:
在整个第一章中,你即将会学习到以下的一些信息:
因此如果要在计算机中建立一个蛋白质的结构模型,就清楚地描述每一个原子的信息。在Rosetta中,Pose是管理蛋白质信息的中心,可以描述蛋白质一到四级结构所有的信息。而且这些信息是分层管理的比如:
分层式管理使得Pose的信息修改和更新变得容易。
除此以外,还有些外部对象如PDBinfo也负责转换和储存Pose与PDB之间的信息变换。
以下是一个Pose中的架构的示意图:
它们之间的关系可以从图上清晰表述: 最底层的是AtomType/ResidueType, 分别定义了原子类型和残基的基本拓扑结构,包括其中的原子成键方式,原子的内坐标等信息。这些信息定义了一个标准的氨基酸残基(Residue)应该包括什么,长什么样子等。氨基酸残基与残基通过树(AtomTree)的方式定义如何与其他残基进行连接,最后得到一个蛋白分子的构象(Conformation)。当然Pose中还有其他的信息如处理能量(Energies)的模块、处理物理约束的模块(ConstraintSet)以及处理中间缓存信息的模块(DataCache)。
Rosetta兼容最常规的两种记录结构格式:PDB和Silent文件:
Rosetta为PDB的结构信息读入提供了非常丰富的接口,此处我们介绍主要4种结构信息读入相关的函数:
一般而言,Rosetta输入的结构信息大多来源于PDB结构文件,经过处理后生成对应的Pose。以下将逐一介绍如何使用这些外部文件来生成Pose对象。
# 初始化(必须执行的步骤):
from pyrosetta import init
init()
PyRosetta-4 2021 [Rosetta PyRosetta4.conda.mac.cxx11thread.serialization.python37.Release 2021.31+release.c7009b3115c22daa9efe2805d9d1ebba08426a54 2021-08-07T10:04:12] retrieved from: http://www.pyrosetta.org (C) Copyright Rosetta Commons Member Institutions. Created in JHU by Sergey Lyskov and PyRosetta Team. core.init: {0} Checking for fconfig files in pwd and ./rosetta/flags core.init: {0} Rosetta version: PyRosetta4.conda.mac.cxx11thread.serialization.python37.Release r292 2021.31+release.c7009b3115c c7009b3115c22daa9efe2805d9d1ebba08426a54 http://www.pyrosetta.org 2021-08-07T10:04:12 core.init: {0} command: PyRosetta -ex1 -ex2aro -database /opt/miniconda3/lib/python3.7/site-packages/pyrosetta/database basic.random.init_random_generator: {0} 'RNG device' seed mode, using '/dev/urandom', seed=2087962570 seed_offset=0 real_seed=2087962570 thread_index=0 basic.random.init_random_generator: {0} RandomGenerator:init: Normal mode, seed=2087962570 RG_type=mt19937
from pyrosetta import pose_from_pdb
pdb_pose = pose_from_pdb('./data/pose_demo.pdb')
core.chemical.GlobalResidueTypeSet: {0} Finished initializing fa_standard residue type set. Created 983 residue types core.chemical.GlobalResidueTypeSet: {0} Total time to initialize 0.653008 seconds. core.import_pose.import_pose: {0} File './data/pose_demo.pdb' automatically determined to be of type PDB core.conformation.Conformation: {0} Found disulfide between residues 2 11 core.conformation.Conformation: {0} current variant for 2 CYS core.conformation.Conformation: {0} current variant for 11 CYS core.conformation.Conformation: {0} current variant for 2 CYD core.conformation.Conformation: {0} current variant for 11 CYD core.conformation.Conformation: {0} Found disulfide between residues 5 25 core.conformation.Conformation: {0} current variant for 5 CYS core.conformation.Conformation: {0} current variant for 25 CYS core.conformation.Conformation: {0} current variant for 5 CYD core.conformation.Conformation: {0} current variant for 25 CYD
表明,PDB文件自动读入成功,并发现2对二硫键,位于2、11、5、25位。
# 打印Pose的基本信息
print(pdb_pose)
PDB file name: ./data/pose_demo.pdb Total residues: 26 Sequence: HCFHCRNIRFCSEDEEELRRAREECK Fold tree: FOLD_TREE EDGE 1 26 -1
通过python的print函数,我们可以直接打印pose对象中的一些基本信息,比如这个pose的pdb文件来源是pose_demo.pdb; 其总氨基酸数目是26、氨基酸的序列是HCFHCRNIRFCSEDEEELRRAREECK。并且可以查看这个pose的foldtree连接上下游关系(可先忽略这个概念。)
除了直接从PDB文件中读取所有的原子信息,pyrosetta中也支持直接通过命令行从一级结构信息直接构建线性结构。 以下以构建5个连续的丙氨酸结构为例:
# 从一级序列生成线性结构
from pyrosetta import pose_from_sequence
seq_pose = pose_from_sequence('AAAAA')
seq_pose.dump_pdb('./data/5A.pdb')
True
将pose输出成PDB结构之后,在pymol中打开可查看到, 构建的模型的确是5个丙氨酸的线性结构:
pyrosetta中也有内置相关的函数,可以直接从PDB数据中下载某个PDB编号,并进行对应的清洗,得到Pose的方法。
# 我们还可以从rscb的PDB数据库中直接下载并读入生成Pose
from pyrosetta.toolbox.rcsb import pose_from_rcsb
rcsb_pose = pose_from_rcsb('4R80')
core.import_pose.import_pose: {0} File '4R80.clean.pdb' automatically determined to be of type PDB core.conformation.Conformation: {0} [ WARNING ] missing heavyatom: OXT on residue SER:CtermProteinFull 76 core.conformation.Conformation: {0} [ WARNING ] missing heavyatom: OXT on residue SER:CtermProteinFull 152
执行完代码后,你会发现当前文件夹中出现了4R80.clean.pdb和4R80.pdb两个PDB结构。使用pymol打开查看。 会发现执行clean之后的文件中,底物和水分子均被删除,侧链缺失的原子也被Rosetta自动补全。
RosettaSilent文件文件是一种Pose的压缩格式文件,可以用它存储大量的Pose信息,它的体积只有PDB文件的1/10。因此非常适合用于云端超算结果的储存和返回。但是Silent文件并不是人类可读的文件格式。需要使用额外的Rosetta app或从pyrosetta中进行读取。
silent文件内容的举例:
SEQUENCE: AAAAA
SCORE: score description
REMARK BINARY SILENTFILE
SCORE: 0.000 AAAAA
ANNOTATED_SEQUENCE: A[ALA:NtermProteinFull]AAAA[ALA:CtermProteinFull] AAAAA
LAAAAAAAAAAAAAAAAH/pu/AAAAAAAAAAA5DJAA9By1+DAAAAAz9Bo/kW+YAEAAAAAUBn//Q90F9767l5v8iuq+SPUx9r9XUQp8iuq+SPUx7Ta8D1P8iuq+SPUx7Ta8D1vKkf5/QGs677OEn2PNlPRAZShD9Lpxe5vEGQ0/0ntm/rH6O5vG8Q0/QHLd6Lj/dAw AAAAA
LPdUVAxrmE/jWThSJYSzfABiq1AULO60Ig9BsAhhUsAEkVXeJOUPwAV2LK/jEKQpJeHkYARaLqB067l5PhCZeA92Oz8zSuWgJHMPbAJ5GXB0OEn2vs6QgAxiHUCEpxe5P31SHAdnfzBkH6O5PbGrcA12gIBEj/dAQ AAAAA
LrbnxA14q0B0s39TJd7R9A14q0BUOqkjJ0+1AB58xnCkufCKJI7n7Atn0GDE0HBYpAmwABhpNDB067l5v2irtAh8jWCEHYfSJuU//ABYVVB0OEn2PJnHFBNkyDBEpxe5vMdr+A9EUBAkH6O5vtqr+A5fBhBEj/dAw AAAAA
LJ1IGBZHfrCk1ckiJbiwIBZtKVDEjCF+oms0OBJtfQDkXeboJ9X7QBBa4sCkM0TxJtv8GBlJbvD067l5Pe+ZIBRTvQC02lztJ3gnHB944lD0OEn2vML8IBlfOHEEpxe5PTboCBJQF0DkH6O5Pcf+HB13leDEj/dAQ AAAAA
LrbnRB14q0D0s39bJkrcXB14q0DUOqkrJpspZBpsDREkufCSJAL2ZB9VDbEEKAM4PIUMbBVs9YEEKAM4v2TkZBJR8bD067l5vRfpPBdscIEE8CabpMYzYBdIAlD0OEn2P+U7dBhuOcDEpxe5vccJYB5e/6CkH6O5vMjJYBZM2qDEj/dAw AAAAA
# 读入Silent文件
from pyrosetta.io import poses_from_silent
poses = poses_from_silent('./data/demo.silent')
poses = list(poses)
for pose in poses:
print(pose)
core.io.silent.SilentFileData: {0} Reading all structures from ./data/demo.silent core.io.silent.SilentFileData: {0} Finished reading 2 structures from ./data/demo.silent core.conformation.Conformation: {0} Found disulfide between residues 2 11 core.conformation.Conformation: {0} current variant for 2 CYD core.conformation.Conformation: {0} current variant for 11 CYD core.conformation.Conformation: {0} current variant for 2 CYD core.conformation.Conformation: {0} current variant for 11 CYD core.conformation.Conformation: {0} Found disulfide between residues 5 25 core.conformation.Conformation: {0} current variant for 5 CYD core.conformation.Conformation: {0} current variant for 25 CYD core.conformation.Conformation: {0} current variant for 5 CYD core.conformation.Conformation: {0} current variant for 25 CYD core.conformation.Conformation: {0} Found disulfide between residues 2 11 core.conformation.Conformation: {0} current variant for 2 CYD core.conformation.Conformation: {0} current variant for 11 CYD core.conformation.Conformation: {0} current variant for 2 CYD core.conformation.Conformation: {0} current variant for 11 CYD core.conformation.Conformation: {0} Found disulfide between residues 5 25 core.conformation.Conformation: {0} current variant for 5 CYD core.conformation.Conformation: {0} current variant for 25 CYD core.conformation.Conformation: {0} current variant for 5 CYD core.conformation.Conformation: {0} current variant for 25 CYD PDB file name: /Users/kunkun/Desktop/PyRosetta培训资料/data/pose_demo.pdb Total residues: 26 Sequence: HCFHCRNIRFCSEDEEELRRAREECK Fold tree: FOLD_TREE EDGE 1 26 -1 PDB file name: /Users/kunkun/Desktop/PyRosetta培训资料/data/pose_demo1.pdb Total residues: 26 Sequence: HCFHCRNIRFCSEDEEELRRAREECK Fold tree: FOLD_TREE EDGE 1 26 -1
运行结果表明,我准备的silent文件里面有2个Pose。通过python的list索引或for循环语句,我们可以轻松地访问每一个Pose的信息。
# 索引第二个Pose
first_pose = poses[1]
print(first_pose)
PDB file name: /Users/kunkun/Desktop/PyRosetta培训资料/data/pose_demo1.pdb Total residues: 26 Sequence: HCFHCRNIRFCSEDEEELRRAREECK Fold tree: FOLD_TREE EDGE 1 26 -1
从上面的4个例子中,我们已经学会了如何从Seq,PDB以及Silent文件中读入并生成Rosetta的Pose。那么接下来的最后一个例子就讲述一下如何从Pose输出对应的信息。
从已经生成的pose对象中输出PDB结构或Silent文件,仅需要调用Pose类中dump_pdb的方法函数即可或poses_to_silent函数即可。
# 从Pose输出pdb
pdb_pose.dump_pdb('./data/output.pdb')
True
# 输出5个重复的pose到silent文件中
from pyrosetta.io import poses_to_silent
for i in range(5):
poses_to_silent(pdb_pose, './data/multi_AAAAA.silent')
使用toolbox中的cleanATOM对PDB进行数据清洗,仅把ATOM和TER行进行分离:
from pyrosetta.toolbox import cleanATOM
cleanATOM("./data/4R80.pdb", out_file="./data/4R80_clean2.pdb")
如以PDB:4jfx结构为例,该结构中存在名为PTR(磷酸化酪氨酸)的残基。
Rosetta中内置可以识别的被修饰氨基酸有3种: PTR,SEP,TPO.
# 需要额外读入ptm-caa的参数文件:
init("-extra_res_fa ptm-caa/PTR.params")
ptms_pose = pose_from_pdb('./data/4jfx_peptide.pdb')
print(ptms_pose.sequence())
ptms_pose.dump_pdb('./data/ptm.pdb')
PyRosetta-4 2021 [Rosetta PyRosetta4.conda.mac.cxx11thread.serialization.python37.Release 2021.31+release.c7009b3115c22daa9efe2805d9d1ebba08426a54 2021-08-07T10:04:12] retrieved from: http://www.pyrosetta.org (C) Copyright Rosetta Commons Member Institutions. Created in JHU by Sergey Lyskov and PyRosetta Team. core.init: {0} Checking for fconfig files in pwd and ./rosetta/flags core.init: {0} Rosetta version: PyRosetta4.conda.mac.cxx11thread.serialization.python37.Release r292 2021.31+release.c7009b3115c c7009b3115c22daa9efe2805d9d1ebba08426a54 http://www.pyrosetta.org 2021-08-07T10:04:12 core.init: {0} command: PyRosetta -extra_res_fa ptm-caa/PTR.params -database /opt/miniconda3/lib/python3.7/site-packages/pyrosetta/database basic.random.init_random_generator: {0} 'RNG device' seed mode, using '/dev/urandom', seed=-1317805105 seed_offset=0 real_seed=-1317805105 thread_index=0 basic.random.init_random_generator: {0} RandomGenerator:init: Normal mode, seed=-1317805105 RG_type=mt19937 core.import_pose.import_pose: {0} File './data/4jfx_peptide.pdb' automatically determined to be of type PDB GNYVVTYA
True
参考: https://www.rosettacommons.org/docs/latest/rosetta_basics/non_protein_residues/Metals
目前Rosetta能够自动处理的一些常见金属离子: Zn, Cu, Fe, Mg, Na, K, Ca等。但注意的是,rosetta并没有认识金属和氨基酸残基形成化学键,仅是在处理是加了坐标约束。
# 在初始化的时候,需要设置metals的检测。
init("-in:auto_setup_metals")
metal_protein_pose = pose_from_pdb('./data/3cjk.pdb')
metal_protein_pose.dump_pdb('./data/3cjk_metalsetip.pdb')
PyRosetta-4 2021 [Rosetta PyRosetta4.conda.mac.cxx11thread.serialization.python37.Release 2021.31+release.c7009b3115c22daa9efe2805d9d1ebba08426a54 2021-08-07T10:04:12] retrieved from: http://www.pyrosetta.org (C) Copyright Rosetta Commons Member Institutions. Created in JHU by Sergey Lyskov and PyRosetta Team. core.init: {0} Checking for fconfig files in pwd and ./rosetta/flags core.init: {0} Rosetta version: PyRosetta4.conda.mac.cxx11thread.serialization.python37.Release r292 2021.31+release.c7009b3115c c7009b3115c22daa9efe2805d9d1ebba08426a54 http://www.pyrosetta.org 2021-08-07T10:04:12 core.init: {0} command: PyRosetta -in:auto_setup_metals -database /opt/miniconda3/lib/python3.7/site-packages/pyrosetta/database basic.random.init_random_generator: {0} 'RNG device' seed mode, using '/dev/urandom', seed=-667901329 seed_offset=0 real_seed=-667901329 thread_index=0 basic.random.init_random_generator: {0} RandomGenerator:init: Normal mode, seed=-667901329 RG_type=mt19937 core.import_pose.import_pose: {0} File './data/3cjk.pdb' automatically determined to be of type PDB core.io.pose_from_sfr.PoseFromSFRBuilder: {0} CD 171 was added by a jump, with base residue 81 core.conformation.Conformation: {0} [ WARNING ] missing heavyatom: OXT on residue ILE:CtermProteinFull 68 core.conformation.Conformation: {0} [ WARNING ] missing heavyatom: OXT on residue ARG:CtermProteinFull 143 core.util.metalloproteins_util: {0} Automatically setting covalent bonds between metal ions and metal-binding residues. core.util.metalloproteins_util: {0} Adding covalent linkage between residue 11's SG atom and residue 144's CD atom. core.util.metalloproteins_util: {0} Adding covalent linkage between residue 80's OG1 atom and residue 144's CD atom. core.util.metalloproteins_util: {0} Adding covalent linkage between residue 81's SG atom and residue 144's CD atom. core.util.metalloproteins_util: {0} Adding covalent linkage between residue 84's SG atom and residue 144's CD atom. core.util.metalloproteins_util: {0} Automatically setting up constraints between metal ions and metal-binding residues.
True
在接下来的章节,我们会介绍更多关于Pose的不同Layer的工作方式以及其中的信息提取方法。