
I am currently involved in making a website aimed at combining all papillomavirus information in a single place. As part of the effort we are curating all known files on public servers (e.g. genbank) One of the issues I ran into was that many (~50%) of all solved structures are not numbered according to the protein. I.e. a subdomain was crystallized (amino acid 310-450) however the crystallographer deposited this as residue 1-140. I was wondering whether anyone knows of a way to renumber the entire pdb file. I have found ways to renumber the sequence (identified by seqres), however this does not update the helix and sheet information. I would appreciate it if you had any suggestions…

도움이 되었습니까?


I frequently encounter this problem too. After abandoning an old perl script I had for this I've been experimenting with some python instead. This solution assumes you've got Biopython, ProDy (http://www.csb.pitt.edu/ProDy/#prody) and EMBOSS (http://emboss.sourceforge.net/) installed.

I used one of the papillomavirus PDB entries here.

from Bio import AlignIO,SeqIO,ExPASy,SwissProt
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
from Bio.Emboss.Applications import NeedleCommandline
from prody.proteins.pdbfile import parsePDB, writePDB
import os

oneletter = {

# Retrieve pdb to extract sequence
# Can probably be done with Bio.PDB but being able to use the vmd-like selection algebra is nice
selection="chain A"
pdbseq_str=''.join([oneletter[i] for i in structure.select("protein and name CA and     %s"%selection).getResnames()])

# Retrieve reference sequence
handle = ExPASy.get_sprot_raw(accession)
swissseq = SwissProt.read(handle)
SeqIO.write(refseq, "%s.fasta"%accession,"fasta")

# Do global alignment with needle from EMBOSS, stores entire sequences which makes numbering easier
needle_cli = NeedleCommandline(asequence="%s.fasta"%pdbname,bsequence="%s.fasta"%accession,gapopen=10,gapextend=0.5,outfile="needle.out")
aln = AlignIO.read("needle.out", "emboss")

alnPDBseq = aln[0]
alnREFseq = aln[1]
# Initialize per-letter annotation for pdb sequence record
# Initialize annotation for reference sequence, assume first residue is #1

# Set new residue numbers in alnPDBseq based on alignment
reslist = [[i,alnREFseq.letter_annotations["resnum"][i]] for i in range(len(alnREFseq)) if alnPDBseq[i] != '-']
for [i,r] in reslist:

# Set new residue numbers in the structure
newresnums=[i for i in alnPDBseq.letter_annotations["resnum"][:] if i != None]
resindices=structure.select("protein and name CA and %s"%selection).getResindices()
resmatrix = [[newresnums[i],resindices[i]] for i in range(len(newresnums)) ]
for [newresnum,resindex] in resmatrix:  
    structure.select("resindex %d"%resindex).setResnums(newresnum)


다른 팁

I'm the maintainer of pdb-tools - which may be a tool that can assist you.

I have recently modified the residue-renumber script within my application to provide more flexibility. It can now renumber hetatms and specific chains, and either force the residue numbers to be continuous or just add a user-specified offset to all residues.

Please let me know if this assists you.

  1. pdb-tools
  2. Phenix pdb-tools
  3. BioPython or Bio3D

Check the first one - it should fit your needs

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top