get.Positions: get.Positions

Description

This function reads a CIF file to extract the names and (x,y,z) coordinates of each residue. Then, the information within the CIF file itself is used to identify the canonical position for each amino acid.

Usage

get.Positions(CIF.File.Location, Fasta.File.Location, chain.required = "A",
			  RequiredModelNum = NULL)

Arguments

CIF.File.Location

The location of the CIF file to be read.

Fasta.File.Location

The location of the FASTA (or FASTA-like) file to be read.

chain.required

The side chain in the protein from which to extract positions in the CIF file.

RequiredModelNum

The required model num to extract positions from in the CIF file. If the RequiredModelNum == NULL, the method will use the first model number found in the file.

Value

Positions: A dataframe that shows the amino acids extracted, their position in protein ordering, the side chain and their position in 3D space.
External.Mismatch: This shows the mismatches that were found between the amino acids extracted from the file and the canonical sequence in the FASTA file.
PDB.Mismatch: This shows the mismatches that are known to exist using only the information available in the CIF file.
Result: The final status of extracting the amino acid sequence. It will either display "OK" or an error message.

Details

This algorithm attempts to recreate the canonical sequence by considering 3 seperate parts of a CIF file. The first part, the "_pdbx_poly_seq_scheme" section corresponds to the SEQRES entries in PDB files and lists all the amino acid residues in the protein that were studied including those that were not included in the 3D structure. The second part, the "struct_ref_seq" section, links the Uniprot (which provides the canonical sequence), PDB and internal CIF amino acid numbering schemes. Finally, the "_atom_site" section provides the (x,y,z) positional information. All three parts are tied together via a "seq_id" numerical identifier. It is thus possible to use the numbering scheme map provided in the "struct_ref_seq" section to calculate the canonical numbering for the residues listed in the "_pdbx_poly_seq_scheme" and then finally match them up to the residues listed in the "_atom_site" section.

Examples

Run this code

###################################################
#Observe that the final result from the code below is "OK". That is because the only
#mismatched residue at position 61, was documented in the CIF file as well.
#Thus it is considered a "reconciled" mismatch. It is up to the user to decide if 
#they want to include it in the position sequence or remove it. 

CIF<-"http://www.pdb.org/pdb/files/3GFT.cif"
Fasta<-"http://www.uniprot.org/uniprot/P01116-2.fasta"
KRAS.extracted.positions<- get.Positions(CIF, Fasta, "A")
###################################################

###################################################
#Observe that the final result from the code below is "FAILURE". For PIK3CA there were
#10 mismatched residues between the CIF file and the canonical sequence. 
#However, none of these residues are explained within the actual CIF file.

CIF<- "http://www.pdb.org/pdb/files/2RD0.cif"
Fasta<-"http://cancer.sanger.ac.uk/cosmic/sequence?width=700&ln=PIK3CA&type=protein&height=500"
PIK3CA.extracted.positions<- get.Positions(CIF,Fasta , "A")
###################################################


###################################################
#Observe that the final result from the code below is "OK". Here we use a different file
#location for the canonical sequence -- the UNIPROT database. The canonical sequence is slightly 
#different and matches up exactly to the extracted positions. As there is only 1 isoform listed
#on UNIPROT for PIK3CA we suggest that users obtain the mutational data and the canonical 
#sequence information from the same source. For example, if your mutation data was obtained from 
#COSMIC, you should use COSMIC to get the canonical protein sequence.

CIF <- "http://www.pdb.org/pdb/files/2RD0.cif"
Fasta <- "http://www.uniprot.org/uniprot/P42336.fasta"
PIK3CA.extracted.positions<- get.Positions(CIF, Fasta, "A")
###################################################

Run the code above in your browser using DataLab