get.AlignedPositions: get.AlignedPositions

Description

This function reads a CIF file to extract the names and (x,y,z) coordinates of each residue. It then performs a pairwise alignment to convert the amino acid ordering in the CIF file to the canonical ordering specified by the FASTA file. The first element in the returned list, $Positions, is the positions matrix required by the ClusterFind method.

Usage

get.AlignedPositions(CIF.File.Location, Fasta.File.Location, chain.required = "A",
		     RequiredModelNum = NULL, patternQuality = PhredQuality(22L),
		     subjectQuality = PhredQuality(22L), type = "global-local",
		     substitutionMatrix = NULL, fuzzyMatrix = NULL, gapOpening = -10,
		     gapExtension = -4, scoreOnly = FALSE)

Arguments

CIF.File.Location

The location of the CIF file to be read.

Fasta.File.Location

The location of the FASTA (or FASTA-like) file to be read.

chain.required

The side chain in the protein from which to extract positions in the CIF file.

RequiredModelNum

The required model num to extract positions from in the CIF file. If the RequiredModelNum == NULL, the method will use the first model number found in the file.

patternQuality

The patternQuality parameter in the pairwiseAlignment function in the Biostrings package.

subjectQuality

The subjectQuality parameter in the pairwiseAlignment function in the Biostrings package.

type

The type parameter in the pairwiseAlignment function in the Biostrings package. This should NOT be changed from "global-local" as we use the canonical protein from the FASTA file as the global pattern and the extracted positions from the CIF as the subject pattern. We then attempt to align parts of the subject pattern to the entire global pattern.

substitutionMatrix

The substitutionMatrix parameter in the pairwiseAlignment function in the Biostrings package.

fuzzyMatrix

The fuzzyMatrix parameter in the pairwiseAlignment function in the Biostrings package.

gapOpening

The gapOpening parameter in the pairwiseAlignment function in the Biostrings package.

gapExtension

The gapExtension parameter in the pairwiseAlignment function in the Biostrings package.

scoreOnly

The scoreOnly parameter in the pairwiseAlignment function in the Biostrings package.

Value

Positions: A dataframe that shows the extracted amino acids, their numerical position in the protein order, the protein side chain being used and the amino acid positions in 3D space.
Diff.Count: A check that the amino acids remaining are in fact matched to the canonical protein. This returns a count of the number of amino acids remaining that do not match the canonical sequence and should be 0 if a successful alignment occurred.
Diff.Positions: A description of the mismatched amino acids if any were found. If a succesful alignment occurred, this will be NULL.
Alignment.Result: The raw alignment result returned by the pairwiseAlignment method in the Biostrings package.
Result: The final status of the alignment. If it is "OK", that means that this alignment appears to be ok. If the alignment failed, this item will contain an error message.

Details

This method is currently in BETA and is provided only as a convenient way to extract the required 3D positional information from a CIF file. Currently, CIF (and PDB) files can have a number of deviations from the canonical protein sequence including additional, missing and mismatched amino acids. The amino acid numbering in CIF files can also be different from the canonical protein. This makes it difficult to match up mutational data (from sources such as COSMIC) to 3D positional data (from sources such as the PDB) and necessitates the use of this function.

This method extracts the canonical amino acid sequence from the file at Fasta.File.Location. It then attempts to align the amino acids extracted from the CIF file to the canonical sequence using the pairwiseAlignment function in the package Biostrings that is available on Bioconductor. After alignment, any amino acids that are mismatched between the canonical sequence and the extracted sequence are automatically removed so that the ClusterFind method, which requires positional data as input, is only run on those amino acids which are correctly matched.

References

Biostrings Package. Bioconductor: Open software development for computational biology and bioinformatics R. Gentleman, V. J. Carey, D. M. Bates, B.Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, and others 2004, Genome Biology, Vol. 5, R80

Examples

Run this code

#Observe that position 61 is missing. It is atuomatically dropped as the pdb data
#specifies it as a "H" while the FASTA sequence specifies it as "Q".
CIF<-"http://www.pdb.org/pdb/files/3GFT.cif"
Fasta<-"http://www.uniprot.org/uniprot/P01116-2.fasta"
get.AlignedPositions(CIF,Fasta, "A")

Run the code above in your browser using DataLab