iprotein: Amino Acid Compositions of Proteins

Description

Functions to identify proteins, get and set amino acid compositions, and calculate thermodynamic properties from group additivity.

Usage

iprotein(protein, organism=NULL)
  ip2aa(protein, organism=NULL, residue=FALSE)
  aa2eos(aa, state=thermo$opt$state)
  seq2aa(protein, sequence)
  dl.aa(protein)
  aasum(aa, abundance = 1, average = FALSE, protein = NULL, organism = NULL)
  read.aa(file = "protein.csv")
  add.protein(aa, print.existing=FALSE)

Arguments

protein

character, name of protein; numeric, indices of proteins (rownumbers of thermo$protein)

organism

character, name of organism

residue

logical, compute per-residue counts?

data frame, amino acid composition in the format of thermo$protein

state

character, physical state

sequence

character, protein sequence

abundance

numeric, abundances of proteins

average

logical, return the weighted average of amino acid counts?

file

character, path to file with amino acid compositions

print.existing

logical, print a message identifying existing proteins that were not added?

Details

A protein in CHNOSZ is defined by a name and by the counts of amino acids, stored in thermo$protein. The purpose of the functions described here is to identify proteins and work with their amino acid compositions. From the amino acid compositions, the thermodynamic properties of the proteins can be estimated (Dick et al., 2006) for use in other functions in the package.

Often, the names of proteins are sufficient to set up calculations using functions such as subcrt or species. The names of proteins in CHNOSZ are distinguished from those of other chemical species by having an underscore character ("_") that separates two identifiers, referred to as the protein and organism (but any other meaning can be attached to these names). An example is LYSC_CHICK.

The first few functions provide low-level operations:

iprotein returns the rownumber(s) of thermo$protein that match the protein names. The names can be supplied in the single protein argument or as separated proteins and organisms (without the underscore). Any protein not matched returns an NA and generates a message.

ip2aa returns the row(s) of thermo$protein that match the supplied protein names, OR the protein indices found by iprotin. Set residue to TRUE to return the per-residue composition (i.e. amino acid composition of the protein divided by total number of residues). For this function only, if the protein argument is a data frame, it is returned unchanged, except for possibly the per-residue calculation.

aa2eos calculates the thermodynamic properties and equations-of-state parameters for the completely nonionized proteins using group additivity with parameters taken from Dick et al., 2006 (aqueous proteins) and LaRowe and Dick, 2012 (crystalline proteins and revised aqueous methionine sidechain group). The return value is a data frame in the same format as thermo$obigt. state indicates the physical state for the parameters used in the calculation (aq or cr).

The remaining functions are more likely to be called directly by the user:

seq2aa returns a data frame of amino acid composition, in the format of thermo$protein, corresponding to the provided sequence. Here, the protein argument indicates the name of the protein with an underscore (e.g. LYSC_CHICK).

dl.aa returns a data frame of amino acid composition, in the format of thermo$protein, retrieved from the sequence identified by protein if it is available from UniProt (http://uniprot.org; The UniProt Consortium, 2012). The name of the protein corresponds to the Entry name on the UniProt search pages.

aasum returns a data frame representing the sum of amino acid compositions in the rows of the input aa data frame. The amino acid compositions are multiplied by the indicated abundance; that argument is recycled to match the number of rows of aa. If average is TRUE the final sum is divided by the number of input compositions. The name used in the output is taken from the first row of aa or from protein and organism if they are specified. This function is useful for calculating bulk amino acid compositions in stress response experiments or localization studies; see read.expr for examples of its use.

read.aa returns a data frame of amino acid composition based on the contents of the indicated file, which can be either a CSV file with the same column names as thermo$protein, or a FASTA file, that is read using read.fasta.

add.protein completes the loop; any amino acid composition returned by the *aa functions described above can be added to thermo$protein using this function to be made available to other functions in the package. Proteins in aa with the same name as one in thermo$protein are skipped. The value returned by this function is the rownumbers of thermo$protein that are added and/or unchanged (having the same protein-organism name).

References

Dick, J. M., LaRowe, D. E. and Helgeson, H. C. (2006) Temperature, pressure, and electrochemical constraints on protein speciation: Group additivity calculation of the standard molal thermodynamic properties of ionized unfolded proteins. Biogeosciences 3, 311--336. http://dx.doi.org/10.5194/bg-3-311-2006

The UniProt Consortium (2012) Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucl. Acid Res. 40, D71--D75. http://dx.doi.org/10.1093/nar/gkr981

Examples

Run this code

# search by name in thermo$protein
ip1 <- iprotein("LYSC_CHICK")
ip2 <- iprotein("LYSC", "CHICK")
# these are the same
stopifnot(all.equal(ip1, ip2))
# two organisms with the same protein name
ip3 <- iprotein("MYG", c("HORSE", "PHYCA"))
# their amino acid compositions
print(aa <- ip2aa(ip3))
# their thermodynamic properties by group additivity
aa2eos(aa)

# an example of an unrecognized protein name
ip4 <- iprotein("MYGPHYCA")
stopifnot(is.na(ip4))

# manually adding a new protein
# Human Gastric juice peptide 1
aa <- seq2aa("GAJU_HUMAN", "LAAGKVEDSD")
ip <- add.protein(aa)
stopifnot(protein.length(ip)==10)
stopifnot(as.chemical.formula(protein.formula(ip))=="C41H69N11O18")

# downloading information about a protein
aa <- dl.aa("ALAT1_HUMAN")
ip6 <- add.protein(aa)
# now it's possible to calculate some properties ...
subcrt("ALAT1_HUMAN", c("cr", "aq"), c(-1, 1))$out[[1]]$H 

# read a fasta file, calculate H/C and O/C for all proteins
# and averages by polypeptide chain, residue and mass
ffile <- system.file("extdata/fasta/HTCC1062.faa.xz", package="CHNOSZ")
aa <- read.aa(ffile)
pf <- as.data.frame(protein.formula(aa))
plot(pf$H/pf$C, pf$O/pf$C, pch=NA)
points(pf$H/pf$C, pf$O/pf$C, pch=20, cex=0.5, col="grey")
# average composition per polypeptide chain
chainaa <- aasum(aa, average=TRUE)
chainpf <- as.data.frame(protein.formula(chainaa))
points(chainpf$H/chainpf$C, chainpf$O/chainpf$C, pch=15, col="red")
# average by amino acid residue
pl <- protein.length(aa)
resaa <- aasum(aa, abundance=pl, average=TRUE)
respf <- as.data.frame(protein.formula(resaa))
points(respf$H/respf$C, respf$O/respf$C, pch=16, col="red")
# average by mass
pm <- mass(pf)
massaa <- aasum(aa, abundance=pm, average=TRUE)
masspf <- as.data.frame(protein.formula(massaa))
points(masspf$H/masspf$C, masspf$O/masspf$C, pch=17, col="red")
# add a legend and title
legend("topright", pch=c(20, 15, 16, 17), col=c("grey", rep("red", 3)),
  legend=c("protein", "chain average", "residue average", "mass average"))
title(main=paste("O/C vs H/C for HTCC1062 proteins\n",
  "n =", nrow(aa)))

Run the code above in your browser using DataLab