fast_read: Reading data from an input file

Description

A highly efficient reading of a tab-separated text file for iq processing.

Usage

fast_read(filename,
          sample_id = "R.Condition",
          primary_id = "PG.ProteinGroups",
          secondary_id = c("EG.ModifiedSequence", "FG.Charge", "F.FrgIon", "F.Charge"),
          intensity_col = "F.PeakArea",
          annotation_col = c("PG.Genes", "PG.ProteinNames"),
          filter_string_equal = c("F.ExcludedFromQuantification" = "False"),
          filter_string_not_equal = NULL,
          filter_double_less = c("PG.Qvalue" = "0.01", "EG.Qvalue" = "0.01"),
          filter_double_greater = NULL,
          intensity_col_sep = NULL,
          intensity_col_id = NULL,
          na_string = "0")

Value

A list is returned with following components

protein: A table of proteins in the first column followed by annotation columns.
sample: A vector of samples.
ion: A vector of fragment ions to be used for quantification.
quant_table: A list of four components: protein_list (index pointing to protein)), sample_list (index pointing to sample), id (index pointing to ion), and quant (intensities).

Arguments

filename: A long-format tab-separated text file with a primary column of protein identification, secondary columns of fragment ions, a column of sample names, a column for quantitative intensities, and extra columns for annotation.
primary_id: Unique values in this column form the list of proteins to be quantified.
secondary_id: A concatenation of these columns determines the fragment ions used for quantification.
sample_id: Unique values in this column form the list of samples.
intensity_col: The column for intensities.
annotation_col: Annotation columns
filter_string_equal: A named vector of strings. Only rows satisfying the condition are kept.
filter_string_not_equal: A named vector of strings. Only rows satisfying the condition are kept.
filter_double_less: A named vector of strings. Only rows satisfying the condition are kept. Default PG.Qvalue < 0.01 and EG.Qvalue < 0.01.
filter_double_greater: A named vector of strings. Only rows satisfying the condition are kept.
intensity_col_sep: A separator character when entries in the intensity column contain multiple values.
intensity_col_id: The column for identities of multiple quantitative values.
na_string: The value considered as NA.

Author

Thang V. Pham

Details

When entries in the intensity column contain multiple values, this function will replicate entries in other column and the secondary_id will be appended with corresponding entries in intensity_col_id when it is provided. Otherwise, integer values 1, 2, 3, etc... will be used.

References

Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics. Bioinformatics 2020 Apr 15;36(8):2611-2613.