Since MaxQuant changes capitalization and sometimes even column names, it seemed convenient to have a function which just reads a txt file and returns unified column names, irrespective of the MQ version. So, it unifies access to columns (e.g. by using lower case for ALL columns) and ensures columns are identically named across MQ versions:
alternative term new term ----------------------------------------- protease enzyme protein.descriptions fasta.headers potential.contaminant contaminant mass.deviations mass.deviations..da. basepeak.intensity base.peak.intensity
A 'this' pointer. Use it to refer/change internal members. It's implicitly added, thus not required too call the function!
(Relative) path to a MQ txt file ()
Searched for "C" and "R". If present, [c]ontaminants and [r]everse hits are removed if the respective columns are present.
E.g. to filter both, filter = "C+R"
Allowed values are: "pg" (proteinGroups) [default], adds abundance index columns (*AbInd*, replacing 'intensity') "sm" (summary), splits into three row subsets (raw.file, condition, total) Any other value will not add any special columns
A vector of column names as read by read.delim(), e.g., spaces are replaced by dot already. If given, only columns with these names (ignoring lower/uppercase) will be returned (regex allowed) E.g. col_subset=c("^lfq.intensity.", "protein.name")
If TRUE and a column 'raw.file' is present, an additional column 'fc.raw.file' will be added with
common prefix AND common substrings removed (simplifyNames
)
E.g. two rawfiles named 'OrbiXL_2014_Hek293_Control', 'OrbiXL_2014_Hek293_Treated' will give
'Control', 'Treated'
If add_fs_col
is a number AND the longest short-name is still longer, the names are discarded and replaced by
a running ID of the form 'file <x>', where <x> is a number from 1 to N.
If the function is called again and a mapping already exists, this mapping is used.
Should some raw.files be unknown (ie the mapping from the previous file is incomplete), they will be augmented
After reading the data, check for unusual number of NA's to detect if file was corrupted by Excel or alike
[For type=='pg' only] An additional custom LFQ column ('cLFQ...') is created where zero values in LFQ columns are replaced by the following method IFF(!) the corresponding raw intensity is >0 (indicating that LFQ is erroneusly 0) "toNA": replace by NA "impute": replace by lowest LFQ value >0 (simulating 'noise')
Additional parameters passed on to read.delim()
A data.frame of the respective file
We also correct 'reporter.intensity.*' naming issues to MQ 1.6 convention, when 'reporter.intensity.not.corrected' is present. MQ 1.5 uses: reporter.intensity.X and reporter.intensity.not.corrected.X MQ 1.6 uses: reporter.intensity.X and reporter.intensity.corrected.X
Note: you must find a regex which matches both versions, or explicitly add both terms if you are requesting only a subset of columns!
Example of usage:
mq = MQDataReader$new() d_evd = mq$readMQ("evidence.txt", type="ev", filter="R", col_subset=c("proteins", "Retention.Length", "retention.time.calibration"))
If the file is empty, this function stops with an error.