After a GC-MS analysis, different types of files are produced from the chromatograph and the mass spectrometer . Each instrument vendor provide specific proprietary data formats that should be converted to common raw data format such as ANDI NetCDF or mzXML. Most commonly used file formats for mass spectral data, i.e. NetCDF, mzXML and ASCII, are acceptable in MS.DataCreation. Specific proprietary format from Agilent Technologies can also be used directly. Below the detailed structure of the three types of input formats: (i) DataType=CDF. Each GC-MS analysis has its own folder, which contains a mass spectrum in AIA/ANDI NetCDF, mzXML, mzData or mzML format, and a peak list stored in a file named peaklist.txt. Peaklist.txt should have column headings similar to
peak/RT/firstscan/maxscan/lastscan/quantification1/quantification2. The first column contain the peak number, the retention time in minute or second is in the second column, the first scan of the peak is in the third column, the scan at the apex (maxscan) is in column 4, the last scan of the peak is in column 5, and optionally a quantitative measure of peak size (quantifaction1) is in column 6, and another quantitative measures of peak size (quantification2) is in column 7 (only maxscan used if apex=TRUE
in MS.clust
). The sample name reported in the output matrix is extracted from the name of the AIA/ANDI files. Thus, all AIA/ANDI files should have different names. All analysis folders should be grouped in one folder.
The function first checks for the presence of AIA/ANDI and peaklist.txt files, controls if the range of mz is consistent and checks the structure of the peaklist.txt files. In a second time, the function collects the peak's retention time in peaklist.txt and looks for corresponding mass spectra in CDF files. Depending on the Apex option, the mean mass spectrum per each peak is calculated or the mass spectrum at the apex is extracted. The intensity, in counts, of each mass fragment is transformed to a relative percentage of the highest mass fragment per spectrum. If quant = TRUE, one or two quantification columns, quantification1 and quantification2, are extracted for each peak from peaklist.txt and placed respectively in columns 3 and 4 of the output initial_DATA matrix.
(ii) DataType=Agilent. For Agilent Technologies providers (using the default parameters): each GC-MS analysis returns a folder .D that contains a file rteres.txt with summary information of the chromatogram (analogous to a peak list). All the analysis folders should have different names and should be grouped in one folder. The mass spectra should be exported in ANDI NetCDF format. These files are automatically generated at once for several selected GC-MS analyses with the Chemstation data analysis software (Menu/File/Export to AIA/ANDI). By default, all CDF files are exported in one folder that may correspond to pathCDF
.
The sample name reported in the output matrix is extracted from the name of the .D folder. Thus, all .D folders should have different names. AIA/ANDI files should have identical name with the corresponding .D folder.
The function first checks if all sample folders (.D) within the folder path have a file rteres.txt and if in pathCDF
there are all the CDF files needed. If one file is missing, the analysis stops and indicates the name of the problematic sample. The analysis should be restarted after correction or removal. In a second time, the function collects the peak's retention time in rteres.txt and looks for corresponding mass spectra in CDF files. Depending on the Apex option, the mean mass spectrum per each peak is calculated or the mass spectrum at the apex is extracted. The intensity, in counts, of each mass fragment is transformed to a relative percentage of the highest mass fragment per spectrum. If quant = TRUE, the two quantification columns CorrArea (corrected peak area) and PercTot (percent of the total corrected area) are extracted for each peak from rteres.txt and placed respectively in columns 3 and 4 of the output initial_DATA matrix.
(iii) DataType=ASCII.If your GC-MS raw data have been converted into the international ASCII format, all files (one per GC-MS analysis) should be grouped in one folder and first pass through the trans.ASCII function. The trans.ASCII function generates a folder output_date_time with translated files compatible with MS.DataCreation. This output_date_time file may correspond to path. First, a smoothing of chromatogram depending on the option N_filt is performed (see the documentation of the function filter, method=convolution). Afterwards, peak are detected by the succession of 3 points with increasing intensity directly followed by three points of decreasing intensity (all points should have an intensity higher than 10 kilocounts). The first and last peaks of the chromatogram are removed if incomplete. In a third time, depending on the Apex option, the function calculates the mean mass spectrum per each peak or extracts the mass spectrum at the apex and the intensity (in counts) of each mass fragment is transformed to a relative percentage of the highest mass fragment per spectrum.
The output file called initial_DATA.txt is saved in a folder called
Output_MSDataCreation_resultdate_time. It contains the relative mass spectrum of each peak of all samples. The first column contains sample name (the name of the folder containing the GC-MS analysis), the second column is the peak retention time (or retention index) and the following columns correspond to the relative mass spectrum of the peak (within the range of the mass spectrum). If quant = TRUE, the first column contains sample name (the name of the folder containing the GC-MS analysis), the second column is the peak retention time (or retention index), the third column contains quantification 1 (corrected area for Agilent), the fourth column contains quantification 2 (percent of the total corrected area for Agilent) and the following columns correspond to the relative mass spectrum of the peak (within the range of the mass spectrum).