Learn R Programming

SQMtools (version 1.7.1)

loadSQM: Load a SqueezeMeta project into R

Description

This function takes the path to a project directory generated by SqueezeMeta (whose name is specified in the -p parameter of the SqueezeMeta.pl script) and parses the results into a SQM object. Alternatively, it can load the project data from a zip file produced by sqm2zip.py.

Usage

loadSQM(
  project_path,
  tax_mode = "prokfilter",
  trusted_functions_only = FALSE,
  single_copy_genes = "MGOGs",
  load_sequences = TRUE,
  engine = "data.table"
)

Value

SQM object containing the parsed project. If more than one path is provided in project_path this function will return a SQMbunch object instead. The structure of this object is similar to that of a SQMlite object (see loadSQMlite) but with an extra entry named projects that contains one SQM object for input project. SQM and SQMbunch objects will otherwise behave similarly when used with the subset and plot functions from this package.

Arguments

project_path

character, a vector of project directories generated by SqueezeMeta, and/or zip files generated by sqm2zip.py.

tax_mode

character, which taxonomic classification should be loaded? SqueezeMeta applies the identity thresholds described in Luo et al., 2014. Use allfilter for applying the minimum identity threshold to all taxa, prokfilter for applying the threshold to Bacteria and Archaea, but not to Eukaryotes, and nofilter for applying no thresholds at all (default prokfilter).

trusted_functions_only

logical. If TRUE, only highly trusted functional annotations (best hit + best average) will be considered when generating aggregated function tables. If FALSE, best hit annotations will be used (default FALSE). Will only have an effect if project_path is not a zip file, and project_path/results/tables is not already present.

single_copy_genes

character, source of single copy genes for copy number normalization, either RecA (COG0468, RecA/RadA), MGOGs (COGs for 10 single copy and housekeeping genes, Salazar, G et al. 2019), MGKOs (KOs for 10 single copy and housekeeping genes, Salazar, G et al., 2019) or USiCGs (KOs for 15 single copy genes, Carr et al., 2013. Table S1). For MGOGs, MGKOs and USiCGs, the median coverage of a set of single copy genes will be used for normalization. Default MGOGs.

load_sequences

logical. If TRUE, contig and orf sequences will be loaded in the SQM object. Setting it to FALSE will reduce memory usage. Default TRUE.

engine

character. Engine used to load the ORFs and contigs tables. Either data.frame or data.table (significantly faster if your project is large). Default data.table.

Prerequisites

Run SqueezeMeta! An example call for running it would be: /path/to/SqueezeMeta/scripts/SqueezeMeta.pl
-m coassembly -f fastq_dir -s samples_file -p project_dir

The SQM object structure

The SQM object is a nested list which contains the following information:

lvl1lvl2lvl3typerows/namescolumnsdata
$orfs$tabledataframeorfsmisc. datamisc. data
$abundnumeric matrixorfssamplesabundances (reads)
$basesnumeric matrixorfssamplesabundances (bases)
$covnumeric matrixorfssamplescoverages
$cpmnumeric matrixorfssamplescovs. / 10^6 reads
$tpmnumeric matrixorfssamplestpm
$seqscharacter vectororfs(n/a)sequences
$taxcharacter matrixorfstax. rankstaxonomy
$tax16Scharacter vectororfs(n/a)16S rRNA taxonomy
$markerslistorfs(n/a)CheckM1 markers
$contigs$tabledataframecontigsmisc. datamisc. data
$abundnumeric matrixcontigssamplesabundances (reads)
$basesnumeric matrixcontigssamplesabundances (bases)
$covnumeric matrixcontigssamplescoverages
$cpmnumeric matrixcontigssamplescovs. / 10^6 reads
$tpmnumeric matrixcontigssamplestpm
$seqscharacter vectorcontigs(n/a)sequences
$taxcharacter matrixcontigstax. rankstaxonomies
$binscharacter matrixcontigsbin. methodsbins
$bins$tabledataframebinsmisc. datamisc. data
$lengthnumeric vectorbins(n/a)length
$abundnumeric matrixbinssamplesabundances (reads)
$percentnumeric matrixbinssamplesabundances (reads)
$basesnumeric matrixbinssamplesabundances (bases)
$covnumeric matrixbinssamplescoverages
$cpmnumeric matrixbinssamplescovs. / 10^6 reads
$taxcharacter matrixbinstax. rankstaxonomy
$tax_gtdbcharacter matrixbinstax. ranksGTDB taxonomy
$taxa$superkingdom$abundnumeric matrixsuperkingdomssamplesabundances (reads)
$percentnumeric matrixsuperkingdomssamplespercentages
$phylum$abundnumeric matrixphylasamplesabundances (reads)
$percentnumeric matrixphylasamplespercentages
$class$abundnumeric matrixclassessamplesabundances (reads)
$percentnumeric matrixclassessamplespercentages
$order$abundnumeric matrixorderssamplesabundances (reads)
$percentnumeric matrixorderssamplespercentages
$family$abundnumeric matrixfamiliessamplesabundances (reads)
$percentnumeric matrixfamiliessamplespercentages
$genus$abundnumeric matrixgenerasamplesabundances (reads)
$percentnumeric matrixgenerasamplespercentages
$species$abundnumeric matrixspeciessamplesabundances (reads)
$percentnumeric matrixspeciessamplespercentages
$functions$KEGG$abundnumeric matrixKEGG idssamplesabundances (reads)
$basesnumeric matrixKEGG idssamplesabundances (bases)
$covnumeric matrixKEGG idssamplescoverages
$cpmnumeric matrixKEGG idssamplescovs. / 10^6 reads
$tpmnumeric matrixKEGG idssamplestpm
$copy_numbernumeric matrixKEGG idssamplesavg. copies
$COG$abundnumeric matrixCOG idssamplesabundances (reads)
$basesnumeric matrixCOG idssamplesabundances (bases)
$covnumeric matrixCOG idssamplescoverages
$cpmnumeric matrixCOG idssamplescovs. / 10^6 reads
$tpmnumeric matrixCOG idssamplestpm
$copy_numbernumeric matrixCOG idssamplesavg. copies
$PFAM$abundnumeric matrixPFAM idssamplesabundances (reads)
$basesnumeric matrixPFAM idssamplesabundances (bases)
$covnumeric matrixPFAM idssamplescoverages
$cpmnumeric matrixPFAM idssamplescovs. / 10^6 reads
$tpmnumeric matrixPFAM idssamplestpm
$copy_numbernumeric matrixPFAM idssamplesavg. copies
$total_readsnumeric vectorsamples(n/a)total reads
$misc$project_namecharacter vector(empty)(n/a)project name
$samplescharacter vector(empty)(n/a)samples
$tax_names_long$superkingdomcharacter vectorshort names(n/a)full names
$phylumcharacter vectorshort names(n/a)full names
$classcharacter vectorshort names(n/a)full names
$ordercharacter vectorshort names(n/a)full names
$familycharacter vectorshort names(n/a)full names
$genuscharacter vectorshort names(n/a)full names
$speciescharacter vectorshort names(n/a)full names
$tax_names_shortcharacter vectorfull names(n/a)short names
$KEGG_namescharacter vectorKEGG ids(n/a)KEGG names
$KEGG_pathscharacter vectorKEGG ids(n/a)KEGG hiararchy
$COG_namescharacter vectorCOG ids(n/a)COG names
$COG_pathscharacter vectorCOG ids(n/a)COG hierarchy
$ext_annot_sourcescharacter vectorCOG ids(n/a)external databases
If external databases for functional classification were provided to SqueezeMeta via the -extdb argument, the corresponding abundance (reads and bases), coverages, tpm and copy number profiles will be present in SQM$functions (e.g. results for the CAZy database would be present in SQM$functions$CAZy). Additionally, the extended names of the features present in the external database will be present in SQM$misc (e.g. SQM$misc$CAZy_names).