read.sampleInfo: Read a sample information file and format appropriate metadata.

Description

Given a sample information file, the function checks if it includes required information to process samples present on each sector/quadrant/region/lane. The function also adds other columns required for processing with default values if not already defined ahead of time.

Usage

read.sampleInfo(sampleInfoPath = NULL, splitBySector = TRUE,
  interactive = TRUE)

Arguments

sampleInfoPath

full or relative path to the sample information file, which holds samples to quadrant/lane associations along with other metadata required to trim sequences or process it.

splitBySector

split the data frame into a list by sector column. Default is TRUE.

interactive

whether to prompt each time the function encounters an issue, or use the defaults. Default is TRUE.

Value

if splitBySector=TRUE, then an object of SimpleList named by quadrant/lane information defined in sampleInfo file, else a dataframe.

Details

Required Column Description:
- sector => region/quadrant/lane of the sequencing plate the sample comes from. If files have been split by samples apriori, then the filename associated per sample without the extension. If this is a filename, then be sure to enable 'alreadyDecoded' parameter infindBarcodes, since contents of this column is pasted together with 'seqfilePattern' parameter inread.SeqFolderto find the appropriate file needed. For paired end data, this is basename of the FASTA/Q file holding the sample data from the LTR side. For example, files such as Lib3_L001_R2_001.fastq.gz or Lib3_L001_R2_001.fastq would be Lib3_L001_R2_001, and consequently Lib3_L001_R1_001 would be used as the second pair!
- barcode => unique 4-12bp DNA sequence which identifies the sample. If providing filename as sector, then leave this blank since it is assumed that the data is already demultiplexed.
- primerltrsequence => DNA sequence of the viral LTR primer with/without the viral LTR sequence following the primer landing site. If already trimmed, then mark this as SKIP.
- sampleName => Name of the sample associated with the barcode
- sampleDescription => Detailed description of the sample
- gender => sex of the sample: male or female or NA
- species => species of the sample: homo sapien, mus musculus, etc.
- freeze => UCSC freeze to which the sample should be aligned to.
- linkerSequence => DNA sequence of the linker adaptor following the genomic sequence. If already trimmed, then mark this as SKIP.
- restrictionEnzyme => Restriction enzyme used for digestion and sample recovery. Can also be one of: Fragmentase or Sonication!
Metadata Parameter Column Description:
- ltrBitSequence => DNA sequence of the viral LTR following the primer landing site. Default is last 7bps of the primerltrsequence.
- ltrBitIdentity => percent of LTR bit sequence to match during the alignment. Default is 1.
- primerLTRidentity => percent of primer to match during the alignment. Default is .85
- linkerIdentity => percent of linker sequence to match during the alignment. Default is 0.55. Only applies to non-primerID/random tag based linker search.
- primerIdInLinker => whether the linker adaptor used has primerID/random tag in it? Default is FALSE.
- primerIdInLinkerIdentity1 => percent of sequence to match before the random tag. Default is 0.75. Only applies to primerID/random tag based linker search and when primeridinlinker is TRUE.
- primerIdInLinkerIdentity2 => percent of sequence to match after the random tag. Default is 0.50. Only applies to primerID/random tag based linker search and when primeridinlinker is TRUE.
- celltype => celltype information associated with the sample
- user => name of the user who prepared or processed the sample
- pairedEnd => is the data paired end? Default is FALSE.
- vectorFile => fasta file containing the vector sequence
Processing Parameter Column Description:
- startWithin => upper bound limit of where the alignment should start within the query. Default is 3.
- alignRatioThreshold => cuttoff for (alignment span/read length). Default is 0.7.
- genomicPercentIdentity => cuttoff for (1-(misMatches/matches)). Default is 0.98.
- clusterSitesWithin => cluster integration sites within a defined window size based on frequency which corrects for any sequencing errors. Default is 5.
- keepMultiHits => whether to keep sequences/reads that return multiple best hits, aka ambiguous locations.
- processingDate => the date of processing

Examples

Run this code

runData <- system.file("extdata/FLX_sample_run",
package = "hiReadsProcessor")
read.sampleInfo(file.path(runData,"sampleInfo.xls"))

Run the code above in your browser using DataLab