qAlign: Align reads

Description

Create read alignments against reference genome and optional auxiliary targets if not yet existing. If necessary, also build target indices for the aligner.

Usage

qAlign(sampleFile,  genome,  auxiliaryFile=NULL,  aligner="Rbowtie",  maxHits=1,  paired=NULL,  splicedAlignment=FALSE,  snpFile=NULL,  bisulfite="no",  alignmentParameter=NULL,  projectName="qProject",  alignmentsDir=NULL,  lib.loc=NULL,  cacheDir=NULL,  clObj=NULL, checkOnly=FALSE)

Arguments

sampleFile

the name of a text file listing input sequence files and sample names (see ‘Details’).

genome

the reference genome for primary alignments, one of:

a string referring to a “BSgenome” package (e.g. “"BSgenome.Hsapiens.UCSC.hg19"”), which will be downloaded automatically from Bioconductor if not present
the name of a fasta sequence file containing one or several sequences (chromosomes) to be used as a reference. The aligner index will be created when neccessary and stored in a default location (see ‘Details’).

auxiliaryFile

the name of a text file listing sequences to be used as additional targets for alignment of reads not mapping to the reference genome (see ‘Details’).

aligner

selects the aligner program to be used for aligning the reads. Currently, only “Rbowtie” is supported, which is an R wrapper package for ‘bowtie’ and ‘SpliceMap’ (see Rbowtie package).

maxHits

sets the maximal number of allowed mapping positions per read (default: 1). If a read produces more than maxHits alignments, no alignments will be reported for it. In case of a multi-mapping read, a single alignment is randomly selected

paired

defines the type of paired-end library and can be set to one of no (single read experiment, default), fr (fw/rev), ff (fw/fw) or rf (rev/fw).

splicedAlignment

if TRUE, reads will be aligned by SpliceMap to produce spliced alignments (without using a database of known exon-exon junctions). Using splicedAlignment=TRUE will increase alignment times roughly by a factor of ten. The option can only be used for reads with a minimal length of 50nt; SpliceMap ignores reads that are shorter. Such short reads will not be contained in the BAM file, neither as mapped or unmapped reads.

snpFile

the name of a text file listing single nucleotide polymorphisms to be used for allele-specific alignment and quantification (see ‘Details’).

bisulfite

for bisulfite-converted samples (Bis-seq), the type of bisulfite library (“dir” for directional libraries, “undir” for undirectional libraries).

alignmentParameter

a optional string containing command line parameters to be used for the aligner, to overrule the default alignment parameters used by QuasR. Please use with caution; some alignment parameters may break assumptions made by QuasR. Default parameters are listed in ‘Details’.

projectName

an optional name for the alignment project.

alignmentsDir

the directory to be used for storing alignments (bam files). If set to NULL (default), bam files will be generated at the location of the input sequence files.

lib.loc

can be used to change the default library path of R. The library path is used by QuasR to store aligner index packages created from BSgenome reference genomes, or to install newly downloaded BSgenome packages.

cacheDir

specifies the location to store (potentially huge) temporary files. If set to NULL (default), the temporary directory of the current R session as returned by tempdir() will be used.

clObj

a cluster object, created by the package parallel, to enable parallel processing and speed up the alignment process.

checkOnly

if TRUE, prevents the automatic creation of alignments or aligner indices. This allows to quickly check for missing alignment files without starting the potentially long process of their creation. In the case of missing alignments or indices, an exception is thrown.

Value

A qProject object.

Details

Before generating new alignments, qAlign looks for previously generated alignments as well as for an aligner index. If no aligner index exists, it will be automatically created and stored in the same directory as the provided fasta file, or as an R package in the case of a BSgenome reference. The name of this R package will be the same as the BSgenome package name, with an additional suffix from the aligner (e.g. BSgenome.Hsapiens.UCSC.hg19.Rbowtie). The generated bam files contain both aligned und unaligned reads. For paired-end samples, by default no alignments will be reported for read pairs where only one of the reads could be aligned. sampleFile is a tab-delimited text file listing all the input sequences to be included in a given analysis. The file has either two (single-end) or three columns (paired-end). The first row contains the column names, and additional rows contain relative or absolute path and name of input sequence file(s), as well as the according sample name. Three input file formats are supported (fastq, fasta and bam). All input files in one sampleFile need to be in the same format, and are recognized by their extension (.fq, .fastq, .fa, .fasta, .fna, .bam), in raw or compressed form (e.g. .fastq.gz). If bam files are provided, then no alignments are generated by qAlign, and the alignments contained in the bam files will be used instead.

The column names in sampleFile have to match to the ones in the examples below, for a single-read experiment:

FileName

SampleName

chip_1_1.fq.bz2

Sample1

and for a paired-end experiment:

FileName1	FileName2
SampleName	rna_1_1.fq.bz2
rna_1_2.fq.bz2	Sample1

The “SampleName” column is the human-readable name for each sample that will be used as sample labels. Multiple sequence files may be associated to the same sample name, which instructs QuasR to combine those files.

auxiliaryFile is a tab-delimited text file listing one or several additional target sequence files in fasta format. Reads that do not map against the reference genome will be aligned against each of these target sequence files. The first row contains the column names which have to match to the ones in the example below:

FileName

AuxName

snpFile is a tab-delimited text file without a header and contains four columns with chromosome name, position, reference allele and alternative allele, as in the example below:

chr1	8596	G
A	chr1	18443
G	A	chr1
18981	C	T

The reference and alternative alleles will be injected into the reference genome, resulting in two separate genomes. All reads will be aligned separately to both of these genomes, and the alignments will be combined, only retaining the best alignment for each read. In the final alignment, each read will be marked with a tag that classifies it into reference (R), alternative (A) or unknown (U), if the reads maps equally well to both genomes.

If bisulfite is set to “dir” or “undir”, reads will be C-to-T converted and aligned to a similarly converted genome.

If alignmentParameter is NULL (recommended), qAlign will select default parameters that are suitable for the experiment type. Please note that for bisulfite or allele-specific experiments, each read is aligned multiple times, and resulting alignments need to be combined. This requires special settings for the alignment parameters that are not recommended to be changed. For ‘simple’ experiments (neither bisulfite, allele-specific, nor spliced), alignments are generated using the parameters -m maxHits --best --strata. This will align reads with up to “maxHits” best hits in the genome and selects one of them randomly.

Examples

Run this code

## Not run: 
#     # see qCount, qMeth and qProfile manual pages for examples
#     example(qCount)
#     example(qMeth)
#     example(qProfile)
# ## End(Not run)

Run the code above in your browser using DataLab