Learn R Programming

sequenza (version 2.1.2)

read.seqz: Read an seqz or acgt format file

Description

Efficiently reads an seqz or acgt file into R.

Usage

read.seqz(file, nrows = -1, fast = FALSE, gz = TRUE, header = TRUE,
            colClasses = c("character", "integer", "character", "integer",
            "integer", "numeric", "numeric", "numeric", "character",
            "numeric", "numeric", "character", "character", "character"),
            chr.name = NULL, n.lines = NULL, ...)

read.acgt(file, colClasses = c("character", "integer", "character", "integer", "integer", "integer", "integer", "integer", "character"), ...)

Arguments

file

file name

nrows

number of rows to read from the file. Default is -1 (all rows).

fast

logical. If TRUE the file will be pre-parsed to count the number of rows; on some systems this can speed up the file reading.

gz

logical. If TRUE (the default) the function expects a gzipped file.

header

logical, indicating whether the file contains the names of the variables as its first line.

colClasses

character. A vector of classes to be assumed for the columns. By default the acgt and seqz format is expected.

chr.name

if specified, only the selected chromosome will be extracted instead of the entire file.

n.lines

vector of length 2 specifying the first and last line to read from the file. If specified, only the selected portion of the file will be used. Requires the sed UNIX utility.

...

any arguments accepted by read.delim. For read.acgt, also any arguments accepted by read.seqz.

Format

seqz is a tab separated text file with column headers. The file has currently 14 columns. The first 3 columns are derived from the original pileup file and contain:

chromosome

with the chromosome name

position

with the base position

base.ref

with the base in the reference genome used (usually hg19). Note the base.ref is NOT necessarily the base in the normal specimen.

The remaining 10 columns contain the following information:
depth.normal

read depth observed in the normal sample

depth.tumor

read depth observed in the tumor sample

depth.ratio

ratio of depth.tumor and depth.normal

Af

A-allele frequency observed in the tumor sample

Bf

B-allele frequency observed in the tumor sample in heterozygous positions

zygosity.normal

zygosity of the reference sample. "hom" corresponds to AA or BB, whereas "het" corresponds to AB or BA

GC.percent

GC-content (percent), calculated from the reference genome in fixed nucleotide windows

good.reads

number of reads that passed the quality threshold (threshold specified in the pre-processing software), in the tumor specimen

AB.normal

base(s) found in the germline sample; for heterozygous positions AB are sorted using the values of Af and Bf respectively

AB.tumor

base(s) found in the tumor sample not present in the normal specimen. The field include all the variants found in the tumor alignment, separated by a colon. Each variant contains the base and the observed frequency

tumor.strand

frequency of the variant nucleotides detected on the forward orientation. The field have a consistent structure with AB.tumor, indicating the fraction, relative to the total number of reads presenting the specific variant, orientated in the forward direction

The acgt file format is similar to the seqz format, but contains only 8 columns. The first 3 are the same as in the seqz file, derived from the pileup format. The remaining 5 columns contain the following information:

read.depth

read depth. The column is derived from the pileup file

A

number of times A was observed among the reads that were above the quality threshold

C

number of times C was observed among the reads that were above the quality threshold

G

number of times G was observed among the reads that were above the quality threshold

T

number of times T was observed among the reads that were above the quality threshold

strand

string indicating the frequencies of reads in the forward strand for A, C, G and T, respectively, separated by ":".

Details

read.seqz is a function that allows to efficiently access a file by chromosome or by number of line. The specific content of a seqz file or an acgt is explained in the value section.

See Also

read.delim.

Examples

Run this code
# NOT RUN {
   
# }
# NOT RUN {
data.file <-  system.file("data", "example.seqz.txt.gz", package = "sequenza")

## read chromosome 1 from an seqz file.
seqz.data <- read.seqz(data.file, chr.name = 1)

## Fast access to chromosome X using the file metrics
gc.stats <- gc.sample.stats(data.file)
chrX <- gc.stats$file.metrics[gc.stats$file.metrics$chr == "X", ]
seqz.data <- read.seqz(data.file, n.lines = c(chrX$start, chrX$end))

## Compare the running time of the two different methods.
system.time(read.seqz(data.file, n.lines = c(chrX$start, chrX$end)))
system.time(seqz.data <- read.seqz(data.file,chr.name="X"))

   
# }

Run the code above in your browser using DataLab