Learn R Programming

sequenza (version 1.0.5)

read.abfreq: Read an ABfreq or acgt format file

Description

Efficiently reads an ABfreq or acgt file into R.

Usage

read.abfreq(file, nrows = -1, fast = FALSE, gz = TRUE, header = TRUE,
            colClasses = c("character", "integer", "character", "integer",
            "integer", "numeric", "numeric", "numeric", "character",
            "numeric", "numeric", "character", "character"),
            chr.name = NULL, n.lines = NULL, ...)

read.acgt(file, colClasses = c("character", "integer", "character", "integer", "integer", "integer", "integer", "integer"), ...)

Arguments

file
file name
nrows
number of rows to read from the file. Default is -1 (all rows).
fast
logical. If TRUE the file will be pre-parsed to count the number of rows; in some cases this can speed up the file reading.
gz
logical. If TRUE (the default) the function expects a gzipped file.
header
logical, indicating whether the file contains the names of the variables as its first line.
colClasses
character. A vector of classes to be assumed for the columns. By default the acgt and ABfreq format is expected.
chr.name
if specified, only the selected chromosome will be extracted instead of the entire file.
n.lines
vector of length 2 specifying the first and last line to read from the file. If specified, only the selected portion of the file will be used. Requires the sed UNIX utility.
...
any arguments accepted by read.delim. For read.acgt, also any arguments accepted by read.abfreq.

Value

  • ABfreq is a tab separated text file with column headers. The file has currently 13 columns. The first 3 columns are derived from the original pileup file and contain:
  • chromosomewith the chromosome names
  • n.basewith the base positions
  • base.refwith the base in the reference genome used (usually hg19). Note the base.ref is NOT the base of the germline.
  • The remaining 10 columns contain the following information:
  • depth.normalread depth observed in the normal sample
  • depth.sampleread depth observed in the tumor sample
  • depth.ratioratio of depth.sample and depth.normal
  • AfA-allele frequency observed in the tumor sample
  • BfB-allele frequency observed in the tumor sample
  • ref.zygosityzygosity of the reference sample. "hom" corresponds to AA or BB, whereas "het" corresponds to AB or BA
  • GC.percentGC-content (percent), calculated from the reference genome in fixed nucleotide windows
  • good.s.readsnumber of reads that passed the quality threshold (threshold specified in the pre-processing software)
  • AB.germlinebase found in the germline sample
  • AB.samplebase found in the tumor sample
  • The acgt file format is similar to the ABfreq format, but contains only 8 columns. The first 3 are the same as in the ABfreq file, derived from the pileup format. The remaining 5 columns contain the following information:
  • read.depthread depth. The column is derived from the pileup file
  • Anumber of times A was observed among the reads that were above the quality threshold
  • Cnumber of times C was observed among the reads that were above the quality threshold
  • Gnumber of times G was observed among the reads that were above the quality threshold
  • Tnumber of times T was observed among the reads that were above the quality threshold

Details

read.abfreq is a function that allows to efficiently access a file by chromosome or by number of line. The specific content of a ABfreq file or an acgt is explained in the value section.

See Also

read.delim.

Examples

Run this code
data.file <-  system.file("data", "abf.data.abfreq.txt.gz", package = "sequenza")
## read chromosome 1 from an ABfreq file.
abf.data <- read.abfreq(data.file, chr.name = 1)

## fast accessing cromosome 17 using the file metrics
gc.stats <- gc.sample.stats(data.file)
chrX <- gc.stats$file.metrics[gc.stats$file.metrics$chr == "X", ]
abf.data <- read.abfreq(data.file, n.lines = c(chrX$start, chrX$end))

## Comparison the running time of the two different methods.
system.time(read.abfreq(data.file, n.lines = c(chrX$start, chrX$end)))
system.time(abf.data <- read.abfreq(data.file,chr.name="X"))

Run the code above in your browser using DataLab