9. Advanced - Cell-index maps for reading and writing: 9. Advanced - Cell-index maps for reading and writing

Description

This part defines read and write maps that can be used to remap cell indices before reading and writing data from and to file, respectively.

This package provides methods to create read and write (cell-index) maps from Affymetrix CDF files. These can be used to store the cell data in an optimal order so that when data is read it is read in contiguous blocks, which is faster.

In addition to this, read maps may also be used to read CEL files that have been "reshuffled" by other software. For instance, the dChip software (http://www.dchip.org/) rotates Affymetrix Exon, Tiling and Mapping 500K data. See example below how to read such data "unrotated".

For more details how cell indices are defined, see 2. Cell coordinates and cell indices.

Arguments

Motivation

When reading data from file, it is faster to read the data in the order that it is stored compared with, say, in a random order. The main reason for this is that the read arm of the harddrive has to move more if data is not read consecutively. Same applies when writing data to file. The read and write cache of the file system may compensate a bit for this, but not completely.

In Affymetrix CEL files, cell data is stored in order of cell indices. Moreover, (except for a few early chip types) Affymetrix randomizes the locations of the cells such that cells in the same unit (probeset) are scattered across the array. Thus, when reading CEL data arranged by units using for instance readCelUnits(), the order of the cells requested is both random and scattered. Since CEL data is often queried unit by unit (except for some probe-level normalization methods), one can improve the speed of reading data by saving data such that cells in the same unit are stored together. A write map is used to remap cell indices to file indices. When later reading that data back, a read map is used to remap file indices to cell indices. Read and write maps are described next.

Definition of read and write maps

Consider cell indices $i=1, 2, ..., N*K$ and file indices $j=1, 2, ..., N*K$. A read map is then a bijective (one-to-one) function $h()$ such that $$i = h(j),$$ and the corresponding write map is the inverse function $h^{-1}()$ such that $$j = h^{-1}(i).$$ Since the mapping is required to be bijective, it holds that $i = h(h^{-1}(i))$ and that $j = h^{-1}(h(j))$. For example, consider the "reversing" read map function $h(j)=N*K-j+1$. The write map function is $h^{-1}(i)=N*K-i+1$. To verify the bijective property of this map, we see that $h(h^{-1}(i)) = h(N*K-i+1) = N*K-(N*K-i+1)+1 = i$ as well as $h^{-1}(h(j)) = h^{-1}(N*K-j+1) = N*K-(N*K-j+1)+1 = j$.

Read and write maps in R

In this package, read and write maps are represented as integer vectors of length $N*K$ with unique elements in ${1,2,...,N*K}$. Consider cell and file indices as in previous section.

For example, the "reversing" read map in previous section can be represented as readMap <- (N*K):1 Given a vector j of file indices, the cell indices are the obtained as i = readMap[j]. The corresponding write map is writeMap <- (N*K):1 and given a vector i of cell indices, the file indices are the obtained as j = writeMap[i].

Note also that the bijective property holds for this mapping, that is i == readMap[writeMap[i]] and i == writeMap[readMap[i]] are both TRUE.

Because the mapping is bijective, the write map can be calculated from the read map by: writeMap <- order(readMap) and vice versa: readMap <- order(writeMap) Note, the invertMap() method is much faster than order().

Since most algorithms for Affymetrix data are based on probeset (unit) models, it is natural to read data unit by unit. Thus, to optimize the speed, cells should be stored in contiguous blocks of units. The methods readCdfUnitsWriteMap() can be used to generate a write map from a CDF file such that if the units are read in order, readCelUnits() will read the cells data in order. Example: Find any CDF file cdfFile <- findCdf()

# Get the order of cell indices indices <- readCdfCellIndices(cdfFile) indices <- unlist(indices, use.names=FALSE)

# Get an optimal write map for the CDF file writeMap <- readCdfUnitsWriteMap(cdfFile)

# Get the read map readMap <- invertMap(writeMap)

# Validate correctness indices2 <- readMap[indices] # == 1, 2, 3, ..., N*K

Warning, do not misunderstand this example. It can not be used improve the reading speed of default CEL files. For this, the data in the CEL files has to be rearranged (by the corresponding write map).

Reading rotated CEL files

It might be that a CEL file was rotated by another software, e.g. the dChip software rotates Affymetrix Exon, Tiling and Mapping 500K arrays 90 degrees clockwise, which remains rotated when exported as CEL files. To read such data in a non-rotated way, a read map can be used to "unrotate" the data. The 90-degree clockwise rotation that dChip effectly uses to store such data is explained by: h <- readCdfHeader(cdfFile) # (x,y) chip layout rotated 90 degrees clockwise nrow <- h$cols ncol <- h$rows y <- (nrow-1):0 x <- rep(1:ncol, each=nrow) writeMap <- as.vector(y*ncol + x)

Thus, to read this data "unrotated", use the following read map: readMap <- invertMap(writeMap) data <- readCel(celFile, indices=1:10, readMap=readMap)