Learn R Programming

liayson (version 1.0.5)

segmentExpression2CopyNumber: Calling CNVs.

Description

Maps single cell expression profiles to copy number profiles.

Usage

segmentExpression2CopyNumber(eps, gpc, cn, seed=0, outF=NULL, maxPloidy=8, 
                                       nCores=2, stdOUT="log.applyAR2seg")

Value

Segment-by-cell matrix of copy number states.

Arguments

eps

Segment-by-cell matrix of expression.

gpc

Number of genes expressed per cell.

cn

Average copy number across cells for each segment (i.e. row in eps).

seed

The fraction of entries in a-priori segment-by-cell copy number matrix to be used as seed for association rule mining.

outF

Output file prefix in which to print intermediary heatmaps and histograms, or NULL (default) if no print.

maxPloidy

The maximum ploidy to accept as solution.

nCores

The numbers of threads used.

stdOUT

Log-file to which standard output is redirected during parallel processing.

Author

Noemi Andor

Details

Let S := { \(S_1, S_2, ... S_n\) } be the set of \(n\) genomic segments obtained from bulk DNA-sequencing. Let \(E_{ij}\) and \(G_{ij}\) be the average number of UMIs and the number of expressed genes per segment \(i\) per cell \(j\). The segment-by-cell expression matrix is first normalized by gene coverage. For each \(x \in S\), the linear regression model:
$$E_{x*} \sim \sum_{i \in S}G_{i*} $$
, fits the average segment expression per cell onto the cell's overall gene coverage. The model's residuals \(R_{ij}\) reflect inter-cell differences in expression per segment that cannot be explained by differential gene coverage per cell. A first approximation of the segment-by-cell copy number matrix CN is given by:
$$CN_{ij} := R_{ij} * (cn_i / \bar{R_{i*}} )$$
, where \(cn_i\) is the population-average copy number of segment \(i\) derived from DNA-seq. Above transformation of \(E_{ij}\) into \(CN_{ij}\) is in essence a numerical optimization, shifting the distribution of each segment to the average value expected from bulk DNA-seq.

Let \(x' \in CN\) be the measured copy number of a given segment-cell pair, and \(x\) its corresponding true copy number state. The probability of assigning copy number \(x\) to a cell \(j\) at locus \(i\) depends on:
A. Cell \(j\)'s read count at locus \(i\), calculated conditional on the measurement \(x'\). Using a Gaussian smoothing kernel, we compute the kernel density estimate of the read counts at locus \(i\) across cells to identify the major (\(M\)) and the minor (\(m\)) copy number states of \(i\) as the highest and second highest peak of the fit respectively. Then we calculate the proportion of cells expected at state \(m\) as \(f = \frac{cn_i - M}{m - M} \). The probability of assigning copy number \(x\) to a cell \(j\) at locus \(i\) is calculated as:
\( P_A(x|x') \sim \)
: \( 0, if x \notin {m,M}\)
: \( P_{ij}(x'|N(m, sd = f)), if x == m\)
: \( P_{ij}(x'|N(M, sd = 1-f)), if x == M\)

B. Cell \(j\)'s read count at other loci, i.e. how similar the cell is to other cells that have copy number \(x\) at locus \(i\). We use Apriori - an algorithm for association rule mining - to find groups of loci that tend to have correlated copy number states across cells. Let \(V_{i,K \to x}\) be the set of rules concluding copy number \(x\) for locus \(i\), where \(k \in K\) are copy number profiles of up to \(n=4\) loci in the form { \(S_1=x_1, S_2=x_2, ... S_n=x_n\) }. Further let \(C_r\) be the confidence of a rule \(r \in V_{i,K \to x}\). For each cell \(j \in J\) matching any of the copy number profiles in \(K\), we calculate:
\( P_B(x) \sim \sum_{r \in V_{i,K \to x}}C_r \)
, the cumulative confidence of the rules in support of \(x\) at \(i\).

We first obtain a seed of cell-segment pairs by assigning a-priori copy number states only when \(argmax_{x \in [1,8]} P_A (x|x') > t\). We use this seed as input to B. Finally, a-posteriori copy number for segment \(i\) in cell \(j\) is calculated as:
$$argmax_{x \in [1,8]} P_A(x|x') + P_B(x) $$

References

Andor, N.*, Lau, B.*, Catalanotti, C., Kumar, V., Sathe, A., Belhocine, K., Wheeler, T., et al. (2018) Joint single cell DNA-Seq and RNA-Seq of gastric cancer reveals subclonal signatures of genomic instability and gene expression. doi: https://doi.org/10.1101/445932

Borgelt C & Kruse R. (2002) Induction of Association Rules: Apriori Implementation.

See Also

apriori

Examples

Run this code
##Calculate number of genes expressed per each cell:
data(epg)
gpc = apply(epg>0, 2, sum)

##Call function:
data(eps)
data(segments)
cn=segments[rownames(eps),"CN_Estimate"]
# \donttest{
	cnps = segmentExpression2CopyNumber(eps, gpc, cn, seed=0.5, nCores=2, stdOUT="log")
	head(eps[,1:3]); ##Expression of first three cells
	head(cnps[,1:3]); ##Copy number of first three cells
# }

Run the code above in your browser using DataLab