segmentExpression2CopyNumber: Calling CNVs.

Description

Maps single cell expression profiles to copy number profiles.

Usage

segmentExpression2CopyNumber(eps, gpc, cn, seed=0, outF=NULL, maxPloidy=8, 
                                       nCores=2, stdOUT="log.applyAR2seg")

Value

Segment-by-cell matrix of copy number states.

Arguments

eps: Segment-by-cell matrix of expression.
gpc: Number of genes expressed per cell.
cn: Average copy number across cells for each segment (i.e. row in eps).
seed: The fraction of entries in a-priori segment-by-cell copy number matrix to be used as seed for association rule mining.
outF: Output file prefix in which to print intermediary heatmaps and histograms, or NULL (default) if no print.
maxPloidy: The maximum ploidy to accept as solution.
nCores: The numbers of threads used.
stdOUT: Log-file to which standard output is redirected during parallel processing.

Author

Noemi Andor

Details

Let S := { $S_1, S_2, ... S_n$ } be the set of $n$ genomic segments obtained from bulk DNA-sequencing. Let $E_{ij}$ and $G_{ij}$ be the average number of UMIs and the number of expressed genes per segment $i$ per cell $j$. The segment-by-cell expression matrix is first normalized by gene coverage. For each $x \in S$, the linear regression model:
$$E_{x*} \sim \sum_{i \in S}G_{i*} $$
, fits the average segment expression per cell onto the cell's overall gene coverage. The model's residuals $R_{ij}$ reflect inter-cell differences in expression per segment that cannot be explained by differential gene coverage per cell. A first approximation of the segment-by-cell copy number matrix CN is given by:
$$CN_{ij} := R_{ij} * (cn_i / \bar{R_{i*}} )$$
, where $cn_i$ is the population-average copy number of segment $i$ derived from DNA-seq. Above transformation of $E_{ij}$ into $CN_{ij}$ is in essence a numerical optimization, shifting the distribution of each segment to the average value expected from bulk DNA-seq.

Let $x' \in CN$ be the measured copy number of a given segment-cell pair, and $x$ its corresponding true copy number state. The probability of assigning copy number $x$ to a cell $j$ at locus $i$ depends on:
A. Cell $j$'s read count at locus $i$, calculated conditional on the measurement $x'$. Using a Gaussian smoothing kernel, we compute the kernel density estimate of the read counts at locus $i$ across cells to identify the major ($M$) and the minor ($m$) copy number states of $i$ as the highest and second highest peak of the fit respectively. Then we calculate the proportion of cells expected at state $m$ as $f = \frac{cn_i - M}{m - M} $. The probability of assigning copy number $x$ to a cell $j$ at locus $i$ is calculated as:
$ P_A(x|x') \sim $
: $ 0, if x \notin {m,M}$
: $ P_{ij}(x'|N(m, sd = f)), if x == m$
: $ P_{ij}(x'|N(M, sd = 1-f)), if x == M$

B. Cell $j$'s read count at other loci, i.e. how similar the cell is to other cells that have copy number $x$ at locus $i$. We use Apriori - an algorithm for association rule mining - to find groups of loci that tend to have correlated copy number states across cells. Let $V_{i,K \to x}$ be the set of rules concluding copy number $x$ for locus $i$, where $k \in K$ are copy number profiles of up to $n=4$ loci in the form { $S_1=x_1, S_2=x_2, ... S_n=x_n$ }. Further let $C_r$ be the confidence of a rule $r \in V_{i,K \to x}$. For each cell $j \in J$ matching any of the copy number profiles in $K$, we calculate:
$ P_B(x) \sim \sum_{r \in V_{i,K \to x}}C_r $
, the cumulative confidence of the rules in support of $x$ at $i$.

We first obtain a seed of cell-segment pairs by assigning a-priori copy number states only when $argmax_{x \in [1,8]} P_A (x|x') > t$. We use this seed as input to B. Finally, a-posteriori copy number for segment $i$ in cell $j$ is calculated as:
$$argmax_{x \in [1,8]} P_A(x|x') + P_B(x) $$

References

Andor, N.*, Lau, B.*, Catalanotti, C., Kumar, V., Sathe, A., Belhocine, K., Wheeler, T., et al. (2018) Joint single cell DNA-Seq and RNA-Seq of gastric cancer reveals subclonal signatures of genomic instability and gene expression. doi: https://doi.org/10.1101/445932

Borgelt C & Kruse R. (2002) Induction of Association Rules: Apriori Implementation.

Examples

Run this code

##Calculate number of genes expressed per each cell:
data(epg)
gpc = apply(epg>0, 2, sum)

##Call function:
data(eps)
data(segments)
cn=segments[rownames(eps),"CN_Estimate"]
# \donttest{
	cnps = segmentExpression2CopyNumber(eps, gpc, cn, seed=0.5, nCores=2, stdOUT="log")
	head(eps[,1:3]); ##Expression of first three cells
	head(cnps[,1:3]); ##Copy number of first three cells
# }

Run the code above in your browser using DataLab