Maps single cell expression profiles to copy number profiles.
segmentExpression2CopyNumber(eps, gpc, cn, seed=0, outF=NULL, maxPloidy=8,
nCores=2, stdOUT="log.applyAR2seg")
Segment-by-cell matrix of copy number states.
Segment-by-cell matrix of expression.
Number of genes expressed per cell.
Average copy number across cells for each segment (i.e. row in eps).
The fraction of entries in a-priori segment-by-cell copy number matrix to be used as seed for association rule mining.
Output file prefix in which to print intermediary heatmaps and histograms, or NULL (default) if no print.
The maximum ploidy to accept as solution.
The numbers of threads used.
Log-file to which standard output is redirected during parallel processing.
Noemi Andor
Let S := { \(S_1, S_2, ... S_n\) } be the set of \(n\) genomic segments obtained from bulk DNA-sequencing. Let \(E_{ij}\) and \(G_{ij}\) be the average number of UMIs and the number of expressed genes per segment \(i\) per cell \(j\). The segment-by-cell expression matrix is first normalized by gene coverage. For each \(x \in S\), the linear regression model:
$$E_{x*} \sim \sum_{i \in S}G_{i*} $$
, fits the average segment expression per cell onto the cell's overall gene coverage. The model's residuals \(R_{ij}\) reflect inter-cell differences in expression per segment that cannot be explained by differential gene coverage per cell. A first approximation of the segment-by-cell copy number matrix CN is given by:
$$CN_{ij} := R_{ij} * (cn_i / \bar{R_{i*}} )$$
, where \(cn_i\) is the population-average copy number of segment \(i\) derived from DNA-seq. Above transformation of \(E_{ij}\) into \(CN_{ij}\) is in essence a numerical optimization, shifting the distribution of each segment to the average value expected from bulk DNA-seq.
Let \(x' \in CN\) be the measured copy number of a given segment-cell pair, and \(x\) its corresponding true copy number state. The probability of assigning copy number \(x\) to a cell \(j\) at locus \(i\) depends on:
A. Cell \(j\)'s read count at locus \(i\), calculated conditional on the measurement \(x'\).
Using a Gaussian smoothing kernel, we compute the kernel density estimate of the read counts at locus \(i\) across cells to identify the major (\(M\)) and the minor (\(m\)) copy number states of \(i\) as the highest and second highest peak of the fit respectively. Then we calculate the proportion of cells expected at state \(m\) as \(f = \frac{cn_i - M}{m - M} \). The probability of assigning copy number \(x\) to a cell \(j\) at locus \(i\) is calculated as:
\( P_A(x|x') \sim \)
: \( 0, if x \notin {m,M}\)
: \( P_{ij}(x'|N(m, sd = f)), if x == m\)
: \( P_{ij}(x'|N(M, sd = 1-f)), if x == M\)
B. Cell \(j\)'s read count at other loci, i.e. how similar the cell is to other cells that have copy number \(x\) at locus \(i\). We use Apriori - an algorithm for association rule mining - to find groups of loci that tend to have correlated copy number states across cells. Let \(V_{i,K \to x}\) be the set of rules concluding copy number \(x\) for locus \(i\), where \(k \in K\) are copy number profiles of up to \(n=4\) loci in the form { \(S_1=x_1, S_2=x_2, ... S_n=x_n\) }. Further let \(C_r\) be the confidence of a rule \(r \in V_{i,K \to x}\). For each cell \(j \in J\) matching any of the copy number profiles in \(K\), we calculate:
\( P_B(x) \sim \sum_{r \in V_{i,K \to x}}C_r \)
, the cumulative confidence of the rules in support of \(x\) at \(i\).
We first obtain a seed of cell-segment pairs by assigning a-priori copy number states only when \(argmax_{x \in [1,8]} P_A (x|x') > t\). We use this seed as input to B. Finally, a-posteriori copy number for segment \(i\) in cell \(j\) is calculated as:
$$argmax_{x \in [1,8]} P_A(x|x') + P_B(x) $$
Andor, N.*, Lau, B.*, Catalanotti, C., Kumar, V., Sathe, A., Belhocine, K., Wheeler, T., et al. (2018) Joint single cell DNA-Seq and RNA-Seq of gastric cancer reveals subclonal signatures of genomic instability and gene expression. doi: https://doi.org/10.1101/445932
Borgelt C & Kruse R. (2002) Induction of Association Rules: Apriori Implementation.
apriori
##Calculate number of genes expressed per each cell:
data(epg)
gpc = apply(epg>0, 2, sum)
##Call function:
data(eps)
data(segments)
cn=segments[rownames(eps),"CN_Estimate"]
# \donttest{
cnps = segmentExpression2CopyNumber(eps, gpc, cn, seed=0.5, nCores=2, stdOUT="log")
head(eps[,1:3]); ##Expression of first three cells
head(cnps[,1:3]); ##Copy number of first three cells
# }
Run the code above in your browser using DataLab