# prepareCalib:
Prepare Hi-C calibration

## Description

A function to build a calibration function, by fitting a subset of FISH distances and Hi-C frequencies with a power law model (see details). The number of distances to fit (taking distances by increasing order) or a subset of selected distances should be provided by the user. Users can also choose how to estimate the distance threshold or may explicitly provide one.
## Usage

prepareCalib(data, npoints, threshold = NULL, useMax = TRUE, delta = 0.05, buffer = 1.0)

## Arguments

data

A data frame with 2 mandatory columns: distances and frequencies, standing for matching FISH distances and Hi-C frequencies, correspondingly.
This data structure could be prepared with `prepareData`

npoints

An integer or an integer vector. If an integer n is given, than the shortest n distances and their matching frequencies will be used. Otherwise, the indices in the integer vector will indicate the subset of distances and frequencies to use from 'data'.

threshold

Optional numeric, set to NULL by default. If provided, will be used as the distance range threshold of the calibration.

useMax

Optional Boolean, set to True by default and ignored if 'threshold' is given. When TRUE, the maximal provided FISH distance will be used for the distance range threshold. Otherwise, the threshold will be estimated by the maximal FISH distance that present a small enough deviation (< delta) from the model.

delta

Optional numeric, set to 0.05 by default and ignored if 'threshold' is given or if 'useMax' is set to TRUE. Defines the acceptable deviation from the model, when the distance range threshold is estimated from the fit (see details).

buffer

Optional numeric, set to 1.0 by default and ignored if 'useMax' is set to FALSE. Defines a constant that is added to the threshold value when 'useMax' is set to TRUE.

## Value

A list with the following objects:
- calib
- a list defining the calibration, with the following objects:
f - the calibration function (the power law model), and params - a list of parameters for f (the parameters of the model and the threshold).
- fit
- the return value of
`lm`

, used to solve the linear regression

## Details

We use a power law model to relate a set of FISH distances, D, and a matching set of contact frequencies, C:
$C$ ~ $\beta$D^$\alpha$
Taking the log of this equation gives a linear dependency:
$log(C)$~ $log(\beta)$ + $\alpha$$log(D)$
Here, we consider only a subset of distances for solving the latter equation and estimate alpha and beta with a linear regression.
The threshold t, defining the range limit of Hi-C (a distance above which Hi-C frequencies are no longer informative) could be set to the maximal distance in D, or estimated more restrictively from the fit:
t = maxD{ | e^( (log(C)-log($\beta)$)/$\alpha$ ) - D |< $\delta$ } ## References

Y. Shavit, F.K. Hamey, P. Lio', FisHiCal: an R package for iterative FISH-based calibration of Hi-C data, 2014 (submitted).
## Examples

data(match)
npoints = 10 # number of points to fit
# prepareCalib computes threshold according to the fit
# useMax is set to FALSE
res = prepareCalib(match, npoints, useMax = FALSE)
calib = res$calib
calib
fit = res$fit
alpha = calib$params[[1]]
beta = calib$params[[2]]
threshold = calib$params[[3]]
# plot
plot(match$frequencies ~ match$distances, xlab = "distances",
ylab = "frequencies")
lines((exp(beta)*match$distances^alpha)~match$distances,
col = "red")
plot(log(match$frequencies) ~ log(match$distances),
xlab = "log(distances)", ylab = "log(frequencies)")
abline(fit, col = "red")
# plot the estimated threshold
abline(h = beta + log(threshold)*alpha, lty = 3)