matchDatasets: Match Two Data Sets by Location

Description

The goal of this function is to match records in the data sets for subsequent enrichment analysis.

For each record in the primary data set (data1) it finds the record in the auxiliary data set (data1) which overlap with it or lie within the flanking distance (flank). If multiple such auxiliary record are found, we select the one with the center closest to the center of the primary record. If no such record is available, no matching is made for the primary record.

Usage

matchDatasets(data1, data2, flank = 0)

Arguments

data1

A data frame with the primary data set, must have at least 4 columns:

Chromosome name.
Start position.
End position.
P-value or test statistic.
Optional additional columns.

data2

A data frame with the auxiliary data set. Must satisfy the same format criteria as the primary data set.

flank

Allowed distance between matched records. Set to zero to require overlap.

Value

Returns a list with matched data sets.

data1

The primary data sets without unmatched records.

data2

The auxiliary data set records matching those in data1 above. Note that some auxiliary records can get duplicated if they are the best match for multiple records in the primary data.

Examples

Run this code

# NOT RUN {
data1 = read.csv(text =
"chr,start,end,stat
chr1,100,200,1
chr1,150,250,2
chr1,200,300,3
chr1,300,400,4
chr1,997,997,5
chr1,998,998,6
chr1,999,999,7")

data2 = read.csv(text =
"chr,start,end,stat
chr1,130,130,1
chr1,140,140,2
chr1,165,165,3
chr1,200,200,4
chr1,240,240,5
chr1,340,340,6
chr1,350,350,7
chr1,360,360,8
chr1,900,900,9")

# Match data sets exactly.
matchDatasets(data1, data2, 0)

# Match data sets with a flank.
# The last records are now matched.
matchDatasets(data1, data2, 100)
# }

Run the code above in your browser using DataLab