tidygenomics

Tidy Verbs for Dealing with Genomic Data Frames

Description

Handle genomic data within data frames just as you would with GRanges. This packages provides method to deal with genomics intervals the "tidy-way" which makes it simpler to integrate in the the general data munging process. The API is inspired by the popular bedtools and the genome_join() method from the fuzzyjoin package.

Installation

install.packages("tidygenomics")

# Or to get the latest development version
devtools::install_github("const-ae/tidygenomics")

Documentation

genome_intersect

Joins 2 data frames based on their genomic overlap. Unlike the genome_join function it updates the boundaries to reflect the overlap of the regions.

x1 <- data.frame(id = 1:4, 
                chromosome = c("chr1", "chr1", "chr2", "chr2"),
                start = c(100, 200, 300, 400),
                end = c(150, 250, 350, 450))

x2 <- data.frame(id = 1:4,
                 chromosome = c("chr1", "chr2", "chr2", "chr1"),
                 start = c(140, 210, 400, 300),
                 end = c(160, 240, 415, 320))

genome_intersect(x1, x2, by=c("chromosome", "start", "end"), mode="both")
id.xchromosomeid.ystartend
1chr11140150
4chr23400415

genome_subtract

Subtracts one data frame from the other. This can be used to split the x data frame into smaller areas.

x1 <- data.frame(id = 1:4,
                chromosome = c("chr1", "chr1", "chr2", "chr1"),
                start = c(100, 200, 300, 400),
                end = c(150, 250, 350, 450))

x2 <- data.frame(id = 1:4,
                chromosome = c("chr1", "chr2", "chr1", "chr1"),
                start = c(120, 210, 300, 400),
                end = c(125, 240, 320, 415))

genome_subtract(x1, x2, by=c("chromosome", "start", "end"))
idchromosomestartend
1chr1100119
1chr1126150
2chr1200250
3chr2300350
4chr1416450

genome_join_closest

Joins 2 data frames based on their genomic location. If no exact overlap is found the next closest interval is used.

x1 <- data_frame(id = 1:4, 
                 chr = c("chr1", "chr1", "chr2", "chr3"),
                 start = c(100, 200, 300, 400),
                 end = c(150, 250, 350, 450))

x2 <- data_frame(id = 1:4,
                 chr = c("chr1", "chr1", "chr1", "chr2"),
                 start = c(220, 210, 300, 400),
                 end = c(225, 240, 320, 415))
genome_join_closest(x1, x2, by=c("chr", "start", "end"), distance_column_name="distance", mode="left")
id.xchr.xstart.xend.xid.ychr.ystart.yend.ydistance
1chr11001502chr121024059
2chr12002501chr12202250
2chr12002502chr12102400
3chr23003504chr240041549
4chr3400450NANANANANA

genome_cluster

Add a new column with the cluster if 2 intervals are overlapping or are within the max_distance.

x1 <- data.frame(id = 1:4, bla=letters[1:4],
                chromosome = c("chr1", "chr1", "chr2", "chr1"),
                start = c(100, 120, 300, 260),
                end = c(150, 250, 350, 450))
genome_cluster(x1, by=c("chromosome", "start", "end"))
idblachromosomestartendcluster_id
1achr11001500
2bchr11202500
3cchr23003502
4dchr12604501
genome_cluster(x1, by=c("chromosome", "start", "end"), max_distance=10)
idblachromosomestartendcluster_id
1achr11001500
2bchr11202500
3cchr23003501
4dchr12604500

genome_complement

Calculates the complement of a genomic region.

x1 <- data.frame(id = 1:4,
                 chromosome = c("chr1", "chr1", "chr2", "chr1"),
                 start = c(100, 200, 300, 400),
                 end = c(150, 250, 350, 450))

genome_complement(x1, by=c("chromosome", "start", "end"))
chromosomestartend
chr1199
chr1151199
chr1251399
chr21299

genome_join

Classical join function based on the overlap of the interval. Implemented and maintained in the fuzzyjoin package and documented here only for completeness.

x1 <- data_frame(id = 1:4, 
                 chr = c("chr1", "chr1", "chr2", "chr3"),
                 start = c(100, 200, 300, 400),
                 end = c(150, 250, 350, 450))

x2 <- data_frame(id = 1:4,
                 chr = c("chr1", "chr1", "chr1", "chr2"),
                 start = c(220, 210, 300, 400),
                 end = c(225, 240, 320, 415))
fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="inner")
id.xchr.xstart.xend.xid.ychr.ystart.yend.y
2chr12002501chr1220225
2chr12002502chr1210240
fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="left")
id.xchr.xstart.xend.xid.ychr.ystart.yend.y
1chr1100150NANANANA
2chr12002501chr1220225
2chr12002502chr1210240
3chr2300350NANANANA
4chr3400450NANANANA
fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="anti")
idchrstartend
1chr1100150
3chr2300350
4chr3400450

Inspiration

If you have any additional questions or encounter issues please raise them on the github page.

Copy Link

Version

Down Chevron

Install

install.packages('tidygenomics')

Monthly Downloads

158

Version

0.1.2

License

GPL-3

Issues

Pull Requests

Stars

Forks

Last Published

August 8th, 2019

Functions in tidygenomics (0.1.2)