Learn R Programming

misha (version 5.3.1)

gintervals.annotate: Annotates 1D intervals using nearest neighbors

Description

Annotates one-dimensional intervals by finding nearest neighbors in another set of intervals and adding selected columns from the neighbors to the original intervals.

Usage

gintervals.annotate(
  intervals,
  annotation_intervals,
  annotation_columns = NULL,
  column_names = NULL,
  dist_column = "dist",
  max_dist = Inf,
  na_value = NA,
  maxneighbors = 1,
  tie_method = c("first", "min.start", "min.end"),
  overwrite = FALSE,
  keep_order = TRUE,
  intervals.set.out = NULL,
  ...
)

Value

A data frame containing the original intervals plus the requested annotation columns (and optional distance column). If

maxneighbors > 1, rows may be duplicated per input interval to accommodate multiple neighbors.

Arguments

intervals

Intervals to annotate (1D).

annotation_intervals

Source intervals containing annotation data (1D).

annotation_columns

Character vector of column names to copy from annotation_intervals. If NULL (default), all non-basic columns are used, i.e. everything beyond the coordinate/strand columns among: chrom, start, end, chrom1, start1, end1, chrom2, start2, end2, strand.

column_names

Optional custom names for the annotation columns. If provided, must have the same length as annotation_columns. Defaults to using the original names.

dist_column

Name of the distance column to include. Use NULL to omit the distance column. Defaults to "dist".

max_dist

Maximum absolute distance. When finite, neighbors with |dist| > max_dist result in annotation columns being set to na_value for those rows, while the row itself is retained.

na_value

Value(s) to use for annotations when beyond max_dist or when no neighbor is found. Can be a single scalar recycled for all columns, or a named list/vector supplying per-column values matching column_names.

maxneighbors

Maximum number of neighbors per interval (duplicates intervals as needed). Defaults to 1.

tie_method

Tie-breaking when distances are equal: one of "first" (arbitrary but stable), "min.start" (smaller neighbor start first), or "min.end" (smaller neighbor end first). Applies when maxneighbors > 1.

overwrite

When FALSE (default), errors if selected annotation columns would overwrite existing columns in intervals. When TRUE, conflicting base columns are replaced by the annotation columns.

keep_order

If TRUE (default), preserves the original order of intervals rows in the output.

intervals.set.out

intervals set name where the function result is optionally outputted

...

Additional arguments forwarded to gintervals.neighbors (e.g., mindist, maxdist).

Details

The function wraps and extends gintervals.neighbors to provide convenient column selection/renaming, optional distance inclusion, distance thresholding with custom NA values, multiple neighbors per interval, and deterministic tie-breaking. Currently supports 1D intervals only.

- When annotation_columns = NULL, all non-basic columns present in annotation_intervals are included. - Setting dist_column = NULL omits the distance column. - If no neighbor is found for an interval, annotation columns are filled with na_value and the distance (when present) is NA_real_. - Column name collisions are handled as follows: when overwrite=FALSE a clear error is emitted; when overwrite=TRUE, base columns with the same names are replaced by annotation columns.

Examples

Run this code
# Prepare toy data
intervs <- gintervals(1, c(1000, 5000), c(1100, 5050))
ann <- gintervals(1, c(900, 5400), c(950, 5500))
ann$remark <- c("a", "b")
ann$score <- c(10, 20)

# Basic usage with default columns (all non-basic columns)
gintervals.annotate(intervs, ann)

# Select specific columns, with custom names and distance column name
gintervals.annotate(
    intervs, ann,
    annotation_columns = c("remark"),
    column_names = c("ann_remark"),
    dist_column = "ann_dist"
)

# Distance threshold with scalar NA replacement
gintervals.annotate(
    intervs, ann,
    annotation_columns = c("remark"),
    max_dist = 200,
    na_value = "no_ann"
)

# Multiple neighbors with deterministic tie-breaking
nbrs <- gintervals.annotate(
    gintervals(1, 1000, 1100),
    {
        x <- gintervals(1, c(800, 1200), c(900, 1300))
        x$label <- c("left", "right")
        x
    },
    annotation_columns = "label",
    maxneighbors = 2,
    tie_method = "min.start"
)
nbrs

# Overwrite existing columns in the base intervals
intervs2 <- intervs
intervs2$remark <- c("orig1", "orig2")
gintervals.annotate(intervs2, ann, annotation_columns = "remark", overwrite = TRUE)

Run the code above in your browser using DataLab