gvtrack.create: Creates a new virtual track

Description

Creates a new virtual track.

Usage

gvtrack.create(
  vtrack = NULL,
  src = NULL,
  func = NULL,
  params = NULL,
  dim = NULL,
  sshift = NULL,
  eshift = NULL,
  filter = NULL,
  ...
)

Value

None.

Arguments

vtrack

virtual track name

src

source (track/intervals). NULL for PWM functions. For value-based tracks, provide a data frame with columns chrom, start, end, and one numeric value column. The data frame functions as an in-memory sparse track and supports all track-based summarizer functions. Intervals must not overlap.

func

function name (see above)

params

function parameters (see above)

dim

use 'NULL' or '0' for 1D iterators. '1' converts 2D iterator to (chrom1, start1, end1) , '2' converts 2D iterator to (chrom2, start2, end2)

sshift

shift of 'start' coordinate

eshift

shift of 'end' coordinate

filter

genomic mask to apply. Can be:

A data.frame with columns 'chrom', 'start', 'end' (intervals to mask)
A character string naming an intervals set
A character string naming a track (must be intervals-type track)
A list of any combination of the above (all will be unified)
NULL to clear the filter

...

additional PWM parameters

Details

This function creates a new virtual track named 'vtrack' with the given source, function and parameters. 'src' can be either a track, intervals (1D or 2D), or a data frame with intervals and a numeric value column (value-based track). The tables below summarize the supported combinations.

Value-based tracks Value-based tracks are data frames containing genomic intervals with associated numeric values. They function as in-memory sparse tracks without requiring track creation in the database. To create a value-based track, provide a data frame with columns chrom, start, end, and one numeric value column (any name is acceptable). Value-based tracks support all track-based summarizer functions (e.g., avg, min, max, sum, stddev, quantile, nearest, exists, size, first, last, sample, and position functions), but do not support overlapping intervals. They behave like sparse tracks in aggregation: values are aggregated using count-based averaging (each interval contributes equally regardless of length), not coverage-based averaging.

Track-based summarizers

Source	func	params	Description
Track	avg	NULL	Average track value in the iterator interval.
Track (1D)	exists	vals (optional)	Returns 1 if any value exists (or specific vals if provided), 0 otherwise.
Track (1D)	first	NULL	First value in the iterator interval.
Track (1D)	last	NULL	Last value in the iterator interval.
Track	max	NULL	Maximum track value in the iterator interval.
Track	min	NULL	Minimum track value in the iterator interval.
Dense / Sparse / Array track	nearest	NULL	Average value inside the iterator; for sparse tracks with no samples in the interval, falls back to the closest sample outside the interval (by genomic distance).
Track (1D)	sample	NULL	Uniformly sampled source value from the iterator interval.
Track (1D)	size	NULL	Number of non-NaN values in the iterator interval.
Dense / Sparse / Array track	stddev	NULL	Unbiased standard deviation of values in the iterator interval.
Dense / Sparse / Array track	sum	NULL	Sum of values in the iterator interval.
Dense / Sparse / Array track	quantile	Percentile in [0, 1]	Quantile of values in the iterator interval.
Dense track	global.percentile	NULL	Percentile of the interval average relative to the full-track distribution.
Dense track	global.percentile.max	NULL	Percentile of the interval maximum relative to the full-track distribution.
Dense track	global.percentile.min	NULL	Percentile of the interval minimum relative to the full-track distribution.

Track position summarizers

Source	func	params	Description
Track (1D)	first.pos.abs	NULL	Absolute genomic coordinate of the first value.
Track (1D)	first.pos.relative	NULL	Zero-based position (relative to interval start) of the first value.
Track (1D)	last.pos.abs	NULL	Absolute genomic coordinate of the last value.
Track (1D)	last.pos.relative	NULL	Zero-based position (relative to interval start) of the last value.
Track (1D)	max.pos.abs	NULL	Absolute genomic coordinate of the maximum value inside the iterator interval.
Track (1D)	max.pos.relative	NULL	Zero-based position (relative to interval start) of the maximum value.
Track (1D)	min.pos.abs	NULL	Absolute genomic coordinate of the minimum value inside the iterator interval.
Track (1D)	min.pos.relative	NULL	Zero-based position (relative to interval start) of the minimum value.
Track (1D)	sample.pos.abs	NULL	Absolute genomic coordinate of a uniformly sampled value.
Track (1D)	sample.pos.relative	NULL	Zero-based position (relative to interval start) of a uniformly sampled value.

For max.pos.relative, min.pos.relative, first.pos.relative, last.pos.relative, sample.pos.relative, iterator modifiers (including sshift / eshift and 1D projections generated via gvtrack.iterator) are applied before the position is reported. In other words, the returned coordinate is always 0-based and measured from the start of the iterator interval after all modifier adjustments.

Interval-based summarizers

Source	func	params	Description
1D intervals	distance	Minimal distance from center (default 0)	Signed distance using normalized formula when inside intervals, distance to edge when outside; see notes below for exact formula.
1D intervals	distance.center	NULL	Distance from iterator center to the closest interval center, `NA` if outside all intervals.
1D intervals	distance.edge	NULL	Edge-to-edge distance from iterator interval to closest source interval (like `gintervals.neighbors`); see notes below for strand handling.
1D intervals	coverage	NULL	Fraction of iterator length covered by source intervals (after unifying overlaps).
1D intervals	neighbor.count	Max distance (>= 0)	Number of source intervals whose edge-to-edge distance from the iterator interval is within params (no unification).

2D track summarizers

Source	func	params	Description
2D track	area	NULL	Area covered by intersections of track rectangles with the iterator interval.
2D track	weighted.sum	NULL	Weighted sum of values where each weight equals the intersection area.

Motif (PWM) summarizers

Source	func	Key params	Description
NULL (sequence)	pwm	pssm, bidirect, prior, extend, spat_*	Log-sum-exp score of motif likelihoods across all anchors inside the iterator interval.
NULL (sequence)	pwm.max	pssm, bidirect, prior, extend, spat_*	Maximum log-likelihood score among all anchors (per-position union across strands).
NULL (sequence)	pwm.max.pos	pssm, bidirect, prior, extend, spat_*	1-based position of the best-scoring anchor (signed by strand when `bidirect = TRUE`); coordinates are always relative to the iterator interval after any `gvtrack.iterator()` shifts/extensions.
NULL (sequence)	pwm.count	pssm, score.thresh, bidirect, prior, extend, strand, spat_*	Count of anchors whose score exceeds `score.thresh` (per-position union).

K-mer summarizers

Source	func	Key params	Description
NULL (sequence)	kmer.count	kmer, extend, strand	Number of k-mer occurrences whose anchor lies inside the iterator interval.
NULL (sequence)	kmer.frac	kmer, extend, strand	Fraction of possible anchors within the interval that match the k-mer.

Masked sequence summarizers

Source	func	Key params	Description
NULL (sequence)	masked.count	NULL	Number of masked (lowercase) base pairs in the iterator interval.
NULL (sequence)	masked.frac	NULL	Fraction of base pairs in the iterator interval that are masked (lowercase).

The sections below provide additional notes for motif, interval, k-mer, and masked sequence functions.

Motif (PWM) notes

pssm: Position-specific scoring matrix (matrix or data frame) with columns A, C, G, T; extra columns are ignored.
bidirect: When TRUE (default), both strands are scanned and combined per genomic start (per-position union). The strand argument is ignored. When FALSE, only the strand specified by strand is scanned.
prior: Pseudocount added to frequencies (default 0.01). Set to 0 to disable.
extend: Extends the fetched sequence so boundary-anchored motifs retain full context (default TRUE). The END coordinate is padded by motif_length - 1 for all strand modes; anchors must still start inside the iterator.
Neutral characters (N, n, *) contribute the mean log-probability of the corresponding PSSM column on both strands.
strand: Used only when bidirect = FALSE; 1 scans the forward strand, -1 scans the reverse strand. For pwm.max.pos, strand = -1 reports the hit position at the end of the match (still relative to the forward orientation).
score.thresh: Threshold for pwm.count. Anchors with log-likelihood >= score.thresh are counted; only one count per genomic start.
Spatial weighting (spat_factor, spat_bin, spat_min, spat_max): optional position-dependent weights applied in log-space. Provide a positive numeric vector spat_factor; spat_bin (integer > 0) defines bin width; spat_min/spat_max restrict the scanning window.
pwm.max.pos: Positions are reported 1-based relative to the final scan window (after iterator shifts and spatial trimming). Ties resolve to the most 5' anchor; the forward strand wins ties at the same coordinate. Values are signed when bidirect = TRUE (positive for forward, negative for reverse).

Spatial weighting enables position-dependent weighting for modeling positional biases. Bins are 0-indexed from the scan start. When using gvtrack.iterator() shifts (e.g., sshift = -50, eshift = 50), bins index from the expanded scan window start, not the original interval. Both strands use the same bin at each genomic position. Positions beyond the last bin reuse the final bin's weight. If the window size is not divisible by spat_bin, the last bin is shorter (e.g., scanning 500 bp with 40 bp bins yields bins 0-11 of 40 bp plus bin 12 of 20 bp). Use spat_min and spat_max to restrict scanning to a range divisible by spat_bin if needed.

PWM parameters can be supplied either as a single list (params) or via named arguments (see examples).

Interval distance notes

distance: Given the center 'C' of the current iterator interval, returns 'DC * X/2' where 'DC' is the normalized distance to the center of the interval that contains 'C', and 'X' is the value of the parameter (default: 0). If no interval contains 'C', the result is 'D + X/2' where 'D' is the distance between 'C' and the edge of the closest interval.

distance.center: Given the center 'C' of the current iterator interval, returns NaN if 'C' is outside of all intervals, otherwise returns the distance between 'C' and the center of the closest interval.

distance.edge: Computes edge-to-edge distance from the iterator interval to the closest source interval, using the same calculation as gintervals.neighbors. Returns 0 for overlapping intervals. Distance sign depends on the strand column of source intervals; returns unsigned (absolute) distance if no strand column exists. Returns NA if no source intervals exist on the current chromosome.

For distance and distance.center, distance can be positive or negative depending on the position of the coordinate relative to the interval and the strand (-1 or 1) of the interval. Distance is always positive if strand = 0 or if the strand column is missing. The result is NA if no intervals exist for the current chromosome.

Difference between distance functions: The distance function measures from the center of the iterator interval (a single coordinate point) to the closest edge of source intervals when outside, or returns a normalized distance within the interval when inside. The distance.center function measures from the center of the iterator interval to the center of source intervals. The distance.edge function measures edge-to-edge distance between intervals, exactly like gintervals.neighbors. Use distance.edge when you need the same distance computation as gintervals.neighbors within a virtual track context.

K-mer notes

kmer: DNA sequence (case-insensitive) to count.
extend: If TRUE (default), counts kmers whose anchor lies in the interval even if the kmer extends beyond it; when FALSE, only kmers fully contained in the interval are considered.
strand: 1 counts forward-strand occurrences, -1 counts reverse-strand occurrences, 0 counts both strands (default). For palindromic kmers, consider using 1 or -1 to avoid double counting.

K-mer parameters can be supplied as a list or via named arguments (see examples).

Modify iterator behavior with 'gvtrack.iterator' or 'gvtrack.iterator.2d'.

Examples

Run this code

# \dontshow{
options(gmax.processes = 2)
# }

gdb.init_examples()

gvtrack.create("vtrack1", "dense_track", "max")
gvtrack.create("vtrack2", "dense_track", "quantile", 0.5)
gextract("dense_track", "vtrack1", "vtrack2",
    gintervals(1, 0, 10000),
    iterator = 1000
)

gvtrack.create("vtrack3", "dense_track", "global.percentile")
gvtrack.create("vtrack4", "annotations", "distance")
gdist(
    "vtrack3", seq(0, 1, l = 10), "vtrack4",
    seq(-500, 500, 200)
)

gvtrack.create("cov", "annotations", "coverage")
gextract("cov", gintervals(1, 0, 1000), iterator = 100)

pssm <- matrix(
    c(
        0.7, 0.1, 0.1, 0.1, # Example PSSM
        0.1, 0.7, 0.1, 0.1,
        0.1, 0.1, 0.7, 0.1,
        0.1, 0.1, 0.7, 0.1,
        0.1, 0.1, 0.7, 0.1,
        0.1, 0.1, 0.7, 0.1
    ),
    ncol = 4, byrow = TRUE
)
colnames(pssm) <- c("A", "C", "G", "T")
gvtrack.create(
    "motif_score", NULL, "pwm",
    list(pssm = pssm, bidirect = TRUE, prior = 0.01)
)
gvtrack.create("max_motif_score", NULL, "pwm.max",
    pssm = pssm, bidirect = TRUE, prior = 0.01
)
gvtrack.create("max_motif_pos", NULL, "pwm.max.pos",
    pssm = pssm
)
gextract(
    c(
        "dense_track", "motif_score", "max_motif_score",
        "max_motif_pos"
    ),
    gintervals(1, 0, 10000),
    iterator = 500
)

# Kmer counting examples
gvtrack.create("cg_count", NULL, "kmer.count", kmer = "CG", strand = 1)
gvtrack.create("cg_frac", NULL, "kmer.frac", kmer = "CG", strand = 1)
gextract(c("cg_count", "cg_frac"), gintervals(1, 0, 10000), iterator = 1000)

gvtrack.create("at_pos", NULL, "kmer.count", kmer = "AT", strand = 1)
gvtrack.create("at_neg", NULL, "kmer.count", kmer = "AT", strand = -1)
gvtrack.create("at_both", NULL, "kmer.count", kmer = "AT", strand = 0)
gextract(c("at_pos", "at_neg", "at_both"), gintervals(1, 0, 10000), iterator = 1000)

# GC content
gvtrack.create("g_frac", NULL, "kmer.frac", kmer = "G")
gvtrack.create("c_frac", NULL, "kmer.frac", kmer = "C")
gextract("g_frac + c_frac", gintervals(1, 0, 10000),
    iterator = 1000,
    colnames = "gc_content"
)

# Masked base pair counting
gvtrack.create("masked_count", NULL, "masked.count")
gvtrack.create("masked_frac", NULL, "masked.frac")
gextract(c("masked_count", "masked_frac"), gintervals(1, 0, 10000), iterator = 1000)

# Combined with GC content (unmasked regions only)
gvtrack.create("gc", NULL, "kmer.frac", kmer = "G")
gextract("gc * (1 - masked_frac)",
    gintervals(1, 0, 10000),
    iterator = 1000,
    colnames = "gc_unmasked"
)

# Value-based track examples
# Create a data frame with intervals and numeric values
intervals_with_values <- data.frame(
    chrom = "chr1",
    start = c(100, 300, 500),
    end = c(200, 400, 600),
    score = c(10, 20, 30)
)
# Use as value-based sparse track (functions like sparse track)
gvtrack.create("value_track", intervals_with_values, "avg")
gvtrack.create("value_track_max", intervals_with_values, "max")
gextract(c("value_track", "value_track_max"),
    gintervals(1, 0, 10000),
    iterator = 1000
)

# Spatial PWM examples
# Create a PWM with higher weight in the center of intervals
pssm <- matrix(
    c(
        0.7, 0.1, 0.1, 0.1,
        0.1, 0.7, 0.1, 0.1,
        0.1, 0.1, 0.7, 0.1,
        0.1, 0.1, 0.1, 0.7
    ),
    ncol = 4, byrow = TRUE
)
colnames(pssm) <- c("A", "C", "G", "T")

# Spatial factors: low weight at edges, high in center
# For 200bp intervals with 40bp bins: bins 0, 40, 80, 120, 160
spatial_weights <- c(0.5, 1.0, 2.0, 1.0, 0.5)

gvtrack.create(
    "spatial_pwm", NULL, "pwm",
    list(
        pssm = pssm,
        bidirect = TRUE,
        spat_factor = spatial_weights,
        spat_bin = 40L
    )
)

# Compare with non-spatial PWM
gvtrack.create(
    "regular_pwm", NULL, "pwm",
    list(pssm = pssm, bidirect = TRUE)
)

gextract(c("spatial_pwm", "regular_pwm"),
    gintervals(1, 0, 10000),
    iterator = 200
)

# Using spatial parameters with iterator shifts
gvtrack.create(
    "spatial_extended", NULL, "pwm.max",
    pssm = pssm,
    spat_factor = c(0.5, 1.0, 2.0, 2.5, 2.0, 1.0, 0.5),
    spat_bin = 40L
)
# Scan window will be 280bp (100bp + 2*90bp)
gvtrack.iterator("spatial_extended", sshift = -90, eshift = 90)
gextract("spatial_extended", gintervals(1, 0, 10000), iterator = 100)

# Using spat_min/spat_max to restrict scanning to a window
# For 500bp intervals, scan only positions 30-470 (440bp window)
gvtrack.create(
    "window_pwm", NULL, "pwm",
    pssm = pssm,
    bidirect = TRUE,
    spat_min = 30, # 1-based position
    spat_max = 470 # 1-based position
)
gextract("window_pwm", gintervals(1, 0, 10000), iterator = 500)

# Combining spatial weighting with window restriction
# Scan positions 50-450 with spatial weights favoring the center
gvtrack.create(
    "window_spatial_pwm", NULL, "pwm",
    pssm = pssm,
    bidirect = TRUE,
    spat_factor = c(0.5, 1.0, 2.0, 2.5, 2.0, 1.0, 0.5, 1.0, 0.5, 0.5),
    spat_bin = 40L,
    spat_min = 50,
    spat_max = 450
)
gextract("window_spatial_pwm", gintervals(1, 0, 10000), iterator = 500)