Learn R Programming

vectra

vectra is an R-native columnar query engine for datasets larger than RAM.

Write dplyr-style pipelines against multi-GB files on a laptop. Data streams through a C11 pull-based engine one row group at a time, so peak memory stays bounded regardless of file size.

Quick Start

Point vectra at any file and query it with dplyr verbs. Nothing runs until collect().

library(vectra)

# CSV: lazy scan with type inference
tbl_csv("measurements.csv") |>
  filter(temperature > 30, year >= 2020) |>
  group_by(station) |>
  summarise(avg_temp = mean(temperature), n = n()) |>
  collect()

# GeoTIFF: climate rasters as tidy data
tbl_tiff("worldclim_bio1.tif") |>
  filter(band1 > 0) |>
  mutate(temp_c = band1 / 10) |>
  collect()

# Point extraction: sample raster values at coordinates, no terra needed
tiff_extract_points("worldclim_bio1.tif",
                    x = c(10.5, 11.2), y = c(47.1, 47.3))

# SQLite: zero-dependency, no DBI required
tbl_sqlite("survey.db", "responses") |>
  filter(year == 2025) |>
  left_join(tbl_sqlite("survey.db", "sites"), by = "site_id") |>
  collect()

For repeated queries, convert to vectra's native .vtr format for faster reads:

write_vtr(big_df, "data.vtr", batch_size = 100000)

tbl("data.vtr") |>
  filter(x > 0, region == "EU") |>
  group_by(region) |>
  summarise(total = sum(value), n = n()) |>
  collect()

Append new data without rewriting the file, or do a key-based diff between two snapshots:

# Append new rows as a new row group; existing data untouched
append_vtr(new_rows_df, "data.vtr")

# Logical diff: what was added or deleted between two snapshots?
d <- diff_vtr("snapshot_old.vtr", "snapshot_new.vtr", key_col = "id")
collect(d$added)   # rows present in new but not old
d$deleted          # key values present in old but not new

Fuzzy string matching runs inside the C engine, no round-trip to R:

tbl("taxa.vtr") |>
  filter(levenshtein(species, "Quercus robur") <= 2) |>
  mutate(similarity = jaro_winkler(species, "Quercus robur")) |>
  arrange(desc(similarity)) |>
  collect()

Register a star schema to avoid flat-table column creep. Define the links once, then pull only what you need:

s <- vtr_schema(
  fact    = tbl("observations.vtr"),
  species = link("sp_id", tbl("species.vtr")),
  site    = link("site_id", tbl("sites.vtr"))
)

# Pull columns from any dimension; joins are built automatically
lookup(s, count, species$name, site$habitat) |> collect()
#> species: all 500 keys matched
#> site: 3/500 unmatched keys (X1, X2, X3)

Use explain() to inspect the optimized plan:

tbl("data.vtr") |>
  filter(x > 0) |>
  select(id, x) |>
  explain()
#> vectra execution plan
#>
#> ProjectNode [streaming]
#>   FilterNode [streaming]
#>     ScanNode [streaming, 2/5 cols (pruned), predicate pushdown, v3 stats]
#>
#> Output columns (2):
#>   id <int64>
#>   x <double>

Why vectra

Querying large datasets in R usually means Arrow (requires compiled binaries matching your platform), DuckDB (links a 30 MB bundled library), or Spark (requires a JVM and cluster configuration).

vectra is a self-contained C11 engine compiled as a standard R extension. No external libraries, no JVM, no runtime configuration. It provides:

  • Streaming execution: data flows one row group at a time, never fully in memory
  • Zero-copy filtering: selection vectors avoid row duplication
  • Query optimizer: column pruning skips unneeded columns at scan; predicate pushdown uses per-rowgroup min/max statistics to skip entire row groups
  • Hash joins: build right, stream left; join a 50 GB fact table against a lookup without materializing both
  • External sort: 1 GB memory budget with automatic spill-to-disk
  • Window functions: row_number(), rank(), dense_rank(), lag(), lead(), cumsum(), cummean(), cummin(), cummax()
  • String expressions: nchar(), substr(), grepl() evaluated in the engine without round-tripping to R
  • Multiple data sources: .vtr, CSV, SQLite, GeoTIFF --- all produce the same lazy query nodes
  • Integer TIFF output: write rasters as int16/int32/uint8/uint16/float32 with embedded GDAL metadata for 5-10x smaller files

Features

CategoryVerbs
Transformfilter(), select(), mutate(), transmute(), rename(), relocate()
Aggregategroup_by(), summarise() (n, sum, mean, min, max, sd, var, first, last, any, all, median, n_distinct), count(), tally(), distinct()
Joinleft_join(), inner_join(), right_join(), full_join(), semi_join(), anti_join(), cross_join(), lookup()
Orderarrange(), slice_head(), slice_tail(), slice_min(), slice_max(), slice()
Windowrow_number(), rank(), dense_rank(), lag(), lead(), cumsum(), cummean(), cummin(), cummax(), ntile(), percent_rank(), cume_dist()
Date/Timeyear(), month(), day(), hour(), minute(), second(), as.Date() (in filter()/mutate())
Stringnchar(), substr(), grepl(), tolower(), toupper(), trimws(), paste0(), gsub(), sub(), startsWith(), endsWith() (in filter()/mutate())
String similaritylevenshtein(), levenshtein_norm(), dl_dist(), dl_dist_norm(), jaro_winkler(): fuzzy matching in filter()/mutate(), with optional max_dist early termination
Expressionabs(), sqrt(), log(), exp(), floor(), ceiling(), round(), log2(), log10(), sign(), trunc(), if_else(), between(), %in%, as.numeric(), pmin(), pmax(), resolve(), propagate() (in filter()/mutate())
Combinebind_rows(), bind_cols(), across()
Schemavtr_schema(), link(), lookup(): star schema definition and dimension lookup with match reporting
I/Otbl(), tbl_csv(), tbl_sqlite(), tbl_tiff(), write_vtr(), write_csv(), write_sqlite(), write_tiff(), tiff_extract_points(), tiff_metadata(), append_vtr(), delete_vtr(), diff_vtr()
Inspectexplain(), glimpse(), print(), pull()

Full tidyselect support in select(), rename(), relocate(), and across(): starts_with(), ends_with(), contains(), matches(), where(), everything(), all_of(), any_of().

Installation

# CRAN
install.packages("vectra")

# Development version
pak::pak("gcol33/vectra")

Documentation

Support

"Software is like sex: it's better when it's free." — Linus Torvalds

If this package saved you some time, buying me a coffee is a nice way to say thanks.

License

MIT (see the LICENSE.md file)

Citation

@software{vectra,
  author = {Colling, Gilles},
  title = {vectra: Columnar Query Engine for Larger-Than-RAM Data},
  year = {2026},
  url = {https://github.com/gcol33/vectra}
}

Copy Link

Version

Install

install.packages('vectra')

Version

0.6.2

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Gilles Colling

Last Published

May 8th, 2026

Functions in vectra (0.6.2)

print.vectra_node

Print a vectra query node
pull

Extract a single column as a vector
mutate

Add or transform columns
materialize

Materialize a vectra node into a reusable in-memory block
reframe

Summarise with variable-length output per group
left_join

Join two vectra tables
relocate

Relocate columns
fuzzy_join

Fuzzy join two vectra tables by string distance
tbl_sqlite

Create a lazy table reference from a SQLite database
link

Define a link between a fact table and a dimension table
glimpse

Get a glimpse of a vectra table
tbl_csv

Create a lazy table reference from a CSV file
tbl

Create a lazy table reference from a .vtr file
tbl_tiff

Create a lazy table reference from a GeoTIFF raster
tbl_xlsx

Create a lazy table reference from an Excel (.xlsx) file
slice

Select rows by position
tiff_metadata

Read GDAL_METADATA from a GeoTIFF
ungroup

Remove grouping from a vectra query
transmute

Keep only columns from mutate expressions
tiff_extract_points

Extract raster values at point coordinates
summarise

Summarise grouped data
select

Select columns from a vectra query
tiff_crs

Read CRS metadata from a GeoTIFF
rename

Rename columns
vec_write_time_cube

Write a 4D time-cube raster to .vec
slice_head

Select first or last rows
vec_write_raster

Write a raster matrix or 3D array to a .vec raster file
vec_extract_points

Extract band values at (x, y) points from a .vec raster
vec_build_overviews

Build overview pyramids for a .vec raster
vec_read_time_slice

Read a single time slice from a .vec time cube
vec_read_pixel_series

Read the full time series at a single pixel from a .vec time cube
write_vtr

Write data to a .vtr file
vec_open_raster

Open a .vec raster
tiff_band_names

Read per-band names from a GeoTIFF
lookup

Look up columns from linked dimension tables
vec_raster_layout

Tile layout of an open .vec raster
vec_raster_times

Distinct time stamps stored in a .vec time cube
vec_close_raster

Close a .vec raster handle
vec_to_tiff

Export a .vec raster to GeoTIFF
vec_read_window

Read a window of pixels from a .vec raster
write_tiff

Write query results to a GeoTIFF file
write_sqlite

Write query results or a data.frame to a SQLite table
write_csv

Write query results or a data.frame to a CSV file
vtr_schema

Create a star schema over linked vectra tables
count

Count observations by group
append_vtr

Append rows to an existing .vtr file
across

Apply a function across multiple columns
bind_rows

Bind rows or columns from multiple vectra tables
collect

Execute a lazy query and return a data.frame
block_lookup

Probe a materialized block by column value
cross_join

Cross join two vectra tables
block_fuzzy_lookup

Fuzzy-match query keys against a materialized block
create_index

Create a hash index on a .vtr file column
arrange

Sort rows by column values
explain

Print the execution plan for a vectra query
filter

Filter rows of a vectra query
distinct

Keep distinct/unique rows
diff_vtr

Compute the logical diff between two .vtr files
desc

Mark a column for descending sort order
delete_vtr

Logically delete rows from a .vtr file
head.vectra_node

Limit results to first n rows
group_by

Group a vectra query by columns
has_index

Check if a hash index exists for a .vtr column