git2rdata v0.1

0

Monthly downloads

0th

Percentile

Store and Retrieve Data.frames in a Git Repository

Make versioning of data.frame easy and efficient using git repositories.

Readme

The git2rdata package

Project Status: Active – The project has reached a stable, usable state and is being actively developed. lifecycle

Licence minimal R version DOI

Travis-CI Build Status AppVeyor Build status codecov

GitHub forks GitHub stars GitHub code size in bytes GitHub repo size

Please visit the git2rdata website at https://ropensci.github.io/git2rdata/. The vignette code on the website link to a rendered version of the vignette. Functions have a link to their help file.

Rationale

The git2rdata package is an R package for writing and reading dataframes as plain text files. Important information is stored in a metadata file.

  1. Storing metadata allows to maintain the classes of variables. By default, the data is optimized for file storage prior to writing. The optimization is most effective on data containing factors. The optimization makes the data less human readable and can be turned off. Details on the implementation are available in vignette("plain_text", package = "git2rdata").
  2. Storing metadata also allows to minimize row based diffs between two consecutive commits). This is a useful feature when storing data as plain text files under version control. Details on this part of the implementation are available in vignette("version_control", package = "git2rdata"). Although git2rdata was envisioned with a git workflow in mind, it can also be used in combination with other version control systems like subversion or mercurial.
  3. git2rdata is intended to facilitate a reproducible and traceable workflow. A toy example is given in vignette("workflow", package = "git2rdata").
  4. vignette("efficiency", package = "git2rdata") provides some insight into the efficiency in terms of file storage, git repository size and speed for writing and reading.

Why Use Git2rdata?

  • You can store dataframes as plain text files.
  • The dataframe you read has exactly the same information content as the one you wrote.
    • No changes in data type.
    • Factors keep their original levels, including their order.
    • Date and date-time are stored in an unambiguous format, documented in the metadata.
  • The data and the metadata are stored in a standard and open format, making it readable by other software.
  • Data and metadata are checked during the reading. The user is informed if there is tampering with the data or metadata.
  • Git2rdata integrates with the git2r package for working with git repository from R.
    • Another option is using git2rdata solely for writing to disk and handle the plain text files with your favourite version control system outside of R.
  • The optimization reduces the required disk space by about 30% for both the working directory and the git history.
  • Reading data from a HDD is 30% faster than read.table(), writing to a HDD takes about 70% more time than write.table().
  • Git2rdata is useful as a tool in a reproducible and traceable workflow. See vignette("workflow", package = "git2rdata").
  • You can detect when a file was last modified in the git history. Use this to check whether an existing analysis is obsolete due to new data. This allows to not rerun up to date analyses, saving resources.

Installation

Install the development version

# installation requires the "remotes" package
# install.package("remotes")

# install with vignettes (recommended)
remotes::install_github(
  "ropensci/git2rdata", 
  build = TRUE, 
  dependencies = TRUE, 
  build_opts = c("--no-resave-data", "--no-manual")
)
# install without vignettes
remotes::install_github("ropensci/git2rdata"))

Usage in a Nutshell

Dataframes are stored using write_vc() and retrieved with read_vc(). Both functions share the arguments root and file. root refers to a base location where the dataframe should be stored. It can either point to a local directory or a local git repository. file is the file name to use and can include a path relative to root. Make sure the relative path stays within root.

# using a local directory
library(git2rdata)
root <- "~/myproject" 
write_vc(my_data, file = "rel_path/filename", root = root)
read_vc(file = "rel_path/filename", root = root)
root <- git2r::repository("~/my_git_repo") # git repository

More details on store dataframes as plain text files in vignette("plain_text", package = "git2rdata").

# using a git repository
library(git2rdata)
repo <- repository("~/my_git_repo")
pull(repo)
write_vc(my_data, file = "rel_path/filename", root = repo, stage = TRUE)
commit(repo, "My message")
push(repo)
read_vc(file = "rel_path/filename", root = repo)

Please read vignette("version_control", package = "git2rdata") for more details on using git2rdata in combination with version control.

What data sizes can git2rdata handle?

The recommendation for git repositories is to use files smaller than 100 MiB, an overall repository size less than 1 GiB and less than 25k files. The individual file size is the limiting factor. Storing the airbag dataset (DAAG::nassCDS) with write_vc() requires on average 68 (optimized) or 97 (verbose) byte per record. The 100 MiB file limit for this data is reached after about 1.5 million (optimize) or 1 million (verbose) observations.

Storing a 90% random subset of the airbag dataset requires 370 kiB (optimized) or 400 kiB (verbose) storage in the git history. Updating the dataset with other 90% random subsets requires on average 60 kiB (optimized) to 100 kiB (verbose) per commit. The git history limit of 1 GiB will be reached after 17k (optimized) to 10k (verbose) commits.

Your mileage might vary.

Citation

Please use the output of citation("git2rdata")

Folder Structure

  • R: The source scripts of the R functions with documentation in Roxygen format
  • man: The help files in Rd format
  • inst/efficiency: pre-calculated data to speed up vignette("efficiency", package = "git2rdata")
  • testthat: R scripts with unit tests using the testthat framework
  • vignettes: source code for the vignettes describing the package
  • man-roxygen: templates for documentation in Roxygen format
  • pkgdown: additional source files for the git2rdata website
  • .github: guidelines and templates for contributors
git2rdata
├── .github 
├─┬ inst
│ └── efficiency
├── man 
├── man-roxygen 
├── pkgdown
├── R
├─┬ tests
│ └── testthat
└── vignettes

Contributions

Contributions to git2rdata are welcome. Please read our Contributing guidelines first. The git2rdata project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

rOpenSci footer

Functions in git2rdata

Name Description
pull Re-exported Function From git2r
repository Re-exported Function From git2r
relabel Relabel Factor Levels by Updating the Metadata
upgrade_data Upgrade Files to the New Version
status Re-exported Function From git2r
write_vc Store a Data.Frame as a Git2rdata Object on Disk
rm_data Remove Data Files From Git2rdata Objects
recent_commit Retrieve the Most Recent File Change
list_data List Available Git2rdata Files Containing Data
is_git2rdata Check Whether a Git2rdata Object is Valid.
is_git2rmeta Check Whether a Git2rdata Object Has Valid Metadata.
meta Optimize an Object for Storage as Plain Text and Add Metadata
read_vc Read a Git2rdata Object from Disk
push Re-exported Function From git2r
prune_meta Prune Metadata Files
commit Re-exported Function From git2r
git2rdata-package git2rdata: Store and Retrieve Data.frames in a Git Repository
No Results!

Vignettes of git2rdata

Name
efficiency.Rmd
plain_text.Rmd
version_control.Rmd
workflow.Rmd
No Results!

Last month downloads

Details

License GPL-3
Encoding UTF-8
LazyData true
RoxygenNote 6.1.1
URL https://github.com/ropensci/git2rdata, https://doi.org/10.5281/zenodo.1485309
BugReports https://github.com/ropensci/git2rdata/issues
Collate 'clean_data_path.R' 'git2rdata-package.R' 'write_vc.R' 'is_git2rdata.R' 'is_git2rmeta.R' 'list_data.R' 'meta.R' 'prune.R' 'read_vc.R' 'recent_commit.R' 'reexport.R' 'relabel.R' 'upgrade_data.R'
VignetteBuilder knitr
Language en-GB
NeedsCompilation no
Packaged 2019-06-15 09:44:56 UTC; thierry_onkelinx
Repository CRAN
Date/Publication 2019-06-17 14:20:04 UTC

Include our badge in your README

[![Rdoc](http://www.rdocumentation.org/badges/version/git2rdata)](http://www.rdocumentation.org/packages/git2rdata)