jailbreakr

Warning: This project is in the early scoping stages; do not use for anything other than amusement/frustration purposes

Data Liberator. To extract tabular data people put in nontabular structures in a program designed to hold tables.

Installation

Requires the development version of xml2 (for xml_find_lgl) as well as cellrangr and linen. Chances are you'll want rexcel too.

devtools::install_github(c("hadley/xml2",
                           "rsheets/linen",
                           "rsheets/cellranger",
                           "rsheets/rexcel",
                           "rsheets/jailbreakr"))

Goals

There are two large excel spreadsheet corpora; it would be nice to use these to get a feel for what fraction of spreadsheets we can handle or the range of non-table-like data out there.

The first is the EUSES corpus of 4,447 spreadsheets (16,853 worksheets). This is all xls files (rather than xlsx) and therefore need either an xls -> xlsx conversion or support in jailbreakr for xls files.

The second, larger, one is the Enron corpus of 15,770 spreadsheets (79,983)

Roadmap

  • data structure package:

    • linen? General representation of spreadsheet data, plus some limited low-level operations on that data
    • depends on cell ranger, tibble
    • constructor function
    • print methods
    • subsetting, range extraction etc.
    • plot method - for quickly getting a feel for structure, or a shiny app
    • summary: this has n sheets, no formulae, 3 plots, etc, things about the references between the sheets?
    • where it came from (excel, googlesheet, etc), with filenames, reference ids etc.
    • probably needs references to handle multiple sheets and formulae within them, definitely if we need to do things with plots, but make them immutable at first?
    • md5 or other "id" so that we can see if the upstream source has changed. This is different for googlesheets where the id is properly baked into the sheet
  • low level packages:

    • googlesheets
    • rexcel
    • these depend on linen, and will have to provide things like ids and filenames to satisfy all the features that linen will do.
  • jailbreakr

    • uses output in linen format that is provided by googlesheets or rexcel

Ideas

Can we feed things through openrefine or something?

Copy Link

Version

Down Chevron

Version

0.0.1

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

January 1st, 1970

Functions in jailbreakr (0.0.1)