split_metadata: Split metadata from a worksheet

Description

Split metadata from a worksheet or region of a worksheet (a worksheet view).

Usage

split_metadata(x, include_merged = TRUE, min_jump = 2, min_data_block = 5)
split_metadata_find(x, include_merged = TRUE, min_jump = 2, min_data_block = 5)
split_metadata_apply(x, n)

Arguments

A worksheet or a worksheet view

include_merged

Include merged cells when computing row width. This is generally going to be what you want to do, but if you have a sheet with metadata rows that are entirely merged across, you probably will do better by turning this option off (in that case they'd be counted as a single column wide).

min_jump

The minimum number of columns increase that we'll count as moving from the metadata block to the data block. A single column (1) is probably going to be prone to false positives, and in sheets with this pattern the jump is often quite large.

min_data_block

The minimum number of rows without decreasing in size before we can conclude that we're in the data block. If we run off the end of the worksheet before reaching this number, we'll conclude no metadata was found.

A scalar integer representing the number of rows to consider to be metadata (i.e., equivalent to the return value of split_metadata_find.

Value

For split_metadata and split_metadata_apply, a worksheet view; in this view the data$metadata element will be a worksheet view of the metadata. For split_metadata_find, a single integer representing the number of rows of metadata found (with zero indicating no metadata).

Details

There is a common pattern in spreadsheets where data is stored as:

mmm
mmm
HHHHHHHH
dddddddd
dddddddd

where m is some metadata (perhaps indicating table name, creator, dates, etc), H is the header and d is the actual data. This function will split the metadata (m) part off, leaving a table that is more suitable for further processing. In many ways this is like the skip argument to readxl::read_excel and read.csv, but we will retain the metadata (somewhere!).

The idea here is that the metadata block starts when we get a shift in the number of non-blank cells. There needs to be some heuristics here to help, and things will need to be tuneable: do merged cells count as non-empty (include_merged; default is to include them), how big a jump we look for (min_jump; default is 2 columns), how many rows of the same size do we look for in the data block (min_data_block; default is 5 rows).

Other things that might be useful, but which aren't supported yet, include looking for different colours and fonts in the metadata and the main block; when we switch from one to the other we're likely to see things like a change here.