oai (version 0.3.0)

dumpers: Result dumpers

Description

Result dumpers are functions allowing to handle the chunks of results from OAI-PMH service "on the fly". Handling can include processing, writing to files, databases etc.

Usage

dump_raw_to_txt(res, args, as, file_pattern = "oaidump",
  file_dir = ".", file_ext = ".xml")

dump_to_rds(res, args, as, file_pattern = "oaidump", file_dir = ".", file_ext = ".rds")

dump_raw_to_db(res, args, as, dbcon, table_name, field_name, ...)

Arguments

res

results, depends on as, not to be specified by the user

args

list, query arguments, not to be specified by the user

as

character, type of result to return, not to be specified by the user

file_pattern, file_dir, file_ext

character respectively: initial part of the file name, directory name, and file extension used to create file names. These arguments are passed to tempfile() arguments pattern, tmpdir, and fileext respectively.

dbcon

DBI-compliant database connection

table_name

character, name of the database table to write into

field_name

character, name of the field in database table to write into

...

arguments passed to/from other functions

Value

Dumpers should return NULL or a value that will be collected and returned by the function using the dumper.

dump_raw_to_txt returns the name of the created file.

dump_to_rds returns the name of the created file.

dump_xml_to_db returns NULL

Details

Often the result of a request to a OAI-PMH service are so large that it is split into chunks that need to be requested separately using resumptionToken. By default functions like list_identifiers() or list_records() request these chunks under the hood and return all concatenated in a single R object. It is convenient but insufficient when dealing with large result sets that might not fit into RAM. A result dumper is a function that is called on each result chunk. Dumper functions can write chunks to files or databases, include initial pre-processing or extraction, and so on.

A result dumper needs to be function that accepts at least the arguments: res, args, as. They will get values by the enclosing function internally. There may be additional arguments, including .... Dumpers should return NULL or a value that will be collected and returned by the function calling the dumper (e.g. list_records()).

Currently result dumpers can be used with functions: list_identifiers(), list_records(), and list_sets(). To use a dumper with one of these functions you need to:

  • Pass it as an additional argument dumper

  • Pass optional addtional arguments to the dumper function in a list as the dumper_args argument

See Examples. Below we provide more details on the dumpers currently implemented.

dump_raw_to_txt writes raw XML to text files. It requires as=="raw". File names are created using tempfile(). By default they are written in the current working directory and have a format oaidump*.xml where * is a random string in hex.

dump_to_rds saves results in an .rds file via saveRDS(). Type of object being saved is determined by the as argument. File names are generated in the same way as by dump_raw_to_txt, but with default extension .rds

dump_xml_to_db writes raw XML to a single text column of a table in a database. Requires as == "raw". Database connection dbcon should be a connection object as created by DBI::dbConnect() from package DBI. As such, it can connect to any database supported by DBI. The records are written to a field field_name in a table table_name using DBI::dbWriteTable(). If the table does not exist, it is created. If it does, the records are appended. Any additional arguments are passed to DBI::dbWriteTable()

References

OAI-PMH specification https://www.openarchives.org/OAI/openarchivesprotocol.html

See Also

Functions supporting the dumpers: list_identifiers(), list_sets(), and list_records()

Examples

Run this code
# NOT RUN {
### Dumping raw XML to text files

# This will write a set of XML files to a temporary directory
fnames <- list_identifiers(from="2018-06-01T",
                           until="2018-06-14T",
                           as="raw",
                           dumper=dump_raw_to_txt,
                           dumper_args=list(file_dir=tempdir()))
# vector of file names created
str(fnames)
all( file.exists(fnames) )
# clean-up
unlink(fnames)


### Dumping raw XML to a database

# Connect to in-memory SQLite database
con <- DBI::dbConnect(RSQLite::SQLite(), dbname=":memory:")
# Harvest and dump the results into field "bar" of table "foo"
list_identifiers(from="2018-06-01T",
                 until="2018-06-14T",
                 as="raw",
                 dumper=dump_raw_to_db,
                 dumper_args=list(dbcon=con,
                                  table_name="foo",
                                  field_name="bar") )
# Count records, should be 101
DBI::dbGetQuery(con, "SELECT count(*) as no_records FROM foo")

DBI::dbDisconnect(con)




# }

Run the code above in your browser using DataCamp Workspace