parBatchByIndex: Parallel Batch By Index

Description

Takes an index set, breaks it into batches and runs the given function on each batch in parallel using the given cluster. See batchByIndex for the non-parallel version.

When doing a select were the condition is a large number of ids it is not always possible to include them in a single SQL statement. This function will break the list of ids into chunks and allow the indexProcessor to deal with just a small number of ids.

Usage

parBatchByIndex(allIndices, indexProcessor, reduce, cl, batchSize = 1e+05)

Arguments

allIndices

A vector of values that will be broken into batches and passed as an argument to the indexProcessor function.

indexProcessor

A function that takes one batch if indices. It is called once for each batch, possibly in parallel. The return value of this function is collected into a list and passed to the reduce function after all jobs have finished.

reduce

This function is run after all jobs have finished. It is called with a list of return values from the indexProcessor function runs. The order of batchs is maintained. The return value of the reduce function is then returned.

The idea is that this function merges all the results together into one result.

A SNOW cluster to run jobs on.

batchSize

The size of each batch. The last batch may be smaller than this value.

Value

The return value of the reduce function is returned.

Examples

Run this code

	## Not run: 
# 
# 		cl = makeCluster(2) # create a SNOW cluster
# 
# 		#function to run a query for each batch of indexes
# 		job = function(indexBatch)
# 				dbGetQuery(dbConnection, paste("SELECT weight FROM table WHERE id IN (",paste(indexBatch,collapse=","),")"))
# 
# 		# function to combine all the results, in this case by summing them up
# 		reduce = function(results) sum(unlist(results))
# 
# 		indices = 1:10000
# 
# 		#run queries in parallel and then sum the results
# 		totalWeight = parBatchByIndex(indices,job,reduce,cl, 1000)
# 
# 	## End(Not run)

Run the code above in your browser using DataLab