lidR-parallelism: Parallel computation in lidR

Description

This document explain how to process point clouds taking advantage of parallel processing in the lidR package. The lidR package has two levels of parallelism, which is why it is difficult to understand how it works. This page aims to provide users with a clear overview of how to take advantage of multicore processing even if they are not comfortable with the parallelism concept.

Arguments

Algorithm-based parallelism

When processing a point cloud we are applying an algorithm on data. This algorithm may or may not be natively parallel. In lidR some algorithms are fully computed in parallel, but some are not because they are not parallelizable, while some are only partially parallelized. It means that some portions of the code are computed in parallel and some are not. When an algorithm is natively parallel in lidR it is always a C++ based parallelization with OpenMP. The advantage is that the computation is faster without any consequence for memory usage because the memory is shared between the processors In short, algorithm-based parallelism provides a significant gain without any cost for your R session and your system (but obviously there is a greater workload for the processors). By default lidR uses half of your cores but you can control this with set_lidr_threads. For example, the lmf algorithm is natively parallel. The following code is computed in parallel:

las  <- readLAS("file.las")
tops <- tree_detection(las, lmf(2))

However, as stated above, not all algorithms are parallelized or even parallelizable. For example, li2012 is not parallelized. The following code is computed in serial:

las <- readLAS("file.las")
dtm <- lastrees(las, li2012())

To know which algorithms are parallelized users can refer to the documentation or use the function is.parallelised.

is.parallel(lmf(2))   #> TRUE
is.parallel(li2012()) #> FALSE

chunk-based parallelism

When processing a LAScatalog, the internal engine splits the dataset into chunks and each chunk is read and processed sequentially in a loop. But actually this loop can be parallelized with the future package. By defaut the chunks are processed sequentially, but they can be processed in parallel by registering an evaluation strategy. For example, the following code is evaluated sequentially:

ctg <- readLAScatalog("folder/")
out <- grid_metrics(ctg, mean(Z))

But this one is evaluated in parallel with two cores:

library(future)
plan(multisession, workers = 2L)
ctg <- readLAScatalog("folder/")
out <- grid_metrics(ctg, mean(Z))

With chunk-based parallelism any algorithm can be parallelized by processing several subsets of a dataset. However, there is a strong cost associated with this type of parallelism. When processing several chunks at a time, the computer needs to load the corresponding point clouds. Assuming the user processes one square kilometer chunks in parallel with 4 cores, then 4 chunks are loaded in the computer memory. This may be too much and the speed-up is not guaranteed since there is some overhead involved in reading several files at a time. Once this point is understood, chunk-based parallelism is very powerful since all the algorithms can be parallelized whether or not they are natively parallel.

Nested parallelism - part 1

Previous sections stated that some algorithms are natively parallel, such as lmf, and some are not, such as li2012. Anyway, users can split the dataset into chunks to process them simultaneously with the LAScatalog processing engine. Let's assume that the user's computer has four cores, what happens in this case:

library(future)
plan(multisession, workers = 4L)
set_lidr_threads(4L)
ctg <- readLAScatalog("folder/")
out <- tree_detection(ctg, lmf(2))

Here the catalog will be split into chunks that will be processed in parallel. And each computation itself implies a parallelized task. This is a nested parallelism task and it is bad! Hopefully the lidR package handles such cases and chooses by default to give precedence to chunk-based parallelism. In this case chunks will be processed in parallel and the points will be processed serially. The question of nested parallel loops is irrelevant. The catalog processing engine has precedence rules that are guaranteed to avoid nested parallelism. This precedence rule aims to (1) always work (2) preserve behaviors of lidR version 2.0.y.

Nested parallelism - part 2

We explained rules of precedence. But actually the user can tune the engine more accurately. Let's define the following function:

myfun = function(cluster, ...)
{
  las <- readLAS(cluster)
  if (is.empty(las)) return(NULL)
  las  <- lasnormalize(las, tin())
  tops <- tree_detection(las, lmf(2))
  bbox <- extent(cluster)
  tops <- crop(tops, bbox)
  return(tops)
}
out <- catalog_apply(ctg, myfun, ws = 5)

This function used two algorithms, one is partially parallelized (tin) and one is fully parallelized lmf. The user can manually use both OpenMP and future. By default the engine will give precedence to chunk-based parallelism because it works in all cases but the user can impose something else. In the following 2 workers are attributed to future and 2 workers are attributed to OpenMP.

plan(multisession, workers = 2L)
set_lidr_threads(2L)
catalog_apply(ctg, myfun, ws = 5)

The rule is simple. If the number of workers needed is greater than the number of available workers then OpenMP is disabled. Let suppose we have a quadcore machine:

# 2 chunks 2 threads: OK
plan(multisession, workers = 2L)
set_lidr_threads(2L)
# 4 chunks 1 threads: OK
plan(multisession, workers = 4L)
set_lidr_threads(1L)
# 1 chunks 4 threads: OK
plan(sequential)
set_lidr_threads(4L)
# 3 chunks 2 threads: NOT OK
# Needs 6 workers, OpenMP threads are set to 1 i.e. sequential processing
plan(multisession, workers = 3L)
set_lidr_threads(2L)