psel: Preference Selection

Description

Evaluates a preference on a given dataset, i.e., returns the maximal elements of a data set for a given preference order.

Usage

psel(df, pref, ...)
psel.indices(df, pref, ...)

Arguments

A data frame or, for a grouped preference selection, a grouped data frame. See below for details.

pref

The preference order constructed via complex_pref and base_pref. All variables occuring in the definition of pref must be either columns

...

Additional (optional) parameters for top(-level)-k selections: [object Object],[object Object],[object Object],[object Object],[object Object]

Top-k Preference Selection

For a given top value of k the k best elements and their level values are returned. The level values are determined as follows:

All the maxima of a data set w.r.t. a preference have level 1.

The maxima of the remainder, i.e., the dataset without the level 1 maxima, have level 2. The n-th iteration of "Take the maxima from the remainder" leads to tuples of level n.

code

df

Grouped Preference Selection

With psel it is also possible to perform a preference selection where the maxima are calculated for every group seperately. The groups have to be created with group_by from the dplyr package. The preference selection preserves the grouping, i.e., the groups are restored after the preference selection.

For example, if the summarize function from dplyr is applied to psel(group_by(...), pref), the summarizing is done for the set of maxima of each group. This can be used to e.g., calculate the number of maxima in each group, see examples below.

A {top, at_least, top_level} preference selection is applied to each group seperately. A top=k selection returns the k best tuples for each group. Hence if there are 3 groups in df, each containing at least 2 elements, and we have top = 2, then 6 tuples will be returned.

Parallel Computation

On multicore machines the preference selection can be run in parellel using a divide-and-conquer approach. Note that, depending on the data set, this is not faster thean a single-threaded computation in all cases. To active parallel compuation within rPref:

options(rPref.parallel = TRUE)

If this option is not set, rPref will use single-threaded computation by default.

Details

The difference between the two variants of the preference selection is:

Thepselfunction returns a subset of the data set which are the maxima according to the given preference.
The functionpsel.indicesreturns just the row indices of the maxima (except top-k queries withshow_level = TRUE, see top-k preference selection). Hencepsel(df, pref)is equivalent todf[psel.indices(df, pref),]for non-grouped data frames.

Examples

Run this code

# Skyline and top-k/at-least skyline
psel(mtcars, low(mpg) * low(hp))
psel(mtcars, low(mpg) * low(hp), top = 5)
psel(mtcars, low(mpg) * low(hp), at_least = 5)

# visualize the skyline in a plot
sky1 <- psel(mtcars, high(mpg) * high(hp))
plot(mtcars$mpg, mtcars$hp)
points(sky1$mpg, sky1$hp, lwd=3)

# grouped preference with dplyr
library(dplyr)
psel(group_by(mtcars, cyl), low(mpg))

# return size of each maxima group
summarise(psel(group_by(mtcars, cyl), low(mpg)), n())

Run the code above in your browser using DataLab