POMDP: Define a POMDP Problem

Description

Defines all the elements of a POMDP problem including the discount rate, the set of states, the set of actions, the set of observations, the transition probabilities, the observation probabilities, and rewards.

Usage

POMDP(
  states,
  actions,
  observations,
  transition_prob,
  observation_prob,
  reward,
  discount = 0.9,
  horizon = Inf,
  terminal_values = NULL,
  start = "uniform",
  name = NA
)
O_(action = "*", end.state = "*", observation = "*", probability)
T_(action = "*", start.state = "*", end.state = "*", probability)
R_(action = "*", start.state = "*", end.state = "*", observation = "*", value)

Arguments

states

a character vector specifying the names of the states. Note that state names have to start with a letter.

actions

a character vector specifying the names of the available actions. Note that action names have to start with a letter.

observations

a character vector specifying the names of the observations. Note that observation names have to start with a letter.

transition_prob

Specifies action-dependent transition probabilities between states. See Details section.

observation_prob

Specifies the probability that an action/state combination produces an observation. See Details section.

reward

Specifies the rewards structure dependent on action, states and observations. See Details section.

discount

numeric; discount factor between 0 and 1.

horizon

numeric; Number of epochs. Inf specifies an infinite horizon.

terminal_values

a vector with the terminal values for each state or a matrix specifying the terminal rewards via a terminal value function (e.g., the alpha component produced by solve_POMDP). A single 0 specifies that all terminal values are zero.

start

Specifies the initial probabilities for each state (i.e., the initial belief), typically as a vector or the string 'uniform' (default). This belief is used to calculate the total expected cumulative reward. It is also used by some solvers. See Details section for more information.

name

a string to identify the POMDP problem.

action, start.state, end.state, observation, probability, value

Values used in the helper functions O_(), R_(), and T_() to create an entry for observation_prob, reward, or transistion_prob above, respectively. The default value '*"' matches any action/state/observation.

Value

The function returns an object of class POMDP which is list of the model specification. solve_POMDP() reads the object and adds a list element named 'solution'.

Details

In the following we use the following notation. The POMDP is a 7-duple:

\((S,A,T,R, \Omega ,O, \gamma)\).

\(S\) is the set of states; \(A\) is the set of actions; \(T\) are the conditional transition probabilities between states; \(R\) is the reward function; \(\Omega\) is the set of observations; \(O\) are the conditional observation probabilities; and \(\gamma\) is the discount factor. We will use lower case letters to represent a member of a set, e.g., \(s\) is a specific state. To refer to the size of a set we will use cardinality, e.g., the number of actions is \(|A|\).

Names used for mathematical symbols in code

\(S, s, s'\): 'states', start.state', 'end.state'
\(A, a\): 'actions', 'action'
\(\Omega, o\): 'observations', 'observation'

State names, actions and observations can be specified as strings or index numbers (e.g., start.state can be specified as the index of the state in states). For the specification as data.frames below, '*' can be used to mean any start.state, end.state, action or observation.

The specification below map to the format used by pomdp-solve (see http://www.pomdp.org).

Specification of transition probabilities: \(T(s' | s, a)\)

Transition probability to transition to state \(s'\) from given state \(s\) and action \(a\). The transition probabilities can be specified in the following ways:

A data.frame with columns exactly like the arguments of T_(). You can use rbind() with helper function T_() to create this data frame.
A named list of matrices, one for each action. Each matrix is square with rows representing start states \(s\) and columns representing end states \(s'\). Instead of a matrix, also the strings 'identity' or 'uniform' can be specified.
A function with the same arguments are T_(), but no default values that returns the transition probability.

Specification of observation probabilities: \(O(o | s', a)\)

The POMDP specifies the probability for each observation \(o\) given an action \(a\) and that the system transitioned to the end state \(s'\). These probabilities can be specified in the following ways:

A data frame with columns named exactly like the arguments of O_(). You can use rbind() with helper function O_() to create this data frame.
A named list of matrices, one for each action. Each matrix has rows representing end states \(s'\) and columns representing an observation \(o\). Instead of a matrix, also the strings 'identity' or 'uniform' can be specified.
A function with the same arguments are O_(), but no default values that returns the observation probability.

Specification of the reward function: \(R(s, s', o, a)\)

The reward function can be specified in the following ways:

A data frame with columns named exactly like the arguments of R_(). You can use rbind() with helper function R_() to create this data frame.
A list of lists. The list levels are 'action' and 'start.state'. The list elements are matrices with rows representing end states \(s'\) and columns representing an observation \(o\).
A function with the same arguments are R_(), but no default values that returns the reward.

Start Belief

This belief is used to calculate the total expected cumulative reward printed with the solved model. The function reward() can be used to calculate rewards for any belief.

Some methods use this belief to decide which belief states to explore (e.g., the finite grid method). The default initial belief is a uniform distribution over all states. No initial belief state can be used by setting start = NULL.

Options to specify the start belief state are:

A probability distribution over the states. That is, a vector of \(|S|\) probabilities, that add up to \(1\).
The string "uniform" for a uniform distribution over all states.
An integer in the range \(1\) to \(n\) to specify the index of a single starting state.
a string specifying the name of a single starting state.

Time-dependent POMDPs

Time dependence of transition probabilities, observation probabilities and reward structure can be modeled by considering a set of episodes representing epoch with the same settings. The length of each episode is specified as a vector for horizon, where the length is the number of episodes and each value is the length of the episode in epochs. Transition probabilities, observation probabilities and/or reward structure can contain a list with the values for each episode. See solve_POMDP() for more details and an example.

References

pomdp-solve website: http://www.pomdp.org

Examples

Run this code

# NOT RUN {
## Defining the Tiger Problem (it is also available via data(Tiger), see ? Tiger)

Tiger <- POMDP(
  name = "Tiger Problem",
  discount = 0.75,
  states = c("tiger-left" , "tiger-right"),
  actions = c("listen", "open-left", "open-right"),
  observations = c("tiger-left", "tiger-right"),
  start = "uniform",

  transition_prob = list(
    "listen" =     "identity",
    "open-left" =  "uniform",
    "open-right" = "uniform"
  ),

  observation_prob = list(
    "listen" = rbind(c(0.85, 0.15),
                     c(0.15, 0.85)),
    "open-left" =  "uniform",
    "open-right" = "uniform"
  ),

  # the reward helper expects: action, start.state, end.state, observation, value
  # missing arguments default to '*' matching any value.
  reward = rbind(
    R_("listen",                    v =   -1),
    R_("open-left",  "tiger-left",  v = -100),
    R_("open-left",  "tiger-right", v =   10),
    R_("open-right", "tiger-left",  v =   10),
    R_("open-right", "tiger-right", v = -100)
  )
)

Tiger

# Defining the Tiger problem using functions

trans_f <- function(action, start.state, end.state) {
  if(action == 'listen')
    if(end.state == start.state) return(1)
    else return(0)

  return(1/2) ### all other actions have a uniform distribution
}

obs_f <- function(action, end.state, observation) {
  if(action == 'listen')
    if(end.state == observation) return(0.85)
  else return(0.15)

  return(1/2)
}

rew_f <- function(action, start.state, end.state, observation) {
  if(action == 'listen') return(-1)
  if(action == 'open-left' && start.state == 'tiger-left') return(-100)
  if(action == 'open-left' && start.state == 'tiger-right') return(10)
  if(action == 'open-right' && start.state == 'tiger-left') return(10)
  if(action == 'open-right' && start.state == 'tiger-right') return(-100)
  stop('Not possible')
}

Tiger_func <- POMDP(
  name = "Tiger Problem",
  discount = 0.75,
  states = c("tiger-left" , "tiger-right"),
  actions = c("listen", "open-left", "open-right"),
  observations = c("tiger-left", "tiger-right"),
  start = "uniform",
  transition_prob = trans_f,
  observation_prob = obs_f,
  reward = rew_f
)

Tiger_func
# }

Run the code above in your browser using DataLab