Defines all the elements of a POMDP problem including the discount rate, the set of states, the set of actions, the set of observations, the transition probabilities, the observation probabilities, and rewards.
POMDP(
states,
actions,
observations,
transition_prob,
observation_prob,
reward,
discount = 0.9,
horizon = Inf,
terminal_values = 0,
start = "uniform",
max = TRUE,
name = NA
)O_(action = "*", end.state = "*", observation = "*", probability)
T_(action = "*", start.state = "*", end.state = "*", probability)
R_(action = "*", start.state = "*", end.state = "*", observation = "*", value)
a character vector specifying the names of the states.
a character vector specifying the names of the available actions.
a character vector specifying the names of the observations.
Specifies action-dependent transition probabilities between states. See Details section.
Specifies the probability that an action/state combination produces an observation. See Details section.
Specifies the rewards structure dependent on action, states and observations. See Details section.
numeric; discount factor between 0 and 1.
numeric; Number of epochs. Inf
specifies an infinite
horizon.
a vector with the terminal values for each state or a matrix specifying the terminal rewards via a terminal value function (e.g., the alpha component produced by solve_POMDP). A single 0 specifies that all terminal values are zero.
Specifies the initial probabilities for each state (i.e., the
initial belief), typically as a vector or the string "uniform"
(default). This belief is used to calculate the total expected cumulative
reward. It is also used by some solvers. See Details section for more
information.
logical; is this a maximization problem (maximize reward) or a
minimization (minimize cost specified in reward
)?
a string to identify the POMDP problem.
Values
used in the helper functions O_()
, R_()
, and T_()
to
create an entry for observation_prob
, reward
, or
transistion_prob
above, respectively. The default value "*"
matches any action/state/observation.
The function returns an object of class POMDP which is list with an
element called model
containing a list with the model specification.
solve_POMDP
reads the object and adds a list element called
solution
.
POMDP problems can be solved using solve_POMDP
. More details
about the available specifications can be found in [1].
In the following we use the following notation. The POMDP is a 7-duple \((S,A,T,R, \Omega ,O, \gamma)\). \(S\) is the set of states; \(A\) is the set of actions; \(T\) are the conditional transition probabilities between states; \(R\) is the reward function; \(\Omega\) is the set of observations; \(O\) are the conditional observation probabilities; and \(\gamma\) is the discount factor. We will use lower case letters to represent a member of a set, e.g., \(s\) is a specific state. To refer to the size of a set we will use cardinality, e.g., the number of actions is \(|A|\).
Specification of transition probabilities
Transition probability to transition to state \(s'\) from \(s\) given action \(a\) is \(T(s' | s, a)\). The transition probabilities can be specified in the following ways:
A data frame with 4 columns, where the columns specify
action \(a\), start.state \(s\), end.state \(s'\) and the transition
probability \(T(s' | s, a)\), respectively. The first 3 columns can be
either character (the name of the action or state) or integer indices. You
can use rbind()
with helper function T_()
to create this data
frame.
A named list of \(|A|\) matrices. Each matrix is square of size
\(|S| \times |S|\). Instead of a matrix, also the strings
"identity"
or "uniform"
can be specified.
Specification of observation probabilities
The POMDP specifies the probability for each observation \(o\) given an action \(a\) and that the system transitioned to a specific state \(s'\), \(O(o | s', a)\). These probabilities can be specified in the following ways:
A data frame with 4 columns, where the columns specify the
action \(a\), the end.state \(s'\), the observation \(o\) and the
probability \(O(o | s', a)\), respectively. The first 3 columns could be
either character (the name of the action, state, or observation), integer
indices, or they can be "*"
to indicate that the observation
probability applies to all actions or states. You can use rbind()
with helper function O_()
to create this data frame.
A named list of \(|A|\) matrices. Each matrix is of size \(|S|
\times |\Omega|\). The name of each matrix is the action it
applies to. Instead of a matrix, also the string "uniform"
can be
specified.
Specification of the reward function
The reward function \(R(s, s', o, a)\) can be specified in the following ways:
a data frame with 5 columns, where the columns specify
action \(a\), start.state \(s\), end.state \(s'\), observation \(o\)
and the associated reward \(R(s, s', a)\), respectively. The first 4
columns could be either character (names of the action, states, or
observation), integer indices, or they can be "*"
to indicate that
the reward applies to all transitions. Use rbind()
with helper
function R_()
to create this data frame.
a named list of \(|A|\) lists. Each list contains \(|S|\) named matrices representing the start states \(s\). Each matrix is of size \(|S| \times |\Omega|\), representing the end states \(s'\) and observations.
Start Belief
This belief is used to calculate the total expected cumulative reward
printed with the solved model. The function reward
can be
used to calculate rewards for any belief.
Some methods use this belief to decide which belief states to explore (e.g.,
the finite grid method). The default initial belief is a uniform
distribution over all states. No initial belief state can be used by setting
start = NULL
.
Options to specify the start belief state are:
a probability distribution over the states. That is, a vector of \(|S|\) probabilities, that add up to \(1\).
the string "uniform"
for a uniform
distribution over all states.
an integer in the range \(1\) to \(n\) to specify the index of a single starting state.
a string specifying the name of a single starting state.
Time-dependent POMDPs
Time dependence of transition probabilities, observation probabilities and
reward structure can be modeled by considering a set of episodes
representing epoch with the same settings. The length of each episode is
specified as a vector for horizon
, where the length is the number of
episodes and each value is the length of the episode in epochs. Transition
probabilities, observation probabilities and/or reward structure can contain
a list with the values for each episode. See solve_POMDP
for
more details and an example.
[1] For further details on how the POMDP solver utilized in this R package works check the following website: http://www.pomdp.org
# NOT RUN {
## The Tiger Problem
Tiger <- POMDP(
name = "Tiger Problem",
discount = 0.75,
states = c("tiger-left" , "tiger-right"),
actions = c("listen", "open-left", "open-right"),
observations = c("tiger-left", "tiger-right"),
start = "uniform",
transition_prob = list(
"listen" = "identity",
"open-left" = "uniform",
"open-right" = "uniform"),
observation_prob = list(
"listen" = rbind(c(0.85, 0.15),
c(0.15, 0.85)),
"open-left" = "uniform",
"open-right" = "uniform"),
# the reward helper expects: action, start.state, end.state, observation, value
reward = rbind(
R_("listen", v = -1),
R_("open-left", "tiger-left", v = -100),
R_("open-left", "tiger-right", v = 10),
R_("open-right", "tiger-left", v = 10),
R_("open-right", "tiger-right", v = -100)
)
)
Tiger
Tiger$model
# }
Run the code above in your browser using DataLab