In the following we use the following notation. The POMDP is a 7-duple:
\((S,A,T,R, \Omega ,O, \gamma)\).
\(S\) is the set of states; \(A\)
is the set of actions; \(T\) are the conditional transition probabilities
between states; \(R\) is the reward function; \(\Omega\) is the set of
observations; \(O\) are the conditional observation probabilities; and
\(\gamma\) is the discount factor. We will use lower case letters to
represent a member of a set, e.g., \(s\) is a specific state. To refer to
the size of a set we will use cardinality, e.g., the number of actions is
\(|A|\).
Names used for mathematical symbols in code
\(S, s, s'\): 'states', start.state', 'end.state'
\(A, a\): 'actions', 'action'
\(\Omega, o\): 'observations', 'observation'
State names, actions and observations can be specified as strings or index numbers
(e.g., start.state
can be specified as the index of the state in states
).
For the specification as data.frames below, '*'
or NA
can be used to mean
any start.state
, end.state
, action
or observation
. Note that '*'
is internally
always represented as an NA
.
The specification below map to the format used by pomdp-solve
(see http://www.pomdp.org).
Specification of transition probabilities: \(T(s' | s, a)\)
Transition probability to transition to state \(s'\) from given state \(s\)
and action \(a\). The transition probabilities can be
specified in the following ways:
A data.frame with columns exactly like the arguments of T_()
.
You can use rbind()
with helper function T_()
to create this data
frame.
A named list of matrices, one for each action. Each matrix is square with
rows representing start states \(s\) and columns representing end states \(s'\).
Instead of a matrix, also the strings 'identity'
or 'uniform'
can be specified.
A function with the same arguments are T_()
, but no default values
that returns the transition probability.
Specification of observation probabilities: \(O(o | s', a)\)
The POMDP specifies the probability for each observation \(o\) given an
action \(a\) and that the system transitioned to the end state
\(s'\). These probabilities can be specified in the
following ways:
A data frame with columns named exactly like the arguments of O_()
.
You can use rbind()
with helper function O_()
to create this data frame.
A named list of matrices, one for each action. Each matrix has
rows representing end states \(s'\) and columns representing an observation \(o\).
Instead of a matrix, also the strings 'identity'
or 'uniform'
can be specified.
A function with the same arguments are O_()
, but no default values
that returns the observation probability.
Specification of the reward function: \(R(s, s', o, a)\)
The reward function can be specified in the following
ways:
A data frame with columns named exactly like the arguments of R_()
.
You can use rbind()
with helper function R_()
to create this data frame.
A list of lists. The list levels are 'action'
and 'start.state'
. The list elements
are matrices with
rows representing end states \(s'\) and columns representing an observation \(o\).
A function with the same arguments are R_()
, but no default values
that returns the reward.
Start Belief
The initial belief state of the agent is a distribution over the states. It is used to calculate the
total expected cumulative reward printed with the solved model. The function reward()
can be
used to calculate rewards for any belief.
Some methods use this belief to decide which belief states to explore (e.g.,
the finite grid method).
Options to specify the start belief state are:
A probability distribution over the states. That is, a vector
of \(|S|\) probabilities, that add up to \(1\).
The string "uniform"
for a uniform
distribution over all states.
An integer in the range \(1\) to \(n\) to specify the index of a single starting state.
A string specifying the name of a single starting state.
The default initial belief is a uniform
distribution over all states.
Time-dependent POMDPs
Time dependence of transition probabilities, observation probabilities and
reward structure can be modeled by considering a set of episodes
representing epoch with the same settings. The length of each episode is
specified as a vector for horizon
, where the length is the number of
episodes and each value is the length of the episode in epochs. Transition
probabilities, observation probabilities and/or reward structure can contain
a list with the values for each episode. See solve_POMDP()
for
more details and an example.