solve_MDP: Solve an MDP Problem

Description

A simple implementation of value iteration and modified policy iteration.

Usage

solve_MDP(
  model,
  horizon = NULL,
  discount = NULL,
  terminal_values = NULL,
  method = "value",
  eps = 0.01,
  max_iterations = 1000,
  k_backups = 10,
  verbose = FALSE
)
q_values_MDP(model, U = NULL)
random_MDP_policy(model, prob = NULL)
approx_MDP_policy_evaluation(pi, model, U = NULL, k_backups = 10)

Arguments

model

a POMDP problem specification created with POMDP(). Alternatively, a POMDP file or the URL for a POMDP file can be specified.

horizon

an integer with the number of epochs for problems with a finite planning horizon. If set to Inf, the algorithm continues running iterations till it converges to the infinite horizon solution. If NULL, then the horizon specified in model will be used. For time-dependent POMDPs a vector of horizons can be specified (see Details section).

discount

discount factor in range \([0, 1]\). If NULL, then the discount factor specified in model will be used.

terminal_values

a vector with terminal utilities for each state. If NULL, then a vector of all 0s is used.

method

string; one of the following solution methods: 'value', 'policy'.

eps

maximum error allowed in the utility of any state (i.e., the maximum policy loss).

max_iterations

maximum number of iterations allowed to converge. If the maximum is reached then the non-converged solution is returned with a warning.

k_backups

number of look ahead steps used for approximate policy evaluation used by method 'policy'.

verbose

logical, if set to TRUE, the function provides the output of the pomdp solver in the R console.

a vector with state utilities (expected sum of discounted rewards from that point on).

prob

probability vector for actions.

a policy as a data.frame with columns state and action.

Value

solve_MDP() returns an object of class POMDP which is a list with the model specifications (model), the solution (solution). The solution is a list with the elements:

policy a list representing the policy graph. The list only has one element for converged solutions.
converged did the algorithm converge (NA) for finite-horizon problems.
delta final delta (infinite-horizon only)
iterations number of iterations to convergence (infinite-horizon only)

q_values_MDP() returns a state by action matrix specifying the Q-function, i.e., the utility value of executing each action in each state.

random_MDP_policy() returns a data.frame with columns state and action to define a policy.

approx_MDP_policy_evaluation() is used by the modified policy iteration algorithm and returns an approximate utility vector U estimated by evaluating policy pi.

Examples

Run this code

# NOT RUN {
data(Maze)
Maze

# use value iteration
maze_solved <- solve_MDP(Maze, method = "value")
policy(maze_solved)

# value function (utility function U)
plot_value_function(maze_solved)

# Q-function (states times action)
q_values_MDP(maze_solved)

# use modified policy iteration
maze_solved <- solve_MDP(Maze, method = "policy")
policy(maze_solved)

# finite horizon
maze_solved <- solve_MDP(Maze, method = "value", horizon = 3)
policy(maze_solved)

# create a random policy where action n is very likely and approximate 
#  the value function. We change the discount factor to .9 for this.
Maze_discounted <- Maze
Maze_discounted$discount <- .9
pi <- random_MDP_policy(Maze_discounted, prob = c(n = .7, e = .1, s = .1, w = 0.1))
pi

# compare the utility function for the random policy with the function for the optimal 
#  policy found by the solver.
maze_solved <- solve_MDP(Maze)

approx_MDP_policy_evaluation(pi, Maze, k_backup = 100)
approx_MDP_policy_evaluation(policy(maze_solved)[[1]], Maze, k_backup = 100)

# Note that the solver already calculates the utility function and returns it with the policy
policy(maze_solved)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples