Learn R Programming

pomdp (version 1.0.2)

simulate_MDP: Simulate Trajectories in a MDP

Description

Simulate trajectories through a MDP. The start state for each trajectory is randomly chosen using the specified belief. The belief is used to choose actions from an epsilon-greedy policy and then update the state.

Usage

simulate_MDP(
  model,
  n = 100,
  start = NULL,
  horizon = NULL,
  visited_states = FALSE,
  epsilon = NULL,
  verbose = FALSE
)

Arguments

model

a MDP model.

n

number of trajectories.

start

probability distribution over the states for choosing the starting states for the trajectories. Defaults to "uniform".

horizon

number of epochs for the simulation. If NULL then the horizon for the model is used.

visited_states

logical; Should all visited states on the trajectories be returned? If FALSE then only the final state is returned.

epsilon

the probability of random actions for using an epsilon-greedy policy. Default for solved models is 0 and for unsolved model 1.

verbose

report used parameters.

Value

A vector with state ids (in the final epoch or all). Attributes containing action counts, and rewards for each trajectory may be available.

See Also

Other MDP: MDP(), solve_MDP()

Examples

Run this code
# NOT RUN {
data(Maze)

# solve the POMDP for 5 epochs and no discounting
sol <- solve_MDP(Maze, discount = 1)
sol
policy(sol)

## Example 1: simulate 10 trajectories, only the final belief state is returned
sim <- simulate_MDP(sol, n = 10, horizon = 10, verbose = TRUE)
head(sim)

# additional data is available as attributes
names(attributes(sim))
attr(sim, "avg_reward")
colMeans(attr(sim, "action"))

## Example 2: simulate starting always in state s_1
sim <- simulate_MDP(sol, n = 100, start = "s_1", horizon = 10)
sim

# the average reward is an estimate of the utility in the optimal policy:
policy(sol)[[1]][1,]

# }

Run the code above in your browser using DataLab