Learn R Programming

pomdp (version 1.1.3)

simulate_MDP: Simulate Trajectories in a MDP

Description

Simulate trajectories through a MDP. The start state for each trajectory is randomly chosen using the specified belief. The belief is used to choose actions from an epsilon-greedy policy and then update the state.

Usage

simulate_MDP(
  model,
  n = 100,
  start = NULL,
  horizon = NULL,
  return_states = FALSE,
  epsilon = NULL,
  delta_horizon = 0.001,
  engine = "cpp",
  verbose = FALSE,
  ...
)

Value

A list with elements:

  • avg_reward: The average discounted reward.

  • reward: Reward for each trajectory.

  • action_cnt: Action counts.

  • state_cnt: State counts.

  • states: a vector with state ids. Rows represent trajectories.

A vector with state ids (in the final epoch or all). Attributes containing action counts, and rewards for each trajectory may be available.

Arguments

model

a MDP model.

n

number of trajectories.

start

probability distribution over the states for choosing the starting states for the trajectories. Defaults to "uniform".

horizon

number of epochs for the simulation. If NULL then the horizon for the model is used.

return_states

logical; return visited states.

epsilon

the probability of random actions for using an epsilon-greedy policy. Default for solved models is 0 and for unsolved model 1.

delta_horizon

precision used to determine the horizon for infinite-horizon problems.

engine

'cpp' or 'r' to perform simulation using a faster C++ or a native R implementation.

verbose

report used parameters.

...

further arguments are ignored.

Author

Michael Hahsler

Details

A native R implementation is available (engine = 'r') and the default is a faster C++ implementation (engine = 'cpp').

Both implementations support parallel execution using the package foreach. To enable parallel execution, a parallel backend like doparallel needs to be available needs to be registered (see doParallel::registerDoParallel()). Note that small simulations are slower using parallelization. Therefore, C++ simulations with n * horizon less than 100,000 are always executed using a single worker.

See Also

Other MDP: MDP(), POMDP_accessors, solve_MDP(), transition_graph()

Examples

Run this code
data(Maze)

# solve the POMDP for 5 epochs and no discounting
sol <- solve_MDP(Maze, discount = 1)
sol

# U in the policy is and estimate of the utility of being in a state when using the optimal policy.
policy(sol)
matrix(policy(sol)[[1]]$action, nrow = 3, dimnames = list(1:3, 1:4))[3:1, ]

## Example 1: simulate 10 trajectories following the policy, only the final belief state is returned
sim <- simulate_MDP(sol, n = 100, horizon = 10, verbose = TRUE)
sim

# Note that all simulations start at s_1 and that the simulated avg. reward 
# is therefore an estimate to the U value for the start state s_1.
policy(sol)[[1]][1,] 

# Calculate proportion of actions taken in the simulation
round_stochastic(sim$action_cnt / sum(sim$action_cnt), 2)

# reward distribution
hist(sim$reward)

## Example 2: simulate starting following a uniform distribution over all
#             states and return all visited states
sim <- simulate_MDP(sol, n = 100, start = "uniform", horizon = 10, return_states = TRUE)
sim$avg_reward

# how often was each state visited?
table(sim$states)
matrix(table(sim$states),nrow = 3, dimnames = list(1:3, 1:4))[3:1, ]

Run the code above in your browser using DataLab