Maze: Steward Russell's 4x3 Maze MDP

Description

The 4x3 maze described in Chapter 17 of the the textbook: "Artificial Intelligence: A Modern Approach" (AIMA).

Arguments

Format

An object of class MDP.

Details

The simple maze has the following layout:

    1234        Transition model:
   ######             .8 (action direction)
  3#   +#              ^
  2# # -#              |
  1#    #         .1 <-|-> .1
   ######

We represent the maze states as a matrix with 3 rows (north/south) and 4 columns (east/west). The states are labeled s_1 through s_12 and are fully observable. The # (state s_5) in the middle of the maze is an obstruction and not reachable. Rewards are associated with transitions. The default reward (penalty) is -0.04. Transitioning to + (state s_12) gives a reward of 1.0, transitioning to - (state s_11) has a reward of -1.0. States s_11 and s_12 are terminal (absorbing) states.

Actions are movements (north, south, east, west). The actions are unreliable with a .8 chance to move in the correct direction and a 0.1 chance to instead to move in a perpendicular direction leading to a stochastic transition model.

Note that the problem has reachable terminal states which leads to a proper policy (that is guaranteed to reach a terminal state). This means that the solution also converges without discounting (discount = 1).

References

Russell, S. J. and Norvig, P., & Davis, E. (2021). Artificial intelligence: a modern approach. 4rd ed.

Examples

Run this code

# NOT RUN {
# The problem can be loaded using data(Maze).

# Here is the complete problem definition:

S <- paste0("s_", seq_len(3 * 4))
s2rc <- function(s) {
  if(is.character(s)) s <- match(s, S)
  c((s - 1) %% 3 + 1, (s - 1) %/% 3 + 1)
}
rc2s <- function(rc) S[rc[1] + 3 * (rc[2] - 1)]

A <- c("north", "south", "east", "west")

T <- function(action, start.state, end.state) {
  action <- match.arg(action, choices = A)
  
  if (start.state %in% c('s_11', 's_12', 's_5')) {
    if (start.state == end.state) return(1)
    else return(0)
  }

  if(action %in% c("north", "south")) error_direction <- c("east", "west")
  else error_direction <- c("north", "south")
  
  rc <- s2rc(start.state)
  delta <- list(north = c(+1, 0), south = c(-1, 0), 
                east = c(0, +1), west = c(0, -1))
  P <- matrix(0, nrow = 3, ncol = 4)
  
  add_prob <- function(P, rc, a, value) {
    new_rc <- rc + delta[[a]]
    if (new_rc[1] > 3 || new_rc[1] < 1 || new_rc[2] > 4 || new_rc[2] < 1 
      || (new_rc[1] == 2 && new_rc[2]== 2))
      new_rc <- rc
    P[new_rc[1], new_rc[2]] <- P[new_rc[1], new_rc[2]] + value
    P
  }
 
 P <- add_prob(P, rc, action, .8)
 P <- add_prob(P, rc, error_direction[1], .1)
 P <- add_prob(P, rc, error_direction[2], .1)
 P[rbind(s2rc(end.state))]
}

T("n", "s_1", "s_2")

R <- rbind(
 R_(end.state   = '*',     value = -0.04),
 R_(end.state   = 's_11',  value = -1),
 R_(end.state   = 's_12',  value = +1),
 R_(start.state = 's_11',  value = 0),
 R_(start.state = 's_12',  value = 0),
 R_(start.state = 's_5',  value = 0)
)


Maze <- MDP(
 name = "Stuart Russell's 3x4 Maze",
 discount = 1,
 horizon = Inf,
 states = S,
 actions = A,
 transition_prob = T,
 reward = R
) 

Maze
str(Maze) 

maze_solved <- solve_MDP(Maze, method = "value")
policy(maze_solved)

# show the utilities and optimal actions organized in the maze layout (like in the AIMA textbook)
matrix(policy(maze_solved)[[1]]$U, nrow = 3, dimnames = list(1:3, 1:4))[3:1, ]
matrix(policy(maze_solved)[[1]]$action, nrow = 3, dimnames = list(1:3, 1:4))[3:1, ]

# Note: the optimal actions for the states with a utility of 0 are artefacts and should be ignored. 
# }

Run the code above in your browser using DataLab