The decentralized partially observable Markov decision process
(DEC-POMDP) framework is a model to represent multiple agents making
decisions under uncertainty. It is an extension of the partially
observable Markov decision process (POMDP) framework and
a specific case of a partially observable
stochastic game (POSG) (see Hansen, et al., 2004).
For two agents, a DEC-POMDP can be described as follows:
M = &lang S, A1, A2, P, R, &Omega1,
&Omega2, O, T &rang
S, the set of states with initial state s0
A1, the set of actions for agent 1
A2, the set of actions for agent 2
P, the state transition probabilities: P(s'| s, a1, a2), the probability of the environment transitioning to state s' given it was in state s and agents 1 and 2 took actions a1 and a2 respectively
R, the global reward function: R,(s, a1, a2), the immediate reward the system receives for being in state s and agent 1 taking action a1 and agent 2 taking a2
&Omega1, the set of observations for agent 1
&Omega2, the set of observations for agent 2
O, the observation probabilities: O(o1, o2| s, a1, a2), the probability of agent 1 seeing observation o1 and agent 2 seeing observation o2, given the state is s and agent 1 takes action a1 and agent 2 takes a2 (the observation probabilities can also depend on the resulting state)
T, the horizon, whether infinite or if finite, a positive integer
when T is infinite a discount factor, 0 ≤ &gamma < 1 , is used
A DEC-MDP is defined in the same way except:
the state must be jointly fully observable. That is, the state is uniquely defined by the observations of all agents. So, if O(o1, o2| s, a1, a2) > 0 then P(s'| o1, o2)=1.
"... there is currently no good way to combine game theoretic and POMDP
control strategies." -- Russell and Norvig, 2003
DEC-POMDP algorithms
Optimal and approximate algorithms have been developed for general DEC-POMDPs and many subclasses. The worst case complexity of the general case has been shown to be NEXP-complete (Bernstein et al., 2002) with optimal solutions found by dynamic programming (Hansen et al., 2004) and approximate solutions may be found using bounded policy iteration (Bernstein et al., 2005). A reduction in complexity is seen when the agents are mostly independent (Becker et al., 2003) and communication can be explicitly modeled and analyzed (Goldman et al., 2004).