As agents are built for ever more complex environments, methods that
consider the uncertainty in the system have strong advantages. This
uncertainty is common in domains such as autonomous navigation, inventory
management, sensor networks and e-commerce. When choices are
made in a decentralized manner by a set of decision makers, the problem can
be modeled as a decentralized partially observable Markov decision process
(Dec-POMDP). While the Dec-POMDP model offers a rich framework for
cooperative sequential decision making under uncertainty, the computational
complexity of the model presents an important research challenge.
It is an extension of the partially
observable Markov decision process (POMDP) framework and a specific
case of a partially observable stochastic game (POSG) (see Hansen, et al.,
2004).
Dec-POMDPs represent a sequential problem. At each stage, each agent takes
an action and receives:
A local observation
A joint immediate reward
A local policy for each agent is a mapping from its observation sequences to
actions, Ω* → A (The state is unknown, so it may be beneficial to remember
history)
A joint policy is a local policy for each agent
The goal is to maximize expected cumulative reward over a finite or infinite
number of steps
Note that decisions for each agent are made solely on local information, but
these decisions can affect the global reward and each agent's transitions
and observations.
For two agents, a DEC-POMDP can be described as follows:
M = <S, A1, A2, P, R, Ω1, Ω2,
O, T >
S, the set of states with initial state s0
A1, the set of actions for agent 1
A2, the set of actions for agent 2
P, the state transition probabilities: P(s'|
s, a1, a2), the probability of the
environment transitioning to state s' given it was in state s
and agents 1 and 2 took actions a1 and a2
respectively
R, the global reward function: R,(s, a1,
a2), the immediate reward the system receives for being
in state s and agent 1 taking action a1 and
agent 2 taking a2
Ω1, the set of observations for agent 1
Ω2, the set of observations for agent 2
O, the observation probabilities: O(o1, o2|
s, a1, a2), the probability of agent 1
seeing observation o1 and agent 2 seeing observation
o2, given the state is s and agent 1 takes action a1
and agent 2 takes a2 (the observation probabilities
can also depend on the resulting state)
T, the horizon, whether infinite or if finite, a positive
integer
when T is infinite a discount factor, 0 ≤ γ < 1 ,
is used
A DEC-MDP is defined in the same way except:
the state must be jointly fully observable. That is, the state is
uniquely defined by the observations of all agents. So, if O(o1,
o2| s, a1, a2) > 0 then
P(s'| o1, o2)=1.
"... there is currently no good way to combine game theoretic and POMDP
control strategies." -- Russell and Norvig, 2003
DEC-POMDP algorithms
Optimal and approximate algorithms have been developed for general
DEC-POMDPs as well as a number of subclasses. The worst case complexity of
the general case has been shown to be NEXP-complete (Bernstein et al., 2002)
with optimal solutions found by dynamic programming (Hansen et al., 2004)
and latter by methods that can exploit structure when present. A reduction
in complexity to NP is seen when the agents are mostly independent (Becker
et al., 2003) and communication can be explicitly modeled and analyzed
(Goldman et al., 2004).
For an overview of DEC-POMDPs, subclasses and algorithms, see our latest
tutorial [pdf][webpage].