Monday, February 23, 2009

Thoughts on an Adaptive Local Search for Planning


Non-stationary Decision Making


The basic question discussed in this article is
Out of a set of possible choices (of heuristics, for example), which choice should be made?
Markov Decision Processes (MDPs) are an often used mechanism for this problem. The model for MDP requires a set of states {s}, and a set of actions {a} which take from one state to another with certain probability P(s_(t+1)|s_(t), a). An immediate reward R_a(s_(t+1), s_t) is received on this transition. The solution of MDP is a policy to take some action a in each state s.

Our problem of making choice is complicated by the following factors:

1. Uncertainty in the real world introduces non-determinism in MDP. This is commonly studied under Partially Observable MDP (POMDP) - the idea is that the underlying states are not observable to the agent, but the agent must infer the underlying state from the observations. Only approximate solutions of POMDP may be determined.

2. There are efficient solutions for solving POMDPs. [2] presents a point-based anytime algorithm for POMDP, for example. However, for really quick responses such as is needed for selecting small actions within local search, they are still too slow. Solving POMDPs would be good for long term learning. For short-term, non-stationary learning low-cost approaches such as in [1] are suitable.

3. MDP does not incorporate change in state structure with time. For example, in local search, usually many more local minima are found as the search progresses, compared to near the start of search. So the policy must evolve as time goes on. One example of such is squeaky wheel optimization in local search.

However, the simple approach for non-stationary learning described in [1] can be augmented with a model to make it more efficient. The idea is that probably more time can be allocated to making a larger decision.

For this purpose, each heuristic contains the following:
- goodness of the heuristic (goodness = the difficulty level that can be improved by application of the heuristic)
- confidence of goodness of the heuristic
- time needed to execute the heuristic
- resources needed to execute the heuristic

Incorporating Short-Term Goal Aspects

We will have two systems - one that handles long term aspects (like difficulty of a goal, priority of a goal), and another that handles short-term aspects (like remaining time, available resources).

Every choice considers both short-term and long-term aspects of the choice.

Making a Choice

Decisions taken when there is sufficient time available are usually different from when little time is available before execution must start. For example, exploration of other/untried options will take a back seat when time available is little.

One question is whether, if confidence in goodness is low (which means there is a huge standard deviation of goodness), should be choice maker assume an optimistic stance or a pessimistic stance? This may be a research question.

General Observation

In general, we note that more sophisticated and resource intensive approaches are more applicable when a large decision is being made, or when the granularity of decision is high, that is, all choices will take a long time and resources to implement. Therefore, effective profiling of time and resource usage is important.

[1] Alexander Nareyek. Choosing heuristics by non-stationary reinforcement learning.
[2] Pineau, Gordon, Thrun. Point-based value iteration: An anytime algorithm for POMDPs.

No comments:

Post a Comment