Here is the entire collection of data produced by all the planners participating in IPC3 (nearly 6MB). At the conference, we had an opportunity to identify those planners whose performance we considered to be outstanding according to the criteria identified below. Now we invite the community to draw its own conclusions based on the entire data set.
The data sets are organised under the subdirectory "IPCResults/PLANS". In the subdirectory "IPCResults/Collected" can be found the data extracted from these plans. The data sets are simple files containing one line for each problem solved in a given problem set. The line contains five values: the problem number, the plan quality, a second plan quality value, the number of steps in the plan and, finally, the time taken to produce the plan. Only plans that validated are entered. The two plan quality measures are determined by the problem instance metric (where the problem stipulates it - otherwise it defaults to plan length). In cases where plan length is used, or in the case of non-temporal domains using "total-time" as a metric (which is considered equivalent to plan length for non-temporal domains), plan length is measured as either the number of steps or as the number of distinct points in the plan at which activity occurs (equivalent to "Graphplan length"). The first plan quality measure is that derived using plan length and the second is that derived using Graphplan length. In most cases the values are identical.
An extra set of results for LPG, generated after the competition (due to a bug resolved after the event) is available here. It should be unpacked in the IPCResults directory and provides the Satellite-Complex set of results for both quality and speed settings.
We made a qualitative judgement based on the coverage (how many problems were tackled), the ratio of successful plans to problems tackled and the quality of the solutions generated. We also considered speed of planners, but believe that an order of magnitude is easily accounted for in details of implementation. We favoured a high coverage and high ratio of success in combination, making a qualitative judgement on the boundary between high coverage combined with moderate ratio and good coverage combined with high ratio. Because the competition was concerned with pushing the frontier of temporal and metric planning we felt that coverage was a very important factor in judging the performance of the planners. Of course, coverage is not necessarily indicative of quality. Therefore we considered ratio an equally important criterion.
|Planner||Problems solved||Problems attempted||Success ratio||Capabilities|
|(Link to description)|
|FF||237 (+70)||284 (+76)||83% (85%)||(Strips, Numeric, HardNumeric)|
|LPG||372||428||87%||(Strips, Numeric, HardNumeric, SimpleTime, Time)|
|MIPS||331||508||65%||(Strips, Numeric, HardNumeric, SimpleTime, Time, Complex)|
|SHOP2||899||904||99%||(Strips, Numeric, HardNumeric, SimpleTime, Time, Complex)|
|TALPlanner||610||610||100%||(Strips, SimpleTime, Time)|
|TLPlan||894||894||100%||(Strips, Numeric, HardNumeric, SimpleTime, Time, Complex)|
|TP4||26||204||13%||(Numeric, SimpleTime, Time, Complex)|
Note that FF attempted 76 additional problems intended for the handcoded planners and solved 70 of them successfully. IxTeT solved 9 problems with plans accepted by the validator and attempted a further 10 problems producing plans that could not be validated. IxTeT requires recoding of the domain and problem instances from PDDL and this must be carried out by hand.
On these critera we identified one fully automated and one hand coded planner as demonstrating distinguished performance of the first order. These were:
Identifying winners is difficult because it seems to undervalue the efforts of the other participants many of whom also performed extremely impressively. In particular, certain planners achieved outstanding performance in particular tracks even though they did not display broad coverage of the entire data set. For example, FF out-performed its competitors in many of the Numeric and Strips problems, but it didn't compete in the temporal domains, giving it lower overall coverage. Similarly, TALPlanner exhibited extremely good performance in many of the temporal domains, but didn't participate in the numeric domains, lowering its overall coverage. Our decision to use coverage and ratio as the criteria for identifying conference prize-winners is not intended to devalue excellent performance in a smaller subset of the domains.