Evaluation Print E-mail

In each domain, performance will be evaluated by calculating average cumulative reward accrued across a series of independent test runs.  Measuring cumulative reward emphasizes the on-line nature of reinforcement learning and requires agents to intelligently balance exploration and exploitation.

Furthermore, this year's competition will utilize new evaluation paradigms designed to encourage algorithms that generalize well to previously unseen tasks. In particular, each domain will be generalized, altered, or unknown.



Generalized Domains

For each generalized domain (Mountain Car, Helicopter Hovering, Tetris), we will select a set of parameters affecting domain dynamics (e.g. noise in sensors or actuators) as well as a distribution over these parameters. In the competition training software, every run will use one of several parameter settings sampled from this distribution (the "training set"). In the competition test software, every run will uses a parameter setting sampled independently from the same distribution (the "test set"). This particular evaluation scheme is designed to test learning algorithms across a range of parameters. As a result, only algorithms that are robust to a the range of parameters can expect to perform well in the competition.

For the generalized domains, the organizers will choose some parameters to modify that induce a whole class of MDPs: generalized versions of the base problem. The organizers will also choose the distributions these parameter values are sampled from, also inducing a distribution of MDPs that will be used in the competition. The participants will never be told explicitly what the parameter variables or their instantiated values are. By "k MDPs", we mean k different variations on the base problem induced by choosing k parameter configurations from the decided distribution.

Evaluation will use a three-phased approach.

Phase 1) Training

Participants will be allowed unlimited interaction with k training MDPs.  The participants can decide which of the k MDPs they want to interact with and have unlimited time and computation to learn whatever they want from the MDP (in a sample-based way).

Phase 2) Proving

Participants will be evaluated on p proving MDPS, and this procedure will be automated by software that will be provided by the organizing committee.  Proving can be done once a week, and the cumulative reward over the proving MDPs will decide the ordering on the leader-board.  The proving MDPs will be the same each week, but the order that the agent sees them will be randomized.

Phase 3) Testing

Evaluation on a third set of MDPs, the t testing MDPs.  Every participant gets the same testing MPDs, and cumulative reward over the testing MDPs decides the winner of the competition.  This testing will be performed at the end of the competition.  Results will not be announced until ICML.

This proposal does not include multiple runs over the same MDP in proving or testing.  This is deliberate, so that people do not  bias subsequent "independent" runs of the same MDP by reusing what they learned in previous runs.

 

Altered Domains

For each altered domain (Keepaway, Real-Time Strategy), we will select one fixed version of the problem, which will be used for every run in the competition training software. We will also release a list of domain parameters that may change between training and testing. In the competition test software, some or all of these parameters will be altered. This particular evaluation scheme is designed to test online learning with little pretraining.

For the altered domains, we would also use a three-phase system.  There will be three MDPs, the training MDP, the proving MDP, and the testing MDP.  Again, the proving MDP gives them a feeling for how well their algorithm generalizes from the training MDP to an unseen MDP without tainting the novelty of the test MDP.  Like in the generalized domains, the each team can do one proving run per week.

 

Polyathlon

For the unknown domain (Polyathlon), no competition training software will be available. Competitors will submit their best general-purpose reinforcement learning agents which will be evaluated on a suite of unknown domains. Information regarding what participants can and cannot expect will be released November 1st. In the Polyathlon, algorithms must learn to perform unknown tasks online with little prior domain knowledge or no pretraining.

Details still to be decided ...

 

Login to Message Boards

Separate username & password from team login.





Lost Password?
NOTE: Registration for message boards has been DISABLED because of SPAM. Please e-mail brian@rl-competition.org for an account.