2 min

Google Artificial Intelligence (AI) researchers have designed an AI that predicts which machine learning models deliver the best results. The off-policy classification (OPC) tests the performance of AI-driven agents by considering evaluation as a classification problem.

The researchers announced this in the publication Off-Policy Evaluation via Off-Policy Classification. The AI research team emphasizes that their approach works with image input and scales conveniently to tasks including vision-based robot interactions, according to Venturebeat.

OPC is a variant of reinforcement learning, where rewards are used to steer software policy towards goals.

Learning from old data

Fully off-policy reinforcement learning is a variant in which an agent learns completely from old data, which is attractive because it enables model titration without a physical robot. With completely off-policy RL one can train different models on the same fixed dataset, as collected by previous agents. Then the best model can be selected, explains Google software engineer, Alexa Irpan.

According to the researchers, it was a challenging road before reaching the developed OPC. This would not be possible since evaluating an AI model training course would not be possible. Also, ground-truth evaluation would generally be too inefficient in methods, which require the evaluation of a large number of models.

The researchers have now been able to solve this with the help of OPC. They assume that tasks have little or no arbitrariness in the way in which states change. Moreover, they assume that agents at the end of experimental studies succeed or fail.

Q-learning algorithm

Furthermore, OPC uses a so-called Q-function to estimate the future total rewards of actions. Something learned using a Q-learning algorithm. Agents choose actions with the largest projected rewards, whose performance is measured by how often the selected actions are effective.

Something which in turn depends on how well the Q function correctly classifies actions as being effective versus catastrophic. The accuracy of classifications acts as an off-policy evaluation score.


The team trained the machines through simulation using fully off-policy reinforcement learning. These are then evaluated using the off-policy scores, which are tabulated from previous real-world data. For example, the team reports that a variant of OPC, SoftOPC, performed best in predicting the final success rate for a robot gripping task.

15 models of different robustness, of which 7 were fully trained in simulation, generated SoftOPC scores according to the team. These would be closely correlated with real success and significantly more reliable than basic line methods.

This news article was automatically translated from Dutch to give Techzine.eu a head start. All news articles after September 1, 2019 are written in native English and NOT translated. All our background stories are written in native English as well. For more information read our launch article.