Google AI now predicts the best ML model

Google Artificial Intelligence (AI) researchers have designed an AI that predicts which machine learning models deliver the best results. The off-policy classification (OPC) tests the performance of AI-driven agents by considering evaluation as a classification problem.

The researchers announced this in the publication Off-Policy Evaluation via Off-Policy Classification. The AI research team emphasizes that their approach works with image input and scales conveniently to tasks including vision-based robot interactions, according to Venturebeat.

OPC is a variant of reinforcement learning, where rewards are used to steer software policy towards goals.

Learning from old data

Fully off-policy reinforcement learning is a variant in which an agent learns completely from old data, which is attractive because it enables model titration without a physical robot. With completely off-policy RL one can train different models on the same fixed dataset, as collected by previous agents. Then the best model can be selected, explains Google software engineer, Alexa Irpan.

According to the researchers, it was a challenging road before reaching the developed OPC. This would not be possible since evaluating an AI model training course would not be possible. Also, ground-truth evaluation would generally be too inefficient in methods, which require the evaluation of a large number of models.

The researchers have now been able to solve this with the help of OPC. They assume that tasks have little or no arbitrariness in the way in which states change. Moreover, they assume that agents at the end of experimental studies succeed or fail.

Q-learning algorithm

Furthermore, OPC uses a so-called Q-function to estimate the future total rewards of actions. Something learned using a Q-learning algorithm. Agents choose actions with the largest projected rewards, whose performance is measured by how often the selected actions are effective.

Something which in turn depends on how well the Q function correctly classifies actions as being effective versus catastrophic. The accuracy of classifications acts as an off-policy evaluation score.

SoftOP

The team trained the machines through simulation using fully off-policy reinforcement learning. These are then evaluated using the off-policy scores, which are tabulated from previous real-world data. For example, the team reports that a variant of OPC, SoftOPC, performed best in predicting the final success rate for a robot gripping task.

15 models of different robustness, of which 7 were fully trained in simulation, generated SoftOPC scores according to the team. These would be closely correlated with real success and significantly more reliable than basic line methods.

This news article was automatically translated from Dutch to give Techzine.eu a head start. All news articles after September 1, 2019 are written in native English and NOT translated. All our background stories are written in native English as well. For more information read our launch article.

Top story

Inside TCS’ digital race behind Formula E

The world of Formula E combines technology and speed with sustainability. It's a blend that Tata Consultancy ...

Erik van Klinken June 27, 2025

Whitepapers

Google AI now predicts the best ML model

Learning from old data

Q-learning algorithm

SoftOP

Stay tuned, subscribe!

A Ferrari needs brakes, innovation needs cybersecurity

Memory-safe malware: Rust challenges security researchers

HPE OpsRamp plays a very important role in the platform

EUVD security database is Europe’s next step towards autonomy

Dutch government starts consultation for NIS2 bill

Don’t wait for NIS2 legislation, organizations can do a lot now

NIS2 leads to better basic hygiene

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices

GITEX DIGI_HEALTH 5.0 - Thailand

IT Arena

Innovation Week 2025

Luxembourg Venture Days

Appdevcon

Webdevcon