Exploring Meta-Learning in Robotics

Audio version of the article

Rapid development of more accurate simulator engines has given robotics researchers a unique opportunity to generate sufficient amounts of data that can be used to train robotic policies for real-world deployment. However, moving trained policies from “sim-to-real” remains one of the greatest challenges of modern robotics, due to the subtle differences encountered between the simulation and real domains, termed the “reality gap”. While some recent approaches leverage existing data, such as imitation learning and offline reinforcement learning, to prepare a policy for the reality gap, a more common approach is to simply provide more data by varying properties of the simulated environment, a process called domain randomization.

However, domain randomization can sacrifice performance for stability, as it seeks to optimize for a decent, stable policy across all tasks, but offers little room for improving the policy on a specific task. This lack of a common optimal policy between simulation and reality is frequently a problem in robotic locomotion applications, where there are varying physical forces at play, such as leg friction, body mass, and terrain differences. For example, given the same initial conditions for the robot’s position and balance, the surface type will determine the optimal policy — for an incoming flat surface encountered in simulation, the robot could accelerate to a higher speed, while for an incoming rugged and bumpy surface encountered in the real world, it should walk slowly and carefully to prevent falling.

In “Rapidly Adaptable Legged Robots via Evolutionary Meta-Learning”, we present a particular type of meta-learning based on evolutionary strategies (ES), an approach generally believed to only work well in simulation, which we use to effectively and efficiently adapt a policy to a real-world robot in a completely model-free manner. Compared to previous approaches for adapting meta-policies, such as standard policy gradients which do not allow sim-to-real adaptation, ES enables a robot to quickly overcome the reality gap and adapt to dynamic changes in the real world, some of which may not be encountered in simulation. This represents the first instance of successfully using ES for on-robot adaptation.

Our algorithm quickly adapts a legged robot’s policy to dynamics changes. In this example, the battery voltage dropped from 16.8V to 10V which reduced motor power, and a 500g mass was also placed on the robot’s side, causing it to turn rather than walk straight. The policy is able to adapt in only 50 episodes (or 150s of real-world data).

This research falls under the general class of meta-learning techniques, and is demonstrated on a legged robot. At a high level, meta-learning learns to solve an incoming task quickly without completely retraining from scratch, by combining past experiences with small amounts of experience from the incoming task. This is especially beneficial in the sim-to-real case, where most of the past experiences come cheaply from simulation, while a minimal, yet necessary amount of experience is generated from the real world task. The simulation experiences allow the policy to possess a general level of behavior for solving a distribution of tasks, while the real-world experiences allow the policy to fine-tune specifically to the real-world task at hand.

In order to train a policy to meta-learn, it is necessary to encourage a policy to adapt during simulation. Normally, this can be achieved by applying model-agnostic meta-learning (MAML), which searches for a meta-policy that can adapt to a specific task quickly using small amounts of task-specific data. The standard approach to computing such meta-policies is by using policy gradient methods, which seek to improve the likelihood of selecting the same action given the same state. In order to determine the likelihood of a given action, the policy must be stochastic, allowing for the action selected by the policy to have a randomized component. The real-world environment for deploying such robotic policies is also highly stochastic, as there can be slight differences in motion arising naturally, even if starting from the exact same state and action sequence. The combination of using a stochastic policy inside a stochastic environment creates two conflicting objectives:

  1. Decreasing the policy’s stochasticity may be crucial, as otherwise the high-noise problem might be exacerbated by the additional randomness from the policy’s actions.
  2. However, increasing the policy’s stochasticity may also benefit exploration, as the policy needs to use random actions to probe the type of environment to which it adapts.

These two competing objectives, which have been noted before, seek to both decrease and increase the policy’s stochasticity and may cause complications.

Evolutionary Strategies in Robotics
Instead, we resolve these challenges by applying ES-MAML, an algorithm that leverages a drastically different paradigm for high-dimensional optimization — evolutionary strategies. The ES-MAML approach updates the policy based solely on the sum of rewards collected by the agent in the environment. The function used for optimizing the policy is black-box, mapping the policy parameters directly to this reward. Unlike policy gradient methods, this approach does not need to collect state/action/reward tuples and does not need to estimate action likelihoods. This allows the use of deterministic policies and exploration based on parameter changes and avoids the conflict between stochasticity in the policy and in the environment.

In this paradigm, querying usually involves running episodes in the simulator, but we show that ES can be applied also for episodes collected on real hardware. ES optimization can be easily distributed and also works well for training efficient compact policies, a phenomenon with profound robotic implications, since policies with fewer parameters can be easier deployed on real hardware and often lead to more efficient inference and power usage. We confirm the effectiveness of ES in training compact policies by learning adaptable meta-policies with <130 parameters.

The ES optimization paradigm is very flexible. It can be used to optimize non-differentiable objectives, such as the total reward objective in our robotics case. It also works in the presence of substantial (potentially adversarial) noise. In addition, the most recent forms of ES methods (e.g., guided ES) are much more sample-efficient than previous versions.

This flexibility is critical for efficient adaptation of locomotion meta-policies. Our results show that adaptation with ES can be conducted with a small number of additional on-robot episodes. Thus, ES is no longer just an attractive alternative to the state-of-the-art algorithms, but defines a new state of the art for several challenging RL tasks.

Adaptation in Simulation
We first examine the types of adaptation that emerge when training with ES-MAML in simulation. When testing the policy in simulation, we found that the meta-policy forces the robot to fall down when the dynamics become too unstable, whereas the adapted policy allows the robot to re-stabilize and walk again. Furthermore, when the robot’s leg settings change, the meta-policy de-synchronizes the robot’s legs causing the robot to turn sharply, while the adapted policy corrects the robot so it can walk straight again.

The meta-policy’s gait, which experiences issues when facing a difficult dynamics task. Left: The meta-policy lets the robot fall down. Center: The adapted policy ensures the robot continues to walk correctly. Right: Comparative measurement of the robot’s height.
The meta-policy’s gait, under changes to the robot’s leg settings. Left: The meta-policy allows the robot veer to the right. Center: The adapted policy ensures the robot continues to walk in a straight line. Right: Comparative measurement of the robot’s walking direction.

Adaptation in the Real World
Despite the good performance of ES-MAML in simulation, applying it to a real robot is still a challenge. To effectively adapt in the noisy environment of the real world while requiring as little real-world data as possible, we introduce batch hill-climbing, an add-on to ES-MAML based on previous work for zeroth-order blackbox optimization. Rather than performing hill-climbing which iteratively updates the input one-by-one according to a deterministic objective, batch hill-climbing samples a parallel batch of queries to determine the next input, making it robust to large amounts of noise in the objective.

We then test our method on the following 2 tasks, which are designed to significantly change the dynamics from the normal setting of the robot:

In the mass-voltage task (left), a 500g weight is placed on the robot’s side and the voltage is dropped to 10.0V from 16.8V. In the friction task (right), we replaced the rubber feet with tennis balls, to significantly reduce friction and hinder walking.

For the mass-voltage task, the initial meta-policy steered the robot significantly to the right due to the extra mass and voltage change, which caused an imbalance in the robot’s body and leg motors. However, after 30 episodes of adaptation using our method, the robot straightens the walking pose, and after 50 episodes, the robot is able to balance its body completely and is able to walk longer distances. In comparison, training from scratch on an easier, noiseless task from only simulation required approximately 90,000 episodes, showing that our method significantly reduces sample complexity on expensive real world data.

Qualitative changes during the adaptation phase under the mass-voltage task.

We compared our method to domain randomization and the standard policy gradient approach to MAML (PG-MAML) only, presenting the final policies qualitatively, as well as metrics from the real robot to show how our method adapts. We found that both domain randomization and PG-MAML baselines do not adapt as well as our method.

Comparisons between Domain Randomization and PG-MAML, and metric differences between our method’s meta-policy and adapted policy. Top: Comparison for the mass-voltage task. Our method stabilizes the robot’s roll angle. Bottom: Comparison for the friction task. Our method results in longer trajectories.

Future Work
This work exposes several avenues for future development. One option is to make algorithmic improvements to reduce the number of real-world rollouts required for adaptation. Another area for advancement is the use of model-based reinforcement learning techniques for a lifelong learning system, in which the robot can continuously collect data and quickly adjust its policy to learn new skills and to operate optimally in new environments.

This article has been published from the source link without modificaations to the text. Only the headline has been changed.

Source link

- Advertisment -

Most Popular

Improving Robots’ performance with Machine Learning

A small drone takes a test flight through a space filled with randomly placed cardboard cylinders acting as stand-ins for trees, people or structures....

Combination of Robots, AI and Data Analytics in your local Supermarket

Robots patrolling grocery store aisles and warehouses; so-called dark stores dedicated to online-only orders; data crunched in the cloud that allows retailers to identify and even...

5 (Most Common) Mistakes New Data Scientists Must Avoid

Emerging technologies like data science, machine learning, artificial intelligence are exploding by giving new dimensions to its applications. With business booming into data-driven technologies...

Using Blockchain to manage the supply chain COVID-19 vaccine

Blockchain could play an essential role in the distribution of the COVID-19 vaccine. Tackling COVID-19 will require the first-ever deployment of blockchain in the...

Role of Artificial intelligence in IVF

IVF is a physically and emotionally draining process and success isn’t guaranteed. But machine learning technology could improve the odds for couples trying to...

Machine Learning a major part of Google Sheets

It’s been a while since the first version of BigML’s add-on for Google Sheets. The post announcing it described how one could add predictions...
- Advertisment -