Conclusion
In summary, we’ve shown that if one formulates the RL problem as a mathematical objective with constraints, then repeated applications of convex duality in conjunction with a cleverly chosen convex regularizer yield an equivalent problem without constraints. The resulting unconstrained problem is easy to implement in practice and applicable in a wide range of settings. We’ve already applied our general framework to agent behavior policy optimization as well as policy evaluation, and imitation learning. We’ve found that our algorithms are not only more mathematically principled than existing RL methods, but they also often yield better practical performance, showing the value of unifying mathematical principles with practical implementation.