AdaX: Adaptive Gradient Descent with Exponential Long Term Memory

Publication
arXiv preprint arXiv:2004.09740