1 min readMar 27, 2020
I once taught “Learning Rate” as “Desperate Index”. You should use something like 0.001 not to be too desperate for a solution :)
Besides that, I don’t really see gradient clipping or the exploding/vanishing gradient problems being overly mentioned anymore. I know they still occur, I just don’t see them being as relevant today as they once were.
A side note on batch norm, I recently became aware that you can train a ResNet for CIFAR-10 by just training batch norm layers. I have written an article on that. I think you might like it.