1 min readAug 31, 2020
One thing caught my eye. The sparsemax paper is from 2016, the attention is all you need is from 2017 and BERT is 2018. This means this activation came before the "boom" of attention. Is the data from table 3 any reliable for today models?
In other words, has this been re-tried recently and is it still "just a tiny bit better" than softmax for attention models?