Ygor Serpa
1 min readAug 31, 2020

--

One thing caught my eye. The sparsemax paper is from 2016, the attention is all you need is from 2017 and BERT is 2018. This means this activation came before the "boom" of attention. Is the data from table 3 any reliable for today models?

In other words, has this been re-tried recently and is it still "just a tiny bit better" than softmax for attention models?

--

--

Ygor Serpa
Ygor Serpa

Written by Ygor Serpa

Former game developer turned data scientist after falling in love with AI and all its branches.

No responses yet