Hey, thanks for the feedback.

2 min readAug 16, 2021

Hey, thanks for the feedback. Indeed, my original explanation was quite short and it didn't capture the whole thing. You are absolutely right about the correction term that replaces dropout during test time, however, as you pointed out, this is not the same "but it suffices". This is a great example of a "programmer" solution versus a "mathematician" solution. It is not perfect but works (and can be coded in no time).

Going a bit deeper, if you deactivate 10% of neurons during training, you only have "90% of the signal". During test time, you deactivate nothing, so you need to scale things down to 90% to pretend you are in the same circumstances as when training. The problem here is that this is a vague approximation. For instance, the weights of the next layer are used to a more sparse input. In other words, we just made the problem less worse. We didn't solve it.

Nowadays we have better ideas to solve this. Monte Carlo Dropout simply maintains the dropout during testing time without any change. The benefit of this is that you can run inference multiple times for the same input and get slightly altered versions of the output. If you take the variance of the outputs, you can have a useful estimation of the algorithm's confidence for that input. In other words, if running X through the network several times yields the same output, there is a high confidence, if not, the algorithm is just wildly guessing. Also, you get the same behavior during training and testing.

Another interesting approach is Gaussian Dropout. Instead of zeroing entries, you apply a controlled gaussian noise to all entries. This way, you are altering activations without changing their mean and variance nor losing any of the original signal. During test time, no correction is needed. As a side note, Gaussian Dropout can be viewed as a data augmentation (adding noise) applied within the network.

Hope this helps understanding the issue.

Best regards,

Written by Ygor Serpa

Responses (1)