加和不加从结果上是等价的,BEiT在实现中去掉是为了fp16训练过程中数值稳定。
Both (i.e., with or without key.bias) are equivalent in terms of calculation results. They are canceled by the softmax function.
Softmax(q,k) = exp(q.weight * key.weight + q.bias * key.weight + q.weight * key.bias + q.bias * key.bias) / Z
Because the query is the same over all the keys, so the term (q.weight * key.bias + q.bias * key.bias) remains the same across all the keys, which in turn can be cancelled without affecting the softmax results.
exp(a)/(exp(a)+ exp(b)) == exp(a+C)/(exp(a+C)+ exp(b+C))
女王:求求题主放过我,我可不敢有什么政绩。。。