The previous version of the comment erroneously claimed that the top
half of c3 held y_1; in fact it holds y_2, but we'll clobber it anyway
because the objective is to carry up into y_1, so mark it as
don't-care (like lo).
// Finally extract the answer. This complicated dance is better than
// storing to memory and loading, because the piecemeal stores
// inhibit store forwarding.
- movdqa \c3, \t // (y_0, y_1)
+ movdqa \c3, \t // (y_0, ?)
movdqa \lo, \t // (y^*_0, ?, ?, ?)
psrldq \t, 8 // (y_2, 0)
psrlq \c3, 32 // (floor(y_0/B), ?)
// Finally extract the answer. This complicated dance is better than
// storing to memory and loading, because the piecemeal stores
// inhibit store forwarding.
- movdqa \c3, \t // (y_0, y_1)
+ movdqa \c3, \t // (y_0, ?)
movdqa \lo, \t // (y^*_0, ?, ?, ?)
psrldq \t, 8 // (y_2, 0)
psrlq \c3, 32 // (floor(y_0/B), ?)