+/// memory, typically in warm L1 cache. The packed operands are read from
+/// memory into working registers XMM0--XMM3 and processed immediately.
+/// The following conventional argument names and locations are used
+/// throughout.
+///
+/// Arg Format Location Notes
+///
+/// U packed [EAX]
+/// X packed [EBX] In Montgomery multiplication, X = N
+/// V expanded [ECX]
+/// Y expanded [EDX] In Montgomery multiplication, Y = (A + U V) M
+/// M expanded [ESI] -N^{-1} (mod B^4)
+/// N Modulus, for Montgomery multiplication
+/// A packed [EDI] Destination/accumulator
+/// C carry XMM4--XMM7
+///
+/// The calculation is some variant of
+///
+/// A' + C' B^4 <- U V + X Y + A + C
+///
+/// The low-level functions fit into a fairly traditional (finely-integrated)
+/// operand scanning loop over operand pairs (U, X) (indexed by j) and (V, Y)
+/// (indexed by i).
+///
+/// The variants are as follows.
+///
+/// Function Variant Use i j
+///
+/// mmul4 A = C = 0 Montgomery 0 0
+/// dmul4 A = 0 Montgomery 0 +
+/// mmla4 C = 0 Montgomery + 0
+/// dmla4 exactly as shown Montgomery + +
+/// mont4 U = V = C = 0 Montgomery any 0
+///
+/// mul4zc U = V = A = C = 0 Plain 0 0
+/// mul4 U = V = A = 0 Plain 0 +
+/// mla4zc U = V = C = 0 Plain + 0
+/// mla4 U = V = 0 Plain + +