Unfortunately, due to the complexity and specialized nature of AVX-512, such optimizations are typically reserved for performance-critical applications and require expertise in low-level programming and processor microarchitecture.
Unfortunately, due to the complexity and specialized nature of AVX-512, such optimizations are typically reserved for performance-critical applications and require expertise in low-level programming and processor microarchitecture.
There is no comparison between a handwritten-assembly and a C version of the same implementation here. The 94x speedup is the comparison between a non-SIMD C implementation and a SIMD assembly implementation.