In scientific application using triangular grids it is often the case to load a value from all three vertices of the triangle which is spread across RAM. In order to use SIMD instruction to speed up execution time of the program the three values must be stored into a single SIMD register to perform e.g. a parallel multiplication (for a scalar product). One approach to archive this is to load the values one by one into multiple registers and combine them in an additional step. Would it be a better approach to load the three values in on step into one register? This can be done with the AVX gather instruction.
As I don't know how to use SIMD Intrinsics in fortran, this example is C++ only.
We start by native approach how to load three values into a SIMD register using intrinsics datatypes like __m256d which can hold up to four 64bit double precision floating-point values where actually only three are used.
__m256d f1(const std::vector<double>& values, const std::array<int,3>& idx) { const __m256d vec{ values[idx[0]], values[idx[1]], values[idx[2]]}; return vec; }
This generate the following assembler code independent of passed optimization flags -O1 -O2 -O3
movq (%rdi), %rax # movslq (%rsi), %rcx # load idx0 in register rcx movslq 4(%rsi), %rdx # load idx1 in register rdx movslq 8(%rsi), %rsi # load idx2 in register rsi vmovsd (%rax,%rcx,8), %xmm0 # load values[idx[0]] into xmm0 vmovq (%rax,%rsi,8), %xmm1 # load values[idx[2]] into xmm1 vmovhpd (%rax,%rdx,8), %xmm0, %xmm0 # load values[idx[1]] into hight part of xmm0 vinsertf128 $0x1, %xmm1, %ymm0, %ymm0 # combine xmm0 and xmm1 into 512bit register ymm0 ret
To load 3 values at once we can use the _mm256_mask_i32gather_pd instruction which take a base address and offsets in a variable of type __m128i. To only load 3 instead of 4 values there is an additional mask variable.
The C++ code looks like this now:
__m256d f2(const std::vector<double>& values, const std::array<int,3>& idx) { // mask to load 3 instead of 4 values const __m128i mask = _mm_setr_epi32(-1, -1, -1, 0); // Copy Element Node IDs into SIMD register const __m128i vindex = _mm_maskload_epi32(idx.data(), mask); // Load 3 double values constexpr int scale = 8; // mask to load 3 values but this time as 256d variable (but docu say it must be 256i, bug?) const __m256d mask2 = _mm256_setr_pd (-1, -1, -1, 0); const __m256d vec{_mm256_mask_i32gather_pd (__m256d{}, values.data(), vindex, mask2, scale)}; return vec; }
This generate the following assembler code independent of passed optimization flags -O1 -O2 -O3
movq (%rdi), %rax # vmovapd .LC1(%rip), %ymm2 # load mask2 into ymm2 vxorpd %xmm0, %xmm0, %xmm0 # zero register xmm0 vmovdqa .LC0(%rip), %xmm1 # load mask1 into xmm1 vpmaskmovd (%rsi), %xmm1, %xmm1 # load idx into xmm1 vgatherdpd %ymm2, (%rax,%xmm1,8), %ymm0 # load values[idx] into ymm0 ret
With AVX512 it is even possible the gather and scatter 3 values at ones. This time the mask variable is easier to create.
void f3(std::vector<double>& values, const std::array<int,3>& idx) { // Copy Element Node IDs into SIMD register constexpr int mask = 0b00000111; const __m128i vindex = _mm_maskz_loadu_epi32(mask, idx.data()); // Load 3 double values constexpr int scale = 8; const __m256d vec{_mm256_mmask_i32gather_pd (__m256d{}, mask, vindex, values.data(), scale)}; // Store 3 double vales _mm256_mask_i32scatter_pd(values.data(), mask, vindex, vec, scale); }
This generate the following assembler code independent of passed optimization flags -O1 -O2 -O3
movl $7, %eax # load mask into eax vxorpd %xmm0, %xmm0, %xmm0 # zero register xmm0 kmovw %eax, %k1 # convert mask into the mask register k1 movq (%rdi), %rax # vmovdqu32 (%rsi), %xmm1{%k1}{z} # load idx into xmm1 kmovw %k1, %k2 # copy mask register. why is that so? bug? vgatherdpd (%rax,%xmm1,8), %ymm0{%k2} # load values[idx] into ymm0 vscatterdpd %ymm0, (%rax,%xmm1,8){%k1} # stpre ymm0 into values[idx] vzeroupper # can be optimization away? ret
To measure I created a little testcase with the presented different variants on the following CPUs: Intel Core i5-10500, Intel Xeon E5-2695 v4, Intel Core i9-7900X. Unfortunately there is no speedup using AVX. There is even no slowdown (except for 1-2%).
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=m256_i32gather&expand=4474&ig_expand=4252,4250,2125,3720,4058,3723,3723&techs=AVX_ALL