On the first step, we need to build a short program that creates two 1000 element integer arrays and fills them with random numbers in the range -1000 to +1000. After that, we will need to sum those two arrays element-by-element to another, separate array. Lastly, we will sum the third array and prints the result.
Auto-Vectorization
Automatic-vectorization is a method in parallel computing used to create multiple-process programs more efficient by transforming loops into sequences of vector operations.
In order to create an efficient, fully vectorized code, we need to divide the processes that take place, so they can work simultaneously as separate, rather than lumped in into a single element at a time.
We can see an example of that in these two short programs created:
Albeit this code is a working solution to our programming needs (as specified in the code requirements), this program isn't auto-vectorized since all the elements of the arrays are processed one at a time, making it extremely inefficient and long.
Take a look at this version of the very same code (rand.c):
In this version, the code is vectorized, since we divide both processes to work simultaneously on each elements of the arrays, in 2 (or more) loops, instead of a single one.
Let's compile the 2nd version of our program and disassemble the code.
It compiles and runs.gcc -O3 rand.c -o rand
Now let's disassemble it:
objdump -d randLet's observe the output of the <main> section:
00000000004004f0 <main>:
4004f0: d283f010 mov x16, #0x1f80 // Assign and initialises space
4004f4: cb3063ff sub sp, sp, x16
4004f8: a9007bfd stp x29, x30, [sp]
4004fc: 910003fd mov x29, sp
400500: a90153f3 stp x19, x20, [sp,#16]
400504: 5289ba74 mov w20, #0x4dd3 // Loop number 1. Filling arrays with random values.
400508: a9025bf5 stp x21, x22, [sp,#32]
40050c: 910103b5 add x21, x29, #0x40
400510: 913f83b6 add x22, x29, #0xfe0
400514: f9001bf7 str x23, [sp,#48]
400518: 72a20c54 movk w20, #0x1062, lsl #16
40051c: 5280fa13 mov w19, #0x7d0 // #2000
400520: d2800017 mov x23, #0x0 // #0
400524: 97ffffe3 bl 4004b0 <rand@plt> // Use of rand() values
400528: 9b347c01 smull x1, w0, w20 // use of vector registers and SIMD instructions
40052c: 9367fc21 asr x1, x1, #39
400530: 4b807c21 sub w1, w1, w0, asr #31
400534: 1b138020 msub w0, w1, w19, w0
400538: 510fa000 sub w0, w0, #0x3e8
40053c: b8376aa0 str w0, [x21,x23]
400540: 97ffffdc bl 4004b0 <rand@plt> //// Use of rand() values
400544: 9b347c01 smull x1, w0, w20 // use of vector registers and SIMD instructions
400548: 9367fc21 asr x1, x1, #39
40054c: 4b807c21 sub w1, w1, w0, asr #31
400550: 1b138020 msub w0, w1, w19, w0
400554: 510fa000 sub w0, w0, #0x3e8
400558: b8376ac0 str w0, [x22,x23]
40055c: 910012f7 add x23, x23, #0x4 // Loop conditionWe can see that our program has been vectorised and there has been a use of vector registers and SIMD instructions.
400560: f13e82ff cmp x23, #0xfa0
400564: 54fffe01 b.ne 400524 <main+0x34>
400568: 4f000401 movi v1.4s, #0x0 // Vectorization
40056c: d2800000 mov x0, #0x0 // Adding elements of the first loop
400570: 3ce06aa0 ldr q0, [x21,x0]
400574: 3ce06ac2 ldr q2, [x22,x0]
400578: 91004000 add x0, x0, #0x10
40057c: f13e801f cmp x0, #0xfa0
400580: 4ea28400 add v0.4s, v0.4s, v2.4s // Vectorization
400584: 4ea08421 add v1.4s, v1.4s, v0.4s // Vectorization
400588: 54ffff41 b.ne 400570 <main+0x80> // Loop condition
40058c: 4eb1b821 addv s1, v1.4s // Vectorization
400590: 90000000 adrp x0, 400000 <_init-0x468> // Print total sum
400594: 911e0000 add x0, x0, #0x780
400598: 0e043c21 mov w1, v1.s[0]
40059c: 97ffffd1 bl 4004e0 <printf@plt>
4005a0: f9401bf7 ldr x23, [sp,#48]
4005a4: a94153f3 ldp x19, x20, [sp,#16]
4005a8: 52800000 mov w0, #0x0 // #0
4005ac: a9425bf5 ldp x21, x22, [sp,#32]
4005b0: d283f010 mov x16, #0x1f80 // #8064
4005b4: a9407bfd ldp x29, x30, [sp]
4005b8: 8b3063ff add sp, sp, x16
4005bc: d65f03c0 ret // Return
Reflection
As we can see, Auto-Vectorization goes a long way with optimising and speeding up your code. Being the major research topic in computer science it is today, a lot can be learnt and said about its advantages and implementation methods, most which are still under research/development. This lab demonstrated the depth compilers can go to in order to optimise programs, especially when working with loops who process a large amount of values and data.
I have found this topic intriguing (yet taxing) and helpful to understanding how vector registers and SIMD work and how it can be applied towards real life applications, when necessary. I will be looking forward to learn more about it during the future, and expand my knowledge as much as possible for better results.