Monday 23 October 2017

SPO600 - Lab 5 - Vectorization Lab

In this lab we will be exploring single instruction/multiple data (SIMD) vectorization, and the auto-vectorization capabilities of the GCC compiler on the AArch64 server.

On the first step, we need to build a short program that creates two 1000 element integer arrays and fills them with random numbers in the range -1000 to +1000. After that, we will need to sum those two arrays element-by-element to another, separate array. Lastly, we will sum the third array and prints the result.

Auto-Vectorization
Automatic-vectorization is a method in parallel computing used to create multiple-process programs more efficient by transforming loops into sequences of vector operations.
In order to create an efficient, fully vectorized code, we need to divide the processes that take place, so they can work simultaneously as separate, rather than lumped in into a single element at a time.

We can see an example of that in these two short programs created:

Albeit this code is a working solution to our programming needs (as specified in the code requirements), this program isn't auto-vectorized since all the elements of the arrays are processed one at a time, making it extremely inefficient and long.

Take a look at this version of the very same code (rand.c):
In this version, the code is vectorized, since we divide both processes to work simultaneously on each elements of the arrays, in 2 (or more) loops, instead of a single one.

Let's compile the 2nd version of our program and disassemble the code.
gcc -O3 rand.c -o rand
It compiles and runs.
Now let's disassemble it:
objdump -d rand
Let's observe the output of the <main> section:
00000000004004f0 <main>:
  4004f0:       d283f010        mov     x16, #0x1f80 // Assign and initialises space
  4004f4:       cb3063ff        sub     sp, sp, x16
  4004f8:       a9007bfd        stp     x29, x30, [sp]
  4004fc:       910003fd        mov     x29, sp
  400500:       a90153f3        stp     x19, x20, [sp,#16]
  400504:       5289ba74        mov     w20, #0x4dd3 // Loop number 1. Filling arrays with random values.
  400508:       a9025bf5        stp     x21, x22, [sp,#32]
  40050c:       910103b5        add     x21, x29, #0x40
  400510:       913f83b6        add     x22, x29, #0xfe0
  400514:       f9001bf7        str     x23, [sp,#48]
  400518:       72a20c54        movk    w20, #0x1062, lsl #16
  40051c:       5280fa13        mov     w19, #0x7d0                     // #2000
  400520:       d2800017        mov     x23, #0x0                       // #0
  400524:       97ffffe3        bl      4004b0 <rand@plt> // Use of rand() values
  400528:       9b347c01        smull   x1, w0, w20 // use of vector registers and SIMD instructions
  40052c:       9367fc21        asr     x1, x1, #39
  400530:       4b807c21        sub     w1, w1, w0, asr #31
  400534:       1b138020        msub    w0, w1, w19, w0
  400538:       510fa000        sub     w0, w0, #0x3e8
  40053c:       b8376aa0        str     w0, [x21,x23]
  400540:       97ffffdc        bl      4004b0 <rand@plt> //// Use of rand() values
  400544:       9b347c01        smull   x1, w0, w20 // use of vector registers and SIMD instructions
  400548:       9367fc21        asr     x1, x1, #39
  40054c:       4b807c21        sub     w1, w1, w0, asr #31
  400550:       1b138020        msub    w0, w1, w19, w0
  400554:       510fa000        sub     w0, w0, #0x3e8
  400558:       b8376ac0        str     w0, [x22,x23]
  40055c:       910012f7        add     x23, x23, #0x4 // Loop condition
  400560:       f13e82ff        cmp     x23, #0xfa0
  400564:       54fffe01        b.ne    400524 <main+0x34>
  400568:       4f000401        movi    v1.4s, #0x0 // Vectorization
  40056c:       d2800000        mov     x0, #0x0 // Adding elements of the first loop
  400570:       3ce06aa0        ldr     q0, [x21,x0]
  400574:       3ce06ac2        ldr     q2, [x22,x0]
  400578:       91004000        add     x0, x0, #0x10
  40057c:       f13e801f        cmp     x0, #0xfa0
  400580:       4ea28400        add     v0.4s, v0.4s, v2.4s // Vectorization
  400584:       4ea08421        add     v1.4s, v1.4s, v0.4s // Vectorization
  400588:       54ffff41        b.ne    400570 <main+0x80> // Loop condition
  40058c:       4eb1b821        addv    s1, v1.4s // Vectorization
  400590:       90000000        adrp    x0, 400000 <_init-0x468> // Print total sum
  400594:       911e0000        add     x0, x0, #0x780
  400598:       0e043c21        mov     w1, v1.s[0]
  40059c:       97ffffd1        bl      4004e0 <printf@plt>
  4005a0:       f9401bf7        ldr     x23, [sp,#48]
  4005a4:       a94153f3        ldp     x19, x20, [sp,#16]
  4005a8:       52800000        mov     w0, #0x0                        // #0
  4005ac:       a9425bf5        ldp     x21, x22, [sp,#32]
  4005b0:       d283f010        mov     x16, #0x1f80                    // #8064
  4005b4:       a9407bfd        ldp     x29, x30, [sp]
  4005b8:       8b3063ff        add     sp, sp, x16
  4005bc:       d65f03c0        ret // Return
We can see that our program has been vectorised and there has been a use of vector registers and SIMD instructions.

Reflection
As we can see, Auto-Vectorization goes a long way with optimising and speeding up your code. Being the major research topic in computer science it is today, a lot can be learnt and said about its advantages and implementation methods, most which are still under research/development. This lab demonstrated the depth compilers can go to in order to optimise programs, especially when working with loops who process a large amount of values and data.
I have found this topic intriguing (yet taxing) and helpful to understanding how vector registers and SIMD work and how it can be applied towards real life applications, when necessary. I will be looking forward to learn more about it during the future, and expand my knowledge as much as possible for better results.

No comments:

Post a Comment