Monday, 23 October 2017

SPO600 - Lab 5 - Vectorization Lab

In this lab we will be exploring single instruction/multiple data (SIMD) vectorization, and the auto-vectorization capabilities of the GCC compiler on the AArch64 server.

On the first step, we need to build a short program that creates two 1000 element integer arrays and fills them with random numbers in the range -1000 to +1000. After that, we will need to sum those two arrays element-by-element to another, separate array. Lastly, we will sum the third array and prints the result.

Auto-Vectorization
Automatic-vectorization is a method in parallel computing used to create multiple-process programs more efficient by transforming loops into sequences of vector operations.
In order to create an efficient, fully vectorized code, we need to divide the processes that take place, so they can work simultaneously as separate, rather than lumped in into a single element at a time.

We can see an example of that in these two short programs created:

Albeit this code is a working solution to our programming needs (as specified in the code requirements), this program isn't auto-vectorized since all the elements of the arrays are processed one at a time, making it extremely inefficient and long.

Take a look at this version of the very same code (rand.c):
In this version, the code is vectorized, since we divide both processes to work simultaneously on each elements of the arrays, in 2 (or more) loops, instead of a single one.

Let's compile the 2nd version of our program and disassemble the code.
gcc -O3 rand.c -o rand
It compiles and runs.
Now let's disassemble it:
objdump -d rand
Let's observe the output of the <main> section:
00000000004004f0 <main>:
  4004f0:       d283f010        mov     x16, #0x1f80 // Assign and initialises space
  4004f4:       cb3063ff        sub     sp, sp, x16
  4004f8:       a9007bfd        stp     x29, x30, [sp]
  4004fc:       910003fd        mov     x29, sp
  400500:       a90153f3        stp     x19, x20, [sp,#16]
  400504:       5289ba74        mov     w20, #0x4dd3 // Loop number 1. Filling arrays with random values.
  400508:       a9025bf5        stp     x21, x22, [sp,#32]
  40050c:       910103b5        add     x21, x29, #0x40
  400510:       913f83b6        add     x22, x29, #0xfe0
  400514:       f9001bf7        str     x23, [sp,#48]
  400518:       72a20c54        movk    w20, #0x1062, lsl #16
  40051c:       5280fa13        mov     w19, #0x7d0                     // #2000
  400520:       d2800017        mov     x23, #0x0                       // #0
  400524:       97ffffe3        bl      4004b0 <rand@plt> // Use of rand() values
  400528:       9b347c01        smull   x1, w0, w20 // use of vector registers and SIMD instructions
  40052c:       9367fc21        asr     x1, x1, #39
  400530:       4b807c21        sub     w1, w1, w0, asr #31
  400534:       1b138020        msub    w0, w1, w19, w0
  400538:       510fa000        sub     w0, w0, #0x3e8
  40053c:       b8376aa0        str     w0, [x21,x23]
  400540:       97ffffdc        bl      4004b0 <rand@plt> //// Use of rand() values
  400544:       9b347c01        smull   x1, w0, w20 // use of vector registers and SIMD instructions
  400548:       9367fc21        asr     x1, x1, #39
  40054c:       4b807c21        sub     w1, w1, w0, asr #31
  400550:       1b138020        msub    w0, w1, w19, w0
  400554:       510fa000        sub     w0, w0, #0x3e8
  400558:       b8376ac0        str     w0, [x22,x23]
  40055c:       910012f7        add     x23, x23, #0x4 // Loop condition
  400560:       f13e82ff        cmp     x23, #0xfa0
  400564:       54fffe01        b.ne    400524 <main+0x34>
  400568:       4f000401        movi    v1.4s, #0x0 // Vectorization
  40056c:       d2800000        mov     x0, #0x0 // Adding elements of the first loop
  400570:       3ce06aa0        ldr     q0, [x21,x0]
  400574:       3ce06ac2        ldr     q2, [x22,x0]
  400578:       91004000        add     x0, x0, #0x10
  40057c:       f13e801f        cmp     x0, #0xfa0
  400580:       4ea28400        add     v0.4s, v0.4s, v2.4s // Vectorization
  400584:       4ea08421        add     v1.4s, v1.4s, v0.4s // Vectorization
  400588:       54ffff41        b.ne    400570 <main+0x80> // Loop condition
  40058c:       4eb1b821        addv    s1, v1.4s // Vectorization
  400590:       90000000        adrp    x0, 400000 <_init-0x468> // Print total sum
  400594:       911e0000        add     x0, x0, #0x780
  400598:       0e043c21        mov     w1, v1.s[0]
  40059c:       97ffffd1        bl      4004e0 <printf@plt>
  4005a0:       f9401bf7        ldr     x23, [sp,#48]
  4005a4:       a94153f3        ldp     x19, x20, [sp,#16]
  4005a8:       52800000        mov     w0, #0x0                        // #0
  4005ac:       a9425bf5        ldp     x21, x22, [sp,#32]
  4005b0:       d283f010        mov     x16, #0x1f80                    // #8064
  4005b4:       a9407bfd        ldp     x29, x30, [sp]
  4005b8:       8b3063ff        add     sp, sp, x16
  4005bc:       d65f03c0        ret // Return
We can see that our program has been vectorised and there has been a use of vector registers and SIMD instructions.

Reflection
As we can see, Auto-Vectorization goes a long way with optimising and speeding up your code. Being the major research topic in computer science it is today, a lot can be learnt and said about its advantages and implementation methods, most which are still under research/development. This lab demonstrated the depth compilers can go to in order to optimise programs, especially when working with loops who process a large amount of values and data.
I have found this topic intriguing (yet taxing) and helpful to understanding how vector registers and SIMD work and how it can be applied towards real life applications, when necessary. I will be looking forward to learn more about it during the future, and expand my knowledge as much as possible for better results.

Sunday, 1 October 2017

SPO600 - Lab 4 - Code Building Lab

In lab number 4, we will be building a software package.
In the first step, we will choose a software package from the Free Software Foundation's GNU Project.
A full list of software packages can be found here: Link.

I have decided to build the software package of Barcode. Barcode is a "a tool to convert text strings to printed bars. It supports a variety of standard codes to represent the textual strings and creates postscript output."

Now after picking the package we would like to work with, we will log into our Linux system and download the package using "wget" command:
 wget ftp://ftp.gnu.org/gnu/barcode/barcode-0.99.tar.gz
This is what we should be getting from the console:
barcode-0.99.tar.gz 100%[===================>] 869.85K  2.87MB/s    in 0.3s

2017-10-01 19:47:58 (2.87 MB/s) - ‘barcode-0.99.tar.gz’ saved [890730]
 Next, we will unzip the file we have downloaded using the "tar" command:
 tar -zxvf barcode-0.99.tar.gz
 After we unzipping the file, we can see there should be an instruction file (often named as INSTALL). In this case, the INSTALL file tells us to look at INSTALL.generic for basic instructions. By reading the file INSTALL.generic we can see the following instructions:

From the document, we understand that the next step would be to run the "configure" command.
After the configuration is done, we will run the command "make" to compile the package.
The fourth step "make install" would install the programs and relevant data, which is something we do not want, so we won't do this part.
 After running the command "make" which should be finished in a few minutes, we will get a new file called "barcode".
By running it we can test the software package:


It works!


Part 2: Build and test glibc

In this part we will find and build the source code for the latest released version of the GNU Standard C Library (glibc), which can be found at the glibc website.
Now we will download it to our system using the "wget" command:
wget http://ftp.gnu.org/gnu/glibc/glibc-2.26.tar.gz
 This is what we are supposed to be getting:
glibc-2.26.tar.gz   100%[===================>]  28.00M  19.6MB/s    in 1.4s

2017-10-01 20:00:41 (19.6 MB/s) - ‘glibc-2.26.tar.gz’ saved [29355499/29355499] 
Next, we will unpack the file using "tar -zxvf".
Same as with more other software packages, the installing instructions are within the INSTALL file, which we will now open and skim through it.
The INSTALL file states that:
The GNU C Library cannot be compiled in the source directory.  You must
build it in a separate build directory.  For example, if you have
unpacked the GNU C Library sources in '/src/gnu/glibc-VERSION', create a
directory '/src/gnu/glibc-build' to put the object files in.  This
allows removing the whole build directory in case an error occurs, which
is the safest way to get a fresh start and should always be done.
As a safety measure so we will create a new folder called "glibc-build" and compile the file there, using a prefix to our command:
../glibc-2.26/configure --prefix=/home/ragarunov/glibc-build
Then we will run the command "make".
After a long compiling process, we can finally begin to test our library!

Testing:
The library provides us the file "testrun.sh" that can be used to test our own version. Using that, we can test our version of glibc by creating a simple Hello World program in C:

It works!
Now, we will try to put a bug and run the program. We will do so with a simple array and a loop:
[ragarunov@xerxes glibc-build]$ cat test.c
#include <stdio.h>

int main () {
        int num[4] = {1, 2, 3, 4};
        int i;

        for (i = 0; i<5; i++) {
                printf("%d", num[i]);
                printf("\n");
        }

        return 0;
}
After compiling and running the command:
./testrun.sh /home/ragarunov/lab4/glibc-build/test
Prints:
1
2
3
4
1219370488
 In both tests, the library worked well and compiled the files as necessary!

Override:
The override mechanism is commonly used in object oriented programming languages as a feature to provide subclasses implementations that will replace (or override) that implementation that has already been given by its parent class.

Multiarch:
Multiarch is a term that refers the capability of a system to install and run applications of multiple different binary targets on the same system. It is used to simplify cross-building of libraries and headers that are required for a system during building.