Saturday, 16 December 2017

SPO600 - Project - Stage 1

For the SPO600 project we are supposed to find an Open Source Software Project and identify a Hash or Checksum Function in its source code.
I have come across and picked the open source project named Hashids.
From its official website:
"Hashids is a small open-source library that generates short, unique, non-sequential ids from numbers.
It converts numbers like 347 into strings like “yr8”, or array of numbers like [27, 986] into “3kTMd”.
You can also decode those ids back. This is useful in bundling several parameters into one or simply using them as short UIDs."
Hashids is relatively a small-sized project that offers hashing as a method of managing IDs (for example in websites like YouTube that hash their videos' ID's).

Benchmarking:
For the purpose of benchmarking, I have created a simple tester that will encode numbers in 2 different testing methods:
For the first method: 11 different numbers will be encoded manually and then timed together.
The code to trigger the encode() function and run the software is:
clock_t begin = clock();
std::cout << hash.encode({123}) << std::endl;
std::cout << hash.encode({123456}) << std::endl;
....
....
std::cout << "Time: " << (std::clock() - begin) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;
These are the results that I got:
The run-times are at best 0.106ms and at worst 0.130ms for 6 tests.

For the second method: Run a for loop that will encode 5,000,000 values.
The code to trigger the encode() function and run the software is:
int i;
clock_t begin = clock();
for (i = 0; i < 5000000; i++)
       hash.encode({i});

std::cout << "Last value (i) is: " << i << std::endl;
std::cout << "Time: " << (std::clock() - begin) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;
 These are the results that I got:
The run-times are at best 3730.58ms and at worst 3922.57ms for 7 tests.

Optimisation:
On first glance, I can already spot that the Makefile is using a -O2 flag. Given the minimal use of resources that software uses for operation, the time differences might be small, so we will be using milliseconds to benchmark the current performance times.
In addition to the flag, I will be focusing on the function named encode() that is inside "hashids.h" and will attempt to adjust the foreign function calling processes to make the function more independent. Considering the fact the code is almost entirely designed to be using vectors, it is already well optimised but can hopefully still be made faster with these updates and algorithm improvements.

In stage 2 of the project I will be implementing the changes discussed in this post and will report back with results of the outcome.

Saturday, 18 November 2017

SPO600 - Lab 6 - Algorithm Selection Lab

In lab number 6, we will be selecting one of two algorithms for adjusting the volume of audio samples using 3 different approaches. Digital sound is represented as signed 16-bit integer signal samples with each stream of samples per a single stereo channel being at sample rates of 44.1 or 48 thousand samples per second, and a total of 88.2 or 96 thousand samples per two stereo channels.
Those samples can be scaled by a volume factor in the range of 0.0000 to 1.0000, with 0 being the lowest possible and 1 being the loudest.

In order for us to do the benchmarking properly, we will generate a large array of 500m int16_t elements, each having a random sound sample, and a volume factor of 0.75.
I will be using the Aarchie server for the purpose of this lab.

#1 - Multiply each sample by the floating point volume factor 0.75:
 In this approach, we will be simply going through each element of the array, and multiplying it by the volume factor (0.75). Then we will store the new value back into the array.
This is the code we have used for this approach:

And the results, before and after optimisation:

We can see that the optimisation improved the execution time significantly.

#2 - Pre-calculate a lookup table (array) of all possible sample values multiplied by the volume factor, and look up each sample in that table to get the scaled values:
Using this approach, we will need to create another array that will function as a lookup table that contains all the samples already scaled.
Let's look at the code:

And the results, before and after optimisation:

Once again the optimisation significantly improved execution times, while keeping same total result.

#3 - Bitwise conversion of volume factor:
In this algorithm, we will first convert our volume factor to a fixed-point integer by multiplying by a binary number that represents the value "1". We will multiply it by 256, then shift the value to 8 bits and assign it to the array.
Let's have a look at the code:

And the results, before and after optimisation:
As we can see, the execution time has improved.

We can see that approach #3 was the fastest of all, with and without optimisation. In the third approach, the result is different than two of the other approaches. This is due to bitshifting of negative values on some servers, which happen to deal with it differently (like Aarchie). This can be fixed by turning negative values into positive, for the purpose of shifting and then returning them back to their original values.

Monday, 23 October 2017

SPO600 - Lab 5 - Vectorization Lab

In this lab we will be exploring single instruction/multiple data (SIMD) vectorization, and the auto-vectorization capabilities of the GCC compiler on the AArch64 server.

On the first step, we need to build a short program that creates two 1000 element integer arrays and fills them with random numbers in the range -1000 to +1000. After that, we will need to sum those two arrays element-by-element to another, separate array. Lastly, we will sum the third array and prints the result.

Auto-Vectorization
Automatic-vectorization is a method in parallel computing used to create multiple-process programs more efficient by transforming loops into sequences of vector operations.
In order to create an efficient, fully vectorized code, we need to divide the processes that take place, so they can work simultaneously as separate, rather than lumped in into a single element at a time.

We can see an example of that in these two short programs created:

Albeit this code is a working solution to our programming needs (as specified in the code requirements), this program isn't auto-vectorized since all the elements of the arrays are processed one at a time, making it extremely inefficient and long.

Take a look at this version of the very same code (rand.c):
In this version, the code is vectorized, since we divide both processes to work simultaneously on each elements of the arrays, in 2 (or more) loops, instead of a single one.

Let's compile the 2nd version of our program and disassemble the code.
gcc -O3 rand.c -o rand
It compiles and runs.
Now let's disassemble it:
objdump -d rand
Let's observe the output of the <main> section:
00000000004004f0 <main>:
  4004f0:       d283f010        mov     x16, #0x1f80 // Assign and initialises space
  4004f4:       cb3063ff        sub     sp, sp, x16
  4004f8:       a9007bfd        stp     x29, x30, [sp]
  4004fc:       910003fd        mov     x29, sp
  400500:       a90153f3        stp     x19, x20, [sp,#16]
  400504:       5289ba74        mov     w20, #0x4dd3 // Loop number 1. Filling arrays with random values.
  400508:       a9025bf5        stp     x21, x22, [sp,#32]
  40050c:       910103b5        add     x21, x29, #0x40
  400510:       913f83b6        add     x22, x29, #0xfe0
  400514:       f9001bf7        str     x23, [sp,#48]
  400518:       72a20c54        movk    w20, #0x1062, lsl #16
  40051c:       5280fa13        mov     w19, #0x7d0                     // #2000
  400520:       d2800017        mov     x23, #0x0                       // #0
  400524:       97ffffe3        bl      4004b0 <rand@plt> // Use of rand() values
  400528:       9b347c01        smull   x1, w0, w20 // use of vector registers and SIMD instructions
  40052c:       9367fc21        asr     x1, x1, #39
  400530:       4b807c21        sub     w1, w1, w0, asr #31
  400534:       1b138020        msub    w0, w1, w19, w0
  400538:       510fa000        sub     w0, w0, #0x3e8
  40053c:       b8376aa0        str     w0, [x21,x23]
  400540:       97ffffdc        bl      4004b0 <rand@plt> //// Use of rand() values
  400544:       9b347c01        smull   x1, w0, w20 // use of vector registers and SIMD instructions
  400548:       9367fc21        asr     x1, x1, #39
  40054c:       4b807c21        sub     w1, w1, w0, asr #31
  400550:       1b138020        msub    w0, w1, w19, w0
  400554:       510fa000        sub     w0, w0, #0x3e8
  400558:       b8376ac0        str     w0, [x22,x23]
  40055c:       910012f7        add     x23, x23, #0x4 // Loop condition
  400560:       f13e82ff        cmp     x23, #0xfa0
  400564:       54fffe01        b.ne    400524 <main+0x34>
  400568:       4f000401        movi    v1.4s, #0x0 // Vectorization
  40056c:       d2800000        mov     x0, #0x0 // Adding elements of the first loop
  400570:       3ce06aa0        ldr     q0, [x21,x0]
  400574:       3ce06ac2        ldr     q2, [x22,x0]
  400578:       91004000        add     x0, x0, #0x10
  40057c:       f13e801f        cmp     x0, #0xfa0
  400580:       4ea28400        add     v0.4s, v0.4s, v2.4s // Vectorization
  400584:       4ea08421        add     v1.4s, v1.4s, v0.4s // Vectorization
  400588:       54ffff41        b.ne    400570 <main+0x80> // Loop condition
  40058c:       4eb1b821        addv    s1, v1.4s // Vectorization
  400590:       90000000        adrp    x0, 400000 <_init-0x468> // Print total sum
  400594:       911e0000        add     x0, x0, #0x780
  400598:       0e043c21        mov     w1, v1.s[0]
  40059c:       97ffffd1        bl      4004e0 <printf@plt>
  4005a0:       f9401bf7        ldr     x23, [sp,#48]
  4005a4:       a94153f3        ldp     x19, x20, [sp,#16]
  4005a8:       52800000        mov     w0, #0x0                        // #0
  4005ac:       a9425bf5        ldp     x21, x22, [sp,#32]
  4005b0:       d283f010        mov     x16, #0x1f80                    // #8064
  4005b4:       a9407bfd        ldp     x29, x30, [sp]
  4005b8:       8b3063ff        add     sp, sp, x16
  4005bc:       d65f03c0        ret // Return
We can see that our program has been vectorised and there has been a use of vector registers and SIMD instructions.

Reflection
As we can see, Auto-Vectorization goes a long way with optimising and speeding up your code. Being the major research topic in computer science it is today, a lot can be learnt and said about its advantages and implementation methods, most which are still under research/development. This lab demonstrated the depth compilers can go to in order to optimise programs, especially when working with loops who process a large amount of values and data.
I have found this topic intriguing (yet taxing) and helpful to understanding how vector registers and SIMD work and how it can be applied towards real life applications, when necessary. I will be looking forward to learn more about it during the future, and expand my knowledge as much as possible for better results.

Sunday, 1 October 2017

SPO600 - Lab 4 - Code Building Lab

In lab number 4, we will be building a software package.
In the first step, we will choose a software package from the Free Software Foundation's GNU Project.
A full list of software packages can be found here: Link.

I have decided to build the software package of Barcode. Barcode is a "a tool to convert text strings to printed bars. It supports a variety of standard codes to represent the textual strings and creates postscript output."

Now after picking the package we would like to work with, we will log into our Linux system and download the package using "wget" command:
 wget ftp://ftp.gnu.org/gnu/barcode/barcode-0.99.tar.gz
This is what we should be getting from the console:
barcode-0.99.tar.gz 100%[===================>] 869.85K  2.87MB/s    in 0.3s

2017-10-01 19:47:58 (2.87 MB/s) - ‘barcode-0.99.tar.gz’ saved [890730]
 Next, we will unzip the file we have downloaded using the "tar" command:
 tar -zxvf barcode-0.99.tar.gz
 After we unzipping the file, we can see there should be an instruction file (often named as INSTALL). In this case, the INSTALL file tells us to look at INSTALL.generic for basic instructions. By reading the file INSTALL.generic we can see the following instructions:

From the document, we understand that the next step would be to run the "configure" command.
After the configuration is done, we will run the command "make" to compile the package.
The fourth step "make install" would install the programs and relevant data, which is something we do not want, so we won't do this part.
 After running the command "make" which should be finished in a few minutes, we will get a new file called "barcode".
By running it we can test the software package:


It works!


Part 2: Build and test glibc

In this part we will find and build the source code for the latest released version of the GNU Standard C Library (glibc), which can be found at the glibc website.
Now we will download it to our system using the "wget" command:
wget http://ftp.gnu.org/gnu/glibc/glibc-2.26.tar.gz
 This is what we are supposed to be getting:
glibc-2.26.tar.gz   100%[===================>]  28.00M  19.6MB/s    in 1.4s

2017-10-01 20:00:41 (19.6 MB/s) - ‘glibc-2.26.tar.gz’ saved [29355499/29355499] 
Next, we will unpack the file using "tar -zxvf".
Same as with more other software packages, the installing instructions are within the INSTALL file, which we will now open and skim through it.
The INSTALL file states that:
The GNU C Library cannot be compiled in the source directory.  You must
build it in a separate build directory.  For example, if you have
unpacked the GNU C Library sources in '/src/gnu/glibc-VERSION', create a
directory '/src/gnu/glibc-build' to put the object files in.  This
allows removing the whole build directory in case an error occurs, which
is the safest way to get a fresh start and should always be done.
As a safety measure so we will create a new folder called "glibc-build" and compile the file there, using a prefix to our command:
../glibc-2.26/configure --prefix=/home/ragarunov/glibc-build
Then we will run the command "make".
After a long compiling process, we can finally begin to test our library!

Testing:
The library provides us the file "testrun.sh" that can be used to test our own version. Using that, we can test our version of glibc by creating a simple Hello World program in C:

It works!
Now, we will try to put a bug and run the program. We will do so with a simple array and a loop:
[ragarunov@xerxes glibc-build]$ cat test.c
#include <stdio.h>

int main () {
        int num[4] = {1, 2, 3, 4};
        int i;

        for (i = 0; i<5; i++) {
                printf("%d", num[i]);
                printf("\n");
        }

        return 0;
}
After compiling and running the command:
./testrun.sh /home/ragarunov/lab4/glibc-build/test
Prints:
1
2
3
4
1219370488
 In both tests, the library worked well and compiled the files as necessary!

Override:
The override mechanism is commonly used in object oriented programming languages as a feature to provide subclasses implementations that will replace (or override) that implementation that has already been given by its parent class.

Multiarch:
Multiarch is a term that refers the capability of a system to install and run applications of multiple different binary targets on the same system. It is used to simplify cross-building of libraries and headers that are required for a system during building.

Thursday, 28 September 2017

SPO600 - Lab 3

In this lab we will be experimenting with assembler on the x86_64 and aarch64 platforms.

I have found the assembly language intriguing and challenging on both the group task and later on during the individual work process I have gone through.

After reviewing the code piece that was given to us (the group) and printed "Loop" The first task was to modify our loop program so that it counts from 0-9.
This was done by setting a conversion of an integer to digital character in ASCII/ISO-8859-1/Unicode UTF-8 which would be 48-57 (0x30-0x39).
After initialising the loop index value to 0 and moving it to the registry,  I have added the current loop value to %r8 (which is register number 8 in the 64-bit mode.
Next step, would be to set the current number to the string, to a variable called pos.
The print function is similar to the one shown in the Assembler Basics tutorial, as it fulfills the needs of the program.

Code that was developed in class printed the result:
Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9
  
In order to get the code to print two digits value and print a loop that counts from 0-30, we would initialise another digit and divide the number by 10 so we can present the remainder as the second digit. Then we can assign the quotient and remainder to the register r8 and r9.

Code developed:
.text
.global    _start

sout = 1
start = 0                       /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 31                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     $start,%r15         /* loop index */

loop:

    /* Set to 0 */
    mov    $48, %r8
    mov $48, %r9
   
    /* Calculation */
    mov    $0,%rdx
    mov    $10,%r10
    mov    %r15,%rax
   
    div    %r10 /* division */
   
    add    %rax,%r8
    add    %rdx,%r9
   
    /* add the current number */
    mov    %r8b, pos
    mov %r9b, posB
   
    cmp $0x30, %r15
    jmp    continue
    /* print (taken from example) */
    mov    $len,%rdx                       /* message length */
    mov    $msg,%rsi                       /* message location */
    mov    $sout,%rdi                      /* file descriptor stdout */
    mov    $1,%rax                         /* syscall sys_write */
    syscall

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall
   
.data

msg:    .ascii      "Loop:    \n"
.set len , . - msg
/* set position of number right after the msg */
.set pos , msg + 6
.set posB , msg + 7

Would print the following result:
Loop:  0
Loop:  1
Loop:  2
Loop:  3
Loop:  4
Loop:  5
Loop:  6
Loop:  7
Loop:  8
Loop:  9
Loop: 10
Loop: 11
Loop: 12
Loop: 13
Loop: 14
Loop: 15
Loop: 16
Loop: 17
Loop: 18
Loop: 19
Loop: 20
Loop: 21
Loop: 22
Loop: 23
Loop: 24
Loop: 25
Loop: 26
Loop: 27
Loop: 28
Loop: 29
Loop: 30



Aarch64:
The aarch64 version is quite similar to the x86_64 version, logic-wise, therefore the same processes can be applied to it as well. However, the instructions, registers and orders were different. Goes to show how meticulous is the process of debugging Assembly language.
I have found it more difficult to debug on Aarch64.
The task we were assigned to do was to create a loop that goes from 0 to 30.

The code that was developed:
.text
.global    _start

sout = 1
start = 0
max = 31

_start:    
       
    mov    x4, start
   
loop:
   
    /* set */
    mov    w8, 48
    mov    w9, 48
   
    /* calculations */
    mov    w2, 10
    udiv    w1, w4, w2
    msub    w5, w1, w2, w4
    add    w8, w8, w1
    add    w9, w9, w5

    adr    x1, msg
    strb    w9, [x1, 7]
   
    cmp    w8, 0x30
    beq continue
    strb    w8, [x1, 6]
   
continue:

    /* print */
    mov    x0, sout
    mov x2, len
   
    mov x8, 64
    svc 0
   
    add x4, x4, 1
    cmp x4, max
    b.ne    loop
   
    mov     x0, 0         /* status -> 0 */
    mov     x8, 93        /* exit is syscall #93 */
    svc     0              /* invoke syscall */
   

/* print data */
.data
msg:     .ascii      "Loop:   \n"
len=     . - msg

Would print the following result:
Loop:  0
Loop:  1
Loop:  2
Loop:  3
Loop:  4
Loop:  5
Loop:  6
Loop:  7
Loop:  8
Loop:  9
Loop: 10
Loop: 11
Loop: 12
Loop: 13
Loop: 14
Loop: 15
Loop: 16
Loop: 17
Loop: 18
Loop: 19
Loop: 20
Loop: 21
Loop: 22
Loop: 23
Loop: 24
Loop: 25
Loop: 26
Loop: 27
Loop: 28
Loop: 29
Loop: 30

In conclusion
I have found Assembly very intriguing, the more you go in depth with it, yet unnecessarily challenging, especially nowadays. The coding on the Aarch64 server was, personally, more difficult to figure out and to solve, due to difficulties I have had with the initial loop.

Thursday, 21 September 2017

SPO600 - Lab 2 - Compiled C Lab

In lab 2 we will be investigating C source code and the output of the C compiler.

The source code we will be working with is a basic Hello World C program:
#include <stdio.h>

int main() {
    printf("Hello World!\n");
}
We will be compiling this program using the "gcc" command along with the following compiler options:
-g - enable debugging information
-O0 - do not optimise
-fno-builtin - do not use builtin function optimizations

After compiling, the size of the file is 73088 bytes. Using the objdump command, we can examine the binary object that was produced and see that there are more than 30 different sections, in which ".text" is the section that holds our code. By closely inspecting the output of our objdump command, we can also see the that string that our program is intended to print is stored in the .rodata section:
Contents of section .rodata:
 400660 01000200 00000000 00000000 00000000  ................
 400670 00000000 00000000 00000000 00000000  ................
 400680 48656c6c 6f20576f 726c6421 0a00      Hello World!..

Now we will attempt to re-compile the code with different options and examine the changes in the output file:
Adding the "-static" option to the compiling command has increased the file size to 834,456 bytes. Examining the binary object using objdump will result in a longer waiting time (until the system finishes printing the entire output, due to the sections containing more information and requiring longer process time for the program.
Next, we will try compiling the program without the "-fno-builtin" option in the command. We can see that the file size has decreased slightly to 834,448 bytes. This cancels the use of built-in optimisation functions.
Next, we will re-compile the code again, this time removing the "-g" option from the compiling command. We can note that the file size has reduced to 832,032 bytes. The decrease in size is for the reason debugging information isn't enabled, thus shortening the size/process requirements.

Adding additional arguments (numbers 1-8 and two strings) to the code results in every argument being moved to a register and added to a stack:

Now, we will move the printf() function to a new function called output() and call that function from the main() function:
Compiling the code and examining the object binary, we can note that the objects of our function that previously resided in <main> section, have been moved to a new section called <output>, whereas the <main> section refers and calls the <output> section and executes it.

Lastly, we will now remove the "-O0" options from our compiling command and add "-O3" instead of it. Changing the 0 to 3, enables more levels of optimisation to your code, which invokes all optimisation features available.


Wednesday, 13 September 2017

SPO600 - Lab 1

Rust
"Rust is a systems programming language that runs blazingly fast, prevents segfaults, and guarantees thread safety."
https://www.rust-lang.org/en-US/index.html

Rust operates under the both the MIT license and the Apache License 2.0.
Being the expansive system of projects that Rust is, patches are approved and maintained by The Rust Project Developers in the rust-lang organization on GitHub. (https://github.com/rust-lang/rust/).

An issue of a change can be found at #44223 was implemented by Eduard-Mihai Burtescu to resolve an issue with existing symbols and bounds.
In order to contribute to the project, I would go through the Contribute page on the official web age and the Contributing to Rust document on the project's GitHub and choose one of the numerous ways it allows users to contribute through.

Hoodie
"Hoodie is a free and Open Source Software for building applications for the web and iOS"
http://hood.ie/intro/

Hoodie operates under the Apache License 2.0.
An example of a patch that was contributed can be seen on issue #595 where a 404.html page was missing sending a request with accept: text/html to the url /hoodie/unkown.
A fix was contributed by Jamie Tanna and was reviewed by Gregor Martynus.

Contributions to Hoodie can be made with coding, design, documentation, events, or money.
In order to make an effective contribution, it is advised to go to Hoodie's Milestone page and check on status of current processes going on for the project.