Ronen - SPO600

Saturday, 6 January 2018

SPO600 - Project - Stage 2

For stage number 2 of the SPO600 project I will be trying to optimise the Hashids package.
As noted in my stage 1 report, the optimisation will be focused on flags and the encode() function that is inside "hashids.h".

Flags:
First, we will start with the Makefile file, and see if there are any possible improvements we can make with the flags of the compiler.
The current flags that are assigned are:

-g -Wall -O2 -pipe -Wextra -Wno-sign-compare

As the code in itself mostly uses vectors, I presume changing the -O2 flag to -O3 won't make a significant change, or will only slightly improve it. Since there isn't really any downside to it, and since we're going to dig deeper into the file functions as well, I will go ahead and will emend the command to an -O3 flag, which may or may not be useful at a later stage.

-g -Wall -O2 -pipe -Wextra -Wno-sign-compare

After running and compiling the project, with no code changes, just as I suspected, the run times remained more or less the same:

Worst run-time is at 3912.18ms and best run-time is at 3878.04ms.
Now let's move onto the code adjustments!

Code:
Going over the encode() function in the hashids.h file, I can already tell the code is well maintained and adjusted.
However, given that the code intricate as that and is also fairly communicative with other functions, I decided to look for optimisation opportunities in the loops inside the functios.
Previous experience (and in some cases, common sense) suggests that most resource-draining process will occur inside loop of the code.
Looking at the first for loop in the function:

    int values_hash = 0;
    int i = 0;
    for (Iterator iter = begin; iter != end; ++iter) {
      values_hash += (*iter % (i + 100));
      ++i;
    };

I would be lying if I said I wasn't disappointed to find this code is highly optimised. It takes care of initialisation, incrementation, and assignment.
Moving forward!

Going over the second for loop in the function, which is more longer, and more consuming:

for (Iterator iter = begin; iter != end; ++iter) {
    uint64_t number = *iter;

    std::string alphabet_salt;
    alphabet_salt.push_back(lottery);
    alphabet_salt.append(_salt).append(alphabet);

    alphabet = _reorder(alphabet, alphabet_salt);

    std::string last = _hash(number, alphabet);
    output.append(last);

    number %= last[0] + i;
    output.push_back(_separators[number % _separators.size()]);
};

I can already construe that my main focus should be the variable declarations inside the loop and the functions that are being called and most likely have loops inside them, as well. Which would mean run time of O(n^2) at best and possibly worse.

On my first step I will keep unnecessary declarations (that happen every time the loop runs) outside of the loop.
I will take uint64_t number and std::string alphabet_salt and declare them both outside of the loop.
std::string last will remain inside the loop because part of it being assigned its values requires it to be declared each and every time anew.
Eventually my loop looks like this:

    i = 0;
    std::string alphabet_salt;
    uint64_t number;

    for (Iterator iter = begin; iter != end; ++iter) {

      number = *iter;
      alphabet_salt.push_back(lottery);
      alphabet_salt.append(_salt).append(alphabet);

      alphabet = _reorder(alphabet, alphabet_salt);

      std::string last = _hash(number, alphabet);
      output.append(last);

      number %= last[0] + i;
      output.push_back(_separators[number % _separators.size()]);
      ++i;
    };

Now we move onto the _hash() function that is inside hashids.cpp file.
The _hash() function is essentially a single while loop that pushes back values into the output string and divides the number by the array size until it reaches 0.
I couldn't find much room for improvement in there.

The _reorder() function, however, contains more to work with.
Similarly to _hash(), the _reorder() also uses a while loop in order to process the class's elements and return them to our encode() function.
This is what it looks like:

while (i > 0) {
    index %= salt.length();
    int integer = salt[index];
    integer_sum += integer;
    unsigned int j = (integer + index + integer_sum) % i;

    std::swap(input[i], input[j]);

    --i;
    ++index;
};

Once again, I can first spot variable being declared inside the loop, making it slower, since now every single time the loop executes, the variables are being declared anew and take up space.
I will declare int integer and unsigned int j outside the loop and initialise them the value 0.
This is how my function looks now:

int integer = 0;
unsigned int j = 0;

while (i > 0) {
    index %= salt.length();
    integer = salt[index];
    integer_sum += integer;
    j = (integer + index + integer_sum) % i;

    std::swap(input[i], input[j]);

    --i;
    ++index;
};

With all these changes, now let's try to run the tester!

Benchmarking:
On the stage 1, we got run-time results of "at best 3730.58ms and at worst 3922.57ms for 7 tests." Which averaged to 3846.29ms.

When running the same tester, which encodes 5,000,000 values in a for loop, we get the following results:

The run-times are at best 3642.28ms and at worst 3892.31ms for 7 tests!
On the bright side, we managed to get the software to run under under 3900ms for all tests!
On the even brighter side, the best run-time has been improved drastically and the total runs average to 3824.36ms. A 21.93ms speed up for average run time.
If calculating the best results of each run, we can see that the speed up is 88.3ms.
If calculating the worst results of each run, we can see that the speed up is 30.26ms.
All in all, it seems that on all metrics, the run-times have been improved, even if slightly.

I've been experiencing several speed related difficulties with the server in the last couple of days (reckon everyone else is the same as me, waited until the last moment), so I (want to) believe that in a similar set up to what I've run my initial benchmark (at December 16, 2017), I would've had even faster results.

Personal reflections and conclusions:
Doing the project has helped me learn more about open source packages and projects. I have found the initial stage, of finding a suitable project to work with, as the most stressful and challenging part, mostly due to the huge, boundless, never-ending amount of packages and projects that exists online. The optimisation stage made me learn and experiment more with methods that we've practiced in class and in our labs.

Most importantly, I have learnt to take into consideration the amount of resources that will be used by the system with every code that I write, and with that to adjust my coding habits to become more processor-friendly and efficient. I am looking forward to learn more this subject in the future and improve as a programmer and an open-source enthusiast!

Monday, 1 January 2018

SPO600 - Lab 7 - Inline Assembler Lab

In lab 7, we will be exploring the use of inline assembler, and its use in open source software.
First we will start with some background,

"inline assembler is a feature of some compilers that allows low-level code written in assembly language to be embedded within a program, among code that otherwise has been compiled from a higher-level language such as C or Ada."

In simple words, inline assembler allows us to apply assembly language code into our high-level code (such as C) and have it work more efficiently, due to the nature of assembly language.

In the first part of the lab, we will download a program provided in the lab instructions and build the program. Then, we will test the performance of the solution by adjusting sample size (in vol.h) and arrays.
We can see that the current sample size is 250000. Running an unaltered version of the code produces the following results:

Now we will change the number of samples to 500 million, similar to what we had in the previous lab (lab 6). For comparing the results of the code to lab 6 results, we will compile both without an -O3 flag:

And:

Processing times without any flag seem to be faster for the inline assembler code.
Compiling with an -O3 flag produces relatively similar run-times for both programs. I find it due to the reason an -O3 creates a run-time that is not solely "code-dependent" whereas no flags will rely wholly on the quality of code (and asm functionality coded) in it.

Q: what is an alternate approach?
An alternate approach would be not to assign registers for the variables in_cursor, out_cursor and vol_int. Doing so will result in the compiler finding and allocating registers on its own.

Q: should we use 32767 or 32768 in next line? why?
32767, for the reason it is the maximum allowed value of data type int16_t.

Q: what does it mean to "duplicate" values here?
Duplicate values means to copy what's in register 22 (vol_int, in this case) into vector v1.8h

Q: what happens if we remove the following lines? Why?
The following lines are responsible for assigning corresponding values into the code's variables for scaling functionality. Removing them results in a Segmentation fault due to a scaling failure caused by the lack of this process.

Q: are the results usable? are they correct?
The results are usable and correct. We can see that with each run the scaling values are changing and so is the summed result.

Part 2 - Individual task:
For this part, I have picked the open source package "BusyBox".
A bit of background for BusyBox from the official website:

"BusyBox combines tiny versions of many common UNIX utilities into a single small executable. It provides replacements for most of the utilities you usually find in GNU fileutils, shellutils, etc. The utilities in BusyBox generally have fewer options than their full-featured GNU cousins; however, the options that are included provide the expected functionality and behave very much like their GNU counterparts. BusyBox provides a fairly complete environment for any small or embedded system."

How much assembley-language code is present?
Using an "egrep" command I was able to find quite a few inline assembler pieces in the package.
Majority of assembly code can be found in the networking and the include directory a

Which platform(s) it is used on?
BusyBox is can be used on any platform that is supported by gcc. Its module loading is currently limited to ARM, CRIS, H8/300, x86, ia64, x86_64, m68k, MIPS, PowerPC, S390, SH3/4/5, Sparc, v850e, and x86_64.

Why it is there (what it does)?
The functions I decided to focus on can be found in "networking/tls_pstm_montgomery_reduce.c"
The file contains a handful of inline assembler functions whose purpose is to optimise operations within different platforms/architectures.

What happens on other platforms?
It seems that the directory networking is responsible for ensuring the software runs properly on all platforms using meticulous inline assembler coding per platform.

Your opinion of the value of the assembler code VS the loss of portability/increase in complexity of the code.
As far as I can see, assembler code can be extremely efficient for projects but can be very time-consuming and draining. The intricacy of assembler code and necessity to adjust it per every architecture can make it detrimental to big scale projects and make standard tasks such as debugging or maintenance be very menial and taxing. However, the value of assembler code can definitely be needed in projects such as "BusyBox" that are striving to be all-encompassing and make operations faster, more efficient and overall better.

Saturday, 16 December 2017

SPO600 - Project - Stage 1

For the SPO600 project we are supposed to find an Open Source Software Project and identify a Hash or Checksum Function in its source code.
I have come across and picked the open source project named Hashids.
From its official website:

"Hashids is a small open-source library that generates short, unique, non-sequential ids from numbers.
It converts numbers like 347 into strings like “yr8”, or array of numbers like [27, 986] into “3kTMd”.
You can also decode those ids back. This is useful in bundling several parameters into one or simply using them as short UIDs."

Hashids is relatively a small-sized project that offers hashing as a method of managing IDs (for example in websites like YouTube that hash their videos' ID's).

Benchmarking:
For the purpose of benchmarking, I have created a simple tester that will encode numbers in 2 different testing methods:
For the first method: 11 different numbers will be encoded manually and then timed together.
The code to trigger the encode() function and run the software is:

clock_t begin = clock();
std::cout << hash.encode({123}) << std::endl;
std::cout << hash.encode({123456}) << std::endl;
....
....
std::cout << "Time: " << (std::clock() - begin) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;

These are the results that I got:

The run-times are at best 0.106ms and at worst 0.130ms for 6 tests.

For the second method: Run a for loop that will encode 5,000,000 values.
The code to trigger the encode() function and run the software is:

int i;
clock_t begin = clock();
for (i = 0; i < 5000000; i++)
hash.encode({i});

std::cout << "Last value (i) is: " << i << std::endl;

std::cout << "Time: " << (std::clock() - begin) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;

These are the results that I got:

The run-times are at best 3730.58ms and at worst 3922.57ms for 7 tests.

Optimisation:
On first glance, I can already spot that the Makefile is using a -O2 flag. Given the minimal use of resources that software uses for operation, the time differences might be small, so we will be using milliseconds to benchmark the current performance times.
In addition to the flag, I will be focusing on the function named encode() that is inside "hashids.h" and will attempt to adjust the foreign function calling processes to make the function more independent. Considering the fact the code is almost entirely designed to be using vectors, it is already well optimised but can hopefully still be made faster with these updates and algorithm improvements.

In stage 2 of the project I will be implementing the changes discussed in this post and will report back with results of the outcome.

Saturday, 18 November 2017

SPO600 - Lab 6 - Algorithm Selection Lab

In lab number 6, we will be selecting one of two algorithms for adjusting the volume of audio samples using 3 different approaches. Digital sound is represented as signed 16-bit integer signal samples with each stream of samples per a single stereo channel being at sample rates of 44.1 or 48 thousand samples per second, and a total of 88.2 or 96 thousand samples per two stereo channels.
Those samples can be scaled by a volume factor in the range of 0.0000 to 1.0000, with 0 being the lowest possible and 1 being the loudest.

In order for us to do the benchmarking properly, we will generate a large array of 500m int16_t elements, each having a random sound sample, and a volume factor of 0.75.
I will be using the Aarchie server for the purpose of this lab.

#1 - Multiply each sample by the floating point volume factor 0.75:
In this approach, we will be simply going through each element of the array, and multiplying it by the volume factor (0.75). Then we will store the new value back into the array.
This is the code we have used for this approach:

And the results, before and after optimisation:

We can see that the optimisation improved the execution time significantly.

#2 - Pre-calculate a lookup table (array) of all possible sample values multiplied by the volume factor, and look up each sample in that table to get the scaled values:

Using this approach, we will need to create another array that will function as a lookup table that contains all the samples already scaled.

Let's look at the code:

And the results, before and after optimisation:

Once again the optimisation significantly improved execution times, while keeping same total result.

#3 - Bitwise conversion of volume factor:

In this algorithm, we will first convert our volume factor to a fixed-point integer by multiplying by a binary number that represents the value "1". We will multiply it by 256, then shift the value to 8 bits and assign it to the array.

Let's have a look at the code:

And the results, before and after optimisation:

As we can see, the execution time has improved.

We can see that approach #3 was the fastest of all, with and without optimisation. In the third approach, the result is different than two of the other approaches. This is due to bitshifting of negative values on some servers, which happen to deal with it differently (like Aarchie). This can be fixed by turning negative values into positive, for the purpose of shifting and then returning them back to their original values.

Monday, 23 October 2017

SPO600 - Lab 5 - Vectorization Lab

In this lab we will be exploring single instruction/multiple data (SIMD) vectorization, and the auto-vectorization capabilities of the GCC compiler on the AArch64 server.

On the first step, we need to build a short program that creates two 1000 element integer arrays and fills them with random numbers in the range -1000 to +1000. After that, we will need to sum those two arrays element-by-element to another, separate array. Lastly, we will sum the third array and prints the result.

Auto-Vectorization
Automatic-vectorization is a method in parallel computing used to create multiple-process programs more efficient by transforming loops into sequences of vector operations.
In order to create an efficient, fully vectorized code, we need to divide the processes that take place, so they can work simultaneously as separate, rather than lumped in into a single element at a time.

We can see an example of that in these two short programs created:

Albeit this code is a working solution to our programming needs (as specified in the code requirements), this program isn't auto-vectorized since all the elements of the arrays are processed one at a time, making it extremely inefficient and long.

Take a look at this version of the very same code (rand.c):

In this version, the code is vectorized, since we divide both processes to work simultaneously on each elements of the arrays, in 2 (or more) loops, instead of a single one.

Let's compile the 2nd version of our program and disassemble the code.

gcc -O3 rand.c -o rand

It compiles and runs.
Now let's disassemble it:

objdump -d rand

Let's observe the output of the <main> section:

00000000004004f0 <main>:
4004f0:       d283f010        mov     x16, #0x1f80 // Assign and initialises space
4004f4:       cb3063ff        sub     sp, sp, x16
4004f8:       a9007bfd        stp     x29, x30, [sp]
4004fc:       910003fd        mov     x29, sp
400500:       a90153f3        stp     x19, x20, [sp,#16]

400504:       5289ba74        mov     w20, #0x4dd3 // Loop number 1. Filling arrays with random values.
400508:       a9025bf5        stp     x21, x22, [sp,#32]
40050c:       910103b5        add     x21, x29, #0x40
400510:       913f83b6        add     x22, x29, #0xfe0
400514:       f9001bf7        str     x23, [sp,#48]
400518:       72a20c54        movk    w20, #0x1062, lsl #16
40051c:       5280fa13        mov     w19, #0x7d0                     // #2000
400520:       d2800017        mov     x23, #0x0                       // #0
400524:       97ffffe3        bl      4004b0 <rand@plt> // Use of rand() values

400528:       9b347c01        smull   x1, w0, w20 // use of vector registers and SIMD instructions
40052c:       9367fc21        asr     x1, x1, #39
400530:       4b807c21        sub     w1, w1, w0, asr #31
400534:       1b138020        msub    w0, w1, w19, w0
400538:       510fa000        sub     w0, w0, #0x3e8
40053c:       b8376aa0        str     w0, [x21,x23]

400540:       97ffffdc        bl      4004b0 <rand@plt> //// Use of rand() values
400544:       9b347c01        smull   x1, w0, w20 // use of vector registers and SIMD instructions
400548:       9367fc21        asr     x1, x1, #39
40054c:       4b807c21        sub     w1, w1, w0, asr #31
400550:       1b138020        msub    w0, w1, w19, w0
400554:       510fa000        sub     w0, w0, #0x3e8
400558:       b8376ac0        str     w0, [x22,x23]

40055c:       910012f7        add     x23, x23, #0x4 // Loop condition
400560:       f13e82ff        cmp     x23, #0xfa0
400564:       54fffe01        b.ne    400524 <main+0x34>
400568:       4f000401        movi    v1.4s, #0x0 // Vectorization
40056c:       d2800000        mov     x0, #0x0 // Adding elements of the first loop
400570:       3ce06aa0        ldr     q0, [x21,x0]
400574:       3ce06ac2        ldr     q2, [x22,x0]
400578:       91004000        add     x0, x0, #0x10
40057c:       f13e801f        cmp     x0, #0xfa0
400580:       4ea28400        add     v0.4s, v0.4s, v2.4s // Vectorization
400584:       4ea08421        add     v1.4s, v1.4s, v0.4s // Vectorization
400588:       54ffff41        b.ne    400570 <main+0x80> // Loop condition
40058c:       4eb1b821        addv    s1, v1.4s // Vectorization
400590:       90000000        adrp    x0, 400000 <_init-0x468> // Print total sum
400594:       911e0000        add     x0, x0, #0x780
400598:       0e043c21        mov     w1, v1.s[0]
40059c:       97ffffd1        bl      4004e0 <printf@plt>
4005a0:       f9401bf7        ldr     x23, [sp,#48]
4005a4:       a94153f3        ldp     x19, x20, [sp,#16]
4005a8:       52800000        mov     w0, #0x0                        // #0
4005ac:       a9425bf5        ldp     x21, x22, [sp,#32]
4005b0:       d283f010        mov     x16, #0x1f80                    // #8064
4005b4:       a9407bfd        ldp     x29, x30, [sp]
4005b8:       8b3063ff        add     sp, sp, x16
4005bc:       d65f03c0        ret // Return

We can see that our program has been vectorised and there has been a use of vector registers and SIMD instructions.

Reflection
As we can see, Auto-Vectorization goes a long way with optimising and speeding up your code. Being the major research topic in computer science it is today, a lot can be learnt and said about its advantages and implementation methods, most which are still under research/development. This lab demonstrated the depth compilers can go to in order to optimise programs, especially when working with loops who process a large amount of values and data.
I have found this topic intriguing (yet taxing) and helpful to understanding how vector registers and SIMD work and how it can be applied towards real life applications, when necessary. I will be looking forward to learn more about it during the future, and expand my knowledge as much as possible for better results.

Sunday, 1 October 2017

SPO600 - Lab 4 - Code Building Lab

In lab number 4, we will be building a software package.
In the first step, we will choose a software package from the Free Software Foundation's GNU Project.
A full list of software packages can be found here: Link.

I have decided to build the software package of Barcode. Barcode is a "a tool to convert text strings to printed bars. It supports a variety of standard codes to represent the textual strings and creates postscript output."

Now after picking the package we would like to work with, we will log into our Linux system and download the package using "wget" command:

wget ftp://ftp.gnu.org/gnu/barcode/barcode-0.99.tar.gz

This is what we should be getting from the console:

barcode-0.99.tar.gz 100%[===================>] 869.85K 2.87MB/s in 0.3s

2017-10-01 19:47:58 (2.87 MB/s) - ‘barcode-0.99.tar.gz’ saved [890730]

Next, we will unzip the file we have downloaded using the "tar" command:

tar -zxvf barcode-0.99.tar.gz

After we unzipping the file, we can see there should be an instruction file (often named as INSTALL). In this case, the INSTALL file tells us to look at INSTALL.generic for basic instructions. By reading the file INSTALL.generic we can see the following instructions:

From the document, we understand that the next step would be to run the "configure" command.
After the configuration is done, we will run the command "make" to compile the package.
The fourth step "make install" would install the programs and relevant data, which is something we do not want, so we won't do this part.
After running the command "make" which should be finished in a few minutes, we will get a new file called "barcode".
By running it we can test the software package:

It works!

Part 2: Build and test glibc

In this part we will find and build the source code for the latest released version of the GNU Standard C Library (glibc), which can be found at the glibc website.
Now we will download it to our system using the "wget" command:

wget http://ftp.gnu.org/gnu/glibc/glibc-2.26.tar.gz

This is what we are supposed to be getting:

glibc-2.26.tar.gz 100%[===================>] 28.00M 19.6MB/s in 1.4s

2017-10-01 20:00:41 (19.6 MB/s) - ‘glibc-2.26.tar.gz’ saved [29355499/29355499]

Next, we will unpack the file using "tar -zxvf".
Same as with more other software packages, the installing instructions are within the INSTALL file, which we will now open and skim through it.
The INSTALL file states that:

The GNU C Library cannot be compiled in the source directory. You must
build it in a separate build directory. For example, if you have
unpacked the GNU C Library sources in '/src/gnu/glibc-VERSION', create a
directory '/src/gnu/glibc-build' to put the object files in. This
allows removing the whole build directory in case an error occurs, which
is the safest way to get a fresh start and should always be done.

As a safety measure so we will create a new folder called "glibc-build" and compile the file there, using a prefix to our command:

../glibc-2.26/configure --prefix=/home/ragarunov/glibc-build

Then we will run the command "make".
After a long compiling process, we can finally begin to test our library!

Testing:
The library provides us the file "testrun.sh" that can be used to test our own version. Using that, we can test our version of glibc by creating a simple Hello World program in C:

It works!
Now, we will try to put a bug and run the program. We will do so with a simple array and a loop:
[ragarunov@xerxes glibc-build]$ cat test.c

#include <stdio.h>

int main () {
        int num[4] = {1, 2, 3, 4};
        int i;

        for (i = 0; i<5; i++) {
                printf("%d", num[i]);
                printf("\n");
        }

        return 0;
}

After compiling and running the command:

./testrun.sh /home/ragarunov/lab4/glibc-build/test

Prints:

1
2
3
4
1219370488

In both tests, the library worked well and compiled the files as necessary!

Override:
The override mechanism is commonly used in object oriented programming languages as a feature to provide subclasses implementations that will replace (or override) that implementation that has already been given by its parent class.

Multiarch:
Multiarch is a term that refers the capability of a system to install and run applications of multiple different binary targets on the same system. It is used to simplify cross-building of libraries and headers that are required for a system during building.

Thursday, 28 September 2017

SPO600 - Lab 3

In this lab we will be experimenting with assembler on the x86_64 and aarch64 platforms.

I have found the assembly language intriguing and challenging on both the group task and later on during the individual work process I have gone through.

After reviewing the code piece that was given to us (the group) and printed "Loop" The first task was to modify our loop program so that it counts from 0-9.
This was done by setting a conversion of an integer to digital character in ASCII/ISO-8859-1/Unicode UTF-8 which would be 48-57 (0x30-0x39).
After initialising the loop index value to 0 and moving it to the registry, I have added the current loop value to %r8 (which is register number 8 in the 64-bit mode.
Next step, would be to set the current number to the string, to a variable called pos.
The print function is similar to the one shown in the Assembler Basics tutorial, as it fulfills the needs of the program.

Code that was developed in class printed the result:

Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9

In order to get the code to print two digits value and print a loop that counts from 0-30, we would initialise another digit and divide the number by 10 so we can present the remainder as the second digit. Then we can assign the quotient and remainder to the register r8 and r9.

Code developed:

.text
.global    _start

sout = 1
start = 0                       /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 31                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     $start,%r15         /* loop index */

loop:

    /* Set to 0 */
    mov    $48, %r8
    mov $48, %r9

    /* Calculation */
    mov    $0,%rdx
    mov    $10,%r10
    mov    %r15,%rax

    div    %r10 /* division */

    add    %rax,%r8
    add    %rdx,%r9

    /* add the current number */
    mov    %r8b, pos
    mov %r9b, posB

    cmp $0x30, %r15
    jmp    continue
    /* print (taken from example) */
    mov    $len,%rdx                       /* message length */
    mov    $msg,%rsi                       /* message location */
    mov    $sout,%rdi                      /* file descriptor stdout */
    mov    $1,%rax                         /* syscall sys_write */
    syscall

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

.data

msg:    .ascii      "Loop:    \n"
.set len , . - msg
/* set position of number right after the msg */
.set pos , msg + 6
.set posB , msg + 7

Would print the following result:

Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9
Loop: 10
Loop: 11
Loop: 12
Loop: 13
Loop: 14
Loop: 15
Loop: 16
Loop: 17
Loop: 18
Loop: 19
Loop: 20
Loop: 21
Loop: 22
Loop: 23
Loop: 24
Loop: 25
Loop: 26
Loop: 27
Loop: 28
Loop: 29
Loop: 30

Aarch64:
The aarch64 version is quite similar to the x86_64 version, logic-wise, therefore the same processes can be applied to it as well. However, the instructions, registers and orders were different. Goes to show how meticulous is the process of debugging Assembly language.
I have found it more difficult to debug on Aarch64.
The task we were assigned to do was to create a loop that goes from 0 to 30.

The code that was developed:

.text
.global    _start

sout = 1
start = 0
max = 31

_start:

    mov    x4, start

loop:

    /* set */
    mov    w8, 48
    mov    w9, 48

    /* calculations */
    mov    w2, 10
    udiv    w1, w4, w2
    msub    w5, w1, w2, w4
    add    w8, w8, w1
    add    w9, w9, w5

    adr    x1, msg
    strb    w9, [x1, 7]

    cmp    w8, 0x30
    beq continue
    strb    w8, [x1, 6]

continue:

    /* print */
    mov    x0, sout
    mov x2, len

    mov x8, 64
    svc 0

    add x4, x4, 1
    cmp x4, max
    b.ne    loop

    mov     x0, 0         /* status -> 0 */
    mov     x8, 93        /* exit is syscall #93 */
    svc     0              /* invoke syscall */


/* print data */
.data
msg:     .ascii      "Loop:   \n"
len=     . - msg

Would print the following result:

Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9
Loop: 10
Loop: 11
Loop: 12
Loop: 13
Loop: 14
Loop: 15
Loop: 16
Loop: 17
Loop: 18
Loop: 19
Loop: 20
Loop: 21
Loop: 22
Loop: 23
Loop: 24
Loop: 25
Loop: 26
Loop: 27
Loop: 28
Loop: 29
Loop: 30

In conclusion
I have found Assembly very intriguing, the more you go in depth with it, yet unnecessarily challenging, especially nowadays. The coding on the Aarch64 server was, personally, more difficult to figure out and to solve, due to difficulties I have had with the initial loop.