Ronen - SPO600: January 2018

For stage number 2 of the SPO600 project I will be trying to optimise the Hashids package.
As noted in my stage 1 report, the optimisation will be focused on flags and the encode() function that is inside "hashids.h".

Flags:
First, we will start with the Makefile file, and see if there are any possible improvements we can make with the flags of the compiler.
The current flags that are assigned are:

-g -Wall -O2 -pipe -Wextra -Wno-sign-compare

As the code in itself mostly uses vectors, I presume changing the -O2 flag to -O3 won't make a significant change, or will only slightly improve it. Since there isn't really any downside to it, and since we're going to dig deeper into the file functions as well, I will go ahead and will emend the command to an -O3 flag, which may or may not be useful at a later stage.

-g -Wall -O2 -pipe -Wextra -Wno-sign-compare

After running and compiling the project, with no code changes, just as I suspected, the run times remained more or less the same:

Worst run-time is at 3912.18ms and best run-time is at 3878.04ms.
Now let's move onto the code adjustments!

Code:
Going over the encode() function in the hashids.h file, I can already tell the code is well maintained and adjusted.
However, given that the code intricate as that and is also fairly communicative with other functions, I decided to look for optimisation opportunities in the loops inside the functios.
Previous experience (and in some cases, common sense) suggests that most resource-draining process will occur inside loop of the code.
Looking at the first for loop in the function:

    int values_hash = 0;
    int i = 0;
    for (Iterator iter = begin; iter != end; ++iter) {
      values_hash += (*iter % (i + 100));
      ++i;
    };

I would be lying if I said I wasn't disappointed to find this code is highly optimised. It takes care of initialisation, incrementation, and assignment.
Moving forward!

Going over the second for loop in the function, which is more longer, and more consuming:

for (Iterator iter = begin; iter != end; ++iter) {
    uint64_t number = *iter;

    std::string alphabet_salt;
    alphabet_salt.push_back(lottery);
    alphabet_salt.append(_salt).append(alphabet);

    alphabet = _reorder(alphabet, alphabet_salt);

    std::string last = _hash(number, alphabet);
    output.append(last);

    number %= last[0] + i;
    output.push_back(_separators[number % _separators.size()]);
};

I can already construe that my main focus should be the variable declarations inside the loop and the functions that are being called and most likely have loops inside them, as well. Which would mean run time of O(n^2) at best and possibly worse.

On my first step I will keep unnecessary declarations (that happen every time the loop runs) outside of the loop.
I will take uint64_t number and std::string alphabet_salt and declare them both outside of the loop.
std::string last will remain inside the loop because part of it being assigned its values requires it to be declared each and every time anew.
Eventually my loop looks like this:

    i = 0;
    std::string alphabet_salt;
    uint64_t number;

    for (Iterator iter = begin; iter != end; ++iter) {

      number = *iter;
      alphabet_salt.push_back(lottery);
      alphabet_salt.append(_salt).append(alphabet);

      alphabet = _reorder(alphabet, alphabet_salt);

      std::string last = _hash(number, alphabet);
      output.append(last);

      number %= last[0] + i;
      output.push_back(_separators[number % _separators.size()]);
      ++i;
    };

Now we move onto the _hash() function that is inside hashids.cpp file.
The _hash() function is essentially a single while loop that pushes back values into the output string and divides the number by the array size until it reaches 0.
I couldn't find much room for improvement in there.

The _reorder() function, however, contains more to work with.
Similarly to _hash(), the _reorder() also uses a while loop in order to process the class's elements and return them to our encode() function.
This is what it looks like:

while (i > 0) {
    index %= salt.length();
    int integer = salt[index];
    integer_sum += integer;
    unsigned int j = (integer + index + integer_sum) % i;

    std::swap(input[i], input[j]);

    --i;
    ++index;
};

Once again, I can first spot variable being declared inside the loop, making it slower, since now every single time the loop executes, the variables are being declared anew and take up space.
I will declare int integer and unsigned int j outside the loop and initialise them the value 0.
This is how my function looks now:

int integer = 0;
unsigned int j = 0;

while (i > 0) {
    index %= salt.length();
    integer = salt[index];
    integer_sum += integer;
    j = (integer + index + integer_sum) % i;

    std::swap(input[i], input[j]);

    --i;
    ++index;
};

With all these changes, now let's try to run the tester!

Benchmarking:
On the stage 1, we got run-time results of "at best 3730.58ms and at worst 3922.57ms for 7 tests." Which averaged to 3846.29ms.

When running the same tester, which encodes 5,000,000 values in a for loop, we get the following results:

The run-times are at best 3642.28ms and at worst 3892.31ms for 7 tests!
On the bright side, we managed to get the software to run under under 3900ms for all tests!
On the even brighter side, the best run-time has been improved drastically and the total runs average to 3824.36ms. A 21.93ms speed up for average run time.
If calculating the best results of each run, we can see that the speed up is 88.3ms.
If calculating the worst results of each run, we can see that the speed up is 30.26ms.
All in all, it seems that on all metrics, the run-times have been improved, even if slightly.

I've been experiencing several speed related difficulties with the server in the last couple of days (reckon everyone else is the same as me, waited until the last moment), so I (want to) believe that in a similar set up to what I've run my initial benchmark (at December 16, 2017), I would've had even faster results.

Personal reflections and conclusions:
Doing the project has helped me learn more about open source packages and projects. I have found the initial stage, of finding a suitable project to work with, as the most stressful and challenging part, mostly due to the huge, boundless, never-ending amount of packages and projects that exists online. The optimisation stage made me learn and experiment more with methods that we've practiced in class and in our labs.

Most importantly, I have learnt to take into consideration the amount of resources that will be used by the system with every code that I write, and with that to adjust my coding habits to become more processor-friendly and efficient. I am looking forward to learn more this subject in the future and improve as a programmer and an open-source enthusiast!

In lab 7, we will be exploring the use of inline assembler, and its use in open source software.
First we will start with some background,

"inline assembler is a feature of some compilers that allows low-level code written in assembly language to be embedded within a program, among code that otherwise has been compiled from a higher-level language such as C or Ada."

In simple words, inline assembler allows us to apply assembly language code into our high-level code (such as C) and have it work more efficiently, due to the nature of assembly language.

In the first part of the lab, we will download a program provided in the lab instructions and build the program. Then, we will test the performance of the solution by adjusting sample size (in vol.h) and arrays.
We can see that the current sample size is 250000. Running an unaltered version of the code produces the following results:

Now we will change the number of samples to 500 million, similar to what we had in the previous lab (lab 6). For comparing the results of the code to lab 6 results, we will compile both without an -O3 flag:

And:

Processing times without any flag seem to be faster for the inline assembler code.
Compiling with an -O3 flag produces relatively similar run-times for both programs. I find it due to the reason an -O3 creates a run-time that is not solely "code-dependent" whereas no flags will rely wholly on the quality of code (and asm functionality coded) in it.

Q: what is an alternate approach?
An alternate approach would be not to assign registers for the variables in_cursor, out_cursor and vol_int. Doing so will result in the compiler finding and allocating registers on its own.

Q: should we use 32767 or 32768 in next line? why?
32767, for the reason it is the maximum allowed value of data type int16_t.

Q: what does it mean to "duplicate" values here?
Duplicate values means to copy what's in register 22 (vol_int, in this case) into vector v1.8h

Q: what happens if we remove the following lines? Why?
The following lines are responsible for assigning corresponding values into the code's variables for scaling functionality. Removing them results in a Segmentation fault due to a scaling failure caused by the lack of this process.

Q: are the results usable? are they correct?
The results are usable and correct. We can see that with each run the scaling values are changing and so is the summed result.

Part 2 - Individual task:
For this part, I have picked the open source package "BusyBox".
A bit of background for BusyBox from the official website:

"BusyBox combines tiny versions of many common UNIX utilities into a single small executable. It provides replacements for most of the utilities you usually find in GNU fileutils, shellutils, etc. The utilities in BusyBox generally have fewer options than their full-featured GNU cousins; however, the options that are included provide the expected functionality and behave very much like their GNU counterparts. BusyBox provides a fairly complete environment for any small or embedded system."

How much assembley-language code is present?
Using an "egrep" command I was able to find quite a few inline assembler pieces in the package.
Majority of assembly code can be found in the networking and the include directory a

Which platform(s) it is used on?
BusyBox is can be used on any platform that is supported by gcc. Its module loading is currently limited to ARM, CRIS, H8/300, x86, ia64, x86_64, m68k, MIPS, PowerPC, S390, SH3/4/5, Sparc, v850e, and x86_64.

Why it is there (what it does)?
The functions I decided to focus on can be found in "networking/tls_pstm_montgomery_reduce.c"
The file contains a handful of inline assembler functions whose purpose is to optimise operations within different platforms/architectures.

What happens on other platforms?
It seems that the directory networking is responsible for ensuring the software runs properly on all platforms using meticulous inline assembler coding per platform.

Your opinion of the value of the assembler code VS the loss of portability/increase in complexity of the code.
As far as I can see, assembler code can be extremely efficient for projects but can be very time-consuming and draining. The intricacy of assembler code and necessity to adjust it per every architecture can make it detrimental to big scale projects and make standard tasks such as debugging or maintenance be very menial and taxing. However, the value of assembler code can definitely be needed in projects such as "BusyBox" that are striving to be all-encompassing and make operations faster, more efficient and overall better.

Ronen - SPO600

Saturday 6 January 2018

SPO600 - Project - Stage 2

Monday 1 January 2018

SPO600 - Lab 7 - Inline Assembler Lab