**Chapter 2
Planning and Design**

**2.1 High Level Objectives and Aims**

The aim of this design phase is to outline the high-level aim for what the proof-of-concept project is to achieve.

1. Produce an optimised version of a Gauss-Jordan Inversion algorithm suitable for parallelisation.

2. Produce a parallelised version of a Gauss-Jordan Inversion algorithm.

3. Compare the run timings of any such algorithm over a base-line comparison.

4. For various sized matrices find the most optimal algorithm based on dimensional size, dependent on the hardware of the system and compiler used.

5. Create a compiled function that takes into account the results of aim number 4 and could be called upon by an external program for processing of a square matrix.

**2.2 Design Decisions**

**2.2.1 Programming language**

To choose the right programming language for the project a set of requirements had to be met. They were:

1) Author’s prior experience

2) Language’s ability to deal with threading

3) Portability and Support

4) Fast code production

5) Environment that final generated function would be used in

**Author’s prior experience**

The author has previously written algorithmic code in C, C++, Java, Eiffel and the M68k assembly language.

**Language’s ability to deal with threading**

Both C and C++ have the ability to use OpenMP or POSIX threads. Java has its own threading model as well as POSIX threading.

**Portability and Support**

C, C++ and Java are widely supported on multiple hardware and operating system platforms. Other languages such as Eiffel and Hascal do not have large communities or companies pushing forward development.

**Fast Code Production**

Current C/C++ compilers such as gcc and the Intel compiler have an array of optimisation switches for various processors and for in-lining and loop unfolding. Java’s compiler is only designed to produce bytecode that is translated into machine code at runtime and although Java’s speed is catching up with C/C++ it still is slower.

**Environment**

The most common computation environment would be the one that is fast and portable and as such C or C++ would be the obvious choices so that the final Gaussian algorithm will be compatible with other programs.

From these requirements it is obvious that the final choice of programming language was either C or C++. The author chose to use C++ since the memory declaration commands such as ‘new’ and ‘delete’ reduced the complexity of using malloc() to define arrays and matrices in memory as well as avoiding any memory leaks that may occur.

Decision: C++ is language of choice.

**2.2.2 Mathematical Techniques**

Three Gaussian based algorithms are used in this project with one acting as a base-line result. The base algorithm is a modified version of Numerical Recipes in C++’s *gaussj* function.

The gaussj function is quite a complex implementation of the Gauss-Jordan algorithm with numerous single letter variables which make it hard to comprehend. The code itself is not directly parallelisable and is intended to run serially. The author edited the function’s matrix declarations to use new and delete commands as opposed to the proprietary matrix format that Numerical Recipes uses. By standardising the matrix format the eventual end timings will be able to be compared and contrasted.

The first of the algorithms written by the author to which *gaussjordanSerial* and *gaussjordanOMP* are based upon is the classic method for a Gauss-Jordan Inversion. The first part of the algorithm is the Gaussian Elimination technique. It requires two matrices, the first being the original matrix and the second is an Identity Matrix. The algorithm starts on the upper left entry of the first matrix and, if the entry is non-zero, then elementary row operations are performed on the rows below it with the aim of creating a column of zeros below the lead diagonal. If the entry is zero then the whole row is swapped with a row below it that has a non-zero entry in that same column. If there happens to be no non-zero entries in the column then the matrix is singular and non invertible. Once the first column is complete the next entry on the lead diagonal will be processed.

The second part of the algorithm is a form of Gaussian Elimination in reverse, whereby the algorithm starts on the lower right element and uses elementary row operations on the rows above it to form a column of zeros above the lead diagonal. This continues until the matrix resembles an Identity Matrix.

All row operations upon the first matrix are replicated upon a second matrix that starts out as the Identity Matrix.

Once the algorithm has completed the inverse of the original matrix is to be found as the second matrix.

The second algorithm, to which *gaussjordanSerial_Combo *and *gaussjordanOMP_Combo* are based, attempts to do both the first and second stage of Gauss-Jordan Inversion at the same time. The algorithm starts exactly like the previous one but after the first column has been reduced to zeros (excluding the lead diagonal value) the algorithm then performs elementary row operations on the rows above the pivot value.

**2.2.3. Algorithm Design**

The design of each algorithm is crucial to optimising for parallelisation. As mentioned previously the gaussj function was complex and didn’t lend itself to having distinct independent sections. By rewriting the algorithms from scratch it was possible to write these sections of code so as they were independent and hence parallelisable.

Additionally each section of the algorithmic code was designed to be composed of simple instructions such as additions or divisions and not function calls. By doing this it reduced the complexity of the machine code generated by the compiler and hence would require less processing cycles per instruction. However the more expensive processor instructions such as multiplication and division require sparing use, even more so in the case of division instructions since they take more cycles per instruction than multiplications.

Decision: Write new Gauss-Jordan Inversion algorithms from scratch and section code for efficient parallelisation.

2.2.4 Algorithm Parallelisation

The next design decision was how to implement a parallelised algorithm. There are two standard methods to run sections of code in parallel. The first is by using POSIX threads or pthreads as they are more commonly known. This method requires each parallelised section to be encapsulated as a function. However that would have meant complicating the code further with more machine instructions in the final compiled version.

The second method for parallelising the program is by using OpenMP. It is a pragma based pre-processing language that is a specification for a set of compiler directives, library routines, and environment variables that can be used to specify memory parallelism in Fortran and C/C++ programs. It has been developed for various computers ranging from super-computers through to SMP servers and multicore desktops and all variations in between. It has a documented list of requirements that code must meet for OpenMP to be successful in parallelising ‘for’ loops which will be predominantly used in this algorithms. They are:

1. The loop variable must be a signed integer

2. The loop condition must not change throughout the processing of the loop.

3. That simple regular expressions are used in the loop condition such as >, <, >=, <=

4. The loop contains no jump commands for into or out of the loop.

5. Upon each iteration of the loop there is some form of incrementing or decrementing of a counter such that the loop does not run ad infinitum due to the loop condition never registering as false.

The author chose to use OpenMP as the method of choice for threading of the code. Its straightforward declarations allowed for more work to be done on optimising other areas of the algorithms.

Decision: OpenMP as API of choice

**2.2.5 Sample Matrices**

A range of matrix dimensions had needed to be chosen so that algorithms could be compared in a realistic manner. Hence all algorithms were run over a series so identical matrices from 3×3 through to 100×100 and sometimes larger dimensions if there was an interesting trend beyond. The matrix entries are created using a standalone program that generates a series of ten million random signed integer values, between -4999 and 5000, and then writes them to a file. This data file is then available for providing test values for the project’s algorithms.

Decision: Range 3×3 -> 100×100

**2.2.6 Compilers**

Once the language was chosen the decision had to be took on which compilers the program should be run under. For the sake of brevity and keeping the number of variables within the project to a manageable level it was decided by the author to use only 2 difference compilers. For the first choice G++ was an obvious choice since it is most widely ported and newer versions such as 4.2 now has OpenMP support. The second compiler was dependent on if it supported OpenMP and as such the Portland and Intel compilers were the primary choices. Unfortunately the Portland compiler cannot be used without having to pay for it where as the Intel compiler allowed for a fully featured evaluation without costing money.

Decision: G++ and Intel Compilers