SplitNeuron Project

Links are plenty of different helpful advices and subsequent list is only a selection targeted onto this project. Similar arguments are grouped together to ease refer during design and implementation. Informations taken from referred sites were reorganized so could differ from originals, if you think their use is incorrect or have other suggestions please contact me.

Object Oriented
It is always possible to use object oriented techniques, even with C language (not only objective-C), in which there is no explicit support for object. The way to make C language fulfil object directive is to represent data members of a class as members of a structure and methods as functions taking ´member´ as first parameter (see here). In any case, this strategy unleashes object memebers implying memory management problems, hindering encapsulation and limiting reuse based solution.
On the other hand, design choices need not to follow only perfection. In this project there are few ´object´, namely only one type (layer), so its data could be kept free with no memory management and encapsulation drawbacks. Rather, having a lot of small entities, using classes would introduce accessing retards, due to arrays made of classes instead of simple data types. In addition, a class is thought to keep close data belonging to the same conceptual object, but in this case it has to be considered overhead reading data not relevant, so tearing data apart allows separate treatment, only when they are needed.

Efficiency
Processes have time and space efficiency, referring respectively to execution speed and used memory. There must be an accurate balance among design and execution needs, because the more leaning towards time-speed, the more space-memory should be used, and vice versa. Paging mechanism of modern kernels manages process memory swapping data back and forth cache, RAM and drives. This mechanism shares system memory among active processes, limiting in practice memory devoted to each process.
Project aim is to simulate in real-time large number neurons. These two conditions explicitly conflict because real-time would waste memory in favor of speed in spite of neurons data space needs. Hence, in order to reduce page and segmentation faults, data and functions have to be conceived with space and time wariness.
Here, numbered in succession, there are a series of rules optimizing code for both temporal and spatial efficiency.

1 Avoid dynamic memory allocation

Run-time memory allocation raises page fragmentation and de-allocation risks, with following memory leak. In place of run-time, could be used static allocation. It implies recompiling on each change, but makes the system steady and neat.

2 Prepare system for Blocking

On RISC architectures the cache line represents the smallest amount of data that can be transfered between RAM and cache. So sizing text and data (but also stack and heap) segment to cache dimension will improve passages between RAM and cache, enhancing time performances. This implies writing taking high care of code size and adjusting it once finished to fit in.

3 Check code carefully

Cache and register operations are much faster than memory operations but there is not so much space, so all compilers try to put in registers data that is supposed to be heavily used. But this use is assigned by programmers, so it is up to them checking really needed instructions.

3.1 Decompose big expressions
Compilers do this task quite well, but designers should know perfectly algorithms they use, so they could break up big expressions. Some sub-expressions may be computed before others and then recomposed for total result. As an example:
    X = A * log(Y) + pow(log(Y));
    
is an heavy expression that could be partitioned, introducing an explicit temporary variable t:
    t = log(Y)
    X = A * t + pow(t)
    
and even exponentiation can be saved by:
    X = (A + t) * t
    
3.2 Simplify expression
Also explicit siplification of big expression is a programmer duty. Not checked code could bring waste of instructions. For example, some compiler do simplify an expression such:
    y = x * (3/x)  ==>  y = 3
    
but surely cannot accomplish expression-function checking:
    y = foo(x) * (3/foo(x))  -->  y = 3
    
Even the order of evaluation is important: (a + b) + c is not the same as a + (b + c).

4 Explicitly assign process priority

Besides memory management there is another aspect undermining temporal efficiency. With scheduling, kernel fools processes letting them think they are the only one running, assigning them slices of CPU time. Scheduling time assigned to each process is determined by process priority, normally assigned by system. Priority could be changed at programmer ease with functions like sched_setscheduler(), setpriority(), eventually following informations acquired in system's proc (/proc/cpuinfo, /proc/stat).

5 Reduce total task number

It is possible to use GNU/Linux systems with text only interface (command line), being able to reduce executing tasks in respect to high number of task brought by user friendly graphical interface.

6 Avoid multi-threaded design

Multi-threaded design complicates the design by introducing synchronization primitives to access resources shared among different threads and overhead due to necessary variable change executing different threads. Raising calculation lines could best be done raising the number of machine running the system.

Variables
Data types used to represent variables should be chosen carefully. Type dimensions are especially important. In most cases values extension provided by a data type is more than values really used by the variable stored in.

7 Prefer WORD-size variables

32 bit microprocessors typically access memory by performing 32 bit bus cycles. So they perform arithmetic operations and parameter passing at integer (4byte) level. The compiler will first convert char and short into integer, perform operations and then convert back the result to original values. So, compilers have to follow the 32 byte alignment restrictions. If user-defined data structure violate this rule compilers have to add pad bytes so that the structure does not violate target microprocessor restrictions. This gets poor memory efficiency that have to be avoided.

8 Avoid bit fields

With bitwise operators (~, &, ^, |, <<, >>) bit subsets inside a data type could be inserted and extracted. This practise allow saving space, stuffing data types to their limit. According to data use, efficiency varies. In fact, inserting and extracting bit field takes some instruction cycles, resulting in lower time efficiency, but if data amount is huge and accessed unfrequently, space efficiency is enhanced.

Another variable facet is declaration place. Variables defined locally to a function (automatic) stay into registers only during computations on them. On the contrary, globally defined variables stay into process data segment. Therefore right place choice for variable is important.

9 Code and data being used together should be placed physically together

Locality of reference for code and data affects efficiency. Indeed compiler do move variable among registers, cache, RAM and drives in order to their use. If a function is declared far from the variables it uses, this will result in cycles to recall variable from its position and other cycles to recall function data being stacked to access variable. In C++, all object's data and code is in only one place. The same could be done when coding in C. Simply, declaration order of related code and functions could be declared together.

10 Limit the maximum number of live variables

If the number of active variables is less, the compiler will be able to fit them into registers, without memory reading or stack pushing, avoiding frame pointer setting cycles. This could be achieved by keeping expressions simple and small, not using too many variables in a function, dividing large functions into smaller ones, declaring local variables in the inner most scope, using ´register´ for frequently-used variables, so they should be allocated to a register with a very high priority.

11 Minimize the use of global variables

Reasons why it is preferable limiting variable is valid also for global ones. If a function makes frequent use of global variables, these have to be recalled each time from data segment. To replace memory access times with those to registers could be suitable to copy those global variables into local ones so that they can be assigned to registers.

12 Manage byte ordering among different machine

Transferring data among different machines, byte ordering has to be faced. Microprocessors support big-endian and little-endian byte ordering, in accordance to Swift's "Gulliver's Travels". Big-endian is an order in which the most significant byte is stored first, i.e. at the lowest address. Little-endian is an order in which the least significant byte is stored first. Sending data between two machines having different byte ordering implies routines to convert between big-endian and little-endian formats (as those available on RealtimeMantra site).

Mathematics and expressions
Checking code from a numerical point of view lets programmers fit perfectly data and algorithms so that compiler enhance both spatial and temporal efficiency. This could be done with a series of specific tips for mathematical data and functions.

13 Integer arithmetic is much faster than floating-point arithmetic

Integer operations can be directly done by the processor. Floating-point operations rather rely on external FPUs or floating point maths libraries. In both cases extra work is needed in order to address data to other processors or invoke external functions. Useful remedy is evaluate effective accuracy data needs. As an example, if data only needs to be accurate to two decimal places, everything could be scaled up by 100, and then converted back to floating point.

14 Simplify floating-point expressions

Since floating-point operations generally lead to loss of precision, compiler cannot apply many optimizations which are normally performed on integers. Therefore, it is beneficial to perform floating-point optimizations manually if it is known they are correct (see exemples on 3.2).

15 Floating point multiplication is faster than division

It is expedient to substitute divisions, expecially by a constant, with multiplication with the inverse.
For example:
x = x / 3.0
becomes
x = x * (1.0/3.0)
or
x = x / 2.0
becomes
x = x * 0.5

16 Avoid remainder (or modulus) operator (%)

Since remainder operation implies division, it is preferable the use of an if statement, as it produces much faster code. For example consider (++count % 60), count is a variable spanning between 0 and 60. But the process will be more efficient with: if(++count >= 60) count = 0.

17 Addition is quicker than multiplication

Resolving a product into a repetition of additions it is faster, because it is used the same register instead of more than one register for multiplications operands and result (x+x+x instead of x*3).

18 Calculate in advance expressions between constants

Frequently repetitions of expressions between constants is a waste of cpu time, since those calculations will always end with the same result. Calculate them once, instead, and assign them to a variable or constant defined, As Izhikevich(2003) has done assigning to a variable 1-a constant of its algorithm.

19 Avoid calling other functions

Transcendental functions, like sin, exp and log, as sqrt, are implemented using series of multiplications and additions with double precision. As a result, these operations are at least ten times slower than normal ones. A function can often be approximated using lookup tables and multiplications can be equally suitable. A lookup table is usually less accurate than calculating the value properly, but for many applications, this does not matter and increases performance significantly. Values, at desired precision, are computed at initialization time and store in array. In any case, for space efficiency, it is better to evaluate effective use, since functions like pow could be resolved into a series of multiplications.

20 Substitute multiplications, divisions and modulus by power of 2 with shift

Shifting bits is equal to multiplying or dividing them by a power of two. 00001010 is binary 10, shifting it by one bit (<<1) results in 00010100, 10*2=20. The same 10, 00001010, can be divided by 2² (>> 2) becoming 00000010, 2 with integer truncation. Shift operations generate integers inside modulus given by the number of bit representing the value. For example, with 4 bit ,0000, there are integers from 0 to 15: from 0001 to 1111. If bit 1 is shifted from rightmost position, 0001, to leftmost one, 1000, it will get in turn 1, 2, 4, 8. But a sequent shift will get to 0000, since 1 went out of scope for the number of bit provided. In such a way, multiplication, division and modulus become very efficient operations since they need only one instruction cycle. But, besides integer only representation, there is also sign reducing representational space because one bit has to be used to represent sign value. In addition sign has to be treated separately during shifting operations, so its treatment would be inefficient. For all this reasons shift operations should be carefully adopted.

Conditional and relational statements
Compilers can be eased in setting up execution flow making an efficient use of conditional and relational statements.

21 Keep statement body simple

Conditional statements should be short and variables implied must have an explicit value at the evaluation moment. So, in order to have compiler optimization done, functions must be executed out of statement.

22 Place frequent case labels first

During execution of a statement, all conditions are tested in a left-to-right order. If most frequent condition is the first, the statement testing ends dropping the rest. The same is valid for if-else instruction cascade. If first statement is valid other ones are not evaluated and control flow passes to subsequent instructions.

23 Many ´if´ can be a switch

Switch use should be preferred when case labels are in a narrow range. In this case the compiler does not generate a if-else-if cascade for the switch statement. Instead, it generates a jump table, with a performance independent of the number of case entries in switch statement. It could be eased further, putting all non-frequent cases into only one label (e.g. default).

23.1 Use lookup tables for complex decision making
Many times complex decision making code with cascades of if-else statements might be replaced with a simple lookup table. If the case labels are dense, and switching between several small pieces of code, they could be implemented more efficiently using a lookup table. For example, a switch of narrow cases, simply returning several values could be substituted by a direct access (by means of case) to a string containing values to return. Considering bytes used to express switch statement and return values, instead of one single string, the first will end up occupying more space than the second.
23.2 Break big switch statements into nested switches
Switches, as if-else conditions cascades, benefit of programmer knowledge of problem space. For example, case labels, as conditions, can be broken down in a binary fashion, to gain all the power of this kind of search.

Functions and Parameters
Functions allow problem decomposition into smaller parts. Advandages are having separate instructions operating almost indipendently on data (encapsulation), since the way data are fed to functions is passing parameters. But function calls has a cost in terms of time efficiency, due to stacking variables of calling environment, setting function variables into registers to start computation and then returning values with restoring previous environment.

24 __inline or #define very small functions

Compilers substitute functions declared with the keyword __inline in each call by their body, instead of a normal call. This results in faster code, since will remove the overhead of a function call and associated parameter passing, but it adversely affects space efficiency, particularly if the inlined function is large and used often (e.g. in a loop). Transforming a very small function in a #defined one is the same as explicitly inline the function.

25 Don't define a return value if not used

Since return values are pushed on the stack and need reading cycles, if return value of a function is not used or relevant, it should be avoided. The compiler cannot determine if the return value is being used. So, it will always create unused instructions for unused return value.

26 Use ´leaf´ functions

A function which does not call any other functions is known as a leaf function. Frequently called functions should always be leaf functions. They are compiled very efficiently and the cost of stacking some registers on entry and popping them on exit is very small compared to the useful work done by a well operating function. To ensure that a function is compiled as a leaf function simply avoid calling other functions or use __inline-ing. Any operations which are converted to calls to the C-library (such as division, or any floating-point operation when the software floating-point library is used) will result in non-leaf compiling.

27 Avoid recursion

Recursion can be very elegant and neat, but creates many more function calls which can become a large overhead.

28 Avoid function calls in main system loop

If a function is often called from within a loop, it may be possible to put that loop inside the function avoiding the overhead of calling the function repeatedly, with stack changes. Alternatively consider inlining the function.

29 Reduce the number of parameters

If a function has more than 4 parameters, remainings are pushed on the stack and then popped out. In addition, if there are a lot of parameters, some compiler do pass everything on the stack for flow consistency. Small functions should have fewer parameters (or better nothing at all) so will not use the stack for argument passing. If a function needs so much parameters, it must do a considerable amount of work, to give back time used to set it.

30 Avoid parameters with type different from bus size

Since C compilers manage parameter passing at integer level, type shorter (char and short) than bus size (usually 32bit) will be first converted into integer and then, after computation, converted back to the original type. For this same reason minimize the number of long parameters, as these take two argument words. This also applies to doubles if floating-point libraries are enabled.

31 Avoid functions with a variable number of parameters

These functions are compiled effectively passing all their arguments on the stack. And the same is for functions with a parameter that is passed partially in a register and partially on the stack (split-argument).

32 Use references for passing big parameters

When passing parameters bigger than normal types (such as structures), the cost of copying the object on the stack can be prohibitive. A pointer to that object would accomplish the same task without trespassing 4 bytes. There are some useful hints about pointers as parameters:

32.1 Put related arguments in a structure, and pass a pointer to the structure to functions
This will reduce the number of parameters and increase readability.
32.2 Declare parameters not altered by the function as pointer to const
This will tell the compiler not to create instructions for returning the value, saving space and time.

Pre-processing
Directives and conditional compilation could help speeding up the system. Tokenized substitutions are performed restituting global values and macro expansions. Changes are made in only one place keeping code simple and frequent code parts are expressed as only one term, making code neat.

33 Perform as much operations as possible at compile-time

Computations and type conversions on constants, indexing arrays having constant indexes, can be performed already by the pre-compiler.

33.1 ´#define´ system variable
In order to keep code aligned with theory some constant variable names could be used throughout the code, having only one definition.
33.2 ´#define´ macros
Small pieces of code (not aiming at the name of functions) can be easily defined avoiding the overhead of a function call and keeping code neat. This practise has one main disadvantage, a function has got an exact address capable of being stored in a pointer, while a macro can be only expanded.
33.3 Select code to be compiled
Conditional compilation permits having different ways of coding in the same source file. If some conditions are verified compilers will remove code from binaries, saving text segment size.

Loops and Arrays
Loops are a central part of any simulation over time, so their afficiency is really needed. Counter management, termination conditions and body content should be carefully designed, coded and tested to fully get advantages.

34 Invert counter direction

Decrement counter (i--) instead of increment it (i++) could get to a fast loop. Indeed, it is quicker to process i-- as test condition: if i is non-zero, counter is decremented and loop continue until counter reaches zero. This makes a considerable difference since loop instructions must be considered as part of the code executed. Testing condition and counting are usually kept separate, but merging them is perfectly legal, saving one instruction that, in longer loops makes difference.
for(i=0; i<10; i++) ...
is equal to:
for(i=9; i!=0; i--) ...
and, better:
for(i=10; i--; ) ...
In this case unary operator (--) associativity is used to firstly evaluate counter, then to decrment it.

35 Use simply termination conditions

Long calculation such as function calls need registers (and even stack). Loop conditional expressions are part of the loop, but compiler optimizes them separately. In such situations could be wiser move condition inside the loop.

36 Break loop early

Infact, it is often not necessary to process the entirety of a loop. Carefully chosing condition and break-ing, would speed-up loop termination.

37 Divide one bigger loop into more smaller ones

If the loop does a lot of work or treats huge data, it might not fit into cache. In this case, separate loops may be faster as each one can run completely in the cache.

38 Use fixed size arrays

That is avoid dynamic memory allocation (principle 1). Dynamic memory allocation in Realtime system lower temporal efficiency due to spread allocated memory. At their place should be adopted fixed size dimension, known at compile time.

39 Array should have power of two data size

Compiler performs the array indexing multiplying index by the data size. In case of a structure, if the structure size is a power of 2, an expensive multiply operation will be replaced by an inexpensive shift operation.

40 Indexing order

In C language, bi-dimensional arrays are stored in row-wise order. So array indexing have to be done looping on rows. For example, having x[10][5]:
for(i=0; i<10; i++){
  for(j=0; j<5; j++) ...
}
is right, while:
for(j=0; j<5; j++){
  for(i=0; i<10; i++) ...
}
is wrong because it jumps over addresses for all entries.

41 Partitioning array into blocks

Principle 2 refers to a blocking system to minimize cache miss and page and segment fault. Functions and structure could be pruned to fit cache size, arrays need some extra care.
Once cache size is known, paging activity can be minimized by partitioning large matrices into small enough blocks that fit cache or page size (whichever is smaller). The array "iteration space" can be rolled by two loops instead of only one. The first one jumps block by block, the second rolls through each entry. As an example, put array x[100], split into blocks of size 10:
for(iB=0; iB<100; iB=iB+10){
  for(i=iB; i<iB+10; i++) ...
}
This principle is straightforward to time efficient coding, but it is based upon automatic cache size detection.

General Principles
In several sites two principles are frequently cited, to be adopted during design phase: 0-1-N-rule and open-closed principle.

42 "0 or 1 or N rule"

This main principle states that, when designing a system, every feature should be passed through the following sieves:

Is the feature really needed?

Having to support the feature, try to implement only one instance

If only one instance is not enough, then support to "n" instances should be designed

Only if feature is really needed, it should be implemented. Then,if feature could not be contained in only one instance then is better to conceive a general purpose design, even if instances are only a few at the moment, because they will surely grow.

43 Open-closed principle

This one states: "Software entities like classes, modules and functions should be open for extension but closed for modifications" means that those entities should be designed in such a way they does not need to be changed when new functionality is added. If applied, this principle grant most changes will be handled as new functions and new structures. And changes to existing code will be minimized.

Memory Organizzation

Level	Capacity	Access time	Transfer rate(GB/s)	Place
Registers	~ 1 KB	~ 0.2 ns (1 clock cycle)	-	CPU
I level cache	~ 32 KB	~ 0.4 ns (2/4 clock cycles)	-	Integrated circuit
II level cache	~ 1/2 MB	~ 1/2 ns (5/10 clock cycles)	~ 100	Integrated circuit
III level cache	~ 2/8 MB	~ 5 ns	~ 50	Integrated circuit (not always available)
RAM	~ 2/8 GB	~ 50 ns (1° WORD) ~ 10 ns (subsequent WORD)	~ 5/10	Motherboard
Hard drives	> 300 GB	~ 10 ms	0.15/0.6	case