Multicore - Tools¶
Introduction¶
Multicore is a term associated with parallel processing and refers to the use of several cores that are simultaneously engaged in the execution of a program. Parallel processing divides computational-intensive operations into smaller parts and distributes these tasks among several cores. The goal is to speed up processing of a computationally-intensive operations.
Characteristics of computationally-intensive applications:
Large amount of data to process
Complex algorithms require heavy computations
We can understand the control program for a system, which was previously calculated on a single processor core, as a complex algorithm. The division of this complex program flow into several subprograms (task partitioning) that can be executed as independently as possible enables the reduction of the cycle times. This results in a better reaction time of the system because the sampling rates can be increased.
Goals of task partitioning:
Evenly dividing the effort among all available cores
Minimization of contention for shared resources
Up to now, the processing of very large amounts of data (e. g. image processing in real time) has often been outsourced to specialized control components because the computing capacity of the PLC’s only processor core has not been sufficient to handle both parts of the data processing in the required time window.
The ability to use multiple processor cores directly out of the PLC program now opens up a wide range of additional possibilities. The parallel processing of data, distributed over several processor cores, now makes it possible to execute the complex processing sequence (e. g. for image processing in real time) in the required time window. In many cases, it is no longer necessary to implement parallel file processing outside the PLC with special tools and programming languages. Parallel data processing becomes an integral part of a PLC program.
Example¶
The new possibilities will be illustrated with the example of industrial image processing, which is based on the evaluation of just taken images, which then influences the sequence of further production steps.
In our example, the image processing consists of three steps:
Image capture due to an external event.
Image analysis
Decision on rejection as scrap or forwarding to subsequent process steps.
With a single core PLC the second step was outsourced to an external system, which was triggered by an external event. The result of the image analysis was provided through an physical input of the PLC.
With a multi-core PLC, the following structure is feasible: The Image processing takes place in a library block, which is triggered by a local input variable and ensures a minimum processing time for image evaluation based on the number of currently available processor cores. The calculations are thus dynamically distributed to an optimal number of processor cores. Subsequently, the results are summarized in order to be able to make a decision about whether the current product meets the requirements. In addition to the computational effort for image processing, there is also the effort for distributing the data and the effort for summarizing the results. Therefore, the image processing efficiency does not increase linearly with the number of processor cores. The different efforts should be in a good relationship to each other. Only then a shorter processing time per product will be possible.
Measurement¶
Below we present a list of procedures and tools to measure and compare the runtime of applications and libraries on single and multicore systems.
- TaskManager:
This tool allows you to view and evaluate basic data on the current existing tasks. Based on the cycle time and the processing time of the task with the highest priority, the remaining calculation time available for the other tasks can be determined very well. This is a first indication of how successful a multi-core distribution can be. If the remaining computing time which the task with the highest priority leaves for the other tasks is very small, then the difficulties to handle the other tasks adequately in the required time window are growing very fast.
- TraceManager:
If the SysCpuMultiCore and CmpTraceMgr components are integrated in the runtime system, then there are two system traces configured, which can be uploaded in the CODESYS environment for analyzing the CPU load and the IEC-Task load.
- Profiler:
- Profiler Watch ListProvides the possibility, to assign (also while in online mode) a specific instance of a building block to the watch list. For each entry in the list, the number of calls and processing times (Min, Max, AVG) are displayed.
- Profiling a Task CycleA configuration dialog can be used to determine how the runtime measurement is to be performed. With the next download, the PLC program is instrumented in such a way that the required data is recorded. The result of the measurement can be uploaded into the CODESYS environment. A call tree view gives you an idea of how much time was spent in which part of the program.
- ExecutionTimeReporter:
Each function block out of the CommonBehaviourModel family can be observed with an ExecutionTimeReporter. For each state of the BehaviourModel, it records the time intervals in which it is active. With this data, the real manufacturing operations associated with these states can be precisely analyzed.
Atomic Operations¶
Atomic operations are mainly used to implement locking mechanisms (locking) used in the synchronization of data and processes. Another variant is the non-blocking synchronization of data structures, whereby objects are only modified by atomic operations and explicit locking of the objects becomes unnecessary.
An operation acting on shared memory is atomic if it completes in a single step relative to other task. When an atomic store is performed on a shared variable, no other task can observe the modification half-complete. When an atomic load is performed on a shared variable, it reads the entire value as it appeared at a single moment in time. Non-atomic loads and stores do not make those guarantees.
Typical CPU instructions for implementing atomic operations on current processors are:
Atomic load and store of integral IEC data types like
LINT
,__XWORD
,POINTER
,INTERFACE
, …Atomic read-modify-write operations like AtomicTestAndSet, AtomicAdd or AtomicCompareAndSwap.
Please take a look to the DemoMultiCore.project
example project to see the executable methods.
AtomicCompareAndSwap¶
Example
FUNCTION AtomicCompareAndSwap : BOOL
VAR_INPUT
pxwAddress : POINTER TO __XWORD;
xwExpected : __XWORD;
xwNewValue : __XWORD;
END_VAR
IF pxwAddress <> 0 AND_THEN pxwAddress^ = xwExpected THEN
pxwAddress^:= xwNewValue;
AtomicCompareAndSwap := TRUE;
ELSE
AtomicCompareAndSwap := FALSE;
END_IF
With the help of an AtomicTestAndSet the operation AtomicCompareAndSwap can be implemented as:
// +-------------------------------------------------------------------------------------------------------------+
// | ``_xwSync`` is context variable of type ``__XWORD`` to synchronize all tasks/cores in the related context |
// +-------------------------------------------------------------------------------------------------------------+
WHILE AtomicTestAndSet(ADR(_xwSync)) = TRUE DO
SchedWaitSleep(uliSleepOneUs);
END_WHILE
IF pxwAddress <> 0 AND_THEN pxwAddress^ = xwExpected THEN
pxwAddress^:= xwNewValue;
AtomicCompareAndSwap := TRUE;
ELSE
AtomicCompareAndSwap := FALSE;
END_IF
_xwSync := 0;
AtomicTestAndSet¶
Example
TRUE
, the old value will be returned.FUNCTION AtomicTestAndSet : BOOL
VAR_INPUT
pxwAddress : POINTER TO __XWORD;
END_VAR
IF pxwAddress <> 0 THEN
AtomicTestAndSet := pxwAddress^ <> 0;
pxwAddress^ := 1
ELSE
AtomicTestAndSet := FALSE;
END_IF
With the help of an AtomicCompareAndSwap the operation AtomicTestAndSet can be implemented as:
IF AtomicCompareAndSwap(pxwAddress:=pxwAddress, xwExpected:=0, xwNewValue:=1) THEN
AtomicTestAndSet := FALSE;
ELSE
AtomicTestAndSet := TRUE;
END_IF
AtomicAdd¶
Example
FUNCTION AtomicAdd : __XWORD
VAR_INPUT
pxwAddress : POINTER TO __XWORD;
xwValue : __XWORD;
END_VAR
VAR_OUTPUT
xError : BOOL;
xOk : BOOL;
END_VAR
IF pxwAddress <> 0 THEN
AtomicAdd := pxwAddress^;
pxwAddress^ := AtomicAdd + xwValue;
ELSE
xError := TRUE;
END_IF
With the help of an AtomicCompareAndSwap the operation AtomicAdd can be implemented as:
IF pxwAddress <> 0 THEN
REPEAT
AtomicAdd := pxwAddress^;
xOk := AtomicCompareAndSwap(pxwAddress:=pxwAddress, xwExpected:=AtomicAdd, xwNewValue:=AtomicAdd + xwValue);
UNTIL xOk
END_REPEAT;
ELSE
xError := TRUE;
END_IF
Applications in CODESYS¶
IEC Operators (Some are not yet implemented!):
dwOldValue := TEST_AND_SET(dwValue);
xOk := COMPARE_AND_SWAP(pxwAddress, xwExpectedValue, xwNewValue);
diOldValue := __XADD(pdiAddress, diValueToAdd);
CODESYS runtime system support:
udiResult := SysCpuAtomicCompareAndSwap(pxwAddress, ADR(xwNewValue), ADR(xwExpected), SIZEOF(__XWORD))
diNewValue := SysCpuAtomicAdd(pdiAddress, diValueToAdd, pResult);
udiOldValue := SysCpuTestAndSet(pudiAddress, 0);
Note
The parameters of SysCpuAtomicAdd
based on DINT
data type.
The examples above are based on __XWORD
data type.
Memory Ordering¶
Different memory ordering at run-time, on the processor itself is invisible to a single-threaded program. It only becomes apparent when a shared memory area is manipulated without using any mutual exclusion between the related tasks. However, the effects of different processor reordering are only visible in multi-core environments. When writing lock-free code, one must often take special care to enforce correct memory ordering. Otherwise, surprising things can happen.
Here’s one of the simplest examples. Suppose we have two global integers X and Y, both are initially 0.
VAR GLOBAL
X : INT := 0;
Y : INT := 0;
END_VAR
Two cores, running in parallel, execute the following code:
Core1 |
Core2 |
---|---|
X := 1;
A := Y;
|
Y := 1;
B := X;
|
Each processor stores 1 into one of the integer variables, then loads the other integer into a locale variable. Now, no matter which processor writes 1 to memory first, it’s natural to expect the other processor to read that value back, which means we should end up with either A = 1, B = 1, or perhaps both. But according to some observations, that won’t necessarily be the case. Some times we can observe that the value of A and/or B equals to 0 at the end of this example – a quite counterintuitive result, to say the least!
One way to understand this is that most processor families, are allowed to reorder the memory interactions of machine instructions according to certain rules, as long it never changes the execution of a single-threaded program. In particular, each processor is allowed to delay the effect of a store past any load from a different location. As a result, it might end up as though the instructions had executed in this order:
Core1 |
Core2 |
---|---|
A := Y;
X := 1;
|
B := X;
Y := 1;
|
You can enforce correct memory ordering on the processor by issuing an instruction which acts as a memory barrier.
__MEMORYBARRIER();
Example
A code snippet to illustrate variables and types
VAR GLOBAL
g_xIsPublished : BOOL;
g_datValue : DATA;
END_VAR
A code snippet running on core #1
IF g_xIsPublished THEN // Load and check shared flag
datResult := g_datValue; // Load published value
x := g_datValue.x;
y := g_datValue.y;
END_IF
A code snippet running on core #2
g_datValue.x := x; // Publish some data
g_datValue.y := y; // Publish some data
__MEMORYBARRIER();
g_xIsPublished := TRUE; // Set shared flag to indicate availability of data
Note
If, at certain position in the program, the result of a sequence of write operations should to become visible for a
program running on another core, a __MEMORYBARRIER
operation must be inserted at exactly this position.
Example
Below a pseudo code snippet of a queue implementation illustrates another fact:
Atomic Operations like AtomicCompareAndSwap
, AtomicAdd
and AtomicTestAndSet
act like a memory barrier.
Therefor a extra __MEMORYBARRIER
in this use case is not necessary.
TYPE NODE :
STRUCT
pNext : POINTER TO NODE;
END_STRUCT
END_TYPE
VAR
g_pTail : POINTER TO NODE;
g_pHead : POINTER TO NODE;
END_VAR
METHOD Enqueue
VAR_INPUT
pNode : POINTER TO NODE;
END_VAR
VAR
pTail : POINTER TO NODE;
END_VAR
// Keep trying until Enqueue is done
REPEAT
pTail := g_pTail;
IF AtomicCompareAndSwap(ADR(pTail^.pNext), 0, pNode) THEN
// __MEMORYBARRIER();
// Enqueue is done. Exit loop
EXIT;
ELSE
// Tail was not pointing to the last node
// Try to swing Tail to the next node
AtomicCompareAndSwap(ADR(g_pTail), pTail, pTail^.pNext);
// __MEMORYBARRIER();
END_IF
UNTIL FALSE
END_REPEAT
// Enqueue is done. Try to swing Tail to the inserted node
AtomicCompareAndSwap(ADR(g_pTail), pTail, pNode);
// __MEMORYBARRIER();
An interesting aspect of the Enqueue
method is that it is lazy: it takes place in
two distinct steps. To make this method lock-free, threads may need to help one another.
Normally, the Enqueue
method locates the last node in the queue, and performs the following two steps:
It calls
AtomicCompareAndSwap
to append the new node, andcalls
AtomicCompareAndSwap
to change the queue’s tail pointer from the prior last node to the current last node.
Because these two steps are not executed atomically, every other method call must be prepared to encounter a
half-finished Enqueue
call, and to finish the job.
Consistent Values Swapping¶
TYPE IObserver_XWORD_Converter :
UNION
itf : IObserver;
xwd : __XWORD;
END_UNION
END_TYPE
METHOD SetObserver : IObserver
VAR_INPUT
itfObserver : IObserver;
END_VAR
VAR_OUTPUT
eErrorID : ERROR;
END_VAR
VAR
xOk : BOOL;
cvtOldValue : IObserver_XWORD_Converter;
cvtNewValue : IObserver_XWORD_Converter;
END_VAR
cvtNewValue.itf := itfObserver;
REPEAT
cvtOldValue.itf := _itfObserver;
xOk := AtomicCompareAndSwap(ADR(_itfObserver), cvtOldValue.xwd, cvtNewValue.xwd);
UNTIL xOK
END_REPEAT
SetObserver := cvtOldValue.itf;