Decentralized Consistency
The design of distributed applications in Lingua Franca requires care, particularly if the coordination of the federation is decentralized. The intent of this post is to illustrate and handle the challenges arising from designing distributed applications in Lingua Franca, with the help of two realistic use cases.
Indefinite wait for inputs: aircraft door use caseโ
Aircraft doors on passenger flights are currently managed manually by flight attendants. Before takeoff, the flight attendants arm the door; if the door is opened in this state, an evacuation slide is automatically inflated and deployed for emergency evacuation. When the aircraft is at a gate, before opening the door, the flight attendants disarm it to avoid the deployment of the evacuation slide. Flight attendants are allowed to disarm the door only when they see through the porthole the ramp that will allow the passengers to disembark the aircraft.
Consider the above Lingua Franca program that implements a simplified system to remotely open an aircraft door that is in the armed state.
The door implements two independent remote services, door disarming and door opening, encoded by two different reactions in the Door reactor.
Suppose the pilot in the cockpit issues a command to open the door.
We would also like to automate the disarming of the door using a camera to verify the presence of a ramp. When the camera determines that the ramp is present, it triggers the disarming service. The camera detection is triggered by the door open command issued from the cockpit.
There are different ways to design and refactor the above system, for example, by removing the direct connection between the Cockpit and Door reactors. Our design choice is meant to highlight that door disarming and opening are two different and independent remote services triggered by two different commands issued by two different system actors. Therefore, each actor has an independent connection to the door to request its service.
The purpose of the system is to open the door in reaction to the command from the cockpit whether or not a ramp is present. If a ramp is present, it is imperative that the door be disarmed before being opened. Hence, the door, upon receiving the open command from the cockpit, should wait for input from the camera before opening.
The order in which messages are processed is crucial in this application. When the disarm and open commands arrive with the same tag, the disarm service needs to be invoked before opening the door, otherwise the escape slide will be erroneously deployed.
Lingua Franca guarantees determinism in the execution order of reactions with logically simultaneous inputs, and the order is given by the the order of declaration of the reactions inside the reactor. It is then sufficient to declare the disarm reaction before the open one. The diagram confirms the execution order by labeling the disarm reaction with 1 and the open reaction with 2.
The problem is that even though the messages are logically simultaneous, they do not arrive at the same physical time. In fact, the open command from the cockpit is likely to arrive before the clearance from the camera because the camera realizes an expensive computer vision algorithm. The door, consequently, has to wait for both inputs before invoking the opening service.
This is an example of an application that cannot safely proceed without assurance on its inputs. The following section explains how to obtain the desired behavior in Lingua Franca using the decentralized coordinator (the centralized coordinator automatically provides the required assurance).
Consistency with decentralized coordinationโ
The application is implemented as a federated program with decentralized coordination, which means that the advancement of logical time in each single federate is not subject to approval from any centralized entities, but it is done locally based on the input it receives from the other federates and on its local physical clock.
Let us consider the case when the Door reactor receives the open command from the Cockpit reactor, but not yet the disarm command from the Camera reactor. As previously observed, the Door cannot proceed to invoke the opening service, because it needs to wait for the Camera to send the disarm command.
But how long should it wait?
The decentralized coordinator in
Lingua Franca allows you to customize this waiting time. Each federate can be assigned an attribute called maxwait that controls how long the federate should wait for inputs from other federates before processing an event, such as an input it has just received.
More precisely, maxwait is the maximum amount of time a federate waits before advancing its logical time to some value t. Specifically, to advance to logical time t, the federate waits until either all inputs are known up to an including time t or its local physical clock exceeds t +maxwait.
An input is known up to an including time t if a message with timestamp t or greater has been received on that input port.
At the expiration of the maxwait, the federate assumes that any unresolved ports will not receive any messages with timestamps t or earlier.
It can then advance its logical time to t.
In our example, we want the door to wait indefinitely for both disarm and open commands to arrive before processing any of them. In Lingua Franca, this is obtained by setting maxwait to forever. The Door reactor cannot safely proceed without assurance about the inputs.
The implementation of the Door reactor and its instantiation are shown below:
The maxwait attribute is specified at instantiation time within the main reactor. Right before creating the instance of the Door reactor for which we want to set the attribute, we use the @maxwait annotation that takes as input the maxwait value.
The reactions of the Door reactor provide fault handlers that are invoked in case the federate assumed inputs were known up to timestamp t and then later received a message with timestamp t or less. When maxwait is forever, these fault handlers should never be invoked.
For finite values of maxwait, it is always possible for messages to get sufficiently delayed that the fault handlers will be invoked.
When they are invoked, the current tag will be greater than the intended tag of the message.
This type of fault is called a safe-to-process (STP) violation because messages are being handled out of tag order.
The intended tag of the input can be accessed as shown in the code above.
Multirate inputs: automatic emergency brakingโ
Consider the above Lingua Franca implementation of an automatic emergency braking system, one of the most critical ADAS systems that modern cars are equipped with.
The controller system modeled by the AutomaticEmergencyBraking reactor reads data coming from two sensors, a lidar and a radar, and uses both to detect objects or pedestrians that cross the trajectory the car.
This is a sensor fusion problem, where a diversity of sensors is used to get better reliability.
When one of the two sensors signals the presence of an object at a distance shorter than a configurable threshold, the controller triggers the brake to stop the car and avoid crashing into it.
The sensors are modeled with their own timer that triggers the generation of data. The clocks of all federates are automatically synchronized by the clock synchronization algorithm of the Lingua Franca runtime (unless this is disabled). Typically, in a real use case of this kind, the clock of sensor devices cannot be controlled by Lingua Franca, but a way to work around this limitation is to resample the data collected by sensors with the timing given by a clock that the runtime can control. The sensor reactors of our application are then modeling this resampling of sensor data so that alignment of data from the two sensors is well defined and sensor fusion becomes possible.
The lidar sensor has a sampling frequency that is twice that of the radar, as indicated by the timers in the corresponding reactors; the lidar timer has a period of 50ms, while that of the radar 100ms.
Their deadline is equal to their period and is enforced using the dedicated DeadlineCheck reactors, following the guidelines of how to work with deadlines.
The sensor behavior in the application can be simulated for testing purposes in a way that each sensor constantly produces distance values above the threshold (i.e., no objects in the way), and then at a random time it sends a distance value below the threshold, indicating the presence of a close object. When the AutomaticEmergencyBraking reactor receives that message, it signals the BrakingSystem reactor to brake the car, and the whole system shuts down.
Desired system propertiesโ
Availability is a crucial property of this application, because we want the automatic emergency braking system to brake as fast as possible when a close object is detected. Consistency is also necessary, as sensor fusion happens with sensor data produced at the same logical time. Even if this is not implemented in our simplified example, sensor fusion in a more general scenario helps rule out false positives, i.e., cases in which one of the sensors erroneously detects a close object that would induce an unnecessary and dangerous braking. False positives are caused by the weaknesses of the specific sensor. For example, rainy or foggy weather reduces the accuracy of lidar sensors. The key concept is to gather data produced at the same logical time by all sensors and combine them to have a more accurate estimate of possible collisions. Consistency and in-order data processing are then required.
Consistency challengeโ
The application is once agin implemented as a federated program with decentralized coordination.
Consistency problems may arise when a federate receives data from two or more federates, as it is the case of the AutomaticEmergencyBraking reactor.
The controller expects to receive input from both sensors at times 0ms, 100ms, 200ms, etc. Let's consider as an example the case where the remote connection between the controller and the radar has a slightly larger delay than that between the controller and the lidar. The lidar input will then always arrive slightly earlier than the radar one. When the controller receives the lidar input, should it process the data immediately, or should it wait for the radar input to come? Sensor fusion requires consistency: if the controller processes the input from the lidar and then the radar data comes, the control action elaborated upon the arrival of the lidar data does not take into account both sensors, even though it should. Hence, in our use case, the AutomaticEmergencyBraking reactor needs to wait for both inputs before processing new data.
In our application, if we aim to process all incoming data with the same logical time to realize sensor fusion, then we can set maxwait = forever to wait indefinitely for the radar input before processing the radar.
Note that this might not be a good choice in this example because if a fault causes one of the sensors to stop sending messages, the ADAS system will stop working.
Hence, in practice, we will probably want a smaller value for maxwait, and we will want to add fault detection and mitigation to the application.
Fault handling will be addressed in a later blog. Here we assume no such faults.
Availability challengeโ
Even without faults, however, setting maxwait to forever creates problems when only the lidar input is expected (50ms, 150ms, 250ms, etc): the controller cannot process that input until an input from the radar comes, because maxwait will never expire. For example, if the single lidar input comes at time 50ms, it has to wait until time 100ms before being processed. If that input was signaling the presence of a close object, the detection would be delayed by 50ms, which may potentially mean crashing into the object. The automatic emergency braking system must be available, otherwise it might not brake in time to avoid collisions.
The ideal maxwait value for maximum availability in the time instants with only the lidar input is 0, because if a single input is expected, no wait is necessary.
Summing up, consistency for sensor fusion requires maxwait = forever when inputs from both sensors are expected (or some finite value for fault tolerance), while availability calls for maxwait = 0 when only the lidar input is coming. The two values are at odds, and any value in between would mean sacrificing both properties at the same time.
Dynamic adjustment of maxwaitโ
The knowledge of the timing properties of the application under analysis enables the a priori determination of the time instants when both inputs are expected and those when only the lidar has new data available.
Lingua Franca allows to dynamically change the maxwait in the reaction body using the lf_set_fed_maxwait API, that takes as input parameter the new maxwait value to set.
This capability of the language permits the automatic emergency braking federate to:
- start with
maxwaitstatically set toforever(or some finite value for fault tolerance), because at time 0 (startup) both sensors produce data; - set
maxwaitto 0 after processing both inputs with the same logical time, because the next data will be sent by the lidar only; - set
maxwaitback toforeverafter processing the radar input alone, because the next data will be sent by both sensors.
This dynamic solution guarantees both consistency and availability as long as lidar data arrives within 50 ms.
The implementation and the instantiation of the AutomaticEmergencyBraking reactor are shown below:
The sensor_fusion() function combines the data and returns true if braking is needed.
The lidar_analysis() function uses only lidar data to make a (presumably more conservative) decision.
The n_invocs integer state variable counts the number of times the reaction of the AutomaticEmergencyBraking reactor is invoked. This variable is used to determine how many inputs the reaction expects to see at the next invocation and set the maxwait accordingly. Even invocation numbers mean that the next reaction invocation will happen with both sensor inputs present, so maxwait is set to forever; with odd invocation numbers, the next reaction invocation will see new data from the lidar only, and maxwait is then set to 0.
Clearly, detecting and handling faults would be needed in practical implementation. This will be the topic of a subsequent blog.