Embedded Control System

(Muppala, Ciardo & Trivedi)

Introduction
The Model
Experimental Results

Related publications: [KNP04c, KNP07b]

Introduction

This case study illustrates the applicability of probabilistic model checking and the PRISM tool to the process of dependability analysis. We model an embedded control system, closely based on the one presented in [MCT94], the structure of which is shown below.

The system comprises an input processor (I) which reads and processes data from three sensors (S1, S2, S3). It is then polled by a main processor (M) which, in turn, passes instructions to an output processor (O). This process occurs cyclically, with the length of the cycle controlled by a timer in the main processor. The output processor takes the instructions it receives and uses them to control two actuators (A1, A2). Finally, a bus connects the three processors together. A concrete example of such a system might be a gas boiler, where the sensors are thermostats and the actuators are valves.

Any of the three sensors can fail, but they are used in triple modular redundancy: the input processor can determine sufficient information to proceed provided two of the three are functional. If more than one becomes unavailable, the processor reports this fact to the main processor and the system is shut down. In similar fashion, it is sufficient for only one of the two actuators to be working, but if this is not the case, the output processor tells the main processor to shut the system down. The I/O processors themselves can also fail. This can be either a permanent fault or a transient fault. In the latter case, the situation can be rectified automatically by the processor rebooting itself. In either case, if the I or O processor is unavailable and this leads to M being unable either to read data from I or send instructions to O, then M is forced to skip the current cycle. If M detects that the number of consecutive cycles skipped exceeds a limit, MAX_COUNT, then it assumes that either I or O has failed and shuts the system down. Unless specified otherwise, we assume a value of 2 for MAX_COUNT. Lastly, the main processor can also fail, in which case the system is automatically shut down.

The mean times to failure for a sensor, actuator and processor are 1 month, 2 months and 1 year, respectively. A transient fault occurs in a processor once a day on average. The mean times for a timer cycle to expire and for a processor reboot to complete are 1 minute and 30 seconds, respectively. We assume that all these delays are distributed exponentially; hence the system can be modelled as a continuous-time Markov chain (CTMC).

The Model

Our PRISM model of the system comprises 6 modules, one for the sensors, one for the actuators, one for each processor and one for the connecting bus. Below we give the PRISM language description.

ctmc

// constants
const int MAX_COUNT;
const int MIN_SENSORS = 2;
const int MIN_ACTUATORS = 1;

// rates
const double lambda_p = 1/(365*24*60*60); // 1 year
const double lambda_s = 1/(30*24*60*60); // 1 month
const double lambda_a = 1/(2*30*24*60*60); // 2 months
const double tau = 1/60; // 1 min
const double delta_f = 1/(24*60*60); // 1 day
const double delta_r = 1/30; // 30 secs

// sensors
module sensors
	
	s : [0..3] init 3; // number of sensors working
	
	[] s>1 -> s*lambda_s : (s'=s-1); // failure of a single sensor
	
endmodule

// input processor
// (takes data from sensors and passes onto main processor)
module proci
	
	i : [0..2] init 2; // 2=ok, 1=transient fault, 0=failed
	
	[] i>0 & s>=MIN_SENSORS -> lambda_p : (i'=0); // failure of processor
	[] i=2 & s>=MIN_SENSORS -> delta_f : (i'=1); // transient fault
	[input_reboot] i=1 & s>=MIN_SENSORS -> delta_r : (i'=2); // reboot after transient fault
	
endmodule

// actuators
module actuators
	
	a : [0..2] init 2; // number of actuators working
	
	[] a>0 -> a*lambda_a : (a'=a-1); // failure of a single actuator
	
endmodule

// output processor
// (receives instructions from main processor and passes onto actuators)
module proco = proci [ i=o, s=a, input_reboot=output_reboot, MIN_SENSORS=MIN_ACTUATORS ] endmodule

// main processor
// (takes data from proci, processes it, and passes instructions to proco)
module procm
	
	m : [0..1] init 1; // 1=ok, 0=failed
	count : [0..MAX_COUNT+1] init 0; // number of consecutive skipped cycles
	
	// failure of processor
	[] m=1 -> lambda_p : (m'=0);
	// processing completed before timer expires - reset skipped cycle counter
	[timeout]  comp -> tau : (count'=0);
	// processing not completed before timer expires - increment skipped cycle counter
	[timeout] !comp -> tau : (count'=min(count+1, MAX_COUNT+1));
	
endmodule

// connecting bus
module bus
	
	// flags
	// main processor has processed data from input processor
	// and sent corresponding instructions to output processor (since last timeout)
	comp : bool init true; 
	// input processor has data ready to send
	reqi : bool init true; 
	// output processor has instructions ready to be processed
	reqo : bool init false;
	
	// input processor reboots
	[input_reboot]  true -> 1 :
	// performs a computation if has already done so or
	// it is up and ouput clear (i.e. nothing waiting)
	(comp'=(comp | (m=1 & !reqo))) 
	// up therefore something to process
	& (reqi'=true)
	// something to process if not functioning and either
	// there is something already pending
	// or the main processor sends a request
	& (reqo'=!(o=2 & a>=1) & (reqo | m=1));
	
	// output processor reboots
	[output_reboot] true -> 1 :
	// performs a computation if it has already or
	// something waiting and is up
	// (can be processes as the output has come up and cleared pending requests)
	(comp'=(comp | (reqi & m=1)))
	// something to process it they are up or
	// there was already something and the main processor acts
	// (output now up must be due to main processor being down)
	& (reqi'=(i=2 & s>=2) | (reqi & m=0))
	// output and actuators up therefore nothing can be pending
	& (reqo'=false);
	
	// main processor times out
	[timeout] true -> 1 :
	// performs a computation if it is up something was pending
	// and nothing is waiting for the output
	(comp'=(reqi & !reqo & m=1))
	// something to process if up or
	// already something and main process cannot act 
	// (down or outputs pending)
	& (reqi'=(i=2 & s>=2) | (reqi & (reqo | m=0)))
	// something to process if they are not functioning and 
	// either something is already pending
	// or the main processor acts
	& (reqo'=!(o=2 & a>=1) & (reqo | (reqi & m=1)));
	
endmodule

// the system is down
formula down = (i=2&s<MIN_SENSORS)|(count=MAX_COUNT+1)|(o=2&a<MIN_ACTUATORS)|(m=0);
// transient failure has occured but the system is not down
formula danger = !down & (i=1 | o=1);
// the system is operational
formula up = !down & !danger;

// reward structures

rewards "up"
	up : 1/3600;
endrewards
rewards "danger"
	danger : 1/3600;
endrewards
rewards "down"
	down : 1/3600;
endrewards

View: printable version Download: embedded.sm

The module for the sensors, includes an integer variable s representing the number of sensors currently working. The module's behaviour is described by one guarded command which represents the failure of a single sensor. Its guard "s>0" states this can occur at any time, except when all sensors have already failed. The action s'=s-1 simply decrements the counter of functioning sensors. The rate of this action is s*r, where r is the rate for a single sensor and s is the PRISM variable referred to previously which denotes the number of active sensors.

The module for the input has a single variable i with range [0..2] which indicates which of the three possible states the processor is in, i.e. whether it is working, is recovering from a transient fault, or has failed. The three guarded commands in the module correspond, respectively, to the processor failing, suffering a transient fault, and rebooting. The commands themselves are fairly self explanatory. Two points of note are as follows. Firstly, the guards of these commands can refer to variables from other modules, as evidenced by the use of s>=2. This is because the input processor ceases to function once it has detected that less than two sensors are operational. Secondly, the last command contains an additional label input_reboot, placed between the square brackets at the start of the command. This is used for synchronising actions between modules, i.e. allowing two or more modules to make transitions simultaneously. Here, this is used to notify the main processor of the reboot as soon as it occurs.

Model Statistics

The tables below shows statistics and construction times of the models when the value of MAX_COUNT varies. The tables include:

the number of states in the CTMC representing the model;
both the number of nodes and leaves of the MTBDD which represents the model;
the construct time which equals the time taken for the system description to be parsed and converted to the MTBDD representing it, and the time to perform reachability, identify the non-reachable states and filter MTBDD accordingly;
the number of iterations required to find the reachable states (which is performed via a BDD based fixpoint algorithm).

MAX_COUNT	States	Nodes	Leaves	Construction
MAX_COUNT	States	Nodes	Leaves	Time (s)	Iter.s
2	2633	1194	9	0.065	14
3	3478	1144	9	0.066	15
4	4323	1558	9	0.082	16
5	5168	1448	9	0.073	17
6	6013	1586	9	0.083	18
7	6858	1405	9	0.076	19
8	7703	1950	9	0.082	20

Experimental Results

We have used PRISM to analyse a number of dependability properties using probabilistic model checking. The full list of properties is available here. First, we consider the probability of the system shutting itself down. Note that there are four distinct types of failure which can cause a shutdown: faults in

the sensors
the actuators
the input/output processors
the main processor.

We can analyse how likely each of these is to be the cause of the shutdown, as time passes. We use the CSL property:

P=? [ !"down" U<=T "fail_j" ]

where j=1...4, refers to one of the four failures above and "down" denotes that any of the failures has occurred, i.e. "down"="fail_1" | "fail_2" | "fail_3" | "fail_4". The atomic propositions which make up the property would in practice be predicates over the variables from the PRISM language description. More precisely:

"fail_1", the failure of more than one sensor, is written "s<2 & i=2", meaning that the number of working sensors has dropped below 2 and the input processor is functioning (and so can report the failure).
"fail_2", the failure of more than one actuator, is written "a<1 & o=2", meaning that the number of working actuators has dropped below 1 and the output processor is functioning (and so can report the failure).
"fail_3", the failure of the input out processors, is written "COUNT=MAX_COUNT+1", meaning that the number of consecutive cycles skipped exceeds a limit MAX_COUNT.
"fail_4", the failure of the main processor, is written "m=0", meaning that the main processor has failed.

The above property denotes the probability that failure j is the first to occur. Note that, for example, if an actuator fails, the sensors, unaware of this, will continue to operate and may subsequently fail. Hence, we need to determine the likelihood of each failure occurring, before any of the others do. In the figures below, we plot the results of this computation over two ranges of values for T: the first 24 hours and the first month of operation. We can see, for example, that while initially the I/O processors are more likely to cause a system shutdown, in the long run it is the sensors which are most unreliable.

By omitting the bound <=T from the above CSL formula:

P=? [ !"down" U "fail_j" ]

we can ask the model checker to compute the long-run failure probability of each failure type. The results are:

Failure type		Probability
Sensors		0.6214
Actuators		0.0877
I/O processors		0.2425
Main processor		0.0484

Note that the above probabilities sum to one, i.e. the system will eventually shut down with probability 1.

Second, we consider a number of cost-based properties. We classify the states of our model into three types: "up", where everything is functioning, "down", where the system has shut down, and "danger", where a (possibly transient) failure has occurred but has yet to cause a shutdown (e.g. if the I or O processor has failed but the M processor has yet to detect this). We can reason about the time spent in each class by assigning a cost of 1 to those states, 0 to all others (as shown in the reward structures "down", "danger" and "up" defined in the prism description above), and then computing the total cumulated reward. We use the properties:

R{"reward_structure"}=? [ C<=T ]
R{"reward_structure"}=? [ F "down" ]

These give us the expected time spent in each class, up until time T and before the system shuts down, respectively.

We plot the results for the first value in the figures below, again for two ranges of T: the first day and the first month of operation.

plots expected times until time T (hours)

plots expected times until time T (days)

In the table below, we give the results of the second property, as we vary MAX_COUNT, the number of skipped cycles which the main processor waits before deciding that the input/output processors have failed. We see that increasing the value of MAX_COUNT increases the expected time until failure, but also has an adverse effect on the expected time spent in "danger" states.

MAX_COUNT:	Expected time:
MAX_COUNT:	danger (hrs):	up (days):
1	0.236	14.323
2	0.293	17.660
3	0.318	19.100
4	0.327	19.628
5	0.330	19.809
6	0.331	19.871
7	0.332	19.891

Finally, to illustrate a different type of cost structure, we compute the expected number of processor reboots which occur over time. This is done by assigning a cost of 1 to all transitions which constitute a reboot and 0 to all others. The results are plotted below.

const double T; // time bound

// causes of failues
label "fail_sensors" = i=2&s<MIN_SENSORS; // sensors have failed
label "fail_actuators" = o=2&a<MIN_ACTUATORS; // actuators have failed
label "fail_io" = count=MAX_COUNT+1; // IO has failed
label "fail_main" = m=0; // ,main processor has failed

// system status
label "down" = (i=2&s<MIN_SENSORS)|(count=MAX_COUNT+1)|(o=2&a<MIN_ACTUATORS)|(m=0); // system has shutdown
label "danger" = !down & (i=1 | o=1); // transient fault has occured
label "up" = !down & !danger; // system is operational

// Probability of any failure occurring within T hours
P=? [ true U<=T*3600 "down" ]

// Probability of each failure type occurring first (within T hours)
P=? [ !"down" U<=T*3600 "fail_sensors" ]
P=? [ !"down" U<=T*3600 "fail_actuators" ]
P=? [ !"down" U<=T*3600 "fail_io" ]
P=? [ !"down" U<=T*3600 "fail_main" ]

// Probability of each failure type occurring first (within T days)
P=? [ !"down" U<=T*3600*24 "fail_sensors" ]
P=? [ !"down" U<=T*3600*24 "fail_actuators" ]
P=? [ !"down" U<=T*3600*24 "fail_io" ]
P=? [ !"down" U<=T*3600*24 "fail_main" ]

// Probability of any failure occurring within T days
P=? [ true U<=T*3600*24 "down" ]

// Long-run probability of each failure type occurring
P=? [ !"down" U "fail_sensors" ]
P=? [ !"down" U "fail_actuators" ]
P=? [ !"down" U "fail_io" ]
P=? [ !"down" U "fail_main" ]

// Expected time spent in "up"/"danger"/"down" by time T
R{"up"}=? [ C<=T*3600 ]
R{"danger"}=? [ C<=T*3600 ]
R{"down"}=? [ C<=T*3600 ]

// Expected time spent in "up"/"danger" before "down"
R{"up"}=? [ F "down" ]
R{"danger"}=? [ F "down" ]

View: printable version Download: embedded.csl

Embedded Control System

Contents

Introduction

The Model

Experimental Results

Case Studies