System Analysis and Modeling In this work

2020-05-27

System Analysis and Modeling - In this work, a three-level hierarchical modeling – composed by combinatorial and state-space models – is used: (1) First, failure rates () of TMS servers subsystem units are estimated through continuous Markov chains; (2) Then, the single server nop receptor failure rate () is produced through a Fault Tree, which OR results of previous phase; (3) Finally, at the top, the overall cluster is modeled. Such a model is specified and analyzed using a formal verification method. In particular, in this paper, the probabilistic model checking tool, namely PRISM [18], has been used. This allows an automatic verification of specific properties of the probabilistic model defined, useful to determine the compliance of the cluster to SIL2 requirements. Mitigation Strategies Enforcement Most relevant outcomes coming from TMS modeling are leveraged to propose possible ways useful to enhance the final cluster THR. In this work, mitigation strategies are proposed to address both software and hardware failures. Possible identified solutions were: (i) increasing the number M of Active nodes and enforce a M-to-1 configuration; (ii) reducing Single Point of Failure (SPF); (iii) enforcing a software rejuvenation of server nodes to reduce the impact of aging-related failures, and take advantage of the system reboot to alternate the Active node. The latter is chosen as it is a good compromise between costs and impact on the overall THR. Experimental Validation Experiments are really important in the certification process. Standards (e.g. EN50129) clearly state that ”Fail-safe behavior of component under adverse conditions shall be demonstrated”, and it is desirable to obtain ”Evidences that the failure mode will not occur as a result of component ratings being exceeded”. The experimental phase, in this paper, aims at: (1) Demonstrate the goodness of TMS node failure rate estimation; (2) Define the TMS TTARF in order to determine a proper period of rejuvenation. The system is subjected to a stress loading scheme through a workload generator. Then, failure and degradation data sets are used for a QALT/ADT analysis. QALT and ADT are proven as the best solutions to measure reliability metrics and, at the same time, understand and quantify the effects of stress. QALT/ADT are usually leveraged for HW components. However these have been demonstrated feasible to observe the behavior of SW suffering from software aging [20], [22].
Modeling and analysis The three-level hierarchical modeling of the TMS cluster is here deepened. In the remainder of pedigree analysis section different rates – having an exponential distribution – are used. While some, like failure rates, depend on the unit/system under study, others are fixed. The MTTR, e.g., is based on ASTS service agreements, which guarantee units replacement within 18 h s. The Time to Switch – equal to 30s – is evaluated through on-field tests of Active-Standby switches s. In the same manner, the Time to Reboot has been calculated and is equal to 302s s.
Mitigation strategies Results coming from TMS formal model verification proved that the current system configuration is not compliant with SIL2 bounds. Hence, mitigation strategies need to be defined and applied in order to reach the desired level of reliability. approach consists in using different cluster configurations. In this sense, two possibilities may be pursued, i.e.: An additional approach aims at mitigate the failure probability of most critical components – i.e., the COTS OS and its CRM – which were proven to be the weak points of the cluster (Fig. 5c). Regarding the OS, measures can be taken for example at kernel level as suggested by Pierce et al. [3] that provide thorough guidelines in this sense. The idea is to configure the kernel to serve only the critical application disabling unused modules, driver peripherals, the graphical interface X window system, unused user processes. Other actions can be enforced on Linux SUSE and, in particular, on Pacemaker/Corosync. In fact, there are CRM settings that could affect the response of the system in case of failure conditions and that define policies on the management of the critical service. All of these, however, can provide a little improvement, which is also difficult to quantify.