#### HIGHLY RELIABLE DESIGN BASED ON TSC CIRCUITS

Pavel Kubalík

Informatics and Computer Science , 4-th class, part-time study Supervisor: Hana Kubátová

Department of Computer Science and Engineering, Czech Technical University Karlovo nám. 13, 121 35 Prague 2

e-mail: (xkubalik, kubatova)@fel.cvut.cz

Abstract. This paper deals with architecture of highly reliable digital circuits based on totally self checking blocks implemented in FPGAs. A duplex system is used as a basic structure of this reliable design. The whole design implemented in FPGA is divided into individual functional parts. Every part is modified to ensure totally self checking properties, which are calculated using our method of detailed fault classification. The reconfiguration process is utilized to increase reliability parameters. Combinational circuit benchmarks have been considered in this work to compute the quality of the adapted duplex system. The benchmarks are represented by two level networks (truth table). All of our experimental results are obtained by XILINX FPGA implementation by EDA tools.

**Keywords**. FPGA, totally self-checking (TSC) circuit, dependability, concurrent error detection (CED), error detecting codes.

## **1** Introduction

The time needed for design process is shorter for FPGA than for ASIC as a final implementation basis. FPGAs enable the in-system reconfiguration to correct functional bugs or update the circuit design to implement new standards. The FPGA devices are also used in mission-critical applications such as aviation, medicine or space missions and due to this fact the design must be reliable.

The FGPA configuration is stored in SRAM, and any changes of this memory may lead to a malfunction of the implemented circuit. Single event Upset (SEU) [1], caused by the high-energy particles impacting sensitive parts, is one possibility to change configuration memory. Some results of SEU effects on the FPGA configuration memory are described in [2]. These changes are described as soft error and cannot be detected by offline test without interruption of the circuit.

Concurrent Error Detection (CED) techniques can allow faster detection of a soft error (an error which can be corrected by a reconfiguration process) caused by a SEU. SEU can also change values in the embedded memory used in the design and cause a data corruption. The FPGAs fabrication process allows using the sub-micron technology with smaller and smaller transistor size. Due to this fact the changes in FPGA memory contents, affected by SEUs, can be observable even at the sea level [3]. This is another reason why CED techniques are important.

There are many papers [4, 5] focused on CED techniques. The CED techniques can be divided into three basic groups according to the type of redundancy method used. The first group focuses on area redundancy method, the second group on the time redundancy method and the third one on the information redundancy method. As concerned the area redundancy method, we assume duplication or triplication of the original circuit. The time redundancy method is repeating the same computation and

compares results. The information redundancy method is based on Error Detecting (ED) codes and leads either to an area overhead or to a time redundancy. Our method uses the information redundancy method (area redundancy method) caused by using the ED codes.

There are three basic terms required by CED techniques:

- Fault Security (FS) property means that for each modeled fault, the produced erroneous output vector does not belong to the output code.
- Self-Testing (ST) property means that for each modeled fault, there is an input vector occurring during normal operation that produces an output vector, which does not belong to the code
- Totally Self-Checking (TSC) property means that the circuit must satisfy FS and ST properties.

The paper is organized as follows: firstly our method of TSC circuit generation used in our architecture is presented in Section 2. Our fault classification is described in Section 3. Our architecture and its reliability parameters computations are presented in Section 4. Section 5 concludes the paper.

# 2 Totally self-checking circuits

CED techniques based on ED codes are widely used [6, 7]. But many groups did not evaluate the FS and ST property of the final circuit. TSC circuit in our method is based on an original combinational circuit, a parity bits predictor and a checker, see Fig. 1.



Fig. 1: Basic structure of TSC circuit

In our approach the parity predictor is generated from original combinational circuit. Primary outputs from original circuit and combinational circuits are encoded by ED code. The most important criterion of ED codes quality is the final area overhead and the fault coverage. Our experimental results show the higher fault coverage leads to high area overhead and low area overhead leads to low fault coverage. Simple ED codes are used because low area overhead is the most important criterion in many real situations. In our solution we use single parity ED codes to ensure TSC properties. Our results of tested ED codes are described in [8]. The area overhead of single parity ED codes is in many cases higher than 75%.

## **3** Fault Classification

The use of ED codes and possibly some special synthesis methods does not necessarily ensure the TSC property. We need to evaluate how many faults violate the FS and ST property to make a comparison of different methods. In the common fault classification, the faults are divided into two groups by the testability of the faults. This classification is not sufficient for our purpose. It is necessary to distinguish whether the change on an output caused by a fault is detectable by used ED code or not.

The fault detection can be based on two different approaches – comparison of two values (duplication) and using the ED codes. In the first case, the outputs of two units are compared. Assuming that one fault at a time can occur; at least one unit will produce correct values. It means that when a fault-free comparator is assumed, each error caused by any fault in a unit will be detectable. The evaluation of the error detection capabilities in the second case is more complicated. The correct output is not known during the processing. The fault detection ability depends only on the ED codes used. It is not sure that each fault causes a detectable error. It is necessary to use a different approach to a fault classification. For each input vector, the responses of a circuit in a presence of a fault can be divided into three groups:

- 1. No error the fault does not affect output values. The data is not corrupted, but the presence of a fault is not detected.
- 2. Detectable error the fault changes outputs into a non-code word. This is the best case, because the presence of a fault is detected.
- 3. Undetectable error the output vector is a valid codeword, but is incorrect (incorrect codeword). This is the worst case, because the checker is not able to detect this error.

Every circuit has a set of admissible input vectors. The faults can be divided into four classes by the circuit reaction to their presence. These classes are:

- A) Faults that do not affect the output for any input vector. This group represents the faults occurring in redundant parts. These faults have no impact to the FS property, but if this fault can occur, a circuit cannot be ST.
- B) Faults that are detectable by at least one input vector and for all the other input vectors, do not produc an incorrect codeword. These faults have no negative impact to the FS and ST property.
- C) Faults that cause an incorrect codeword for at least one input vector and not detectable by any other input vector. Faults from this class cause undetectable errors. If any fault in the circuit belongs to this class, the circuit is neither FS nor ST.
- D) Faults that cause an undetectable error for at least one vector and a detectable error for at least one other vector. Although these faults are detectable, the circuit does not satisfy the FS property.

With regards to the definitions of the FS and ST properties, we can introduce these theorems:

- A circuit will be FS and ST only if all faults belong to the class B.
- A circuit will be FS only if all the faults belong to the class A or B.
- A circuit will be ST only if all the faults belong to the class B or D.

These theorems follow directly from the definitions of FS and ST.

To compare different techniques for the TSC circuits design, the distribution of the considered faults into the above defined classes has to be obtained. A suitable fault simulator is needed. Most of the simulators (like FSIM [9] or HOPE [10]) cannot produce the above outlined classification. We have used the simulator described in [11]. This simulator has these features:

- The simulation is performed for circuits described by a netlist format (EDIF).
- The stuck-at-1 and stuck-at-0 faults on inputs and outputs of components are considered.

• Combinational and sequential circuits are supported.

• This simulator supports circuits whose inputs, outputs and internal states (in the case of a sequential circuit) are coded by even parity, multiple parity and 1 out of N code. Multiple code groups can be used to ensure TSC. The simulator also supports Hamming-like codes and the M-out-of-N code.

We must force design rules to preserve the information redundancy [8]. When we violate some design rules the FS property may not be high. It can be useful to evaluate this value to compare different methods. We can use this value to evaluate "How much the circuit satisfies the FS property".

| Circuit | Inputs | Outputs | All faults | Х    | Α  | В    | С  | D   |
|---------|--------|---------|------------|------|----|------|----|-----|
| alu1    | 12     | 9       | 2594       | 2566 | 28 | 2566 | 0  | 0   |
| apla    | 10     | 13      | 632        | 632  | 0  | 522  | 3  | 107 |
| b11     | 8      | 32      | 418        | 416  | 2  | 321  | 42 | 53  |
| br1     | 12     | 9       | 594        | 594  | 0  | 369  | 78 | 147 |
| al2     | 16     | 48      | 628        | 627  | 1  | 576  | 17 | 34  |
| alu2    | 10     | 9       | 830        | 819  | 11 | 757  | 0  | 62  |
| alu3    | 10     | 9       | 622        | 622  | 0  | 572  | 0  | 50  |

Tab. 1. Combinational circuits and even parity

Our experimental results for single parity predictor are described in Tab. 1. First three columns describe used circuits. The next column shows the number of all faults. Column X shows the number of faults obtained by the standard methods. The last four columns describe our fault classification. Our fault classification is described in more detail in [12].

The evaluation of the FS property (the number of faults that belong to the class A or B) is independent of the set of admissible input words. If any fault does not manifest itself as an incorrect codeword for all admissible input words, it cannot cause an undetectable error. So we can use the exhaustive test set for combinational circuits and a test that uses all transitions for a sequential circuit.

The evaluation of the ST property (the number of faults that belong to the class B or D) is more complicated because some input words may not appear. For combinational circuits, where the set of admissible input words is not defined, the exhaustive test set is generated. In the real situation, some input words can not occur. It means that some faults can be undetectable. It can decrease the final fault coverage. The reconfiguration process is initiated after a fault is detected. The time needed to localize the faulty part is not negligible and must be included into the calculation of reliability parameters. Other reconfiguration approach not use localization process and the new configuration data of the faulty block is downloaded into FPGA.

#### **4** Reliability of our architecture

As our previous results show that the full satisfaction TSC properties (100%) is difficult, we have proposed a new structure based on two FPGAs, Fig. 2. Each FPGA contains a TSC circuit and a comparator. The TSC circuit is composed of small circuits where every block satisfies the TSC property. The methods how to satisfy TSC property for the compound design is described in [13].

Every FPGA has one primary input, one primary output and two pairs of checking signals OK/FAIL. The first checking signal generated by the TSC circuit serves as additional information. The probability of the information correctness depends on the TSC properties. When the TSC property is satisfied only in e.g. 83%, the correctness of checking information is also 83%. It means that the OK signal is correct for 83% of occurred errors (same probabilities for both signals OK and FAIL).



Fig. 2. Reconfigurable duplex system

To increase the reliability parameters we must add two comparators, one for every FPGA. The comparator compares outputs from both FPGAs. When these outputs are different the fail signal is generated. This information is not sufficient to distinguish which TSC circuit failed. Additional information to mark out the wrong circuit is generated by the original TSC circuit. In a case when outputs are different and one of circuits generates the fail signal, the wrong circuit is correctly detected. Correct outputs can be processed by the next circuit. When the outputs are different and both circuits signalize a correct function, we must stop the circuits and reconfiguration process must be initiated for both circuits.

| С    | SINGLE PARITY |      |                      | DU   | PLEX                 | TRIPLEX |                      |  |
|------|---------------|------|----------------------|------|----------------------|---------|----------------------|--|
|      | FS            | S[b] | Ass                  | S[b] | Ass                  | S[b]    | Ass                  |  |
| alpa | 83            | 349k | 0.9 <sub>5</sub> 787 | 233k | 0.9 <sub>5</sub> 184 | 233k    | 0.9 <sub>8</sub> 986 |  |
| b11  | 77            | 252k | 0.9 <sub>5</sub> 856 | 233k | 0.9 <sub>5</sub> 412 | 233k    | 0.9 <sub>8</sub> 993 |  |
| br1  | 62            | 257k | 0.9 <sub>5</sub> 750 | 233k | 0.9 <sub>5</sub> 402 | 233k    | 0.9 <sub>8</sub> 992 |  |
| al2  | 92            | 242k | 0.9 <sub>5</sub> 951 | 233k | 0.9 <sub>5</sub> 434 | 233k    | 0.9 <sub>8</sub> 993 |  |
| alu3 | 92            | 520k | 0.9 <sub>5</sub> 783 | 233k | 0.9 <sub>4</sub> 879 | 233k    | 0.9 <sub>8</sub> 970 |  |

Tab. 2. Availability parameters

The fault security (FS) and the used bit-stream size (s) is summarised in Tab. 2, where the results obtained from the reliability computation of three models is also included. Here "C" is benchmark circuit, "FS" is a probability that a fault is detected by code words, "S(b)" is configuration memory size for one FPGA, "Ass" is the steady-state availability [14]

## **5** Conclusion

The proposed structure can increase the availability parameters for adapted duplex system with minimal area overhead. Due to using the comparators of outputs, we can use circuits where TSC

property is satisfied on less than hundred percent. Our structure will increase reliability parameters due to duplication and detection of the faulty circuit. This solution has smaller area overhead than the triplex system (TMR), but has better reliability parameters than duplex system. The reconfiguration process allows the correction of faulty part and increases reliability parameters, too. Our present research is focused on the real implementation of our structure in AT94K40 and on the precise reliability parameters calculation.

# Acknowledgment

This research has been in part supported by the GA102/03/0672 grant and MSM6840770014 research program.

# References

[1] QuickLogic Corporation .: Single Event Upsets in FPGAs, 2003, www.quicklogic.com

[2] Bellato, M, Bernardi, P, Bortalato, D, Candelaro, A, Ceschia, M, Paccagnella, A, Rebaudego, M, Sonza Reorda, M, Violante, M, Zambolin, P.: Evaluating the effects of SEUs affecting the configuration memory of an SRAM-based FPGA Design Automation Event for Electronic System in Europe 2004, pp. 584-589.

[3] Normand, E.: Single Event Upset at Ground Level, IEEE Transactions on Nuclear Science, vol. 43, 1996, pp. 2742-2750.

[4] Mohanram, K, Sogomonyan, E. S, Gössel, M, Touba, N. A.: Synthesis of Low-Cost Parity-Based Partially Self-Cheking Circuits, Proceeding of the 9th IEEE International On-Line Testing Symposium, 2003, pp. 35.

[5] Drineas, P, Makris, Y.: Concurrent Fault Detection in Random Combinational Logic, In: Proceedings of the IEEE International. Symposium on Quality Electronic Design (ISQED), 2003, pp. 425-430.

[6] Mitra, S, McCluskey, E. J.: Which Concurrent Error Detection Scheme To Choose? Proc. International Test Conf., 2000, pp. 985-994.

[7] Bolchini, C, Salice, F, Sciuto, D.: Design Self-Checking FPGAs through Error Detection Codes, 17th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT'02), 2002, pp. 60.

[8] Kubalik, P, Kubatova, H.: Minimization of the Hamming Code Generator in Self Checking Circuits In Proceedings of the 15th International Workshop on Discrete-Event System Design - DESDes'04 2004 pp. 161-166

[9] Lee, H. K, Ha, D. S.: An Efficient Forward Fault Simulation Algorithm Based on the Paralel Pattern Single Fault Propagation, Proc. of the 1991 International Test Conference, pp. 946-955, Oct. 1991

[10] Lee, H. K, Ha, D. S.: HOPE: An Efficient Parallel Fault Simulator for Synchronous Sequential Circuits, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 15, No. 9, pp. 1048-1058, September 1996.

[11] Kafka, L.: Design of TSC circuits implemented in FPGA, CTU FEE, 2004.

[12] Kafka, L, Kubalík, P, Kubátová, H, Novák, O.: Fault Classification for Self-checking Circuits Implemented in FPGA, Proceedings of IEEE Design and Diagnostics of Electronic Circuits and Systems Workshop. Sopron University of Western Hungary, 2005, s. 228-231.

[13] Kubalik, P, Kubatova, H.: High Reliable FPGA Based System Design Methodology, Work in Progress Session of 30th EUROMICRO and DSD 2004, Universitat Linz 2004 pp. 30-31.

[14] Pradhan, D. K.: Fault-Tolerant Computer System Design, Prentice-Hall, Inc., New Jersey, 1996.