Discover the PCIM Europe

Tools & Software Are your test cases good enough? What test case quality means for error detection

Frank Büchner*

Related Vendors

Even a "good-looking" set of test cases, which also reaches 100% code coverage, can overlook defects in the software. Only "good" test cases detect errors. But how do you find such test cases?

100% code coverage when testing software is still no guarantee for error-free program code. How do I create clean test cases that do not overlook bugs?
100% code coverage when testing software is still no guarantee for error-free program code. How do I create clean test cases that do not overlook bugs?
(Bild: Clipdealer)

The unit test specification for a simple test object could look like this: A start value and a length define a range. Determine whether another value is in this range or not. The end of the range should not belong to Range. All values are integers.

The test object consists of the three test cases shown in Figure 1, which together reach 100% code coverage (MC/DC). Nevertheless, they have a weak point. The problem is that not all requirements have been tested. In particular, there is no test case that checks whether the end of the range really does not belong to the range. You can also say that we have tested without limits. A test case with values 5, 2 and 7 for start, length and value would fail due to a defect in the software. This is due to a wrong relational operator in a decision ('<' instead of '<=').

Gallery
Gallery with 9 images

Thresholds, Min/Max, Extreme, Illegals

Thresholds

How many test cases with which data do we need to sufficiently test the limits? Let us take the (partial) specification "Input value less than 5" as an example. How could this be implemented and which tests detect faulty implementations?

The table in Figure 2 shows possible implementations for "Input value less than 5" in the first column. The first two lines are correct implementations, all other lines contain incorrect implementations. The second column indicates how likely it is that (a) such a defect will be implemented and (b) an implemented defect will be overlooked in a code review. To determine the first probability, the number of different characters between the correct and the incorrect implementation is used; for the second probability, the optical difference.

For example, there is only one difference between the correct implementation (i<5) and the incorrect implementation (i>5) (the relational operator is incorrect) and this difference is visually inconspicuous. Therefore, such a defect is classified as likely. In contrast, the incorrect implementation (i!=5) requires the use of two wrong characters and there is a big optical difference. Therefore such a defect is classified as improbable (unlikely).

The three other columns in the table indicate the result of the respective implementation for the input values 4, 5 and 6. The value 5 is the limit value from the specification "Input value less than 5" and 4 or 6 represent limit value 1 and limit value+1. The results in the table, which are shown in bold type and which are also highlighted in red, are wrong. This means that these test cases reveal incorrect implementation. The question now is whether two test cases with the input values 4 and 5 reveal a sufficient number of incorrect implementations or whether three test cases are necessary.

The two test cases with 4 and 5 do not reveal the two incorrect implementations (i!=5) and (i<>5). These two faulty implementations are each a test for inequality, expressed in different programming languages. The input value 6 (i.e. a third test case) would reveal the incorrect implementations. The faulty implementation (i<>5) is considered likely; on the other hand, programming languages in which '<>' is used as an unequal operator are rarely used in embedded systems. The faulty implementation (i!=5) is classified as unlikely. In my opinion, we don't necessarily need the test case with the value 6.

A special case is the faulty implementation (i==4), which is not detected by any of the three input values. But I don't think this is critical, because (i==4) is wrong in two ways: The relational operator and the value are both obviously wrong, this should be noticed immediately during a review. However, if the incorrect implementation is to be uncovered by testing, we need another test case, for example with the value 3. With this input value, the expected result is 1, but the actual result is 0, which uncovers the incorrect implementation. From this, we could now conclude that we need four test cases to be on the safe side, especially if no code review is foreseen.

Four test cases are proposed by René Tuinhout ([1] and [4]); he calls them Black Box Boundary Value Analysis (B3VA). But: We can assume that the range "less than 5" has a lower (left) limit. In the worst case, this is INT_MIN. And we can still assume that this limit will also be checked, i.e. there will be a test case with this limit as input. This test case would reveal incorrect implementation (i==4).

Subscribe to the newsletter now

Don't Miss out on Our Best Content

By clicking on „Subscribe to Newsletter“ I agree to the processing and use of my data according to the consent form (please expand for details) and accept the Terms of Use. For more information, please see our Privacy Policy.

Unfold for details of your consent

Note: The question mark at the implementation (i<=4) in the second line stems from the fact that this implementation is correct, but does not reflect the requirement. Therefore, a reviewer will most likely truncate on this implementation. And if a programming error happens in the implementation with the '4', for example, if it is implemented incorrectly (i==4), this is not detected by the black-box test cases with 4, 5 and 6.

Conclusion: In embedded systems (with "!=" as unequal operator) we can consider two test cases (with values 4 and 5) as sufficient, provided reviews are performed and if the number of test cases is to be kept as low as possible. On the other hand, in some cases, you need four test cases to uncover all faulty implementations.
For test services by Hitex the rule applies to test with three values (in this case 4, 5 and 6).

Min/Max values

We can consider the largest possible input value and the smallest possible input value (i.e. the "most negative" input value) as special cases of limit values. Therefore, we should also test with such input values. The following example (from [1]) shows that this makes sense.

The function abs_short() in Figure 3 works correctly for input values like -5, 0, 5; these three input values result correctly in 5, 0, 5. These three test cases also achieve 100% code coverage. But the input value -32768, the smallest ("most negative") value for a signed 16-bit number, does not give +32768, but -32768 (i.e. again the input value). This is because the correct value cannot be displayed with 16 bits. (Background: -32768 = 0x8000. 0x8000 - 1 = 0x7FFF. The inverted value is 0x8000, which is the same value we started with.)

Extreme values

Extreme (or unusual) input values are not directly limit values or min/max values but are specific in other respects. Let's take as an example a minimum function that has three unsigned integers (a, b and c) as input and which returns the smallest of the input values as a result.

The table in Figure 4 shows almost a dozen passed test cases for a minimum function. The first three test cases check that the smallest value is determined correctly, regardless of which input variable it occurs at. In the other test cases, minimum/maximum values are used. In addition, 100% code coverage is achieved by this test case.

What is deficient in this test case set? Well, no extreme or unusual input values are used. For example, there is no test case where all three values are equal and positive, such as (3, 3, 3). If we were to run this test case, the result would erroneously be 0 (and not 3 as expected). So the test case (3, 3, 3) reveals a defect in the software.

For a sort function, extreme or exceptional test cases are, for example, if the values to be sorted are already sorted, they are sorted in reverse order, if all values are the same, or if there are no values at all to sort.

Illegal values

If we look at the introductory simple example, the specification says "All values are integer." This also applies to the length of the range. But is a negative length value a valid input value? Probably not. In any case, it is exciting to run a test case with the value 5 as the start value and the value -2 for the length. Is the value 4 recognized as lying in the range?

As a rule of thumb we can formulate: Always search for (possibly) illegal input values and execute test cases with these values.

Methods for determining the test case quality

equivalence classing

A constant problem with test case creation is that an input variable can take on too many possible values and it is practically impossible to use all these values in test cases, especially if these values are still to be combined with all possible values of other input variables (combinatorial explosion). The formation of equivalence classes addresses this problem.

Equivalence class formation assigns a class to each possible input value. The classes are formed in such a way that all values in a class are considered equivalent to the test. Equivalent for the test here means that if a value from a certain class detects a certain error, all other values in this class do the same. Under this assumption, you can consider any value from a class as a substitute for all other values in that class. Thus, the values which a certain input value can assume may have been reduced considerably and the combination with other values made practicable. An example can be found in Figure 6.

However, we must be aware that an error in the equivalence class formation can lead to the fact that not all values in a class are equivalent for the test and if, moreover, the value is chosen for the test that does not detect a certain error, although it should, this error slips through. It is the responsibility of the human equivalence class builder to carefully build the classes.

The Classification Tree Method

The Classification Tree Method (KBM) is a test case specification method that uses the methods we have learned so far.

The classification tree method starts with the analysis of the requirements. This determines which relevant inputs there are. Relevant inputs are those that you want to vary during the tests. Now you think about the values that a relevant input can assume. If there are too many values, classes are created according to the equivalence class method. Then one considers limit values, illegal and extreme input values.

This results in the so-called classification tree. It forms the upper part of a test case specification according to the classification tree method. The root of the tree is at the top, the tree grows from top to bottom, the leaves of the tree form the head of the combination table. The combination table is the lower part of a test case specification according to the classification tree method. Each row represents a test case. The classes from which values are used in a test case are determined by marks on the lines. Both the design of the tree and the decision about the number of test cases as well as the setting of markings in the lines are human tasks (and therefore unfortunately subject to human error).

Figure 7 is an example of a test case specification by the classification tree method (KBM). The root of the tree is called "suspension", i.e. the test object appears to be the suspension of a motor vehicle. Two (test-relevant) aspects are considered for their test, namely speed, and steering angle. Both are classifications; they are displayed in rectangular frames. Both classifications are divided into equivalence classes. Equivalence classes are displayed without frames. For "Steering Angle" there are three equivalence classes: "left", "central" and "right". From the classification tree, we cannot see how the values are coded in a particular class. This depends on the implementation and does not interest the classification tree method because it is based on a black-box approach. Therefore, the test case specification is abstract according to the classification tree method.

If one does not consider "central" as an extreme steering angle, there is no limit, extreme or illegal values for steering angles. With "speed" this is different. The classification "speed" is divided into the two equivalence classes "valid" and "invalid". The latter class guarantees that invalid values for "speed" are used in the test because each class in the tree must be selected at least once for a test case. The class "invalid" is further subdivided according to the classification "Too low or too high? This results in the additional classes "negative" and "> v_max". Test cases with values from these classes find out what happens when the unexpected/impossible occurs. The valid velocities are divided into "normal" and "extreme" velocities. We can assume that the class "zero" for a valid speed contains only a single value (probably the value 0), as does the class "v_max" (which probably contains the maximum value from the requirements). These are limits.

The combination table (the lower part of the figure above) consists of seven lines and thus specifies seven test cases. The test cases can be named. The markers in each row indicate from which class values are selected for that test case. This results in the (abstract) test case specification that indicates the purpose of a test case. In this example, this is also expressed by the name of the test case, but this does not necessarily always have to be the case.

The overall test case specification shows that there are only three "normal" test cases (the first three test cases). You can also see that there is no "normal" test case with a low speed and steering angle to the right. If necessary, you can add more "normal" test cases. However, the question here is not whether three normal test cases are sufficient or not, but the point is that it is obvious that there are only three normal test cases. This is an important advantage of the classification tree method.

The unit test tool TESSY contains a graphical editor for classification trees. Thus, unit tests for TESSY can be conveniently specified using the classification tree method.

Methods for deriving test cases

Recommendations from ISO 26262

Figure 8: Methods from ISO 26262 to derive test cases.
Figure 8: Methods from ISO 26262 to derive test cases.
(Bild: Hitex)

In Part 6, Section 9, ISO 26262:2011 lists the methods for deriving test cases for the software unit test in Table 11 [7; see also Figure 8]. The degree to which a method is recommended depends on the Automotive Safety Integrity Level (ASIL). The ASIL ranges from A to D, with D placing the highest demands on risk reduction. Methods that are particularly recommended are provided with a double plus sign ("++"); methods that are recommended have a single plus sign ("+").

  • Method 1a from Table 11 requires test cases to be derived from the requirements. This is particularly recommended for all ASILs. To first consider the requirements to derive the test cases is the naive approach.
  • Method 1b from Table 11 requires that equivalence classes be used to derive test cases. This is recommended for ASIL A and especially recommended for ASIL B to D.
  • Method 1c from Table 11 requires that limit values be considered to derive test cases. This is recommended for ASIL A and especially recommended for ASIL B to D.
  • Method 1d from Table 11 requires that "error guessing" be used to derive test cases. This is recommended for ASIL A to D.
  • Methods 1a, 1b and 1c have already been discussed in the previous sections. Method 1d is explained below.

error rates (error guessing)

Error guessing usually requires an experienced tester who is able to find error-sensitive ("exciting") test cases based on his experience. Therefore, error guessing is also referred to as experience-based testing. Error guessing is a non-systematic method to specify test cases (which the first three methods are not). Admittedly, when using checklists or error reports, error guessing can also get a certain systematic. Error guessing is related to boundary, extreme and illegal values because test cases that are caused by error guessing often contain such values.

Alternatives

This section discusses further methods to obtain test cases that have not been considered before.

Deriving Test Cases from the Source Code

It is tempting to use a tool to automatically generate test cases from the source code, for example with the aim of achieving 100% code coverage. There are different technical approaches, for example, backtracking or genetic algorithms. Both freely available tools and commercial tools offer this possibility for test case generation. Why not use it on a large scale? Well, there are at least two aspects you should be aware of:

  • 1. omissions: No omissions are found in the code. For example, if a requirement is "if the first parameter is equal to the second parameter, a certain error number should be returned" and the implementation of that requirement is missing, then that missing code will never be detected by test cases derived from the code. You need test cases derived from the requirements to find non-implemented functionality.
  • 2. Correctness: You cannot decide based on the code whether it is correct or not. For example, you do not know whether the decision (i<5) or (i<=5) implements the intended functionality. This requires test cases derived from the requirements or a test oracle. A test oracle is an instance that can decide for a set of input values whether the result is correct or not.

Therefore, it is not sufficient to use only test cases derived from the source code. You also need test cases derived from the requirements. But wouldn't it be a good idea to have the main work of creating test cases done by a tool? You can then manually check whether the requirements have been tested and if not, improve them accordingly.

A study [5] tried to answer exactly this question. The study came to the following four main conclusions:

  • 1. Automatically generated test cases generate a higher code coverage than manually generated test cases.
  • 2. Automatically generated test cases do not lead to the discovery of more errors.
  • 3. automatically generated test cases have a negative effect on the ability to understand the intended behavior of the code (of classes).
  • 4. Automatically generated test cases do not necessarily lead to better mutation test values.

The study used the tool EvoSuite, which automatically generates tests for Java classes. It was an empirical study in which one hundred students were supposed to find errors in Java code. Half of the students started the search with test cases generated by EvoSuite; the other half started with their own test cases derived from the requirements.

The conclusion from the study for me is that automated test case creation has no advantage (for example, less effort or more errors found); on the other hand, automated test case creation has no disadvantage. This basic statement "no advantage / no disadvantage" is surprising in my eyes.

Of course, one can discuss the boundary conditions of the study (e.g. the programming language used, the knowledge of the students, etc.) and consider whether they also apply to embedded software development.

Random Test Data / Fuzzing

Like deriving test cases from source code, it is also tempting to use randomly generated test input data. With automated test execution, many test cases can be executed in a short time. But: A (functional) test case needs an expected result! And without a test oracle (see above) it can be very time-consuming to check for each test case whether the result obtained is correct or not.

However, there is a useful application of randomly generated test data: If you need to port or optimize "old" legacy code for which there may never have been a specification, it is useful to run tests with randomly generated test data first, record the results, and repeat the same test afterward. If you get identical results, you can assume with some certainty that the work on the old code was successful. Otherwise, I consider random input testing without result checking as a kind of robustness test. Robustness tests only detect coarse malfunctions (e.g. crashes).

However, this type of test can lead to the detection of safety and security vulnerabilities. The method of testing a test object with syntactically correct but random test data is also called "fuzzing".

Mutation test

As we have seen in the previous sections, 100% code coverage does not guarantee the quality of the test cases.

But how can you evaluate the quality of the test cases? One possibility is mutation testing (in IEC 61508 [8] these tests are called "error seeding"). If you have a set of passed test cases, you mutate the code and re-execute the test cases. Mutation means changing the code semantically, but in such a way that it remains syntactically correct. For example, you can change a decision from (i<5) to (i<=5), or you could replace a logical OR with a logical AND in a decision. The question now is whether the set of existing test cases detects this change in the code, i.e. whether at least one of these test cases fails. The quality of the test cases can be inferred from the number of detected or overlooked mutations.

Conclusion

As we have seen, 100% code coverage does not automatically mean good test cases. But you need good test cases to find bugs in the code. The task of creating good test cases is a difficult (mainly human) task; the use of tools should be treated with caution.

list of references

[1] Grünfelder, Stephan: Software-Test für Embedded Systems, 2. Auflage, dpunkt.verlag GmbH, Heidelberg, 2017.

{2] Spillner, Andreas, et al: Basiswissten Softwaretest, 5. Auflage, dpunkt.verlag GmbH, Heidelberg, 2012.

[3] Liggesmeyer, Peter: Software-Qualität, 2. Auflage, Spectrum Akademischer Verlag, Heidelberg, 2009.

[4] Tuinhout, René: The Software Testing Fallacy, Testing Experience 02/2008.

[5] Fraser, Gordon, et al: Dos automated Unit Test Generation Really Help Software Testers? A controlled Empirical Study, ACM Transactions on Software Engineering and Methodology, Vol. 24, No. 4, Article 23, published August 2015.

[6] http://www.hitex.de/tessy: Mehr über das Unit-Test-Werkzeug TESSY

[7] ISO 26262, International Standard, Road vehicles – Functional Safety, First edition, 2011

[8] IEC 61508, Functional safety of electrical/electronic/programmable electronic safety-related systems, part 7, IEC, 2000

This article was first published in German by Elektronikpraxis.

* Frank Büchner holds a Diploma in Computer Science from the Karlsruhe University of Technology, now KIT. For several years he has been dedicated to testing and software quality. He regularly imparts his knowledge through lectures and professional articles. Currently, he is working as "Principal Engineer Software Quality" at Hitex GmbH in Karlsruhe.

(ID:46236121)