Thursday, November 7, 2013

Pairwise Testing


In computer systems, defects usually involve a single condition, independent of any other condition in the system. If there's a problem with a device or a variable or a setting, the problem is usually in that device or variable or setting alone. As an example, in a word processing program, a given font may be rendered improperly, irrespective of the printer, printer connection, or version of Windows. In some cases, however, there are two or more bad dancers at the party; normally everything is all right, but if one encounters the other while in a certain state, they trip over each other's feet. In testing, we want to be sure that we don't miss problems based on conflicts between two or more conditions, variables, or configurations, so we often test in combinations in order to find defects most efficiently. All printers except one might render a given font perfectly, and that printer might render all fonts perfectly except for that one. How can we maximize the chance of finding such a problem in the limited time we have to test?

When we consider such a task, we begin almost immediately to run into the problem of "complete testing". Suppose that a defect depends upon every variable in the system to be in one specific state. If only we could try all of the possible combinations by checking each aspect of the system in each of its states, combined with all of the other aspects in each of their possible states, we would thereby test every state and every combination of states in the system. With an effective test of this nature, we'd be guaranteed to find any defect that depended on the state of the system.

That guarantee of finding any state-based defect is very desirable. Alas, it's completely infeasible; even for a small program, with a relatively simple set of variables and relatively few possible states per variable, the total number of possible valid states in combination is intractably large. To arrive at that number, we consider each variable, and count the number of valid states for it. We then multiply the numbers of all those valid states for each variable together.

Suppose that there are five variables, each represented by a letter of the alphabet. And suppose that each variable can contain a value from one to five. Let's set variables A, B, C, and D all equal to 1. If those values are fixed, variable E can have a value from one to five—so five combinations there. We'll keep track of the total number of combinations in our leftmost column; our first five combinations will be numbered from 1 to 5.

Table 1: Varying column E only
Combination Number
A
B
C
D
E
1
1
1
1
1
1
2
1
1
1
1
2
3
1
1
1
1
3
4
1
1
1
1
4
5
1
1
1
1
5


Let's set variable D to two instead of one, and keep A, B, and C at one. Five more combinations, numbered 6-10.

Table 2: Varying Column E, having changed D
Combination Number
A
B
C
D
E
6
1
1
1
2
1
7
1
1
1
2
2
8
1
1
1
2
3
9
1
1
1
2
4
10
1
1
1
2
5

Then we'll set D to 3, then to 4, and then to 5, setting E to values from 1 to 5 each time. We'll have to go through 25 combinations before we exhaust all of the combinations for D and E, and only then can we change C to from 1 to 2. The columns on the right will roll over relatively quickly, but as we move leftwards, columns A, B, and C will change much less often, like the odometer on a car.

In fact, if there are only five variables in the program, each with five possible states, the program would be absurdly simple, yet we'd have to test 3125 combinations—five combinations for the first variable, times five for the second, times five for the third, times five for the fourth, times five for the fifth—5 x 5 x 5 x 5 x 5. If we could test one combination per minute, complete testing would take seven and a half days of continuous, seven-hour-a-day tests—assuming that we didn't die of boredom first. There's got to be a way to reduce the number of tests into something that we can handle.

To start figuring out how to cut down the number of tests, let's assume a simple program—an even simpler program than the one above, in which we have only two checkboxes. A checkbox can be in one of two states—checked or unchecked. Imagine that our defect depends upon checkbox A being cleared (that is, unchecked) and checkbox B being set (checked). If we try all of the possible settings in combination with one another, we'll find the defect. We'll set up a table, with a row for each variable. Within each cell, we'll put one of the variable's possible states—set or cleared. In the leftmost column, we'll put an index number so that we can refer to each combination by number.

Table 3: Possible states for two checkboxes
Combination Number
Box A
Box B
1
Set
Set
2 (defect!)
Cleared
Set
3
Set
Cleared
4
Cleared
Cleared

Combinations 1, 3, and 4 work all right, but Combination 2 exhibits the problem. We'd require four tests to make sure that we had covered all of the combinations in which A and B could be found. Two variables, and two settings for each; four combinations.

Now consider something only a little more complex: a black-box test of a dialog with three sets of radio buttons on it, representing travel needs in North America. The first set of radio buttons specifies a destination, by country; the second notes the choice of an aisle or window seat; and the third affords the choice of coach, business, or economy class. When the user makes her selections and presses "OK" to close the dialog, the application will take on a different path of execution for each combination of given options in given states.

As usual, the simplest kind of problem to detect is one that is triggered by a single variable in a single state. This is called a single-mode fault [Phadke1997]. For example, consider an application that contains a defect causes it to behave bizarrely when the "destination" option is set to "Canada". Naturally, the way to find such a defect would be to ensure that we run at least one test with "destination" set to "Canada".

Assume now that there is a defect that depends upon two conditions—for example, the destination field being set to "Mexico" and the service class being set to "Business" at the same time. This is known as a double-mode fault. It should be clear if we run at least one test with "destination" set to "Mexico" and the "Class" field to "Business", we'll encounter the defect.

With two objects and a two options for each, we could compile a relatively straightforward table. However, with any additional complexity, the required testing effort grows very quickly. For two radio buttons (each with three possible states) and one checkbox, there are 3 x 3 x 2, or 18 possible values to test. Here's a table that shows each possible combination:



Table 4: Possible states for two radio buttons and one checkbox
Test
Destination
Class
Seat Preference
1
Canada
Coach
Aisle
2
Mexico
Coach
Aisle
3 (defect!)
USA
Coach
Aisle
4
Canada
Business Class
Aisle
5
Mexico
Business Class
Aisle
6
USA
Business Class
Aisle
7
Canada
First Class
Aisle
8
Mexico
First Class
Aisle
9
USA
First Class
Aisle
10
Canada
Coach
Window
11
Mexico
Coach
Window
12 (defect!)
USA
Coach
Window
13
Canada
Business Class
Window
14
Mexico
Business Class
Window
15
USA
Business Class
Window
16
Canada
First Class
Window
17
Mexico
First Class
Window
18
USA
First Class
Window


We invent tests in response to risks. There's a risk that the developer has written his code with a single-mode fault, so we do want to be sure that we test every variable in each of its states if we can. There is a lessened risk that the developer has written in a double-mode fault, so ideally we'll want to produce at least one test of each variable in each of its states with every other variable in each of its states. If our theory of error is that a developer has made a double-mode fault mistake (say, if there is a conflict between one country, Mexico, and one service class, Business), we need to run at least one test that includes the combination of "Mexico" and "Business". On the other hand, if we run a number of equivalent tests on Mexico/Business, we could be duplicating effort and wasting time.

The most difficult kind of problem to find by black-box testing is one in which several variables are involved, and each must be in a specific, unique state to trigger the problem. For example, when the country field is set to "USA", the seat choice to "aisle", and the service class to "First", then and only then will the bug be triggered. This is known as a triple-mode fault, or more generally as a multi-mode fault, which describes faults associated with three or more parameters. For testers, finding this kind of problem depends on one of two approaches: testing all combinations of variables and states in the program; or some other technique combined with... well, luck. The first option is flatly impossible for anything but the most trivial program. The key to the second option is optimize the tests so to be as lucky as possible.

In 1997, researchers at Telcordia Technologies (formerly Bell Communications Research, or Bellcore) published a paper by Siddharta Dalal et. al., "The Combinatorial Design Approach to Automatic Test Generation" [Telcordia 1997]. Telcodia's studies suggest that "most field faults were caused by either incorrect single values or by an interaction of pairs of values." If that's generally correct, we ought to focus our testing on the risk of single-mode and double-mode faults. The Telcordia paper suggests that we can get excellent coverage by choosing tests such that 1) each state of each variable is tested, and 2) each variable in each of its states is tested in a pair with every other variable in each of its states. This is called pairwise testing or all-pairs testing.

Focusing on single- and double-mode faults reduces dramatically the number of combinations that we have to run. If all of the the pairs in a given combination exist in other combinations, we can drop that combination. For example, consider combination 14 from Table 4 above: the Mexico/Business pair also appears in combination 5, the Business/Window pair appears in combination 13, and the Mexico/Window combination appears in combination 11. We can drop combination 14 without leaving any of its pairs untested.

Here's a table, based on Table 4 above, that tests all of pairs of values:

Table 5: Possible paired states for two radio buttons and one checkbox
Number
Destination
Class
Seat Preference
1
Canada
Coach
Aisle
3 (defect!)
USA
Coach
Aisle
5
Mexico
Business Class
Aisle
8
Mexico
First Class
Aisle
9
USA
First Class
Aisle
11
Mexico
Coach
Window
13
Canada
Business Class
Window
15
USA
Business Class
Window
16
Canada
First Class
Window

This table makes sure that Canada is tested at least once with each of Coach, Business Class, and First Class, and with the checkbox in the Aisle state and the Window state. Similarly, every option is tested with every other option. We capture the double-mode fault exposed in combination number 3; we need not test combination 12, since its pairs are covered by combination 15 (which tests Destination=USA and Seat=Window), combination 11 (Class=Coach and Seat=Window) and combination 3 (which tests Destination=USA and Class=Coach, and exposes the conflict between them). Using the all-pairs theory as the basis for selecting tests, in this case we cut in half the number of test requirements, from 18 to nine.

Now, how did we manage to winnow out the duplicates? In this case, the table of all possible combinations is 18 entries. The easiest thing to do is to select lines, starting at the bottom of the table, and if all of the pairs for the line are duplicated somewhere higher up in the table, we can remove the line.

·         Line 18 contains USA, First Class, and Window.

1.       The USA-First Class pair appears in line 9

2.       The First Class-Window pair appears in lines 16 and 17

3.       The USA-Window pair appears in line 12 and line 15

So we can drop line 18.

·         Line 17 contains Mexico, First Class, and Window

1.       The Mexico-First Class pair appears in line 8

2.       The First-Class Window pair appears in line 16

3.       The Mexico-Window pair appears in lines 14 and 11

We can drop line 17 too.

·         Line 16 contains Canada, First Class, and Window

1.       Canada-First Class appears in line 7

2.       Canada-Window appears in line 13

3.       First-Class-Window doesn't appear elsewhere in the table.

We must keep line 16 to preserve the First Class-Window pair.

This is a reasonable, albeit tedious, approach when the main table is very small, but a selection set with a trivial number of options increases complexity enormously. As we add variables or values, three things happen. First, the size of the all-combinations table expands exponentially. Second, the effort associated with sifting out the duplicated pairs becomes enormous. On the third point, there is some good news: the all-pairs selection cuts the number of required combinations much more dramatically than in our example above. Assume a program that has 75 binary options; the table of all possible combinations of states would extend to 2^75 entries. (Just for fun, that number happens to be 37,778,931,862,957,161,709,568). Telcordia's paper notes that 75 parameters of two states each can be pair-checked with only 28 combinations! [Telcordia, 1997]

So all-pairs testing is a very powerful technique for reducing the number of tests to be run, but there are two problems to consider.

The first problem is that if the defect in are example requires three variables to be just so (Destination=Canada, Class=Business, and Seat Preference=Aisle), our smaller table won't trigger the defect. Still, simply bashing away at random will be even less efficient. One workaround is to complement pairwise testing with other testing and QA techniques—code inspections, coverage tools, boundary tests and risk-based tests, to name but a few.

The second problem is that an all-pairs table takes a significant time to construct and to check for anything but a small number of variables and more possible states for each variable. If you want to test for more than pairs--triples, or n-tuples--the problem quickly becomes intractable for a human test planner. Fortunately, there are a few solutions to this problem.

·         Telcordia has created a Web-based tool called the Automatic Efficient Test Generator (AETG). You can find more information about the AETG Web service here; you can also find a paper which describes the principles here. The paper is tough but worthwhile reading. AETG-based testing isn't for everyone, but the principles behind it are worth considering, and the service may be useful for your project.

·         James Bach has written a tool called ALL PAIRS, a command-line PERL script that will create all-pairs tables. James' site is http://www.satisfice.com/. ALL PAIRS is not as efficient as AETG, but as he says, "Telcordia Technologies...has a web-based tool that does a somewhat better job than All Pairs...However, their tool costs $6000 a seat. Compare that to FREE."

At this point, you should have a sufficient understanding of pairwise testing to understand its value (and if you don't, I'd like you to let me know). If you're interested in more formal math surrounding pairwise testing, read on.

About Orthogonal Arrays

Pairwise testing is strongly influenced by a mathematical construct called an orthogonal array, or an OA. Orthogonal arrays are used in a variety of disciplines, including medical research, manufacturing, metallurgy, polling, and other fields that require testing and statistical sampling.

An orthogonal array has specific properties. First, an OA is a rectangular array or table of values, presented in rows and columns, like a database or spreadsheet. In this spreadsheet, each column represents a variable or parameter. Here are some column headers for a compatibility test matrix.

Table 6: Column Headers for a Test Matrix
Combination Number
Display Resolution
Operating System
Printer

The value of each variable is chosen from a set known as an alphabet. This alphabet doesn't have to be composed of letters—it's more abstract than that; consider the alphabet to be "available choices" or "possible values". A specific value, represented by a symbol within an alphabet is formally called a level. That said, we often use letters to represent those levels; we can use numbers, words, or any other symbol. As an example, think of levels in terms of a variable that has Low, Medium, and High settings. Represent those settings in our table using the letters A, B, and C. This gives us a three-letter, or three-level alphabet.

At an intersection of each row and column, we have a cell. Each cell contains a variable set to a certain level. Thus in our table, each row represents a possible combination of variables and values, as in Table 4 above.

Imagine a table that contains combinations of display settings, operating systems, and printers. One column represents display resolution (800 x 600 or Low; 1024 x 768, or Medium; and 1600 x 1200, High); a second column represents operating systems (Windows 98, Windows 2000, and Windows XP); and a third column represents printer types (PostScript, LaserJet, and BubbleJet). We want to make sure to test each operating system at each display resolution; each printer with operating system; and each display resolution with each printer. We'd need a table with 3 x 3 x 3 rows—27 rows—to represent all of the possible combinations; here are the first five rows in such a table:

Table 7: Compatibility Test Matrix (first five rows of 27)
Combination Number
Display Resolution
Operating System
Printer
1
Low
Win98
PostScript
2
Low
Win98
LaserJet
3
Low
Win98
BubbleJet
4
Low
Win2000
PostScript
5
Low
Win2000
LaserJet

Writing all those options out takes time; let's use shorthand. All we need is a legend or mapping to associated each variable with a letter of the alphabet: Low=A, Medium=B, High=C; Windows 98=A, Windows 2000=B, and Windows XP=C; PostScript=A, LaserJet=B, and BubbleJet=C. Again, this is only the first five rows in the table.

Table 8: Compatibility Test Matrix (first five rows of 27)
Combination Number
Display Resolution
Operating System
Printer
1
A
A
A
2
A
A
B
3
A
A
C
4
A
B
A
5
A
B
B

Note that the meaning of the letter depends on the column it's in. In fact, the table can mean whatever we like; we can construct very general OAs using letters, instead of specific values or levels. When it comes time to test something, we can associate new meanings with each column and with each letter. Again, the first five lines:

Table 9: Compatibility Test Matrix (first five rows of 27)
Combination Number
Variable 1
Variable 2
Variable 3
1
A
A
A
2
A
A
B
3
A
A
C
4
A
B
A
5
A
B
B

While we can replace the letters with specific values, a particular benefit of this approach is that we can broaden the power of combination testing by replacing the letters with specific classes of values. One risk associated with free-form text fields is that they might not be checked properly for length; the program could accept more data than expected, copy that data to a location (or "buffer") in memory, and thereby overwrite memory beyond the expected end of the buffer. Another risk is that the field might accept an empty or null value even though some data is required. So instead of associating specific text strings with the levels in column 3, we could instead map A to "no data", B to "data from 1 to 20 characters", and C to "data from 21 to 65,336 characters". By doing this, we can test for risks broader than those associated with specific enumerated values.

For example, suppose that we are testing an insurance application. Assume that Variable 1 represents a tri-state checkbox (unchecked = A = no children, checked = B = dependent children, and greyed out = C = adult children). Next assume that Variable 2 represents a set of radio buttons (labeled single, married, or divorced). Finally assume that Variable 3 represents a free-form text field of up to 20 characters for the spouse's first name. Suppose further the text field is properly range checked to make sure that it's 20 characters or fewer. However, suppose that a bug exists wherein the application produces a garbled record when the spouse's name is empty, but only when the "married" radio button is selected. By using classes of data (null, valid, excessive) in Column 3, pairwise testing can help us find the bug.

Back to orthogonal arrays. There are two kinds of orthogonal arrays: those which use the same-sized alphabet over all of the columns, and those which use a larger alphabet in at least one column. The latter type is called a "mixed-alphabet" orthogonal array, or simply a "mixed orthogonal array". If we decide to add another display resolution, we need to add another letter (D) to track it; we we would then have a mixed orthogonal array.

A regular (that is, non-mixed) orthogonal array has two other properties: strength and index.

Formally, an orthogonal array of strength S and index I over an alphabet A is a rectangular array with elements from A having the property that, given any S columns of the array, and given any S elements of A (equal or not), there are exactly I rows of the array where those elements appear in those columns.

Only a mathematician could appreciate that definition. Let's figure out what it means in practical terms.

If we're using an OA to test for faults, strength essentially refers to the mode of the fault for which we're checking. While checking for a double-mode fault, we consider pairs of columns and pairs of letters from the alphabet; we would need a table with a strength of 2. We select a strength by choosing a certain number of columns from the table and the same number of values from the alphabet; we'll call that number S. The array is orthogonal if in our S columns, each combination of S symbols appears the same number of times; or to put it another way, when = 2, each pair of symbols must appear the same number of times. Thus we are performing pairwise or all-pairs testing when we create test conditions using orthogonal arrays of strength 2.

Were we checking for a triple-mode fault—a fault that depends on three valuables set to a certain value, we would need to look at three columns at a time, and combinations of three letters from the alphabet—a table with strength of 3. In general, strength determines how many variables we want to test in combination with each other.

An index is a more complicated concept. The orthogonal array has an index I if there are exactly I rows in which the S values from the alphabet appear. This effectively means that OAs that have an index must either have the same alphabet, or that all the columns must contain alphabets of equal size, which amounts to the same thing. An mixed-alphabet orthogonal array therefore cannot have an index.

The index is important when you want to make sure not only that each combination is tested, but that each combination is tested the same number of times. That's important in manufacturing and product safety tests, because it allows us to account for wear or friction between components. When we test combinations of components using orthogonal arrays, we find not only which combination of components will break, but also which combination will break first.

This leads us to the major difference between orthogonal array testing and all-pairs testing. If we are searching for simple conflicts between variables, we would presume that we'll expose the defect on the first test of a conflicting value pair. If we stick to that presumption, it is not very useful to test the same combination of paired values more than once, so when testers use pairwise testing, they use orthogonal arrays of strength 2 to produce pairwise test conditions, and, to save time, they'll tend to stick to an index of 1, to avoid duplication of effort. In cases where the alphabets for each column are of mixed size, we'll take the hit and test some pairs once and some pairs more than once.

Moreover, unless we specifically plan for it, it's not very likely that the variables in a piece of software or the parameters in configuration testing will have the same number of levels per variable. If we adjust all of our columns such that they have the same-sized alphabets, our array grows. This gives us an orthogonal array that is balanced, such that each combination appears the same number of times, but at a cost: the array is much larger than neccessary for all-pairs. In all-pairs testing, the combinations are pairs of variables in states such that each variable in each of its states is paired at least once with some other variable in each of its states. In strictly orthogonal arrays, if one pair is tested three times, all pairs have to be tested three times, whether the extra tests reveal more information or not.

A strictly orthogonal array will have a certain number of rows for which the value of one or more of the variables is irrelevant. This isn't necessary as long we've tested each pair against the other once; thus in practical terms, a test suite based on an orthogonal array has some wasted tests. Consquently, in pairwise testing, our arrays will tend not to have an index, and will be "nearly orthogonal".

So let's construct an orthogonal array. We have three columns, representing three variables. We'll choose an alphabet of Red, Green, and Blue--that's a three-level alphabet. Then we'll arrange things into a table for all of the possible combinations:

Table 10: All Combinations for Three Variables of Three Levels Each
A
B
C
1
Red
Red
Red
2
Red
Red
Green
3
Red
Red
Blue
4
Red
Green
Red
5
Red
Green
Green
6
Red
Green
Blue
7
Red
Blue
Red
8
Red
Blue
Green
9
Red
Blue
Blue
10
Blue
Red
Red
11
Blue
Red
Green
12
Blue
Red
Blue
13
Blue
Green
Red
14
Blue
Green
Green
15
Blue
Green
Blue
16
Blue
Blue
Red
17
Blue
Blue
Green
18
Blue
Blue
Blue
19
Green
Red
Red
20
Green
Red
Green
21
Green
Red
Blue
22
Green
Green
Red
23
Green
Green
Green
24
Green
Green
Blue
25
Green
Blue
Red
26
Green
Blue
Green
27
Green
Blue
Blue

Now, for each pair of columns, AB, AC, and BC, each pair of colours appears exactly three times. Our table is an orthogonal array of level 3, strength 2, and index 3. In order to save testing effort, let's reduce the appearance of each pair to once.

Table 11: All-Pairs Array, Three Variables of Three Levels Each
A
B
C
2
Red
Red
Green
4
Red
Green
Red
9
Red
Blue
Blue
12
Blue
Red
Blue
14
Blue
Green
Green
16
Blue
Blue
Red
19
Green
Red
Red
24
Green
Green
Blue
26
Green
Blue
Green

How did I happen to choose these nine specific combinations? That was the easy part: I cheated. I tried trial and error, and couldn't get it down to fewer than 12 entries without spending more time than I felt it was worth. The All Pairs tool mentioned above got me down to 10 combinations. However, on the Web, there are pre calculated orthogonal array tables for certain numbers of variables and alphabets; one is here; that's the one that I used to produce the chart above.

Mixed-alphabet orthogonal arrays permit alphabets of different sizes in the columns. The requirement is that the number of rows where given elements occur in a given number of prescribed columns is constant over all choices of the elements, but is allowed to vary for different sets of columns. Such an orthogonal array does not have an index, and is known as a "nearly orthogonal array." The airline example above is of this "mixed alphabet" type. The consequence is that some combinations will be tested more often than others. Note that in Table 5, there are two instances of Canada as the destination and Window as the seating preference, but each test is significant with respect to the class column. In cases where a single column has a larger alphabet than all the others, we'll test all pairs, but some pairs may be duplicated or may have variables that can be set to "don't care" values. If there's an opportunity to broaden the testing in a "don't care" value—for example by trying a variety of representatives of a class that might be otherwise be considered equivalent, we can broaden the coverage of the tests somewhat.

Once again, the goal of using orthogonal arrays in testing is to get the biggest bang for the testing buck, performing the smallest number of tests that are likely to expose defects. To keep the numbers low, the desired combinations are pairs of variables in each of their possible states, rather than combinations of three or more variables. This makes pairwise testing a kind of subset of orthogonal array testing. An orthogonal array isn't restricted to pairs (strength 2); it can be triples—strength 3, or combinations of three variables; or n-tuples—strength n, or combinations of n variables. However, orthogonal arrays with a strength greater than 2 are large, complex to build, and generally return too many elements to make them practicable. In addition, in the case of mixed alphabets, strictly orthogonal arrays produce more pairwise tests than we need to do a good job, so all-pairs testing uses nearly orthogonal arrays, which represents reasonable coverage and reduced risk at an impressive savings of time and effort.

No comments:

Post a Comment