Comparing Two Black-Box Testing Strategies for Software Product Lines
Abstract: Software Product Line (SPL) testing has been considered a challenging task, mainly due to the diversity of products that might be generated from an SPL. To deal with this problem, several techniques for specifying and deriving product specific functional test cases have been proposed. However, there is not much empirical evidence of the benefits and drawbacks of these techniques. To provide this kind of evidence, we conduct studies that compare two design techniques for black-box manual tests, a generic technique that we have observed in an industrial test execution environment, and a product specific technique whose functional test cases could be derived using any SPL technique that considers variations in functional tests. We evaluate their impact from the point of view of the test execution process, obtaining results that indicate that executing product specific test cases is faster and generates fewer errors.
Infrastructure Material
In the link below you can download all the material used to perform both experiments. This material includes the RGMS products with instructions on how to run them, the test suites used (written in portuguese), the Testwatcher tool and the training and dry run material (also written in portuguese).
Instructions
Test Environment and the
database installer we use
Test Suites
Collected Data
Here you can download the data that we collected during the experiments. This material includes the sheets generated by Testwatcher and the CRs reported by each subject.
1º experiment data
2º experiment data
Data Analysis
Here we provide the script to run our data analysis with the R scripts and the data files used. Also, we display here some results and graphics that we couldn't present in the JUCS paper due to lack of space.
Data to run the analysis
Below we see each experiment box plot.
In the graphics below we see the individual times. The first dotplot shows that, in spite of the feature used to run the tests, 17 from the 18 subjects ran the test suites faster using the ST. The second dotplot shows that 8 from the 10 subjects ran the test suites faster using the ST.
After the descriptive analysis we proceeded to run the hypothesis test. Because we used the Latin square design, we created an effect model that models our response variable (execution time). This models states that the response variable is the result of the sums of the influence factors (latin square replica, subjects, features and technique) considered by our experiment plus the residual. With this effect model we can run an ANOVA test to check if the tendency observed by the descriptive graphics is statistically significant. But before running the ANOVA, we first needed to run some tests to check if we could confirm the ANOVA assumptions to our data.
First we checked the assumption of equality or homogeneity of variances, that is, the variance of data in groups should be the same. Below we can see the Box Cox test which gives a significancy of 95% that our model residuals maintain a constant variance. We can see that because the interval which maximizes the function (above the 95% line) contains the value 1.
The second assumption that we examined was if the distribution of the residuals followed a normal distribution. We ran the Shapiro-Wilk hypothesis test to examine this property. It tests the null hypothesis that the data set follows a normal distribution. If it provides a high p-value we cannot reject this hypothesis. With a level of 95% of significance we couldn't reject the null hypothesis in neither experiments. In the first experiment the p-value was 0.1456, and, in the second one, the p-value was 0.4659.
The last property that we wanted to investigate was if our model was additive, that is, there was no interaction between our control factors. So we ran the Tukey Test of Additivity which tests the null hypothesis stating that the model is additive. One more time, we had high p-values for both experiments (0.5743 in the first one and 0.7976 in the second one) hence we cannot reject the null hypothesis the our model is indeed additive.
Finally we ran the ANOVA test to examine whether the technique factor had a significant impact on the execution time. This time the null hypothesis stated that there was no significant difference between the execution time means achieved in GT and in ST. Again we used a 95% level of significance to compare the p-value and in both experiments (0.0001 in the first one and 0.0109 in the second one) we were able to reject the null hypothesis. Our conclusion is that, within the scope of our studies, there is a significant difference between the GT and the ST execution time means. In addition, ST showed smaller values than GT.
In case of any problem, please contact one of the following:
--
PaolaAccioly - 2012-06-01 --
PaolaAccioly - 27 Mar 2012 --
PaolaAccioly - 01 Mar 2012 --
PaolaAccioly - 15 Feb 2012