Written by David Foster, CEO of Caveon
In the beginning, one of my biggest concerns about the SmartItem was about how it would contribute (or not) to the reliability of an exam. As you may have learned by now, each SmartItem can be presented in tens of thousands–even hundreds of thousands–of ways. No test taker sees the same thing, as this effect is compounded by the number of SmartItems on each test. Logically and psychometrically, I thought it would all work out, but I was looking forward to seeing the first calculations of reliability from an information technology certification test, one of the first to facilitate the use of the SmartItem.
The test had 62 SmartItems, and the reliability statistic was based on the responses to those items by 70 individuals. The reliability was calculated at .83, a decent and acceptable result. Whew!
For this IT certification exam, a field test was conducted and item analyses were run to evaluate the performance of the SmartItems. For each SmartItem, we calculated the p-value (proportion of examinees answering the item correctly) to see which questions were too easy or too difficult. We also used the point-biserial correlation as a measure of the SmartItem’s ability to discriminate between more competent and less competent candidates. These are statistics based on classical test theory (CTT) and are routinely used to evaluate items in most of the high-stakes tests that are used today.
My first peek at these data was gratifying. The analysis looked… well… normal. I’ve seen hundreds of these analyses over my career, and this set looked liked the others; some items performed better than others, with most performing in acceptable ranges. A small number performed poorly, which is to be expected. Of the 62 SmartItems, only 4 have p-values below .20, indicating that these 4 might be too difficult to keep on the exam. In addition, only 8 items had correlations close to zero. The large majority, therefore, performed as designed and built.
Given that these were SmartItems–and therefore were viewed differently by each candidate–it was wonderful to see that they could serve competently on a high-stakes certification exam. Based on individual SmartItem statistics, it isn’t too surprising that the resulting reliability was high.
After this, our interest peaked; we wanted to know how well…