A Caveon colleague and I recently wrote a chapter on how test security can be improved by considering item and test designs. A prominent reviewer of the chapter made this clear statement: “We don’t design/develop test items with regard to test security.” (He added that items were designed with the sole purpose to measure a particular skill or construct.) Regarding his comment about items and security, I’ve thought about it a lot since receiving his review and have come to the conclusion, based on more than 30 years as a practicing psychometrician, that he is simply wrong. Let me give a simple example to help show why I reached this conclusion.
With almost all computerized tests, multiple choice items have their options randomized before presentation. This occurs for at least one good security reason: To make sure that test takers sitting adjacently cannot copy easily from each other. At the test level, the items themselves are randomly presented for the same reason. Another reason to randomize options has been to make item harvesting a bit more difficult. If a person (“harvester”) sent to memorize particular question on an exam only recalls the A, B or C label of the option, this is no longer very helpful information as those options will be in different positions for subsequent test takers.
Cheating on tests and the related test fraud category of item harvesting are becoming an increasing threat to the validity of test scores. In my opinion, today, they likely represent a larger threat to validity than poor item design and or the authoring of bad items. Testing programs need to use every tool in the toolbox to put up effective defenses against these threats, and to detect test fraud as soon as it occurs. Two of these tools are item and test design. I’m going to devote the rest of my remarks to the former. Let me start with a couple of analogies.
A building is designed and built with several purposes in mind. One is that it has to be functional for the reason it is built. But buildings also have to be built with durability, maintenance, accessibility, and many other purposes, including security. It seems unreasonable to think that anything should be built by considering only a single purpose. Even a simple toy like a yo-yo is designed to be both fun and safe, and probably affordable as well. Test items are more complex than yo-yos and are built with several purposes in mind. It is true that test items are certainly built with the measurement of a particular skill (I use the term “skill” very broadly) in mind, as the reviewer noted. They are also built so that they can be scored easily—multiple choice questions being a prime example. They are also designed to be presented on available and chosen presentation platforms, such as a test booklet or a computer screen. I just listened to an address by a very respected psychometrican at the conference of the International Test Commission who admonished us all that items should be designed with the principles of Universal Design in mind, which are principles that allow a question to measure well individuals with disabilities as well as it does those without those disabilities. I could come up with more, but these examples make it clear that the measurement of a skill or construct is not the only factor to consider when designing a test item. Why not security?
Items, like tests, can be designed to improve security. That effort can consider a defensive position, such as limiting the exposure of the items making it more difficult to steal the content or to cheat, or it can be serve a more active monitoring purpose, with the design being such that when taken and answered by a cheater, the item helps to detect the cheating. An example of the defensive approach is the Discrete Option Multiple Choice (DOMC) item. One of the ways the DOMC item has security value is that, by presenting options one-at-time—instead of all at once—and by ending the item before all options are presented, it reduces or limits the exposure of item content while not harming the measurement value. This exposure benefit makes it difficult to steal the content and for subsequent test takers to profit from the theft. As a monitoring/detection device, items can also be designed to detect cheaters without affecting their measurement capability. For example, the options in a multiple choice item can be purposefully arranged (as opposed to randomizing or using a fixed order) to eventually provide a code—working together with other items on the test—that will reveal the source of the theft of items that are later discovered on the Internet. Decoding the order of options for items discovered this way could easily lead right back to the parties involved in the theft.
When I consider the many current possibilities that computerizing tests affords us, I see some great things in store for the field of testing. The ability for us to create truly new and innovative items and tests that solve some of our very difficult and long-standing problems is very exciting. As we leave behind the era of paper-based testing I hope we can achieve a proper balance of useful and rapid technology-based advancement and the preservation of valuable principles from our history.