What is Exam Validity?

by Apr 7, 2015

The Commission on Accreditation of Allied Health Education Programs (CAAHEP) standards state that EMS educational programs must “provide both the students and program faculty with valid and timely indications of the students’ progress.”¹ In other words, your program must demonstrate that students are on track to meet the expectations set for them at the outset of the semester. To effectively do so, you must use valid examinations. Validity is the most important factor to consider when developing an exam. But what does it mean for a test item or exam to be valid?

Some quick vocab to get us started: Testing literature often refers to “test items,” which may seem unfamiliar. An item is simply a test question. Not all test questions are posed in question form, so they are referred to as items. A terminal competency exam is commonly referred to as a final. Back to validity!

What does validity mean?

The Standards for Educational and Psychological Testing defines validity as “the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests.”² Essentially, determining validity is a search for truth: a valid test is an accurate, true measure of student competency. In our case, we should think of validity as a continuum of evidence. A greater quantity of evidence that supports the our idea bolsters the test’s validity. Validity for a terminal competency exam in EMS is measured by how well an exam assesses a graduating student’s cognitive abilities.

How do I demonstrate validity?

A few different techniques are available for investigating the validity of an examination. These analyses fall under the domain of psychometrics, a field of psychological study concerned with the theory and technique of mental measurement.

(A brief disclaimer: I’m no psychometrician. Reading this post won’t teach you classical test theory, item response theory, or how to set a cut score. That said, read on to learn the broader strokes of validity.)

The CoAEMSP guideline interpretations elaborate on the importance of validity, stating that:

Validity must be demonstrated on major exams, but methods may vary depending on the number of students. All exams should be reviewed by item analysis, which includes difficulty index (p+) and discrimination index (point bi-serial [sic] correlation). The results of the review must be documented as well as any changes to exams that resulted from the review. Programs with large enrollments may be able to employ recognized mathematical formulas.³

Yeah, item analysis can seem seriously complex. Let’s define each of those terms.

Difficulty & Discrimination

The item difficulty index indicates the frequency with which the correct answer was selected. For example, if 12 students out of a 24 student cohort answer item #1 correctly, the difficulty index (p-value) for that item is 50% (values range from 0 - 100%). Although this value is simple enough to calculate, it doesn’t provide enough information on its own for a well-informed item analysis. This only tells us how many times a correct answer was selected, but doesn’t give us any context. This is why item discrimination is crucial.

The item discrimination index demonstrates which students in the cohort correctly answer the item (point-biserial correlation). In other words, it tells us whether high-performing students answer the item correctly with greater frequency than low-performing students. This value ranges from -1.0 to 1.0. The closer the item discrimination index is to 1, the better-performing the item. If low-performing students consistently answer the question correctly, which is indicated by a negative discrimination index, the item likely needs to be edited to improve its validity.

Where calculating the difficulty index of an item is simple division, calculating discrimination is a bit more complex. The value preferred by the CoAEMSP to indicate discrimination is the point-biserial. A point-biserial calculates which response (correct or not) the student selected for an item and then relates it to their overall score.

Fisdap’s Test Item Analysis Report can be really useful at this point: This report automatically imports student scores and calculates the difficulty index and point biserial values. (To learn more about the Test Item Analysis Report, head over to Fisdap’s help documentation or check out our webinar on “Testing & Accreditation.”)

Item Validity vs. Test Validity

So you’ve gone through rigorous content and grammar review and calculated the difficulty and discrimination indices. You have a valid item! You’re done, right?

Unfortunately, not quite; it’s possible to craft a quality test item that, on its own, is an effective measure of one part of the educational domain, but measuring validity of an entire exam is another undertaking. This is the chief reason that exams constructed from test banks are bad news. Picking and choosing items--no matter how valid--without measuring how effectively the collection of items performs as a whole can result in an invalid measurement tool.

There aren’t necessarily different “types” of test validity, but rather a variety of types of evidence used to demonstrate it. Developing and Validating Test Items sets forth an “argument-based” approach to test validation. Their checklist contains sixteen questions, each representing a “different and complementary type of validity evidence.”

Consider how you might answer each of these questions for your exam:

What type of target domain is intended?
How is the target domain organized?
How is the universe of generalization organized?
How much fidelity is there between the target domain and the universe of generalization?
Which item formats will be used?
How are items developed?
Is the scoring key accurate?
What is the content for each item?
What is the intended cognitive demand for each item for a typical set of test takers?
Were items edited?
Were items reviewed for currency and effectiveness?
Were items reviewed for unnecessary linguistic complexity?
Were items reviewed for fairness?
Were items pretested effectively?
Was the internal structure of the test content studied?
Who decides whether an item stays or goes or gets revised?⁴

Once you have completed all the steps outlined in this checklist and the paragraphs above, you’ve got yourself a valid exam. Congratulations!

Have more questions? Contact Christina Morley at cmorley@fisdap.net to learn more about Fisdap's valid examinations.

Sources Cited:
1. CoAEMSP Interpretations of the CAAHEP Standards and Guidelines, Committee on Accreditation of Educational Programs for the Emergency Medical Services Profession (2014), Section IV.A.1.
2. Standards for educational and psychological testing, American Educational Research Association (Washington, DC: American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999), 9.
3. CoAEMSP Interpretations of the CAAHEP Standards and Guidelines, Committee on Accreditation of Educational Programs for the Emergency Medical Services Profession (2014), Section IV.A.1.1.
4. Thomas M. Haladyna and Michael C. Rodriguez, Developing and validating test items (New York, NY: Routledge, 2013), 12.

Stay Connected

Search Blogs