What is Item Response Theory (IRT) and how to calculate it

Psychometrics makes use of item response theory. It uses mathematical formulae, models, and methods to achieve two key goals: assessing test item performance and understanding test taker characteristics.

IRT models can be used to create adaptive tests, evaluate test-taker skills and examine the complexity of test items (discrimination, difficulty, and guessing).

These models can determine if a test-taker will likely select the right answer based on their ability.

Commonly used IRT Models

• 1PL model

The 1PL model, popularly known as the Rasch model and one of the simplest IRT models, presumes the only factor impacting the possibility of a right response is the test taker’s ability.

It assumes the test items are complex, and the only parameter that should impact their performance is their ability to take the test.

The 1PL model is used where there are chances for two responses, including true or false and multiple choice questions with only one right answer.

• 2PL model

This model adds an additional parameter that indicates the difficulty of the item. 2PL model is an extension of the 1IPL model.

It considers the chances of a right response is estimated by the test taker’s ability and the item’s difficulty.

The model is leveraged where there are chances for two responses and is ideal for items having multiple difficulty levels.

• 3PL model

This model is an extension of 2PL. It brings in a parameter for guessing the test taker’s behavior. It checks whether the test taker can go for a guess in choosing the correct answer and the options in the item that can impact the chances of guessing the correct answer.

The model is good where there are more than two responses, such as multiple-choice questions.

Uses of IRT

Researchers can calculate IRT parameters using data from the test-taker’s responses to a test. The data in binary form indicates whether the test taker has answered the item correctly or erroneously.

By using the Maximum Likelihood Estimation, researchers use the data to estimate the parameters of the IRT model. Researchers must first estimate the parameters. Once done, they can use it to estimate the test taker’s ability, verify their traits, and the test item’s difficulty.

A standard normal score, such as a z-score or a t-score, indicates the ability level. This can distinguish the test taker’s performance from others or a group. The probit or a logit scale can be used to represent the difficulty of the test items based on their location in a specific range.

The scale can be used to assess the item’s difficulty levels or develop items with the same difficulty levels. IRT can also be used for item selection, item bias detection, and Adaptive Testing. IRT can also help select test items by identifying them specifically for a target population.

IRT can be used to check biases in test items by associating it with the performance of various subgroups on the same items and checking if the item does a fair assessment of test takers from all backgrounds.

IRT Models & Adaptive Testing

IRT is also used in Adaptive Testing. Adaptive Testing adapts the test items for the test taker based on the performance of previous items.

This type of testing aims to give the test taker items that are difficult enough to match their ability.

IRT models can also help to select test items during Adaptive Testing. It identifies test items that are suited based on the test taker’s ability.

Tangible benefits of IRT

  • Improved measurement precision: Accurate measurement of test taker ability based on test item difficulty and ability. This also helps in precise and dependable test scores.
  • Better test design: Tests can be designed accurately by identifying items based on difficulty levels. Items can be added, modified, or removed to enhance test quality.
  • Identification of high-stakes test takers: Recognize test takers who might fail or those who might use these scores for college admissions or job selections.
  • Better diagnostic information: Develop better investigative information than legacy scoring techniques. This is done by detecting the strengths and weaknesses of test takers and mentioning where they fared well or had challenges.
  • Better ability estimation: Estimate those who do not have educational or psychological testing. This can be used for low-priority tests where the test taker’s ability is evaluated using a small number of items.
  • Flexibility in handling missing data and item format: Used to check for missing data and item formats, including multiple choice, Likert-scale, etc.
  • Use of IRT models in computerized Adaptive Testing (CAT): It plays a vital role in CAT exams to adapt test items based on the test taker’s ability, making it more efficient and less time-consuming.
    IRT is the best bet to deliver highly accurate and practical methodologies to evaluate the ability o the test taker. IRT can improve test design, enhance test measurement, and offers better investigative information on the test taker.