Test Theories: TCT And IRT

One of the most important parts of the intervention is the evaluation. An evaluation that is often highly conditioned by the results of the administered tests. But, how have these tests been constructed and how do we know that they are good?
Test theories: TCT and IRT

The tests are used in psychology as measuring instruments.  To get a little closer to the concept and without being entirely exact, just as we use the meter to measure length, we could use a test to measure intelligence, memory, attention … One of the differences between one action and another would be that the tests are not so easy to build, besides that so little are so easy to apply.

Furthermore, just as a single measurement does not allow us to talk about the volume of an object, the administration of a single test does not allow us to give a diagnosis or propose an intervention. Thus, tests are important for evaluation, but they are not a determinant of it.

This is where the psychologist plays the most important role: somehow he has to use the information he has obtained from the test, and from other sources, to shape a coherent evaluation that leads to the planning of the intervention. In other words, it is when it comes to integrating the results from different sources that the quality of the professional is most noticeable. We are talking about an expertise that is achieved with knowledge, but also with years of experience.

Brief history of test theories

The origin of the tests is usually cited in tests carried out by Chinese emperors in 3000 BC. Thus, these had the objective of evaluating the professional competence of the officers who were to enter their service. (1)

The current tests have their closest origins in the tests carried out by Galton (1822-1911) in his laboratory. However, James Cattell is the first to use the term mental test, in 1890. Since these first tests did not turn out to be very predictive of the cognitive capacity of the human being, researchers such as Binet and Simon (1905) introduce tasks in their new scale cognitive to assess aspects such as judgment, understanding and reasoning.

The Binet scale opens up a tradition of individual scales. In addition to cognitive tests, great advances are made in personality tests.

Test

Why are test theories necessary?

Given all the advances produced, measurement theories (test theories) begin to develop in turn that directly affect tests as instruments that they are. With the concern of generating instruments that measure what we want them to measure and do so with the least possible error, psychometry appears. A psychometry that is going to require every test or measuring instrument, that boasts of being, that it is valid and that it is reliable,

Remember that reliability is understood as the stability or consistency of the measurements when the measurement process is repeated. In other words, a test will be more reliable the better it replicates the results when measuring two subjects – or the same subject on different occasions – who have the same level in what is measured. For its part, validity refers to the degree to which empirical evidence and theory support the interpretation of test scores. (2)

Thus, there are two major theories of tests or approaches when we talk about analyzing and building this type of instrument: the classical theory of tests (TCT) and the item response theory (IRT).

The classical theory of tests (TCT)

It is the dominant theory in the construction and analysis of tests. The bowl: it is relatively easy to build tests that meet the minimum required by this paradigm. The evaluation of the test itself is also relatively simple in terms of the aforementioned parameters: reliability and validity.

It has its origin in the works of Spearman at the beginning of the 20th century. Then, in 1968, the researchers Lord and Novick carried out a reformulation of this theory and ushered in the new approach to IRT.

This theory is based on the classical linear model . This model was proposed by Spearman, and consists of assuming that the score that a person obtains in a test, which we call his empirical score, and which is usually designated by the letter X, is made up of two components. (2)

On the one hand, we find the subject’s true score in the test (V), and on the other, the error (e). It is expressed as follows: X = V + e.

Spearman adds three assumptions to this theory:

  • First, define the true score (V) as the mathematical expectation of the empirical score: This is the score that a person would have on a test if they did it an infinite number of times.
  • There is no relationship between the amount of true scores and the size of the errors that affect those scores.
  • Finally, the measurement errors in one test are not related to the measurement errors in a different test.

To complete this theory, Spearman defines parallel tests as those tests that measure the same thing but with different items.

Limitations of the classical approach

The first limitation is that, within this theory, the measurements are not invariant with respect to the instrument used. This means that if a psychologist evaluated the intelligence of three people with a different test for each one, the results are not comparable. But why does this happen?

Well, the results of the three measurement instruments are not on the same scale: each test has its own scale. To be able to compare, for example, the intelligence of X people who have been evaluated with different intelligence tests, it is necessary to transform the scores obtained directly from the test into other scales.

The problem with this is that by transforming the scores into scales, we assume that the normative groups in which the scales of the different tests were prepared are comparable – same mean, same standard deviation –, which is difficult to guarantee in practice. (1) Thus, the new approach of the IRT was a great advance with respect to this fact. The IRT will thus ensure that the results obtained when using different instruments are on the same scale.

The second limitation of this approach is the absence of invariance of the properties of the tests with respect to the people used to estimate it. Thus, within the framework of TCT, the important psychometric properties of the tests depend on the type of sample used to calculate them. This is a fact that also finds a solution, at least partially, in the IRT approach.

Person doing a test

Item Response Theory (IRT)

The item response theory (IRT) was born as a complement to the classical test theory. In other words, the TCT and IRT could evaluate the same test, as well as establishing a score or relevance for each of the items, which in turn could give us a different result for each person. On the other hand, it should be noted that IRT would give us a much better calibrated instrument, the problem is that this paradigm is associated with a much higher cost and the participation of specialized professionals.

IRT has several assumptions, but perhaps the most important tells us that any measurement instrument should be consistent with an idea: there is  a functional relationship between the values ​​of the variable measured by the items and the probability of getting them right. This function is called the Item Characteristic Curve (ICC). What do we assume then?

Well, something that from the outside may seem very logical and that the TCT does not evaluate. For example, the most difficult items would be those that only the smartest people answer. On the other hand, an item that all people answer well would not be worth us because it would not have any power to discriminate. In other words, it would not give any information. This is just a small sketch of the revolution proposed by TRI.

Characteristic curve of the item

To see a little better the differences between one measurement model and another, we can take as a reference the table by José Muñiz (2010):

Table 1. Differences between TCT and IRT (Muñiz, 2010)

Aspects TCT TRI
Model Linear Non linear
Assumptions Weak (easy to meet by data) Strong (difficult to meet because of the data)
Invariance of measurements No Yes
Invariance of the test properties No Yes
Scale of scores Between 0 and the maximum in the test Infinite
Emphasis Test Item
Item-test relationship Unspecified Characteristic curve of the item
Description of the items Difficulty and Discrimination indices Parameters a, b, c
Measurement errors Standard error of common measurement for the entire sample Information Functions (varies by skill level)
Sample size Can work well with samples between approximately 200 and 500 subjects More than 500 subjects are recommended

This is how the two test theories are related. Although being almost contemporary, it seems clear that IRT was born in response to the limitations or problems that TCT can develop. However, it seems clear that research still has a long way to go in this field of psychometrics.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *


Back to top button