In definition, a standardized test is administered and scored in a consistent, or “standard”, manner. They are, in fact, designed in a way that stabilizes questions, conditions for administering, scoring procedures, and interpretations as consistent.
Norming, Reliability and Validity of Psychometric Tests
Standardized testing could be composed of true-false, multiple-choice, authentic assessments or essays. It’s possible to shape any form of assessment into standardized tests. When it comes to the creation of psychometric assessments, questions are measured in scales. And these too are often most valid with standardization post-creation.
Basically, we should look for these three factors when creating/standardizing psychometric tests:
- Nature of Reliability
- Understanding Validity
- Importance of Norming
A test is reliable as long as it produces similar results over time, repeated administration or under similar circumstances.
If you were to use a professional dart player as an example, his or her ability to consistently hit the a designated target, but not the bullseyes, under specified conditions would classify them as an excellent and reliable dart player. But this does fail on account for validity. When compared to psychometric assessments, a reliable test is better known for its ability to produce stable results over time.
Over the years, scholars and researchers uncovered multiple ways to check for reliability. Some include testing the same participants at different points of time or presenting the participants different versions of the same test to see how consistent the results are.
Suffice it to say that an assessment has to show demonstrably good reliability in order to qualify for validity.
Validity is qualitatively defined as the test’s ability to measure what it claims to measure. Suffice to say, a test with high validity ensures the test items (questions) remain closely linked with the test’s intended focus.
It is understandable to expect a test used in organizations to shed light on how a candidate would perform in a particular job. With this in mind, it is essential to reiterate the difference between reliability and validity, with the former being a prerequisite to the latter.
Let’s consider the same dart player. In repeated trials, he or she continues to miss the mark consistently by about two inches. Of course, this implies reliable aim. Each shot hits the board in a region two inches from the target. It’s difficult to not question his validity as a professional – considering he or she doesn’t hit the bulls eye as is the aim of all professional dart players – in comparison to his or her peers.
Reliability and validity go hand in hand, but reliability by no means indicates validity. As our example showed, having the first without the second hints at great consistency, but also inaccurate consistency.
There are tests for validity.
Even with a test that is both reliable and valid, there exists a question about results. An assessment fails without quantifiable results, but as often stated – human beings are far from quantifiable.
Psychometric tests are often normed against groups for comparison. It also avoids looking at individual items or questions, and instead observes the total score of an individual in comparison to a representative sample for the same.
A representative sample means using a group of children when developing a test for children, and an adult group when developing a test for adults. Also, based on the population, samples are generally made representative based on demographic factors like age, gender, education, religion, etc.
This is standard practice primarily because a psychometric test score of say 30 correct out of 40 is meaningless unless compared to the performance of others at a similar level on the same test. The practice of using relative scores becomes all the more important when interpreting ability test results.
When you get a 94th percentile on a trait like extraversion, you know that you are simply more extraverted than 94% of the sample group from whom the test makers derived the normal distribution. On the other hand, if you scored 94% on a math test, it simply implies that you marked about 94 in every 100 questions correctly.
It’s important to note however, that every test has its own appropriate norm group. Data is better developed when the psychometrics is within the context of the role also. For example, if the role possessed numerical work but without the time pressure in real world scenarios, someone with below average results on numerical reasoning tests may be given the benefit of the doubt.
Where possible, it also makes sense to take the candidate’s response style in interpreting percentile scores. It has something to do with both speed and accuracy, meaning some people may prefer a slower approach through ability tests – which are a part of psychometrics – to emphasize accuracy. Others may cover ground on several items with lowered accuracy.
Psychological constructs such as personality have no right or wrong answers associated with them, and can thereby not be marked using percentages. This is why academics and researchers alike resort to norming among other methods to make sense of scores on personality assessments.
With growing concerns over costs, conveniences and other logistical challenges, technology-enabled assessments have become popular over time as well. Simply because they serve to streamline the process, reduce costs, increase efficiencies, allow employers to assess, and analyse more data points than previously deemed possible.
Know-how about the creation or standardization of psychometric tests aside, it’s also an imperative to understand how best one can determine the quality of a psychometric test.
Coming soon…..stand by for Chapter 3