Reliability of Personality Tests

The world around us has changed dramatically with the spread of new technology. From communication with people on the other side of the globe to submitting school- or work-related documents, everything became easier and faster with the internet, computers, and smartphones. The hiring process is not an exception. Job hunting online allowed candidates to look and apply for more positions than before and better their chances of finding the job they would really love. This, however, also meant that while before tens (or a few hundred at most) of applications would come for every position, nowadays it is not unusual to receive thousands of applications for positions in big companies. The standard hiring process - phone interview with every candidate followed by an in-person interview with people that passed the first round, became more burdening on resources and time for employers, so updates to the recruitment process were needed.

Incorporating pre-employment personality assessment into the recruitment process allowed employers to assess more candidates quickly before inviting them for an in-person interview and only talk further to those that had the best predispositions of fitting to the role and the company. Such personality testing can give hiring managers objective quantifiable measures on which to base decisions so that they can be sure that they hire only the best for the position in question. However, can these tests really deliver the results they promise? Are they reliable?

Reliability in Psychometrics

Reliability refers to the stability or the consistency of a measure. This means that a reliable measure will produce the same or very similar results when used repeatedly under constant conditions. People’s height is often used as an example of extremely reliable measure, as the difference between two measurements of the height of one person that were taken only a few minutes apart is usually so minute it is not recognizable with a standard measurement tool. Measuring psychology constructs, like personality, is not as straightforward as measuring physical quantities (e.g. height or weight), nevertheless, it still has to be reliable. Using a test that produces different results every time it is used, even though the subject of measurement and the conditions are the same, is useless. It reveals nothing about the person as there is no certainty which one, or if any, of the measurements is actually correct.

Firstly, to even be inquiring about the reliability of a measure, some certain conditions or factors need to be taken into account because they do influence the overall reliability of the measure. To get stable results on repeated assessments of a construct, the construct needs to be relatively stable and consistent over time. People’s mood, for instance, is something that changes rather quickly, so measuring it with a testing tool hoping to get consistent results does not make sense. Personality, however, has been found to be relatively stable over time from childhood through adulthood, so hiring managers can assess people’s personalities and draw conclusions or implications for their future job performance from the results they get. If they assess that a certain candidate can be considered extraverted because they demonstrated that they are assertive, talkative, and sociable on the personality, there is a high chance that they will exhibit these characteristics also in a few months or years.

The conditions of the testing situation are also particularly important for the reliability of a measure. Oftentimes, one of the first questions asked before someone is presented with a standardized test, whether it is a personality, skill, or intelligence test, is “How are you feeling?” or “Are you feeling alright? Do you need to refresh with a glass of water?” Of course, the assessors want the test taker to feel good, but the reason behind the questions is also to ensure that the person taking the test is in the right physical and mental state to take the test. They ask all of the candidates the same questions and try to ensure that things like temperature or lighting in the room are consistent between all assessments. Moreover, instructions for the assessment also have to be consistent for everyone to ensure equal conditions, so either trained and skilled people are used to explain the ins and outs of the assessment or standardized computer instructions are presented to all candidates. Once these conditions are met, the reliability of the measure can be interpreted in a meaningful way and it makes sense to explore how reliable an assessment is.

Reliability, however, cannot be exactly computed. Several types of estimates can provide information about different aspects of the consistency of measures.

Test-retest reliability

Test-retest reliability is an estimate of consistency over time, therefore for this type of reliability, it is extremely important that the measured construct is stable in time. To measure test-retest reliability, a test is administered two times to the same group of people and the results are then compared and their correlation is determined. If the test is reliable, there should not be any significant difference between the results and their correlation should be high.

When administering the same assessment tool twice to the same people, choosing the right time lag between the two rounds of testing is essential for obtaining meaningful data. If the time passed between the two assessments is too short, people might remember the questions and how they answered them, producing higher reliability than there really is. If the time lag is too long, there is the risk that even relatively stable constructs can change with time under unexpected circumstances. Intelligence, a highly stable construct, can deteriorate with age or after experiences such as injuries. Waiting too long between two measurements may prevent the risk of recall, but it can also bring additional difficulties so choosing the right frame for the measured construct is a vital part of correctly determining test-retest reliability.

Internal consistency

This type of reliability estimate determines the consistency of a person’s answers across various questions related to the same construct. Usually, a test consists of several items that are all asking about the same underlying construct in a slightly different way. Extraversion, for instance, might be inquired about with the item “I see myself as someone who is talkative” but also with “I see myself as someone who is outgoing, sociable.” When a person indicates that they strongly agree with the first statement, there is a high probability that also the second statement would apply to them, as both of these are generally true for extraverted people. If, however, the answers to such items are not correlated, it means that they do not measure the same construct and the test lacks internal consistency.

Inter-rater reliability

The assessors (or raters) and their judgment play a big part in some types of assessments. If a rater is assessing a person’s leadership skills based on their answers to hypothetical situations, for their assessment to have objective value it is important that a different rater would assess the answers of the candidate with the same or very similar conclusions. This type of consistency estimate is called inter-rater reliability and is usually calculated as the correlation between the ratings of the raters.

Parallel-forms reliability

Some assessments have several versions that test the same construct with different items that were created by dividing a large pool of items into equivalent parts. To determine the reliability of the test, both versions are administered to the same person one after another, the results are then compared, and their correlation is checked. Determining reliability with such parallel forms of a test can help overcome the obstacles of recalling the questions/answers present in test-retest reliability, however, proving that the versions are truly equivalent might be challenging at times.

Reliability of the most used personality tests

Knowing what reliability and reliability estimates are, it becomes obvious that it is not possible to determine whether personality assessment tests as a whole are or are not reliable. It largely depends on the assessment itself. And while some personality assessment tools are highly reliable and can produce objective and quantifiable results, other personality tests are probably not more reliable than a weekly zodiac horoscope in a magazine.

One of the most widely used personality assessment tests is the Myers–Briggs Type Indicator (MBTI). 89 out of the Fortune 100 list of companies use this assessment tool in their recruitment process or with their current employees, which means that several million people have taken this test in the past years. The test divides people into 16 personality types based on what combination of four dichotomous categories (Introversion/ Extroversion, Intuition/ Sensing, Thinking/ Feeling, Judging/ Perceiving) most closely represents their personality. Despite being one of the most used tests, researchers have brought attention to the fact that the assessment is not very reliable. After only a five-week lag between the first testing and second testing with MBTI, test takers have a 50% chance of getting a completely different result. Five weeks is usually considered too soon to be retesting, as recall might still be an issue with some people, but even after such a short period of time, the results were not reliable. The assessment has been criticized not only for its low reliability but also for the lack of scientific proof for the dichotomous categories and the absolutistic nature of the theory that labels people as one extreme versus the other, however, that is more the question of the validity of the assessment and not the reliability. Despite this criticism and serious objections, the assessment tool remains popular and is still being used, probably because of the fact that companies as well as employees are used to this test, know what to expect, and it is quite simple to use. Nevertheless, they should not forget that the results are far from reliable and no important decisions should be based on results that can be different from one day to another.

Another popular and often utilized theory used for personality assessment is the Big Five Personality Traits model or the Five-factor model. Although not as widely spread in the recruitment world as the MBTI, this theory has survived since its creation in the second half of the 20th century and is still growing in popularity today. It postulates that there are five main broad traits or factors - openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism, that cover almost every personality trait or characteristic there is. In contrast with the MBTI, these traits are not considered to be dichotomous absolutes but rather a continuum anywhere on which a person might find themselves. This allows for more variability and less labeling, making the theory more reflective of the real world. This theory and its validity and reliability have been countlessly scrutinized over the past few decades and researchers have found out that the theory in its form can be considered widely universal (although with best results in the western world) and one of the most reliable personality theories. Many different measures are based on the Big Five Personality Traits theory, for instance, the NEO-PI-R, International Personality Item Pool (IPIP), or tools offered at Test Center, and all of these measures have proven to have high reliability, thus can be effectively utilized in the hiring process. Test-retest reliability is usually sufficiently high for all of the traits, and the reliability estimate of internal consistency also proves to be higher than in, for instance, mentioned MBTI test. Psychology researchers from various sides of the world recommend using the Big Five Personality traits over other personality assessment theories because of these characteristics. That is also the reason why Test Center based its assessment tools on this theory but wanted to make the tests even further reliable using proprietary artificial intelligence methods applied to adjust scoring coefficients.

Reliability does not mean validity

While reliability tells us the information on how consistent a measure is, there is a second important indicator of the quality of psychometric measurement tools called validity. Validity provides the information to what extent the measure really measures what it claims to. A tool assessing intelligence should really measure the level of intelligence and not another construct, like a memory for instance. Validity is usually determined by checking how well the results of the assessment correspond with results obtained by a different tool that measures the same construct, or how well the results correspond with established theories that are known to be true and valid. The high reliability of an assessment tool does not automatically mean that the tool is also valid. A test can produce stable, consistent results, but if the tool measures a completely different construct, the results will still not give us the information we want. Assessing just the reliability of a tool is, therefore, not enough to ensure correct and usable outcomes.

That is why using the MBTI assessment tool in the recruitment process has been highly criticized – not only the reliability but also the validity of the measure has been questioned since its creation. The results not only differ from day to day but there is also no certainty that they provide any valuable information about personality. Tests based on the Big five, on the other hand, have been properly validated and there is sound evidence of the five traits of human personality regardless of the location or background of people.


The utilization of personality assessment tools in the recruitment process can allow companies to make a quicker and better hiring decision based on objective quantifiable measures. However, to gain from these advantages of assessment tools, the selection of the test itself is extremely important because not all personality tests are equal in what they can offer. Reliability is one indicator of the quality of the tool, that provides the information on how consistent a measure is. A reliable test should yield very similar results under consistent conditions. If the results are not the same or similar, either the measured construct is not stable, or the test is not reliable. Using a test that is not reliable is worse than not using a test at all, as it does not provide any valuable information about the candidate, but it robs the candidate and the hiring managers of time and resources, so choosing a trustworthy tool made by professionals is essential. Validity is another indicator of the quality of the assessment tool, that says whether the tool really measures what it claims to. For the assessment tool to produce valuable results on which implications and decisions can be made, it has to be both – reliable and valid. Choosing the right personality assessment tool can ensure that the newly hired talent will not only fulfill their responsibilities but will also fit into the team of co-workers and will hold the same values and opinions as the company. This is vital for them to have a desire to stay and prosper in the company, resulting in low turnover, more efficient handling of resources, and even higher sales and customer satisfaction, as these are all results of having competent and fulfilled staff.

Last Updated on October 20, 2020