Te Kete Ipurangi Navigation:

Te Kete Ipurangi

Te Kete Ipurangi user options:

You are here:

Reliability and validity

The reliability of an assessment tool is the extent to which it measures learning consistently.

The validity of an assessment tool is the extent by which it measures what it was designed to measure.


The reliability of an assessment tool is the extent to which it consistently and accurately measures learning.

When the results of an assessment are reliable, we can be confident that repeated or equivalent assessments will provide consistent results. This puts us in a better position to make generalised statements about a student’s level of achievement, which is especially important when we are using the results of an assessment to make decisions about teaching and learning, or when we are reporting back to students and their parents or caregivers. No results, however, can be completely reliable. There is always some random variation that may affect the assessment, so educators should always be prepared to question results.

Factors which can affect reliability:

  • The length of the assessment – a longer assessment generally produces more reliable results.
  • The suitability of the questions or tasks for the students being assessed.
  • The phrasing and terminology of the questions.
  • The consistency in test administration – for example, the length of time given for the assessment, instructions given to students before the test.
  • The design of the marking schedule and moderation of marking procedures.
  • The readiness of students for the assessment – for example, a hot afternoon or straight after physical activity might not be the best time for students to be assessed.

How to be sure that a formal assessment tool is reliable

Check in the user manual for evidence of the reliability coefficient. These are measured between zero and 1. A coefficient of 0.9 or more indicates a high degree of reliability.

Assessment tool manuals contain comprehensive administration guidelines. It is essential to read the manual thoroughly before conducting the assessment.


Educational assessment should always have a clear purpose. Nothing will be gained from assessment unless the assessment has some validity for the purpose. For that reason, validity is the most important single attribute of a good test.

The validity of an assessment tool is the extent to which it measures what it was designed to measure, without contamination from other characteristics. For example, a test of reading comprehension should not require mathematical ability.

There are several different types of validity:

  • Face validity: do the assessment items appear to be appropriate?
  • Content validity: does the assessment content cover what you want to assess?
  • Criterion-related validity: how well does the test measure what you want it to?
  • Construct validity: are you measuring what you think you're measuring?

It is fairly obvious that a valid assessment should have a good coverage of the criteria (concepts, skills and knowledge) relevant to the purpose of the examination. The important notion here is the purpose. For example:

  • The PROBE test is a form of reading running record which measures reading behaviours and includes some comprehension questions. It allows teachers to see the reading strategies that students are using, and potential problems with decoding. The test would not, however, provide in-depth information about a student’s comprehension strategies across a range of texts.
  • STAR (Supplementary Test of Achievement in Reading) is not designed as a comprehensive test of reading ability. It focuses on assessing students’ vocabulary understanding, basic sentence comprehension and paragraph comprehension. It is most appropriately used for students who don’t score well on more general testing (such as PAT or e-asTTle) as it provides a more fine grained analysis of basic comprehension strategies.

There is an important relationship between reliability and validity. An assessment that has very low reliability will also have low validity; clearly a measurement with very poor accuracy or consistency is unlikely to be fit for its purpose. But, by the same token, the things required to achieve a very high degree of reliability can impact negatively on validity. For example, consistency in assessment conditions leads to greater reliability because it reduces 'noise' (variability) in the results. On the other hand, one of the things that can improve validity is flexibility in assessment tasks and conditions. Such flexibility allows assessment to be set appropriate to the learning context and to be made relevant to particular groups of students. Insisting on highly consistent assessment conditions to attain high reliability will result in little flexibility, and might therefore limit validity.

The Overall Teacher Judgment balances these notions with a balance between the reliability of a formal assessment tool, and the flexibility to use other evidence to make a judgment.

Further reading

NZCER has kindly allowed us to use two valuable articles from SET magazine - Set 2, 2005 and Set 3, 2005 - written by Charles Darr