Abstract
This document evaluates several arguments related to the use of science scores as a proxy measure for reading literacy in the framework of measuring and monitoring SDG 4.1.1 at a global scale. The SDG indicator 4.1.1 measures the proportion of children and young people (a) in grades 2/3; (b) at the end of primary; and (c) at the end of lower secondary achieving at least a minimum proficiency level in (i) reading and (ii) mathematics, by sex. The arguments presented here are evaluated in relation to problems associated with the validity of these measures when scores from different subjects (e.g., literacy, science) are used interchangeably. Our evaluation is focused on three aspects: problems associated with differences in the conceptual framework on which the different tests are based; problems associated with the different interpretations that it is possible to make of the scores analysed, and problems associated with the relevant differences that are observed when these measures are correlated with student background factors such as gender.
Multiple efforts have been made to define standards that establish the quality of educational assessments. The main standards refer to the validity, reliability and fairness of the tests. The Standards for Educational and Psychological Testing defines validity as the “degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” (EERA et al., 1999, 2014). This definition reminds us that one of the main aspects of validity has to do with what we intend to do with the results of the tests (i.e., the scores). In that sense, the validation process involves accumulating relevant evidence to provide a scientific basis for the proposed interpretation of the scores. In other words, “a clear articulation of each intended test score interpretation for a specified use should be set forth, and appropriate validity evidence in support of each intended interpretation should be provided”. Based on these definitions, a series of specific considerations regarding the theoretical framework and the possible uses of the scores from International Large-Scale Assessments (ILSA) to measure SDG 4.1.1 are derived. Arguably, the most relevant are the following:
-Conceptual framework: the conceptual definition of the construct(s) the test intends to assess. The construct or constructs that the test is intended to assess should be clearly described.
-Intended interpretation: test developers intended interpretation and use of test scores (e.g., disaggregation). A rationale should be presented for each intended interpretation of test scores for a given use.
-Correlation with relevant factors: the association of different test scores to a relevant background/sociodemographic factors (e.g., gender).
Multiple efforts have been made to define standards that establish the quality of educational assessments. The main standards refer to the validity, reliability and fairness of the tests. The Standards for Educational and Psychological Testing defines validity as the “degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” (EERA et al., 1999, 2014). This definition reminds us that one of the main aspects of validity has to do with what we intend to do with the results of the tests (i.e., the scores). In that sense, the validation process involves accumulating relevant evidence to provide a scientific basis for the proposed interpretation of the scores. In other words, “a clear articulation of each intended test score interpretation for a specified use should be set forth, and appropriate validity evidence in support of each intended interpretation should be provided”. Based on these definitions, a series of specific considerations regarding the theoretical framework and the possible uses of the scores from International Large-Scale Assessments (ILSA) to measure SDG 4.1.1 are derived. Arguably, the most relevant are the following:
-Conceptual framework: the conceptual definition of the construct(s) the test intends to assess. The construct or constructs that the test is intended to assess should be clearly described.
-Intended interpretation: test developers intended interpretation and use of test scores (e.g., disaggregation). A rationale should be presented for each intended interpretation of test scores for a given use.
-Correlation with relevant factors: the association of different test scores to a relevant background/sociodemographic factors (e.g., gender).
| Original language | English |
|---|---|
| Place of Publication | Montreal |
| Publisher | UNESCO Institute for Statistics |
| Commissioning body | UNESCO |
| Number of pages | 10 |
| Volume | WG/GAML/6 |
| Publication status | Published - 24 Nov 2022 |
| Event | Global Alliance to Monitor Learning (GAML). Technical Cooperation Group 9th meeting - Duration: 22 Nov 2022 → 24 Nov 2022 https://tcg.uis.unesco.org/9th-meeting-of-the-tcg/ |
Publication series
| Name | Global Alliance to Monitor Learning (GAML) |
|---|
Keywords
- SDG
- sdg 4
- Measurement