|
|
|||||||
Errors: What are they and how significant are they? |
|||||||
|
Errors in educational research and measurement arise from four main sources:
In addition, since it is rarely possible to take measurements of a complete population there are
Sampling errors are of two types. First, there are errors arising from a sample not being fully representative of the population from which it is drawn-sample bias. Secondly, there are errors that arise from the variability among the cases included in the sample, which can be estimated from information on the variability between cases and the number of cases-standard errors of sampling. Estimates of the standard error of sampling permit inferences to be drawn about the range of a characteristic in the population. The word "error" has many meanings. The most common meaning is concerned with the idea of a "mistake" which does not apply in this context. A further meaning is concerned with "the difference between an observed or estimated numerical result and the true or exact one". However, in educational research the "true value" is both unknown and unknowable and this meaning does not apply. In statistical work the term "error" simply means "the action of wandering", since the observed values are dispersed about a central value and are assumed to be as likely to be greater than this central value as they are to be less than the central value. This "errant" or "wandering" nature of observations applies to all four types of error considered above. However, it does not apply to sample bias. In order to make some allowance for sample bias, prior knowledge is needed. The making of statistical estimates that are based on prior knowledge lies in the domain of Bayesian statistics. Bayesian procedures were employed in statistical estimation in the Australian Studies of School Performance in 1976 and 1981 (Morgan, 1978), but have not been used since in other Australian studies of student achievement. The examination of error in educational research, when Bayesian procedures are not employed, is built around the idea of the importance of findings; namely, pattern of results, size of effect and statistical significance. This paper is concerned with the examination of errors in several recent Australian research studies. |
Statistical Significance |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The initial problem to be considered in the examination of data collected in the Basic Skills Testing Program and through the Course Experience Questionnaire, where an attempt is made to provide a study in which all members of the target population are involved, is whether a sample survey or a census has been undertaken. Inevitably, there are losses at both the institution and student within institution levels. An initial question must be asked is whether the losses introduce bias, and how the extent of bias could be assessed. Further questions must be addressed. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Errors of Measurement |
|
For 20 years the ACER has gradually advanced the use of Rasch measurement in school systems across Australia, in spite of marked opposition led by academics from the University of London and some statisticians in the United States. The major advantages of the use of Rasch measurement procedures are, first, that provided a test can be considered to be unidimensional, and the items contained within a test satisfy the requirement of unidimensionality, then estimates of performance on that scale are independent of the items employed. Furthermore, the estimates of performance are also independent of the persons employed in the calibration of the scale. As a consequence of these properties it is readily possible for tests employed at different grade levels and on different occasions to be equated on a single common scale. |
|
Errors in Rasch Scaled Scores |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 6 records the errors of measurement for persons on the 1999 Basic Skills Tests of Literacy and Numeracy at selected score levels on the scales of measurement developed which are expressed in logits. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Equating Errors |
|
The tests administered each year are converted to measures on a standard scale that was constructed several years earlier. Each year the test for that year is equated with tests employed during earlier years and with the scale developed from those tests. Consideration needs to be given to the estimation of the equating errors introduced. The magnitude of the equating errors will depend on the equating procedure employed, whether it involves concurrent equating, or anchor item equating, or common item difference equating, and the sizes of the errors associated with the calibration of the separate tests being equated. It would seem that the magnitude of the errors of equating should involve both the numbers of common items and the numbers of common persons involved in the test equating operation, as well as the complexity of structure of both the data concerned with the items and persons. |
|
Calibration Errors |
|
The test of fit to the data recorded for a particular item in order to assess whether the item satisfies the requirement of unidimensionality is a relative test and not an absolute one. Moreover, a balance must be achieved between the bandwidth of a range of items, each assessing slightly different aspects of the characteristics under examination and the fidelity of the items with respect to the unidimensional scale (see Cronbach, 1960). As a consequence, it is sometimes necessary to eliminate from a test those items that have apparently high fidelity but assess over too narrow a bandwidth, and thus supply redundant information, as well as items that have too low fidelity and are spread over too wide a bandwidth. Different computer programs for Rasch scaling use different indices for the testing of item fit, that give very different results for items of marginal fit. Moreover, the effects of sample structure on indices of item fit need to be further investigated, since only one study has been carried out in Australia (Farish, 1984). |
|
Conclusion |
|
While this paper has addressed issues concerned with errors of measurement and sampling in testing programs, it is necessary to recognize the marked contribution that advances in educational measurement have made and are making to education, particularly in Australian schools. The emphasis has turned during recent decades from innovation and the liberation of teachers from the dominating effects of public examinations to the advancement of student learning and development in which growth in student performance is recorded on profiles each with a scale of performance. These scales of performance not only relate to the areas of the curriculum and the strands within those areas, but also the instruction and testing provided across the grades of schooling and the learning outcomes attained by individual students. Underlying these scales of performance is the principle of strong and meaningful measurement. As a consequence it is becoming possible for teachers and school principals to report to students, their parents, the school community and the community at large the extent of learning that is taking place in schools. Of particular importance, however, is the information given to individual students in order to show them in clearly identifiable terms the growth that they have achieved over a particular period of time as a result of their efforts in the classroom and at school. It is also important to recognize that the scales of learning extend across all levels of schooling and beyond. Learning does not end at the completion of schooling or tertiary education, but extends throughout life as opportunities for lifelong learning and development are followed. |
|
References |
|
Adams, R. J. and Khoo, S. T. (1993). QUEST: The Interactive Test Analysis System. Melbourne:ACER |
|
|
Keeves, J.P., Johnson, T.G. and Afrassa, T.M. (2000) Errors: What are they and how significant are they? International Education Journal, 1 (3), 164-180 [Online] http://iej.cjb.net |
|
|
Back to Contents |
Download |
Download |