International Education Journal

welcome

contents

Back to Contents

download

Download
Article

Acrobat Reader

Download
Acrobat Reader

 

The impact of training on rater variability

Steven Barrett
University of South Australia
steven.barrett@unisa.edu.au

Abstract

 In the five years, 1993 to 1998, total Commonwealth Government spending on education fell from 4.9 to 4.4 per cent of Gross Domestic Product. Australian universities have responded to this changed funding environment though the increased use of casual teaching staff. The aim of this study was to develop, implement and evaluate a short, cost-effective training package designed to improve the rating performance of causal teaching staff. The pre and post training performance of a group of raters was measured using the Partial Credit Model, an extension of the Rasch model.

The intervention was largely unsuccessful. The study may have identified the existence of cultural barriers to the training of academic staff, both casual and tenured. This study should be repeated using a revised method and a more extensive training procedure. The participants in the proposed follow-up study should also be interviewed to identify their views about training.

rater training, rasch model, partial credit model

Abstract

Introduction

Standard setting judges

The Partial Credit Model

Methods

 Results

Conclusions

References

 
Introduction

top

 During the 1980s and 1990s Commonwealth Governments of both political persuasions proudly pointed to the apparently increasing levels of government spending as proof of their commitment to education. However, in the five years from 1993 to 1998, total Commonwealth Government spending on education fell from 4.9 to 4.4 per cent of Gross Domestic Product. Moreover, Commonwealth Government expenditure per student in the higher education sector has been falling consistently since 1983. Furthermore, the rate of decline has significantly increased since the Coalition Government took office in 1996. Australian universities have responded to the changed funding environment, inter alia, through the increased casualisation of the teaching staff. The increased casualisation of the teaching staff has not gone unnoticed by Australian students. For example, a recent study (Barrett, 1999) found that students are concerned about the possible effects of casualisation on marker consistency. Two important results emerged from that study. First, the study identified considerable inter-rater variability and intra-rater variability. Second, the study demonstrated that the raters were constantly making four of the five common rating errors identified by Saal et al (1980). However, the study offered no explanation as to why the rating performance of sessional staff was significantly lower than that of tenured or contract staff.

This study aims to develop, implement and evaluate a cost-effective intervention to improve the rating performance of sessional staff. The intervention took the form of a short, that is a half-hour, training package delivered as part of the markers' meeting for a subject. The aim of the training package was to improve rater performance by reducing in the incidence of the five common rater errors identified by Saal et al (1980) namely, severity or leniency, the halo effect, the central tendency effect, the restriction of range effect and inter-rater reliability or agreement. The incidence of these rating errors can be measured in an Item Response Theory framework using the Partial Credit Model (Masters, 1982), which is an extension of the Simple Logistic Model (Rasch, 1960).

Standard setting judges

top

Examination marking requires raters to make complex judgments and decisions quickly in order to meet increasingly tight end of semester deadlines. On the other side of the marking equation are students, who require raters to make consistent judgments about the minimum level of competence for each grade. Consistency in examination marking can only be expected if the raters are highly knowledgeable in the domain in which ratings are made (Jaeger, 1991). Such raters are referred to as experts. However, the increased casualisation of academe means that examinations are increasingly being marked by groups of people who vary quite markedly in their level of expertise. Consequently, raters are increasingly novices, which adversely affects the consistency of ratings.

Jaeger (1991; 4) argues that experts can be described with respect to eight criteria. First, experts excel in their own domains of knowledge. Second, experts are able to perceive large meaningful patterns in their domain of experience. Third, experts are able to perform rapidly in their domain of experience. Fourth, experts see and represent a problem in their own domain at a deeper, more principled level, than novices. Fifth, experts spend time analysing a problem qualitatively. Sixth, experts have strong self-monitoring skills. Seventh, experts are more accurate than novices at judging problem difficulty. Finally, expertise lies more in elaborated semantic memory than in a general reasoning process. Most importantly, novices provide estimates of item difficulty that are incompatible with the estimates of other raters (van der Linden, 1982). The key to ensuring consistent rater performance lies in the selection process. However, in many university departments, the field of potential raters is often limited. Hence, the challenge is to assist novices to perform like experts.

The logical solution to make novices perform like experts is training. However, this may not be as easy as it first seems as raters need to acknowledge the context in which rating occurs. In addition, marking an examination with a large number of candidates is a complex process that consists of three elements. That is, a team of raters, interacting with a set of test items, through the use of a particular standard setting process (Plake et al 1991). An analysis of these three elements identifies a range of factors, in addition to the expert/novice dichotomy, which may affect intra-rater consistency.

The first source of intra-rater inconsistency is a range of factors that are related to the raters themselves. This is not surprising as any team of raters will differ with respect to experience, specialties and professional skill. Individual raters may also have idiosyncratic perceptions about the knowledge or skills that are required to demonstrate the minimum level of competence for a particular test item. This is more likely to be a cause of concern if the examination contains items that test a broad range of skills or knowledge. Furthermore, inconsistencies in rater performance may be exacerbated by fatigue during the rating process (Plake et al, 1991).

A second set of factors that may lead to intra-rater inconsistencies are related to the items and the examination. The perceptions of raters about the quality of items or the appropriateness of an examination may lead to more inconsistent ratings. Plake et al (1991) cites the example of raters who disagreed about the validity of an examination for certification purposes. Raters who felt that the examination was not valid were less conscientious and more prone to lapses of concentration, thereby accentuating the fatigue effect. Plake et al (1991) also argued that the rater factors and the examination factors may interact with each other to produce a third source of inconsistencies. For example, novice raters may be less consistent when marking long examinations that contain complex and demanding items.

Finally, there are a number of factors relating to the rating process itself. For example, the absence of a marking guide may be a source of inconsistency when raters are confronted with unfamiliar content. Furthermore, rater inconsistency may result if the group of raters is unable to meet and discuss the rating process beforehand.

Plake et al (1991) argue that there are five strategies that can be used to improve rater inconsistencies.

Periodic retraining The rating process is periodically interrupted to conduct additional group discussion. These discussions ensure that the raters maintain consistent definitions of the minimum competent candidates.

Estimations of minimally competent test performance This involves providing raters with empirical data relating to the performance of previous candidates on similar or identical test items. Such information can range from simply providing raters with information about the proportion of previous candidates who passed a particular item in the past, to providing raters with estimates of the person statistics from analysing previous examinations using Item Response Theory.

Empirical data on item performance Raters can be provided with data relating to the difficulty of individual items in an examination. Again this information can range from pass rates of test items to estimates of item parameters using Item Response Theory.

Providing descriptive data relating to the performance of raters Raters can be provided with information about the distribution of marks for the entire rating team. This requires raters to provide information to the entire group on two occasions. After receiving the first batch of information, raters should review all of their previous ratings in light of the judgments made by the other raters. Raters could use the information shared the second time for a variety of purposes, such as to review further their rating or as a basis for attaining a group consensus.

Training is a necessary condition if rater inconsistencies are to be minimised, if not eliminated. Mills, Melican and Ahluwalia (1991) argue that training of raters should achieve four important outcomes. First, training provides a context within which the rating process occurs. Second, training defines the tasks to be performed by the raters. Third, training minimises the effects of variables other than item difficulty from the rating process. Fourth, training develops a common definition of the minimally competent candidate. Furthermore, there are three measurable criteria that can be used to determine whether a rater is well trained (Reid, 1991). First, ratings should be stable throughout the rating process. Second, ratings should reflect the relative difficulties of the test items. Third, ratings should reflect realistic expectations of the expected performance of the candidates. However, the big question remains, how should raters be trained? Hambleton and Powell (1983) argue that this is a difficult question to answer due to the poor documentation of training procedures in most of the reports of standard setting studies. Nevertheless, this brief review of the literature provided a framework within which the intervention that was at the centre of the present study was developed.

The Partial Credit Model

top

The examination results that form the basis of this study were analysed using the computer program Conquest (Adams and Khoo, 1993). This program fits item response and latent regression models to data obtained from both dichotomously scored and polychotomously scored tests (Wu, Adams and Wilson, 1998). The data were analysed using the Partial Credit Model (Masters, 1982), which is an extension of the Simple Logistic Model (Rasch, 1960). The Simple Logistic Model is only appropriate where items are dichotomously scored, such as in true/false or multiple-choice tests. Whereas the Partial Credit Model facilitates the analysis of cognitive or attitudinal items that have two or more levels of response. The levels of response have to be ordered, but they do not have to be on a specified scale. Hence, the Partial Credit Model is ideal for analysing the effects of student ability and item difficulty on the performance of students answering extended response type questions. Moreover, the Partial Credit Model converts the ordered category scores to interval scaled scores.

Rasch (1960) developed a latent trait model for dichotomously scored items. All statistical models that are used to operationalise Item Response Theory specify a relationship between the observed performance of examinees on a test and unobservable or latent traits that are assumed to underlie the observed performance. This relationship is the item characteristic curve (Hambleton 1989). When test data fit the Rasch model, the requirements that underlie Item Response Theory have been met. The Rasch model produces item-free estimates of student ability or performance and sample-free or person-free estimates of the item parameters. That is, the Rasch model is independent of both the items on the test and the sample of people to whom the test is administered. Moreover, the Rasch model can be used to equate readily the performance of different students answering different items on a test, which replaces the concept of parallel test forms that characterises classical test theory.

The Rasch model (Rasch, 1960) estimates the probability of an examinee gaining a correct answer to a dichotomously scored item as an exponential function of the difference between the ability of a person and item difficulty. The Simple Logistic Model can be expressed as;

Where: is the probability for person n of success on item i,

is the ability of person n,

is the difficulty of item i, and

is the probability of an incorrect answer on item i.

This is the only latent trait model for dichotomously scored responses for which the number of successes, rn, is a sufficient statistic for the person parameter (Masters, 1982; 152).

The general applicability of the Simple Logistic Model (Rasch, 1960) is greatly reduced as not all test data are dichotomously scored. Masters (1982) argues that there are four other observation formats that record ordered levels of responses.

 Repeated trials The data are obtained from a fixed number of independent attempts at each item on a test.

 Counts There is on upper limit to the number of independent successes or failures a person can make on an item.

 Rating scales Respondents are presented with a fixed set of ordered response alternatives that are used with every item.

 Partial credit Data are obtained from a test that required the prior identification of several ordered levels of performance on each item and where partial credit is awarded for partial success on items.

The Partial Credit Model developed by Masters (1982) is an extension of the Simple Logistic Model, which overcomes this substantial shortcoming. The model was developed by estimating parameters for the difficulties associated with a series of performance levels within each item. Masters (1982) argues that the difficulty of the kth level in an item governs the probability of responding in category k rather than in category k - 1. The probability of person n of completing the kth level is specified by Masters (1982; 158) as:

where for notational convenience

The model estimates the probability of a person n scoring x on the mi performance level of item i as a function of the person ability on the variable being measured and the difficulties of the mi levels in item i. The observation x is a count of the successfully completed item levels, while only the difficulties of these completed levels appear in the numerator of the model. The model provides estimates of person ability and level difficulty

Methods

top

 Subjects

The Division of Business and Enterprise at the University of South Australia requires all undergraduate students to take a "core" of eight subjects. One of these eight subjects is Economic Environment, a principles of macroeconomics subject. Approximately 1,200 students commenced this subject in Semester 1, 1999, of whom 810 sat the final examination. Of these students, 100 students went on to complete Business Economics, a principles of microeconomics subject. Despite the obvious difference in content between these two subjects, the students were taught and assessed by the same group of staff. Hence, this study was a comparison of the rating performance of those staff members who marked both the Economic Environment Semester 1, 1999 and Business Economics, Semester 2, 1999 final examinations. Student performance on the Semester 1 examination was assessed by eight markers, three of whom were employed to mark the Semester 2 examination. The Semester 2 markers are an interesting group of three people as they include the subject convener and two sessional tutors.

The script books in this study were randomly allocated to raters, who marked all items on the paper. No crossover occurs when raters mark items that are not marked by other raters or when raters only mark the work of their own students. Whereas crossover between items, students and raters is maximised when raters mark a random sample of all papers and mark all items. Maximised crossover ensures that the Partial Credit Model fully separates the rater, student and item effects (Barrett, 1999).

 The intervention

The aim of this study is to develop, implement and evaluate a short training package to improve the rating performance of sessional staff prior to the marking of the Semester 2 examination. Markers meetings in the Division of Business and Enterprise tend to be rather brief and informal affairs. The main items under consideration are the distribution of script books, a brief discussion about the marking guide and the establishment of deadlines. The intervention that was evaluated in this study was a 30 minute training session that was conducted as part of the markers' meeting for Business Economics. The training package comprised three parts, which addressed four of the five strategies for improving intra-judge consistency as reported by Mills, Melican and Ahluwalia (1991).

The first component of the training package was a presentation by the author to the raters about the nature of the five common rater errors (Saal et al, 1980). The aim was to sensitise the raters to the types of errors they were committing.

The second component was a discussion of the performance of the people who marked the 1998 Business Economics and the Semester 1, 1999 Economic Environment examinations. This discussion was based on the results of the Partial Credit Model analysis of the marking of these examinations, and introduced the three raters to the concept that student performance is the outcome of complex interactions between student ability, rater performance and item difficulty, which could be separated from each other using Item Response Theory. This phase of the training package concluded with a discussion of the performance of each rater during the Semester 1 examination for Economic Environment.

The third component was a new style of marking guide that was developed in conjunction with the subject convener. Previous marking guides tended to focus on content with marks being awarded for particular points. Such marking guides did not reward answers that were qualitatively better than others. They also penalised candidates who took a different approach to answering questions. Consequently, the subject convener developed a marking guide that outlined the minimum level of achievement for the grades of pass, credit and distinction.

 

 Results

top

The aim of this study was to evaluate the effectiveness of a training package designed to reduce the incidence of the five common rater errors identified by Saal et al (1980), namely (a) leniency or severity, (b) halo effect for a person or an item, (c) central tendency effect, (d) restriction of range and (e) inter-rater reliability or agreement. Figure 1, summarises the rating performances of the eight raters who marked the Semester 1 examination for Economic Environment. The figure clearly shows two groups of raters. Raters 2,5,7 and 8 were more severe than Raters 1,3,4, and 6. The severe raters tended to be more experienced university teachers. This group of raters included the convener of Economic Environment, the convener of Business Economics and two highly experienced sessional staff who have previously held academic appointments. Conversely, raters 1, 3 and 4 were relatively inexperienced sessional staff who had only recently completed their honours degrees. Paradoxically, Rater 6 was a long serving tenured member of staff who had previously been the convener of both Economic Environment and Business Economics.

The item estimates shown in Figure 1 indicate that the three essay questions on the Economic Environment paper were all approximately of the same level of difficulty. This is quite unusual, as examinations tend to contain items that vary in difficulty. The absence of variation in item difficulty is reflected in the estimates of the rater*item interaction. However, the disturbing point shown in Figure 1 is that these items are too difficult for the majority of students. Figure 2, summarises the rating performances of the three raters who marked the Semester 2 examination for Business Economics. The variations in the item estimates shown in Figure 2 are more typical of an essay style examination. Furthermore, the figure also shows that the difficulty of these four items is more appropriate for this group of students. The vertical scale of Figure 1 and 2 is an interval scale, the units of which are logits. Some parameters could not be shown on these figures. In Figure 1 each "x" represents 17.6 students and in Figure 2 each "x" represents 2.5 students.

The rater estimates reported in Table 1 and shown in the rater column of Figure 1, indicate that on average Rater 2 was a harder marker than both sessional markers before the training was undertaken. However, the rater estimates reported in Table 2 shows that there is some variation in the severity of rating for individual items. Both sessional staff (Raters 1 and 3) have marked Item 1 harder than Rater 2, while Rater 1 was the hardest marker for Item 3. The post training estimates for rater severity reported in Table 1 shows that on average the sessional markers were more severe that the subject convener. This paradox may have been the result of the two sessional raters experiencing some performance anxiety.

Evidence of the extent of inter-rater reliability or agreement between the raters is best obtained from inspecting the rater*item columns of Figures 1 and 2. These figures are produced from the tables of rater*item estimates that are provided by Conquest. This type of error would be absent if each rater has correctly estimated the difficulty of each item. In Figure 1, the range of item difficulties for the Economic Environment examination is only 0.018 logits, but the range of rater*item estimates is 0.124, which is clearly greater. This increase is largely due to Rater 8 marking Item 1 as if it were much harder, while marking Items 2 and 3 as if they were much easier items. Rater 5 also marked Item 2 as if it were much more difficult. Figure 1 therefore suggests that with the exception of Rater 8, and to some extent Rater 5, there was strong inter-rater reliability or agreement, that is consistency, between the ratings of the group of people who marked the Economic Environment examination.

 
Figure 1: Economic Environment Semester 1, 1999, Map of Latent Distributions and Response Model Parameter Estimates.


Figure 2: Business Economics Semester 2, 1999, Map of Latent Distributions and Response Model Parameter Estimates.

Table 1: Estimates of Rater Parameters

An asterisk next to a parameter estimate indicates that it is constrained.

Table 2: Estimates of Rater by Item Parameters

An asterisk next to a parameter estimate indicates that it is constrained.

Figure 2 demonstrates that the previously existing broad inter-rater agreement or reliability is largely absent after the training. The increased variation in the difficulty of the items on the Business Economics examination appear to have been translated into a greater dispersion of the rater*item estimates. The variation in item difficulty is 0.323 logits, whereas the range of the rater*item estimates is 0.473. This reduction in inter-rater reliability or agreement stems from the inability of the raters to estimate accurately the difficulty of these items. For example, Rater 1 correctly estimated the difficulty of Items 1 and 3 on the Economic Environment examination and marked them accordingly, but she reversed the order of the hardest and easiest items on the Business Economics examination. That is, she marked Item 4 as if it were the easiest (not the hardest) item and marked Item 2 as if it was the hardest (not the easiest item). These observations suggest that the items on an examination paper should all be of the same level of difficulty in order to achieve the highest possible level of inter-rater reliability or agreement when the rating team includes people who are not experts.

Conclusions

top

 Since the mid-1990s, the real level of Commonwealth funding for university places has been falling. The university sector has responded to these cuts in a variety of ways. Increased employment of inexperienced sessional staff has been a fairly universal response by universities. Students are concerned that increased casualisation has led to a reduction in marker consistency. The aim of this study was to develop, implement and evaluate a cost-effective training package designed to reduce the incidence of several common rater errors. The study identified the widespread presence of only two rater errors, a marked variation in rater severity or leniency and a lack of inter-rater reliability or agreement. The lack of agreement between the raters lends support to student concerns about the lack of consistency between markers. Furthermore, it would appear that this training package did little, if anything, to improve the performance of the sessional staff raters. The only area of improvement observed was that sessional staff members were rating more severely than the subject convener after the training, rather than being more lenient as was the case before the intervention. The apparent lack of success of this study may be explained in terms of the attitudes of academics to training and shortcomings with the study design.

The issue of training academics to make careful and consistent ratings has received very little attention in the past. Indeed, markers in universities do not expect such training to be incorporated into a rating exercise, such as marking a final examination. Hence, there are considerable cultural barriers to overcome before academics are likely to accept the need for training as part of a standard setting exercise. Second, it is not clearly understood how the training of academics should be undertaken as the extensive literature on the topic relates primarily to school teachers. Nevertheless, it is clear that the time of 30 minutes that was allowed for this training package was inadequate. Furthermore, the training may have produced some performance anxiety on the part of the subjects, which might explain the paradoxical increase in severity. Third, suggestions for activities to be incorporated into a training program for academics would include a detailed scoring breakdown for each item, systematic cross checking of rater performances and detailed discussions between raters of their expectations for each item. However, it is not possible to undertake such activities when there are large numbers of students, large numbers of raters and tight deadlines. The training of raters is important. Unfortunately, no money is being spent by Australian universities on reviewing the critical process of evaluating marking procedures. Clearly more than training a small number of disparate groups of raters is required. The culture of the higher education sector needs to be changed.

 A second reason why this study did not achieve its goals may be due to shortcomings in its design. The study analysed the performance of both a cohort of students and a small group of raters over an academic year. It was decided that this study design would eliminate the confounding effects that are generated from studying two different groups of raters or students. Moreover, the small number of items marked by each rater may have provided inaccurate estimates of the item and rater parameters. The study could have compared the performance of the large group of raters who marked the Economic Environment examinations at the end of Semester 1 and then at the end of Semester 2. In which case the sample sizes would be about 850 and 400 respectively. Furthermore, the number of raters being evaluated would rise to eight, which would reduce the number of parameter estimates that were constrained and hence, the amount of missing data would be greatly reduced. Cleary in fairness to students much more work should be undertaken to examine the processes used for marking in universities and to improve marker performance by rigorous and informed training of markers.

 

References

top

 Adams, R.J. and Khoo S-T. (1993) Conquest: The Interactive Test Analysis System, ACER Press, Hawthorn.

Barrett, S.R.F. (1999) Question choice and Marker Variability: Insights From Item Response Theory, Unfolding Landscapes in Engineering Education, Proceedings 11th Australasian Conference on Engineering Education, pp. 240-245, University of South Australia, September 1999.

Engelhard, G.Jr (1994) Examining Rater Error in the Assessment of Written Composition With a Many-Faceted Rasch Model, Journal of Educational Measurement, 31(2), 179-196.

Engelhard, G.Jr and Stone, G.E. (1998) Evaluating the Quality of Ratings Obtained From Standard-Setting Judges, Educational and Psychological Measurement, 58(2), 179-196.

Hambleton, R.K. (1989) Principles of Selected Applications of Item Response Theory, in R. Linn, (ed.) Educational Measurement, 3rd ed., MacMillan, New York, 147-200.

Jaeger, R.M. (1991) Selection of Judges for Standard-Setting, Educational Measurement: Issues and Practice, 10(2), 3-10.

Keeves, J.P. and Alagumalai, S. (1999) New Approaches to Research, in G.N. Masters and J.P. Keeves, Advances in Educational Measurement, Research and Assessment, 23-42, Pergamon, Amsterdam.

Masters, G.N. (1982) A Rasch Model for Partial Credit Scoring, Psychometrika, 47, 149-174.

Mills, C.N., Melican, G.J. and Ahluwalia, N.T. (1991) Defining Minimal Competence, Educational Measurement: Issues and Practice, 10(2), 7-14.

Plake, B.S., Melican, G.J. and Mills, C.N. (1991) Factors Influencing Intrajudge Consistency During Standard-Setting, Educational Measurement: Issues and Practice, 10(2), 15-26.

Rasch, G. (1960) Probabilistic Models for Some Intelligence and Attainment Tests, University of Chicago Press, Chicago.

Reid, J.B. (1991) Training Judges to Generate Standard-Setting Data, Educational Measurement: Issues and Practice, 10(2), 11-14.

Saal, F.E., Downey, R.G. and Lahey, M.A (1980) Rating the Ratings: Assessing the Psychometric Quality of Rating Data, Psychological Bulletin, 88(2), 413-428.

van der Linden, W.J. (1982) A Latent Trait Method for Determining Intrajudge Inconsistency in the Angoff and Nedelsky Techniques of Standard Setting, Journal of Educational Measurement, 19(4), 295-308.

Wu, M.L., Adams, R.J. and Wilson, M.R. (1998) ACER Conquest: Generalised Item Response Modelling Software, ACER Press, Hawthorn.

   

top

 Barrett, S. (2001) The impact of training on rater variability. International Education Journal, 2 (1), 49-58 [Online] http://iej.cjb.net


contents

Back to Contents

download

Download
Article

Acrobat Reader

Download
Acrobat Reader


All text and graphics © 1999-2001 Shannon Research Press
online editor