Automated Scoring
What it is and why it’s a big deal

By David Williamson

Broadly speaking, automated scoring means using machines to evaluate things typically evaluated by people. The automated portion refers to the use of machines, typically computers, while the scoring portion can be broadly defined. In the arena of education, scoring usually means numeric scores, classifications or grades for academic work, but can also include providing performance feedback; detailed analysis of the positive and/or negative characteristics of a response. Typically evaluated by people contextualizes the use of the term based on current societal expectations and the cutting edge of technological advances. For example, in the 1940s automated scoring referred to getting a machine to grade multiple-choice answer sheets. Today, machine scoring of multiple choice answer sheets is passé and taken for granted, so the term refers to using computers to score more sophisticated responses such as essays, algebraic equations and spoken responses.

There are a number of terms that have similar meanings to automated scoring and are sometimes used interchangeably. AI scoring is commonly used as a synonym, with AI referring to artificial intelligence, the field of study that provides some of the techniques and tools that are used in automated scoring systems. Another term commonly used is automated evaluation, chosen to emphasize the use of such systems for performance feedback instead of solely for issuing a score. Other terms include automated grading, machine scoring, automated rating, machine marking and synonymous variations on those elements of both automated and scoring.

In many domains, including educational instruction and assessment, there is a natural tension between what is fundamentally valued and what is efficient. The benefit of automated scoring is that it provides a way to do more of the things that are valued by making them more efficient. One way to illustrate this is through the example of a teacher of composition (writing). This teacher may firmly believe that the best way to become a better writer is to write and to receive constructive feedback on that writing. This feedback serves as the basis for improvement, both for that particular piece of writing and for future writing assignments. The more of this writing-feedback cycle students get, the more they can improve their writing ability.

Now assume that this teacher has five classes of students per day, with 20 students per class, for a total of 100 students. If the teacher takes a mere 15 minutes to read and provide feedback on a five-page paper, each writing assignment will take the teacher 25 hours to grade for the entire group of students. That projection only increases if the teacher spends more time for students that need more feedback. Even though the teacher may believe that writing with informed feedback is the best way to learn, there will be a limited number of such assignments the teacher can make and still provide informed feedback. Therefore, they may make greater use of more efficient, but less valued, activities for learning (such as multiple choice quizzes on the principles of writing). An automated system that could provide some or all of the feedback to students on their writing could help make this process more efficient, and therefore provide more opportunities to do what is valued in education.

This example represents the tension between what is valued (i.e., writing practice and feedback) and what is efficient (i.e., homework that is easy to score). While the example is in the context of education, similar challenges exist in testing. We desire tests that represent what is valued most (e.g., writing or solving algebraic equations), but for efficiency, tests may use only multiple-choice items even though some greater representation of the complexity of performance might be preferred. This is because, like the grading of homework, the grading of test responses requires extensive time by experts, increasing the costs of tests and the wait time to receive scores. Automated scoring represents a way to have more of what is valued while still meeting expectations for efficiency, for both learning and testing.

In the consideration of how automated scoring can shift the balance between what is valued and efficiency toward more of what is valued, there is a keen interest in defining what is valued and what constitutes efficiency. In considering what is valued, there is much discussion around what parts of the human role an automated system might do well and what parts of a human role the automated system might do less well. This is often referred to as construct representation; the extent to which the construct of interest is represented well by the score, or feedback. In considering efficiency, there are multiple aspects that are represented as important elements, including the cost of providing scores, the timeliness of providing the scores and the amount of effort involved in getting scores out. There are many different perspectives on how to evaluate and deferentially value aspects of both what is valued and efficiency. This will be a topic of a future post.

Tagged: R&D Lab