FAQ - Test Scores 

During field testing about half a million responses generated by more than 10 000 native and non-native speakers with 126 different home languages have been gathered. These responses are scored by a pool of human raters on a number of traits (content, vocabulary, fluency, pronunciation). The human scores on a large subset of the responses serve to train the automatic scoring system. The system builds acoustic models and language models for the responses and compares actual responses with the human scores these receive. Based on this large data set the automatic scoring system ‘learns’ what the correspondence is between characteristics of the responses and the human scores that are awarded to these responses. Once the machine has been trained the machine independently scores responses that were not used in the training. These independent machine-generated scores are then compared with the human scores awarded to these responses. If they correlate high enough (typically > 0.90) the automatic scoring system is considered to be sufficiently accurate.
The Pearson automatic system for assessing spoken language is the result of years of research in speech recognition, statistical modeling, linguistics, and testing theory. Over the last 11 years, more than 27 million test questions making use of this technology have been delivered, responded to, and automatically scored for individuals from over 100 countries around the world. Sample projects making use of the technology include:
• the assessment of spoken Spanish for the US Department of Homeland Security
• a test of aviation English, co-developed with the Federal Aviation Administration, to certify that pilots and air traffic controllers meet the language requirements set forth by International Civil Aeronautics Organization (ICAO)
• a test of Dutch language and culture for the Justice Ministry's Immigration and Naturalization Service in the Netherlands
• an assessment of children's oral reading fluency to support the US No Child Left Behind and Reading First initiatives
The Pearson speech recognizers are more highly attuned than other speech recognizers because they have been optimized using thousands of samples of speech from native speakers and non-native speakers worldwide. Research shows that the Pearson speech recognizer is generally able to understand words equally as well as a native listener.
“Manner of speech” and “Content of speech” are kept entirely distinct in the scoring. For example, if test takers use the correct vocabulary in their answer but they pronounce the word poorly, the speech processors will credit the vocabulary score but debit the pronunciation score accordingly.
Typically the correlations between two human scorers reach values between 0.80 and 0.90. Very skilled well-trained scorers may even reach 0.93 or 0.95. Machine scores typically correlate with expert human ratings at the same level as two human raters do. Machine scores correlate better with well trained expert human scorers, than with human scorers with low interrater agreement. Furthermore machine scoring is trained on a large pool of human scorers, more than 250 human scorers in the case of PTE Academic. This means that the machine scores are in fact closer to the scores from the ideal human rater than the scores from individual human raters, or even pairs of human raters.
In many ways, automated scoring gives more clinical, objective scoring than humans do. The machine scoring model always gives the same assessment of a spoken response and is not influenced by external factors such as the speaker’s appearance, personality, or body language. Furthermore, automated scoring gives standardized assessments regardless of where in the world the test is taken. Human scorers, on the other hand, give more impressionistic judgments and are influenced by a variety of external factors which lead to less consistent, less standardized scoring.
The system Pearson uses for automatically scoring speech is a proprietary and patented system not available to other companies. Versant has been operational for more than 10 years. The system is now available for several languages (English, Spanish, Dutch) and systems for more languages are under development (Arabic, French, German, Japanese). Versant has been used by hundreds of educational and commercial institutions worldwide. It is trusted by these institutions and also by governmental bodies (US Department of Homeland Security, Immigration and Naturalization Service in the Netherlands).
PTE Academic-recognizing organizations will be able to access test taker results online through a secure password protected website. Results will include an Overall Score, feedback on Communicative Skills (Reading, Writing, Listening and Speaking) as well as six Enabling Skills (Oral Fluency, Pronunciation, Spelling, Written Discourse, Vocabulary and Grammar). Recognizing institutions will have timely and accessible information.
Yes. One important aspect of accuracy reflects whether a test taker who takes a test one day will score similarly if he or she takes it on another occasion. Another aspect of accuracy is whether two different assessors who listen to a test taker talking, or who read the test takers ’s writing, would award the test taker the same score. These are both considered measures of test “reliability”. Years of research have demonstrated that Pearson’s automated scoring is at least as reliable – if not more reliable – as human scoring. If confronted with identical speech or written sample on repeated occasions, the Pearson automated system will produce identical scores. Such impartiality coupled with consistency is very difficult to achieve with even the best-trained human assessors.
Pearson conducted “validation studies” to make sure that the machine’s scores are comparable to scores given by skilled human assessors.  In a validation study, a new set of test takers’ responses (never seen by the machine) are scored by both human assessors and by the automated scoring system. When the human scores are compared with the machine scores they are found to be remarkably similar. In fact, the difference between the human score and the machine score is so small that it is usually less than the difference between one human score and another human score. This is the same for both written and spoken assessments.
Page 1 of 3     Next »