How Standardized is Standardized Testing?

By Kenneth H. Wodtke

Can standardized group tests of achievement or ability be administered to young children in a manner which yields meaningful and valid information? A growing body of research, including work by the author, suggests that such testing may be inappropriate and in some cases even harmful to young school children. In spite of the available research showing that group testing is unsuitable for young children who are not cognitively or socially mature enough to take such tests, school districts throughout the U.S. continue to require teachers on an ever increasing basis to administer tests to millions of children. The results of such testing are used for instructional grouping, student promotion decisions, referrals to special education, parent conferences, program evaluation, distribution of federal and state-funds, and increasingly for the personnel evaluation of teachers and principals. According to one national survey conducted by Mary C. Austin and Coleman Morrison ofKenneth H. Wodtke the Harvard University Reading-Project, more than 80% of teachers reported that they “always” or “often” used readiness tests to determine when children should begin formal reading instruction, and that “mental maturity,” as determined by psychological measurement was used “always” or “often” by 40% as a deciding factor.

For the results of group testing to be valid and meaningful, the tests must be administered in a reasonably standardized manner in a context in which all children can do their best work on the test. Educational psychologists are unanimous and vehement in their view that the testing context must be standardized. This reflects not some fondness for rules, but the statistically proven fact that test results are simply not reliable and may well grossly distort reality if conditions are not standardized.

GROSS INCONSISTENCIES FOUND

Are school tests administered in conformity with the requirement that the testing conditions be the same for all? The results of the author’s research suggests that uniform group testing of young school children may be the exception rather than the rule. In my research, observers recorded a detailed record of exactly what took place during the administration of standardized readiness tests in eight kindergarten classrooms. The results showed that while the testing in some classrooms conformed well to the ideal of scientific measurement, a number of classrooms revealed such extreme variability in testing conditions that their test results would be rendered totally invalid.

One source of variation during the testing involved the tendency of the children to call out answers to test questions. Young and inexperienced with the testing process, they apparently did not understand that calling out answers was inappropriate during testing. Only in one classroom did the children refrain from calling out answers. In the classroom with the greatest instance of such behavior, the children verbalized answers during the presentation of 52% of the test questions. Approximately two thirds of the answers called out in this classroom were correct In addition, talking, laughter, shouting, getting out of seats, dropping pencils or crayons, etc., were also common, often contributing to a chaotic testing environment. Such behavior occurred during the presentation of 3% to 37% of the test questions in the eight classrooms and frequently created distractions for other children trying to take the test.

The size of testing groups varied from 2 to 18 children, creating very different management problems for the teachers during the testing. Several teachers administered four subtests in one sitting without rest breaks. (The examiner’s manual for this test called for no more than two subtests in one sitting), and the required Practice Test booklet was not utilized in another classroom. Interruptions during the testing were common, including announcements on the school intercom by principals, secretaries, and older students. One examiner tested in the cloakroom and nurse’s room. Such variation from commonly accepted testing practice would not be conducive to optimal or valid test performance in young children. None of the test scores, however, were declared invalid as would be required for such non – standardized testing conditions.

CUES FROM THE TEACHER

In all but one classroom, the teacher served as the test examiner. One category of the examiner’s behavior which was observed in the study was the extent to which the examiner prompted the children or provided cues to correct answers. While five examiners exhibited little or no cueing, one cued on 4% of the test items, a second on 9%, and a third helped the children on 24% of the questions. Obviously, comparing the test results of these latter three classrooms with classrooms or schools in which examiners adhered more closely to the standardized procedures would be totally inappropriate; however, school districts frequently do make such comparisons.

Other significant variations in testing included examiners who inadvertently omitted critical sections of the test instructions and numerous instances of test items which were read to the children incorrectly. In a classroom in which the auditory memory subtest of the Metropolitan Readiness Tests was administered, the teacher omitted the instruction to the children “to close your eyes and listen.” Thus, she totally in validated the scores as measures of “auditory memory.”

DAMAGING RESULTS

These results raise serious questions concerning the indiscriminate group testing of very young children. Kindergarten and first grade children, many of whom are non-readers, may simply be too young to understand the requirements of a group testing situation. When test scores resulting from non standardized testing conditions are used in decisions which may permanently affect a child’s educational future, such as ability grouping or grade promotion, such practices may be harmful and raise serious ethical questions. Similar ethical issues would be involved if such test results were used in the personnel evaluation of teachers or principals. According to Professor Samuel Meisels of the University of Michigan’s Center for Human Growth and Development, improper labeling of children based on test scores, “has resulted in young children being denied a free and appropriate public education.”

A kindergarten classroom surely cannot be likened to a scientific laboratory. It is more accurately described as a complex social system which may differ from other classrooms on many dimensions, such as the children’s level of development, student cultural or socioeconomic characteristics, parental values, teacher-student relations, and teacher philosophy and instructional practice. All of these potential differences between classrooms and schools may affect the testing process. The scientific ideal of standardized measurement may simply be unattainable in many classrooms — and especially in classrooms with very young children.

School systems report test results to parents and the general public as if they were based on scientific measurement. In fact, the testing situation often includes confusion, student anxiety, resistance, immature children, inadequately trained examiners, and many other problems that are endemic to schools. In such cases, school officials are promoting the illusion of valid test results which conceals underlying practices detrimental to the instructional process and harmful to children’s educational development.

TOWARD A MORE ENLIGHTENED POLICY

As a matter of testing policy, I believe these results support a recommendation that group tests should not be administered on a general or unrestricted basis to children until they have developed reading proficiency and an adequate level of cognitive and emotional maturity. This would mean excluding most kindergarten, first, and second grade pupils from group testing. Special problems arise in testing large groups of young children who cannot read, who do not yet know how to follow complex instructions, and who are developmentally unable to concentrate for a prolonged period on a cognitive task in a group setting. It is difficult to imagine how valid test results can be obtained in a setting where the examiner must read all of the test instructions and questions to a group of very young children in lock-step. Children who fall behind in the test, fail to find the correct page, become confused, experience anxiety, or mark at random, cannot be adequately identified and responded to by a single examiner in a group testing situation. Most teachers I have talked to agree that group testing at these early grade levels is highly problematic at best.

Discontinuing group testing in the early grades is not without some precedent. A number of school districts in our own area do not, as a matter of policy, administer group tests in kindergarten. The principle which justifies this sensible policy would certainly apply to the first grade and to many second grades as well. These enlightened school districts have thrived with out kindergarten group testing and apparently parents and teachers have supported the policy. The argument that such early testing is needed to provide essential information for early learning is not borne out in the experience of these districts. Unfortunately, the Milwaukee Public Schools is not one of these more enlightened school districts.

Group testing is a deeply ingrained practice in the culture of schools and in the administrative bureaucracy of large school districts. The multi-million dollar testing industry places great pressure on district administrators to spend our tax money on more frequent testing or the purchase of the latest editions of standardized tests. The annual administration of group tests is used to justify school district funding requirements, as well as providing employment for a large staff of testing personnel at the district office level (not to mention creating profits for large publishing companies.) In spite of the savings in direct costs and instructional time which could be realized if group testing were discontinued in grades K-2, school administrators are not likely to be easily persuaded of the merits of such a proposal. In fact, administration at the Milwaukee Schools refused the author’s request to conduct follow up research on the important question of kindergarten testing even though the proposal was awarded funds in a competitive peer review. Clearly some school administrators can be quite defensive about their testing program in refusing researchers the opportunity to conduct observational research on testing.

One way that a testing policy might be modified to prohibit inappropriate testing of young children is through the pressure of parents’ groups, teachers’ professional associations, and teachers’ unions. Teachers as test examiners are in the best position to know whether the group testing of their kindergarten, first, or second grade children is defensible or in the children’s best interests. Where teachers through their union or professional association identify serious problems in their group testing program, they can begin with an educational campaign for parents and the general public regarding the need to discontinue such practices. A proposal to discontinue, or at least severely restrict group testing in grades K-2 could be made by a teachers union either directly to the school district administration or through their collective bargaining process.

The time spent inappropriately group testing children can be put to better use in instruction and developmental activities. Since school administrators will undoubtedly want to continue a district-wide group testing program, they can begin at the third grade when at least the majority of the students will be cognitively and emotionally mature enough to handle a group testing situation. Such a testing policy would be much more consistent with the responsibilities of schools for enhancing the general welfare and development of children than current group testing policy in many school districts.

*This article is a summary of a research manuscript entitled, “Social Context Effects in Early School Testing; Ari Observational Study of the Testing Process.” Copies; of the complete research report may be obtained by writing to the author, Professor Kenneth H. Wodtke, Department of Educational Psychology, University of Wisconsin-Milwaukee, P.O. Box 413, Milwaukee, WJB3201

(For additional research see Samuel J, Meisel’s article, “Uses and Abuses of Developmental Screening and School readiness Testing,” in the January, 1987 issue of Young Children).

Included in:

Volume 1, No.2

Spring 1987