Steve Cordogan, Ed.D., was previously director of research and evaluation at Township High School District 214 in Arlington Heights. He is a researcher, consultant and an adjunct professor of graduate educational statistics and research at Aurora University.
For many years, we have struggled to make sense of the volumes of high-stakes test data that surround us. We have our homegrown school-based tests, federally mandated state accountability tests, national tests, international tests, college entrance tests, workplace readiness tests and a variety of other tests which can have a profound influence on test-takers and their schools.
Because test data are more important than ever, we ingest more test data than ever, particularly from local, national and social media. For most subjects, we read articles and advertisements with a critical eye, forming opinions about their accuracy and reliability. However, we do not always think of articles on test data the same way. We assume test data are somehow objective, scientific and significant.
As a statistician, I say that we need to read much more critically when evaluating what we read about test data. This article looks at issues surrounding current and future high-stakes test data.
High-stakes tests: More important than ever
We are entering a new world of testing, with the introduction of the Common Core State Standards and tests built to assess whether students have mastered them. Additionally, we will soon utilize these tests not only for student and school evaluation, but also for formal teacher evaluation. Despite current backlash against high-stakes testing, it is not going away.
Test data are vital for measuring student, school and district academic performance. Test data can identify at-risk students and guide their remediation. Data provide a reality check to a school’s perception of its performance by providing comparison data with other schools. Schools can use such data to guide improvements in academic performance, such as identifying which curricular changes enhance student learning. The data are the most accessible measures of school accountability.
High-stakes testing truly is high-stakes. It usually requires many hours of student and teacher time. It provides students with stressful hours of testing and labels them with a score rating that can have life-long implications. Aggregate scores can label schools as desirable or undesirable, impact teacher and administrator careers, and ultimately be used to judge educational systems of an entire state or country. So, as we stand on the threshold of a new testing era, we have to get it right.
Even the best test data have limits
We expect a lot out of tests. In a relatively brief time (as short as a few minutes for some computer-adaptive tests), we expect to know how much a student knows across a complex subject area.
But most standardized tests were never meant to solely define student achievement. They were meant as additional data points to supplement student learning as measured by classroom grades. When they were developed, we did not expect brief standardized tests to be as accurate a measure of student ability as a teacher’s appraisal of student performance over nine months of observation and testing. Almost every research study ever conducted has shown that future student classroom performance, whether in primary, secondary or post-secondary educational institutions, is best predicted by past classroom performance as measured by grades, not test scores. In our quest to improve accountability, we have lost sight of this extremely important fact.
However, we cannot compare grade data between schools. Test scores are one of the best ways we can compare overall performance of students across schools, states and countries. Measurements of college and career success might ideally be more authentic comparison measures, but the data are harder to gather.
Test scores cannot be used to measure school quality without important reservations. The most important factor in understanding test scores is student demographic characteristics. In Illinois, the percentage of students in a school receiving free and reduced lunches — that data alone — can predict over 70 percent of the variance in school ACT performance. In an area even more specifically defined, the Chicago suburbs, a combination of the percentage of adults in the community with bachelor’s degrees and the percentage of students who are black, Hispanic or Native American can explain 93 to 95 percent of the difference in district ACT performance. That means school and district performance is largely determined by the students who enter the school, not what the school does for the students. Some schools do significantly outperform expectations set by their demographics, and some underperform, but no open-enrollment public school in Illinois can escape the reality of demographics determining academic performance.
One proposed way around demographic determination of school performance is to use growth models, such as where average of student growth over several grades is used instead of a final performance score to evaluate school performance. For example, if a school grows from a student average ACT EXPLORE score of 12 to a score of 18 over the course of the first three years of high school, it is equal to the performance of a school whose students grow from 15 to 21.
However, demographics also predetermine most growth levels. Students with more academic at-risk characteristics have both lower initial scores and lower growth. For example, I found that students entering a large suburban high school district with an average score of 12 on the ACT EXPLORE test grew only an average of four points to their PSAE ACT score, while students with an average EXPLORE score of 21 grew by more than eight points. Growth models will not compensate for differences in student characteristics, and any use of growth levels to evaluate schools or teachers which does not consider demographics or initial performance levels will penalize those who work with at-risk students.
Tracking school improvement across years also is confounded by consideration of where the school was performing initially and demographic changes in its student body. For example, it is much easier for a school to improve if it was seriously underperforming initially, while the initially high-performing school has much less room for growth. An increase in at-risk student population will usually lower academic performance, a change that cannot be attributed to school quality. Mathematically identifiable changes in a school’s performance may mean little in terms of actual instructional improvement.
There are other issues. Some tests are more accurate than others. Different versions of tests are not always consistent. And random fluctuations in data can cause substantial differences in school performance between years, particularly for smaller schools and subgroups like special education students.
With assessments as with any product, consumers must be educated to understand what the data really mean. Test data can provide us with useful answers if we use truly high-quality tests that are taken seriously by students, and we then consider the possible impact of demographic considerations. But when testing companies and media outlets release data, they will report the data with few, if any, caveats. This means that we cannot — and should not — believe everything we read about test data.
Media outlets are businesses, not public servants
There certainly is nothing wrong with being a business, competing and trying to make a profit — this is a vital part of the economy of any country. But because media outlets are in the business of making money, even content as seemingly objective as test data needs to contribute to that bottom line.
A most unfortunate reality of modern media coverage is that negative news gets more attention than neutral or positive news. While local education news, often based on school press releases, may be positive, statewide news is disproportionately negative.
There have been countless examples of such reporting in the past few years, but I will focus on one article, “Illinois ACT scores post biggest drop in a decade,” printed in the Chicago Tribune on Aug. 21, 2013. The story ignored the fact that extended-time accommodated students, who comprise 10 percent of Illinois ACT takers and score much lower than the rest of the students, were included in reported average for the first time. In reality, ACT scores in Illinois for students without extended-time accommodations (those included in all ACT reports in all decades prior to 2013) had reached a 12-year record high since universal testing began. In other words, the headline was the opposite of the truth. In the year since, ACT scores in Illinois increased to another record high, as shown in the graph below.
Attempts to prompt newspaper publishers in all major metropolitan areas of Illinois in 2011 and 2014 to publish the data on record-high performance levels were mostly ignored, despite the fact that the data were public and could be verified. Jim Broadway’s State School News Service was the only media outlet to report the findings.
Testing companies are businesses, not public servants
Testing companies are too often assumed to be altruistic. Regardless of whether they are for-profit or technically not-for-profit, they are businesses whose survival depends on selling products. The distinction between for-profit and not-for-profit may be meaningless in terms of credibility (the National Football League is legally classified as a not-for-profit organization too).
When organizations make announcements about test results, they want to receive publicity. Publicity helps sell products. Unfortunately, again, negative news gets more attention than neutral or positive news. So most releases of test data from testing companies stress negative aspects; students are not improving, doing worse than before, or failing to meet standards. This reportage continues a demand for their products, either to continue to monitor the situation through further testing, or through purchase of the organization’s other products, which are touted to improve student learning — or at least improve test performance.
The most compelling example of such reporting is the ACT Corporation. ACT used its test findings to create a set of benchmark score levels in each test subject area. Students had to achieve this score level to be considered college-ready. The methodology was seriously flawed, but that issue is beyond the scope of this article.
For the past few years, ACT has issued a press release near the end of summer to announce the most recent graduating class test findings. ACT uses the occasion to claim that, according to its benchmarks, only around 25 percent of U.S. students are ready for college. This figure is invalidated by all other research I have seen, and has been refuted by ACT’s own research. In fact, the study ACT conducted to validate the benchmarks (available at www.act.org/research/policymakers/pdf/2005-2.pdf) found that 65 percent of students who met none of the benchmarks persisted to a second year in college with a better than C+ average. Almost 2/3 of the very lowest-testing students succeeded, according to ACT’s own standard. Obviously, the benchmarks’ value in predicting first-year college success is limited and utterly unworthy of a press release.
Furthermore, ACT omitted its own reading test from the study. The only possible reason I can identify is that the reading test did not have any predictive value. Despite such knowledge, ACT continues to use the same reading test.
The accuracy of using tests to measure student performance is not simply a debate on test psychometrics and academics. Testing and reporting of test data is big business, with multi-million and even multi-billion dollar companies. This is about money, jobs and even the survival of major organizations when a new test like PARCC (flawed but promising) is introduced. If a new test is accepted, an old test will be shoved aside, and a testing company will be substantially downsized or cease to exist.
We need to have objective measures of student and school performance. However, the serious use of high-stakes test data that followed implementation of No Child Left Behind showed that we did not understand the limitations of such data. Forming a conclusion, and even an emotional reaction, around a piece of data that we see in the media is natural, but there is a very good chance that our conclusions, like the data upon which they are based, will be inaccurate. That’s why school leaders, the media and public need to become more educated consumers of test data.