Third Education Group® Review / Articles, Volume 1 Number 2

Access this article in pdf format
©All Rights Reserved, Richard P. Phelps, 2007




THE SOURCE OF LAKE WOBEGON Footnote


Richard P Phelps



 ABSTRACT

John J. Cannell's late 1980's “Lake Wobegon” reports suggested widespread deliberate educator manipulation of norm-referenced standardized test (NRT) administrations and results, resulting in artificial test score gains. The Cannell studies have been referenced in education research since, but as evidence that high stakes (and not cheating or lax security) cause test score inflation. This article examines that research and Cannell's data for evidence that high stakes cause test score inflation. No such evidence is found. Indeed, the evidence indicates that, if anything, the absence of high stakes is associated with artificial test score gains. The variable most highly correlated with test score inflation is general performance on achievement tests, with traditionally low-performing states exhibiting more test score inflation—on low-stakes norm-referenced tests—than traditionally high-performing states, regardless of whether or not a state also maintains a high-stakes testing program. The unsupported high-stakes-cause-test-score-inflation hypothesis seems to derive from the surreptitious substitution of an antiquated definition of the term “high stakes” and a few studies afflicted with left-out-variable bias.




Introduction


We know that tests that are used for accountability tend to be taught to in ways that produce inflated scores.

                                                                                                          – D. Koretz, CRESST 1992, p.9 


Corruption of indicators is a continuing problem where tests are used for accountability or other high-stakes purposes.

                                                                                                          – R.L. Linn, CRESST 2000, p.5


 The negative effects of high stakes testing on teaching and learning are well known.

Under intense political pressure, test scores are likely to go up without a corresponding improvement in student learning…

all tests can be corrupted.

                                                                                                          – L.A. Shepard, CRESST 2000


High stakes… lead teachers, school personnel, parents, and students to focus on just one thing:

raising the test score by any means necessary. There is really no way that current tests

can simultaneously be a legitimate indicator of learning and an object of concerted attention.

– E.L. Baker, CRESST 2000, p.18



People cheat. Educators are people. Therefore, educators cheat. Not all educators, nor all people, but some.

   

   This simple syllogism would seem incontrovertible. As is true for the population as a whole, some educators will risk cheating even in the face of measures meant to prevent or detect it. More will try to cheat in the absence of anti-cheating measures. As is also true for the population as a whole, some courageous and highly-principled souls will refuse to cheat even when many of their colleagues do.

   

   Some education researchers, however, assert that deliberate educator cheating had nothing to do with the Lake Wobegon effect. Theirs are among the most widely cited and celebrated articles in the education policy research literature. Members of the federally-funded Center for Research on Education Standards and Student Testing (CRESST) have, for almost two decades, asserted that high-stakes cause “artificial” test score gains. They identify “teaching to the test” (i.e., test prep or test coaching) as the direct mechanism that produces this “test score inflation.”



             The High-Stakes-Cause-Test-Score-Inflation Hypothesis

             The empirical evidence they cite to support their claim is less than abundant, however, largely consisting of,

 

∙first, a quasi-experiment they conducted themselves fifteen years ago in an unidentified school district with unidentified tests (Koretz, Linn, Dunbar, Shepard 1991),

∙second, certain patterns in the pre- and post-test scores from the first decade or so of the Title I Evaluation and Reporting System (Linn 2000, pp.5, 6), and

∙third, the famous late-1980s “Lake Wobegon” reports of John Jacob Cannell (1987, 1989), as they interpret them.

   

   Since the publication of Cannell’s Lake Wobegon reports, it has, indeed, become “well known” that accountability tests produce score inflation. Well known or, at least, very widely believed. Many, and probably most, references to the Lake Wobegon reports in education research and policy circles since the late 1980s have identified high stakes, and only high stakes, as the cause of test score inflation (i.e., test score gains not related to achievement gains).

   

   But, how good is the evidence?


   In addition to studying the sources the CRESST researchers cite, I have analyzed Cannell’s data in search of evidence. I surmised that if high stakes cause test score inflation, one should find the following:

 

∙grade levels closer to a high-stakes event (e.g., a high school graduation test) showing more test score inflation than grade levels further away;

∙direct evidence that test coaching (i.e., teaching to the test), when isolated from other factors, increases test scores; and

∙an association between stakes in a state testing program and test score inflation.


   One could call this the “weak” version of the high-stakes-cause-score-inflation hypothesis.


   I further surmised that if high-stakes alone, and no other factor, cause artificial test score gains, one should find no positive correlation between test score gains and other factors, such as lax test security, educator cheating, student and teacher motivation, or tightening alignment between standards, curriculum, and test content.


   One could call this the “strong” version of the high-stakes-cause-score-inflation hypothesis.



John Jacob Cannell and the “Lake Wobegon” Reports


Welcome to Lake Wobegon, where all the women are strong, all the men are good-looking,

and all the children are above average.

                                                                                      Garrison Keillor, A Prairie Home Companion

 

It is clear that the standardized test results that were widely reported as part of accountability systems

in the 1980s were giving an inflated impression of student achievement.

                                                                                                          – R.L. Linn, CRESST 2000, p.7

 

In 1987, a West Virginia physician, John Jacob Cannell, published the results of a study, Nationally Normed Elementary Achievement Testing in America’s Public Schools. He had been surprised that West Virginia students kept scoring “above the national average” on a national norm-referenced standardized test (NRT), given the state’s low relative standing on other measures of academic performance. He surveyed the situation in other states and with other NRTs and discovered that the students in every state were “above the national average,” on elementary achievement tests, according to their norm-referenced test scores.

 

   The phenomenon was dubbed the “Lake Wobegon Effect,” in tribute to the mythical radio comedy community of Lake Wobegon, where “all the children are above average.” The Cannell report implied that half the school superintendents in the country were lying about their schools’ academic achievement. It further implied that, with poorer results, the other half might lie, too.

 

   School districts could purchase NRTs “off-the-shelf” from commercial test publishers and administer them on their own. With no “external” test administrators watching, school and district administrators were free to manipulate any and all aspects of the tests. They could look at the test items beforehand, and let their teachers look at them, too. They could give the students as much time to finish as they felt like giving them. They could keep using the same form of the test year after year. They could even score the tests themselves. The results from these internally-administered tests primed many a press release. (See Cannell 1989, Chapter 3)

 

   Cannell followed up with a second report (1989), How Public Educators Cheat on Standardized Achievement Tests, in which he added similar state-by-state information for the secondary grades. He also provided detailed results of a survey of test security practices in the 50 states (pp.50–102), and printed some of the feedback he received from teachers in response to an advertisement his organization had placed in Education Week in spring 1989 (Chapter 3).



Institutional Responses to the Cannell Reports


The proper use of tests can result in wiser decisions about individuals and programs than would be

the case without their use…. The improper use of tests, however, can cause considerable harm.…

                                                                                                          – AERA, APA, & NCME 1999, p.1


The Lake Wobegon controversy led many of the testing corporations to be more timely

in producing new norms tables to accompany their tests.

                                                                                                          – M. Chatterji 2003, p.25

 

The natural response to widespread cheating in most non-education fields would be to tighten security and to transfer the evaluative function to an external agency or agencies—agencies with no, or at least fewer, conflicts of interest. This is how testing with stakes has been organized in hundreds of other countries for decades.


     Steps in this direction have been taken in the United States, too, since publication of Cannell’s Reports. For example, it is now more common for state agencies, and less common for school districts, to administer tests with stakes. In most cases, this trend has paralleled both a tightening of test security and greater transparency in test development and administration.


     There was a time long ago when education officials could administer a test statewide and then keep virtually all the results to themselves. In those days, those education officials with their fingers on the score reports could look at the summary results first, before deciding whether or not to make them public via a press release. Few reporters then even covered systemwide, and mostly diagnostic, testing much less knew when the results arrived at the state education department offices. But, again, this was long ago.


 

Legislative Responses

      Between then and now, we have seen both California (in 1978) and New York State (in 1979) pass “truth in testing” laws that give individual students, or their parents, access to the corrected answers from standardized tests, not just their scores. Footnote The laws also require test developers to submit technical reports, specifying how they determined their test’s reliability and validity, and they require schools to explain the meaning of the test scores to individual students and their parents, while maintaining the privacy of all individual student test results.


     Between then and now, we have seen the U.S. Congress pass the Family Education Rights and Privacy Act (FERPA), also called the Buckley Amendment (after the sponsor, Congressman James Buckley (NY)), which gives individual students and their parents similar rights of access to test information and assurances of privacy. Some federal legislation concerning those with disabilities has also enhanced individual students’ and parents’ rights vis à vis test information (e.g., the Rehabilitation Act of 1973).

   

 

Judicial Responses

     Between then and now, the courts, both state and federal, have rendered verdicts that further enhance the public’s right to access test-related information. Debra P. v. Turlington (1981) (Debra P. being a Florida student and Mr. Turlington being Florida’s education superintendent at the time) is a case in point. A high school student who failed a nationally-norm-referenced high school graduation examination sued, employing the argument that it was not constitutional for the state to deny her a diploma based on her performance on a test that was not aligned to the curriculum to which she had been exposed. In other words, for students to have a fair chance at passing a test, they should be exposed to the domain of subject matter content that the test covers; in fairness, they should have some opportunity to learn in school what they must show they have learned on a graduation test. In one of the most influential legal cases in U.S. education history, the court sided with Debra P. against the Florida Education Department.


     A more recent and even higher profile case (GI Forum v. Texas Education Agency (2000)), however, reaffirmed that students still must pass a state-mandated test to graduate, if state law stipulates that they must.

 

Response of the Professions

     Cannell’s public-spirited work, and the shock and embarrassment resulting from his findings within the psychometric world, likely gave a big push to reform as well. The industry bible, the Standards for Educational and Psychological Testing, mushroomed in size between its 1985 and 1999 editions, and now consists of 264 individual standards (i.e., rules, guidelines, or instructions) (American Educational Research Association 1999, pp. 4, 5):

 

“The number of standards has increased from the 1985 Standards for a variety of reasons.… Standards dealing with important nontechnical issues, such as avoiding conflicts of interest and equitable treatment of all test takers, have been added… such topics have not been addressed in prior versions of the Standards.”


     The Standards now comprise 123 individual standards related to test construction, evaluation, and documentation, 48 individual standards on fairness issues, and 93 individual standards on the various kinds of testing applications (e.g., credentialing, diagnosis, and educational assessment). Close to a hundred member & research organizations, government agencies, and test development firms sponsor the development of the Standards and pledge to honor them.


     Any more, to be legally defensible, the development, administration, and reporting of any high-stakes test must adhere to the Standards which, technically, are neither laws nor government regulations but are, nonetheless, regarded in law and practice as if they were. (Buckendahl & Hunt 2005)



Education Researchers’ Response to the Cannell Reports


There are many reasons for the Lake Wobegon Effect,

most of which are less sinister than those emphasized by Cannell

                                                                                                          – R.L. Linn, CRESST 2000, p.7

 

The Cannell Reports attracted a flurry of research papers (and no group took to the task more vigorously than those at the Center for Research on Education Standards and Student Testing (CRESST)). Most researchers concurred that the Lake Wobegon Effect was real—across most states, many districts, and most grade levels, more aggregate average test scores were above average than would have been expected by chance—many more.


     But, what caused the Lake Wobegon Effect? In his first (1987) report, Cannell named most of the prime suspects—educator dishonesty (i.e., cheating) and conflict of interest, lax test security, inadequate or outdated norms, inappropriate populations tested (e.g., low-achieving students used as the norm group, or excluded from the operational test administration), and teaching the test.


     In a table that “summarizes the explanations given for spuriously high scores,” Shepard (1990, p.16) provided a cross-tabulation of alleged causes with the names of researchers who had cited them. Conspicuous in their absence from Shepard’s table, however, were Cannell’s two primary suspects—educator dishonesty and lax test security. This research framework presaged what was to come, at least from the CRESST researchers. The Lake Wobegon Effect continued to receive considerable attention and study from mainstream education researchers, especially those at CRESST, but Cannell’s main points—that educator cheating was rampant and test security inadequate—were dismissed out of hand, and persistently ignored thereafter.

 


Semantically Bound


The most pervasive source of high-stakes pressure identified by respondents was media coverage.

                                                                                                          – L.A. Shepard, CRESST 1990, p.17


In his second (1989) report, Cannell briefly discussed the nature of stakes in testing. The definition of “high stakes” he employed, however, would be hardly recognizable today. According to Cannell (1989, p.9),

 

“Professor Jim Popham at UCLA coined the term, ‘high stakes’ for tests that have consequences. When teachers feel judged by the results, when parents receive reports of their child’s test scores, when tests are used to promote students, when test scores are widely reported in the newspapers, then the tests are ‘high stakes.’”


     Researchers at the Center for Research on Education Standards and Student Testing (CRESST) would use the same definition. For example, Shepard (1990, p.17) wrote:

 

“Popham (1987) used the term high-stakes to refer to both tests with severe consequences for individual pupils, such as non-promotion, and those used to rank schools and districts in the media. The latter characterization clearly applies to 40 of the 50 states [in 1990]. Only four states conduct no state testing or aggregation of local district results; two states collect state data on a sampling basis in a way that does not put the spotlight on local districts. [Two more states] report state results collected from districts on a voluntary basis. Two additional states were rated as relatively low-stakes by their test coordinators; in these states, for example, test results are not typically page-one news, nor are district rank-orderings published.”


     Nowadays, the definition that Cannell and Shepard attributed to Popham is rather too broad to be useful, as it is difficult to imagine a systemwide test that would not fit within it. The summary results of any systemwide test must be made public. Thus, if media coverage is all that is necessary for a test to be classified as “high stakes,” all systemwide tests are high stakes tests. If all tests are high stakes then, by definition, there are no low-stakes tests and the terms “low stakes” and “high stakes” make no useful distinctions.


     This is a bit like calling all hours daytime. One could argue that there’s some validity to doing so, as there is at all times some amount of light present, from the moon and the stars, for example, even if it is sometimes an infinitesimal amount (on cloudy, moonless nights, for example), or from fireflies, perhaps. But, the word “daytime” becomes much diminished in utility once its meaning encompasses its own opposite.


     Similarly, one could easily make a valid argument that any test must have some stakes for someone; otherwise why would anyone make the effort to administer or take it? But, stakes vary, and calling any and all types of stakes, no matter how slight, “high” leaves one semantically constrained.


     To my observation, most who join height adjectives to the word “stakes” in describing test impacts these days roughly follow this taxonomy:

 

High Stakes – consequences that are defined in law or regulations result from exceeding, or not, one or more score thresholds. For a student, for example, the consequences could be completion of a level of education, or not, or promotion to the next grade level or not. For a teacher, the consequences could be job retention or not, or salary increase or bonus, or not.

 

Medium Stakes – partial or conditional consequences that are defined in law or regulations result from exceeding, or not, one or more score thresholds. For a student, for example, the consequences could be an award, or not, admission to a selective, but non-required course of study, or not, or part of a “moderated” or “blended” score or grade, only the whole of which has high-stakes consequences.

 

Low Stakes – the school system uses test scores in no manner that is consequential for students or for educators that is defined in law or regulations. Diagnostic tests, particularly when they are administered to anonymous samples of or individual students, are often considered low-stakes tests.


     The definitions for “high-stakes test” and “low-stakes test” in the Standards for Educational and Psychological Testing (1999) are similar to mine above Footnote :

 

“High-stakes test. A test used to provide results that have important, direct consequences for examinees, programs, or institutions involved in the testing.” (p.176)

 

“Low-stakes test. A test used to provide results that have only minor or indirect consequences for examinees, programs, or institutions involved in the testing.” (p.178)


     Note that, by either taxonomy, the fact that a school district superintendent or a school administrator might be motivated to artificially inflate test scores—to, for example, avoid embarrassment or pad a résumé—does not give a test high or medium stakes. By these taxonomies, avoiding discomfit is not considered to be a “stake” of the same magnitude as, say, a student being denied a diploma or a teacher losing a job. Administrator embarrassment is not a direct consequence of the testing nor, many would argue, is it an important consequence of the testing.


     By either taxonomy, then, all but one of the tests analyzed by Cannell in his late 1980s-era Lake Wobegon reports were low stakes tests. With one exception (the Texas TEAMS), none of the Lake Wobegon tests was standards-based and none carried any direct or important state-imposed or state-authorized consequences for students, teachers, or schools.


     Still, high stakes or no, some were motivated to tamper with the integrity of test administrations and to compromise test security. That is, some people cheated in administering the tests, and then misrepresented the results.



Wriggling Free of the Semantic Noose


The phrase, teaching the test, is evocative but, in fact, has too many meanings to be directly useful.

                                                                                                          – L.A. Shepard, CRESST 1990, p.17.


The curriculum will be degraded when tests are ‘high stakes,’ and when specific test content is known in advance.

                                                                                                          – J.J. Cannell 1989, p.26


Cannell reacted to the semantic constraint of Popham’s overly broad definition of “high stakes” by coining yet another term—“legitimate high stakes”—which he contrasted with other high-stakes that, presumably, were not “legitimately” high. Cannell’s “legitimate high stakes” tests are equivalent to what most today would identify as medium- or high-stakes tests (i.e., standards-based, accountability tests). Cannell’s “not legitimately high stakes” tests—the nationally-normed achievement tests administered in the 1980s mostly for diagnostic reasons—would be classified as low-stakes tests in today’s most common terminology. (See, for example, Cannell 1989, pp.20, 23)


     But, as Cannell so effectively demonstrated, even those low-stakes test scores seemed to matter a great deal to someone. The people to whom the test scores mattered the most were district and school administrators who could publicly advertise the (artificial) test score gains as evidence of their own performance.


     Then and now, however, researchers at the Center for Research on Education Standards and Student Testing (CRESST) neglected to make the “legitimate/non-legitimate,” or any other, distinction between the infinitely broad Popham definition of “high stakes” and the far more narrow meaning of the term common today. Both then and now, they have left the definition of “high stakes” flexible and, thus, open to easy misinterpretation. “High stakes” could mean pretty much anything one wanted it to mean, and serve any purpose.



Defining “Test Score Inflation”

 

Cannell’s reports …began to give public credence to the view that scores on high-stakes tests could be inflated.

                                                                                                          – D.M. Koretz, et al. CRESST 1991, p.2

 

Not only can the definition of the term “high stakes” be manipulated and confusing, so can the definition of “test score inflation.” Generally, the term describes increases (usually over time) in test scores on achievement tests that do not represent genuine achievement gains but, rather, gains due to something not related to achievement (e.g., cheating, “teaching to the test” (i.e., test coaching)). To my knowledge, however, the term has never been given a measurable, quantitative definition.

 

For some of the analysis here, however, I needed a measurable definition and, so, I created one. Using Cannell’s state-level data (Cannell 1989, Appendix I), I averaged the number of percentage-points above the 50th percentile across grades for each state, for which such data were available. In table 1 below, the average number of percentage points above the 50th percentile is shown for states with some high-stakes testing (6.1 percentage points) and for states with no high-stakes testing (12.1 percentage points).

 

 

Table 1.

State had high-stakes test?

Average number of percentage points above 50th percentile

Yes (N=13)

6.1

No (N=12)

12.2

25 states had insufficient data

SOURCE: J.J. Cannell, How Public Educators Cheat on Standardized Achievement Tests, Appendix I.

 

 

 

At first blush, it would appear that test score inflation is not higher in high-stakes testing states. Indeed, it appears to be lower. Footnote

 

The comparison above, however, does not control for the fact that some states generally score above the 50th percentile on standardized achievement tests even when their test scores are not inflated. To adjust the percentage-point averages for the two groups of states—those with high stakes and those without—I used average state mathematics percentile scores from the 1990 or 1992 National Assessment of Educational Progress (NAEP) to compensate. Footnote (NCES, p.725)

   

For example, in Cannell’s second report (1989), the percentage-point average above the 50th percentile on norm-referenced tests (NRTs) is +20.3 (p.98). But, Wisconsin students tend to score above the national average on achievement tests no matter what the circumstances, so the +20.3 percentage points may not represent “inflation” but actual achievement that is higher than the national average. To adjust, I calculated the percentile-point difference between Wisconsin’s average percentile score on the 1990 NAEP and the national average percentile score on the 1990 NAEP—+14 percentage points. Then, I subtracted the +14 from the +20.3 to arrive at an “adjusted” test score “inflation” number of +6.3.

   

I admit that this is a rough way of calculating a “test score inflation” indicator. Just one problem is the reduction in the number of data points. Between the presence (or not) of statewide NRT administration and the presence (or not) of NAEP scores from 1990 or 1992, half of the states in the country lack the necessary data to make the calculation. Nonetheless, as far as I know, this is the first attempt to apply any precision to the measurement of an “inflation” factor.

   

With the adjustment made (see table 2 below), at second blush, it would appear that states with high-stakes tests might have more “test score inflation” than states with no high-stakes tests, though the result is still not statistically significant.

 

Table 2.

 

State had high-stakes test?

Average number of percentage points above 50th percentile (adjusted)

Yes (N=13)

11.4

No (N=12)

8.2

25 states had insufficient data

SOURCE: J.J. Cannell, How Public Educators Cheat on Standardized Achievement Tests, Appendix I.

 

These data at least lean in the direction that the CRESST folk have indicated they should, but not yet very convincingly. Footnote

 

 

Testing the “Strong” Version of the High-Stakes-Cause-Score-Inflation Hypothesis

 

Research has continually shown that increases in scores… reflect factors other than increased student achievement.

Standards-based assessments do not have any better ability to correct this problem.

                                                                                                  – R.L. Linn, CRESST 1998, p.3

 

As mentioned earlier, the “strong” test of the high-stakes-[alone]-cause[s]-test-score-inflation hypothesis requires that we be unable to find a positive correlation between test score gains and any of the other suspected factors, such as lax test security and educator cheating.

 

Examining Cannell’s data, I assembled four simple cross-tabulation tables. Two compare the presence of high-stakes in the states to, respectively, their item rotation practices and their level of test security as described by Cannell in his second report, The next two tables compare the average number of percentage points above the 50th percentile (adjusted for baseline performance with NAEP scores) on the “Lake Wobegon” tests—a rough measure of “test score inflation”—to their item rotation practices and their level of test security.

 

 

Item Rotation

Cannell noted in his first report that states that rotated items had no problem with test score inflation. (Cannell 1987, p.7) In his second report, he prominently mentions item rotation as one of the solutions to the problem of artificial test score gains.

   

According to Cannell, 20 states employed no item rotation and 16 of those twenty had no high-stakes testing. Twenty-one states rotated items and the majority, albeit slight, had high-stakes testing. (see table 3 below)

 

 

Table 3.

 

Did state rotate test items?

State had high-stakes test?

yes

no

Yes

11

4

No

10

16

9 states had insufficient data

SOURCE: J.J. Cannell, How Public Educators Cheat on Standardized Achievement Tests, Appendix I.

 

 

Contrasting the average “test score inflation,” as calculated above (i.e., the average number of percentage points above the 50th percentile (adjusted by NAEP performance)), between item-rotating and non-item-rotating states, it would appear that states that rotated items had less test score inflation (see table 4 below). Footnote

 

 

Table 4.

 

Did state rotate test items?

yes

no

Average number of percentage points above 50th percentile (adjusted)

9.3

10.0

29 state had insufficient data

N=12

N=9

SOURCE: J.J. Cannell, How Public Educators Cheat on Standardized Achievement Tests, Appendix I.

 

 

Level of Test Security

Cannell administered a survey of test security practices and received replies from all but one state (Cannell 1989, Appendix I). As Cannell himself noted, the results require some digesting. For just one example, a state could choose to describe the test security practices for a test for which security was tight and not describe the test security practices for other tests, for which security was lax,… or vice versa. Most states at the time administered more than one testing program.

   

I classified a state’s security practices as “lax” if they claimed to implement only one or two of the dozen or so practices about which Cannell inquired. I classified a state’s security practices as “moderate” if they claimed to implement about half of Cannell’s list. Finally, I classified a state’s security practices as “tight” if they claimed to implement close to all of the practices on Cannell’s list.

   

These three levels of test security are cross-tabulated with the presence (or not) of high-stakes testing in a state in table 5 below. Where there was lax test security, only four of 19 states had high-stakes testing. Where there was moderate test security, only four of 14 states had high-stakes testing. Where there was tight test security, however, eight of ten states had high-stakes testing.

 

 

Table 5.

 

What was the quality of test security in the state?

State had high-stakes test?

Lax

Moderate

Tight

Yes

4

4

8

No

15

10

2

7 states had insufficient data

SOURCE: J.J. Cannell, How Public Educators Cheat on Standardized Achievement Tests, Appendix I.

 

 

Contrasting the average “test score inflation,” as calculated above (i.e., the average number of percentage points above the 50th percentile (adjusted by NAEP performance)), between lax, moderate, and tight test security states, it would appear that states with tighter test security tended to have less test score inflation (see table 6 below). Footnote

 

 

Table 6.

 

What was the quality of test security in the state?

 

Lax

Moderate

Tight

Average number of percentage points above 50th percentile (adjusted)

10.6

9.7

8.9

27 states had insufficient data

N=12

N=5

N=6

SOURCE: J.J. Cannell, How Public Educators Cheat on Standardized Achievement Tests, Appendix I.

 

 

At the very least, these four tables confound the issue. There emerges a rival hypothesis—Cannell’s—that item rotation and tight test security prevent test score inflation. In the tables above, both item rotation and tight test security appear to be negatively correlated with test score inflation. Moreover, both appear to be positively correlated with the presence of high-stakes testing.