understanding a Vicious Cycle: The Relationship Between Student Discipline and Student Academic Outcomes

While numerous studies have demonstrated a correlation between exclusionary discipline and negative student outcomes, this relationship is likely confounded by other factors related to the underlying misbehavior or risk of disciplinary referral. using 10 years of student-level demographic, achievement, and disciplinary data from all K–12 public schools in Arkansas, we find that exclusionary consequences are related to worse academic outcomes (e.g., test scores and grade retention) than less exclusionary consequences, controlling for type of behavioral infraction. However, despite controlling for a robust set of covariates, sensitivity checks demonstrate that the estimated relationships between consequences and academic outcomes may still be driven by selection bias into consequence type. Implications for policy and practice are discussed.


R
ecently, many school districts and states have enacted policies to limit suspensions, likely in response to the wellestablished link between suspensions and undesirable academic outcomes.There is also substantial evidence that outof-school suspensions (OSSs) and expulsions are disproportionately assigned to certain types of students, particularly students of color (e.g., Anderson & Ritter, 2017;Anyon et al., 2014;Skiba, Chung et al., 2014).However, it is also possible that the relationship between consequences and negative student outcomes is correlational rather than causal.For the sake of crafting good policy, it is important that policymakers gain a better grasp of the true impacts of various types of disciplinary responses.
In this paper, we estimate the relationships between seven types of disciplinary responses to student behavior and two academic outcomes-math test scores and grade retention-while controlling for a rich set of observable characteristics that predict selection into disciplinary referrals and consequences.While estimating a causal relationship is difficult because of the potential for reverse causality or common causes, our detailed data provide a unique opportunity to estimate these relationships.Specifically, our contribution is the ability to control for infraction type, thereby disentangling the impact of the consequence from the underlying factors causing misbehavior.
In the sections that follow, we summarize prior evidence on the relationship between student discipline and student outcomes, describe our data and analytic approach, and discuss our results and their implications for designing policy solutions.
There are several mechanisms through which academic declines could occur, but the complex relationship between behavior, consequences, and academic outcomes leads to uncertainty about the causal relationships.Lower academic achievement could be a result of lost instruction when suspended (Scott & Barrett, 2004).However, students struggling academically exhibit undesirable behaviors in later time periods (e.g., Arcia, 2006;Beck & Muschkin, 2012;Choi, 2007), raising questions about the causal direction.Suspended or expelled students might develop feelings of isolation, stigmatization, or disengagement from school following a suspension, which could translate into poor academic outcomes; however, students may have felt these feelings prior to their exclusion as well (Morrison et al., 2001).
Moreover, it is difficult to isolate the impact of the consequence itself.Most studies have estimated the difference in outcomes for excluded students, relative to nonexcluded students, without direct comparisons to students who behaved similarly but were nonexcluded.For example, a comparison of the relative effects of OSS and in-school suspension (ISS) would be informative for educators and policymakers, as ISS is a commonly used alternative to OSS.However, a literature review noted a lack of research on the effects of ISS (Noltemeyer et al., 2015).
Due to these challenges, the existing literature generally fails to distinguish the impact of punishment from the underlying factors leading to it.In this section, we summarize the literature on the relationship between student discipline, academic achievement, grade retention, and drop-out and discuss the evidence on the impacts of suspension-reducing policies.

Relationship Between Student Discipline and Achievement
A meta-analysis of 24 studies from 1986 to 2012 determined there was a significant, negative relationship between suspensions and academic achievement (Noltemeyer et al., 2015).OSS was more strongly associated with achievement (r = -0.25)than ISS was (r = -0.10),but the authors do not emphasize this difference because few studies assessed ISS and OSS separately.About 42% of the included studies assessed the correlation between schoollevel suspension rates and achievement, rather than using studentlevel data, and most did not control for presuspension academic or behavioral factors (Noltemeyer et al., 2015).
Several student-level studies have found a negative relationship between OSS and academic achievement conditional on demographic and contextual characteristics (Arcia, 2006;Cobb-Clark et al., 2015;Kinsler, 2013;Raffaele-Mendez, 2003).Yet these studies did not control for baseline achievement, leaving an important variable omitted.Indeed, Cobb-Clark et al. (2015) conducted a sensitivity test proposed by Altonji, Elder, and Taber (2005) to assess whether selection bias might be driving their results.They concluded that the estimated relationship between suspensions and academic outcomes is unlikely to be causal and is likely a function of other differences, not controlled for, between suspended and nonsuspended students.

Relationship Between Disciplinary Consequences, Grade Retention, and Drop-Out
Disciplinary issues, including those indicated by the observance of school suspensions, are commonly found to predict both grade retention and drop-out (Balfanz et al., 2014;Carpenter & Ramirez, 2007;Chu & Ready, 2018;Fabelo et al., 2011;Suh & Suh, 2007;Swanson et al., 2017).Next, we describe a few of the recent studies on this topic, focusing on student-level studies in particular.Suh and Suh (2007), using the National Longitudinal Survey of Youth (NLSY97), found that suspended students were 77.5% more likely to drop-out than nonsuspended students, controlling for factors such as GPA, absenteeism, fighting, receiving threats in school, family structure, socioeconomic status, and school contextual factors.Notably, survey data often provide many more covariates that are typically available in administrative data.Fabelo et al. (2011) estimated that Texas students suspended or expelled for discretionary violations-for which exclusion is not mandated-relative to their nonexcluded but otherwise similar peers (in terms of demographics and prior achievement, attendance, and past disciplinary issues) in similar schools, were about twice as likely to be retained in grade.
Using 7 years of student-and infraction-level data from Arkansas, Swanson et al. (2017) found that students who receive exclusionary discipline in eighth grade were 2.5 percentage points (PPTs) more likely to be retained in ninth, conditional on demographics, baseline achievement, school fixed effects, and notably, the types of infractions reported.
Chu and Ready (2018) used a within-student approach, comparing student-semesters with and without suspensions and found that students were more likely to drop out the semester following a suspension.Bias would remain in these estimates, however, if an external shock contributed to both suspension and drop-out.In a separate matching analysis, they found that students suspended in the first three semesters of high school were less likely to graduate than their peers with similar demographics, eighth grade test scores, and pre-high school history of absences, tardies, and suspensions, who were not suspended in their first three semesters of high school.They do not control for the types of behaviors that led to suspensions.
An important result across these studies is that the magnitude of the relationship between exclusionary discipline and academic outcomes is often diminished when controlling for student, school, and family characteristics (e.g., Fabelo et al., 2011;Swanson et al., 2017).While a few account for student behavior in some way (Chu & Ready, 2018;Fabelo et al., 2011;Suh & Suh, 2007;Swanson et al., 2017), only one compared exclusionary and nonexclusionary consequences for similar infractions (Swanson et al., 2017), which directly addresses the fundamental policy question of the impact of using exclusionary discipline for a given offense.

Limited Evidence Isolates the Impact of Consequences Separately From Behavior
The studies just described-plus others that focus on outcomes such as criminal involvement (e.g., Wolf & Kupchik, 2017)demonstrate a statistically significant relationship between discipline and student outcomes after controlling for student demographics and baseline achievement.Thus, in a literature review, Skiba, Arredondo, and Williams (2014) argued that "above and beyond individual, family, and community risk factors, exclusionary school discipline makes a significant contribution in and of itself to a range of negative developmental outcomes" (p.556).However, with few exceptions (e.g.Swanson et al., 2017), a key variable remains omitted: the misbehavior leading to these disciplinary consequences.Therefore, a key contribution of our study is the ability to compare outcomes for students who receive different disciplinary responses for the same type of infractions.

Suspension-Reducing Policies and Student Outcomes
One strategy to better understand the effects of suspension is to assess what happens when suspensions are banned or limited.A small but growing literature on suspension-reducing policies indicates a mix of benefits and unintended consequences with respect to student outcomes.
For example, in some cases, reducing suspensions resulted in improved attendance and achievement overall (Hinze-Pifer & Sartain, 2018) or for suspended students (Steinberg & Lacoe, 2018).In others, attendance-but not academic achievementimproved following reductions in the length of suspensions (Sartain et al., 2015).Of course, this evidence base is new and growing and there is no consensus.Indeed, some studies have found that reductions in suspensions may also have unintended consequences such as increased truancy and declines in achievement (Lacoe & Steinberg, 2018) as well as deteriorating school climate (Sartain et al., 2015;Lacoe & Steinberg, 2018).
Given the limited and mixed nature of these findings, it is too early to draw conclusions about the overall impacts of these policies, but the current study-which estimates the relationship between disciplinary consequences and academic outcomes while controlling for a uniquely rich set of observable characteristicssheds some light on what we should expect from suspensionreducing policies.

Data
This study uses 10 years of de-identified demographic, achievement, and disciplinary data from all K-12 public schools in Arkansas provided by the Arkansas Department of Education (ADE) for 2007-08 through 2016-17.Demographic data include race, gender, grade, special education status, limited English proficiency (LEP), and free-and reduced-price lunch (FRL) eligibility.Academic achievement data include scores on state mathematics and reading (ELA) tests in Grades 3 through 8 from 2008-09 to 2016-17.All test scores are standardized by test type, grade level, and academic year to have a mean of zero and standard deviation of one (i.e.z scores). 1 The available data do not include indicators of grade retention, but we are able to infer these outcomes based on student grade-level assignments in consecutive years. 2  Discipline data are provided at the incident level and include indicators for infraction type, consequence type and, in some cases, the length of the consequence in days.There were 1,940,772 infractions during our 10-year panel.Three subjective categories, disorderly conduct (28.7%), insubordination (23.8%), and other nonspecified infractions (28.1%), 3 represent over 80% of infractions.Other types include fighting (6.8%), truancy (6.3%), bullying (2.4%), tobacco (1.2%), student assault (1.0%), drugs (0.6%), vandalism (0.5%), knives (0.2%), staff assault (0.2%), alcohol (0.2%), and gang-related activity (0.1%).Gun, explosives, and club infractions are extremely rare.There are seven consequence types recorded as ISS (37.3%), other nonspecified consequences (27.0%), 4 OSS (21.8%), corporal punishment (12.6%), no action (0.8%), referrals to Alternative Learning Environments (ALE) (0.3%), and expulsions (0.1%).We aggregate disciplinary data to the student-by-academic year level, creating variables that indicate the number of infractions and consequences, by type.
There are, on average, 0.4 infractions per student per year, or 2.7 infractions per student among the student-years with at least one referral.Middle and high school students have more infractions than younger students; 3rd through 5th graders have on average 0.26 infractions per year, 6th through 8th graders average 0.59, and 9th through 12th graders average 0.55.

Methods
We use a variety of specifications to estimate the relationships between each consequence type and two academic outcomes (math test scores and grade retention). 5We estimate models using disciplinary consequences in the current year (CY), the prior year (PY), or both, to test for persistent relationships over time.We estimate nested models that start with a narrow set of controls and sequentially add more controls, in order to observe the change in the estimated relationships.Test score models are estimated for students in Grades 3-8, in which annual testing occurs.Grade retention models are estimated using students in Grades 9-11, because the risk of discipline-induced drop-out, perhaps due in part to grade retention, is more common in high school.
Our fully specified linear regression model incorporating only CY discipline measures is: Our fullest models that incorporate PY discipline measures in lieu of or in conjunction with the CY discipline measures are represented by the following two equations: In the test score models, we control for lagged measures of the outcome, y it-1 , or in equations that include PY discipline measures, we control for twice-lagged versions, y it-2 , as a prediscipline baseline measure. 6For the grade retention outcomes, we estimate discrete choice probit models predicting the probability of grade retention for student i in year t as a function of the same variables, except y it-1 and y it-2 are replaced by eighth grade (baseline) test scores, because it is extremely rare for a student to be retained 2 years in a row.For 9th to 11th grade students, of the 2.85% of student-year observations with grade retention, only 7.88% of these experienced grade retention in the PY.The grade retention results are very similar with and without the eighth grade test score controls. 7 The variables of interest are vectors of consequence counts, Cons it and Cons it-1 .These vectors each include six variables indicating the count of incidences of six types of consequences (expulsions, referrals to an ALE, ISS, corporal punishment, "other" nonspecified consequences, and no action) in the same and PY, respectively, with OSS consequences as the reference category.For these count variables, each incidence is counted once, regardless of the number of days associated with it.
Two vectors, Infrac it and Infrac it-1 , include incident counts for each of 17 infraction types, which allow us to control for the type of misbehavior leading to disciplinary consequences.A remaining limitation is that infraction types include a range of behaviors, and the resulting consequence type is likely related to underlying-but unobservable-characteristics of the misbehavior.For example, within the disorderly conduct category, more severe behaviors might result in more exclusionary consequences and may also be associated with worse academic outcomes regardless of consequence type.Inconsistent reporting practices within or across schools could also bias the results, although we do use school fixed effects to address such differences-and all other time-invariant characteristics-across schools.In addition, given that our outcome measures are annual measures, we are not estimating the outcomes associated with each particular incident.Rather, we estimate the correlation between the cumulative set of consequences, controlling for the cumulative set of infractions in the same year.Thus, we are not accounting for heterogeneity of the estimated relationships based on the other reported misbehavior that year, which may be something that school administrators consider when determining consequences. 8Despite these limitations, our approach represents an improvement upon prior work that generally does not control for the behavior leading to suspensions.
The coefficients should be interpreted differently in the models with and without infraction controls.In the models that control for infraction counts, the infraction controls and consequence controls are perfectly collinear, because the total number of infractions equals the total number of consequences.In these models, the OSS consequence is dropped and used as the reference category, and as a result, we estimate the "impact" of six consequence types, relative to receiving an OSS, for students referred for the same infraction types.We use OSS as the reference category because it is most commonly the focus of research and political discussions surrounding discipline reform.In contrast, in the models that do not include these infraction controls, there is a seventh variable indicating the count of OSS consequences.In these models, we estimate the "impact" of seven consequence types (including OSS), without accounting for reported behavior.
We account for school-level time-invariant characteristics with school fixed effects, σs, and for state-wide differences over time using academic year indicators, τt.We control for student characteristics, X it , including binary indicators of gender, FRL status, special education status, LEP status, race/ethnicity (Black, Hispanic, Asian, and other, with White as the reference group), and grade levels.There may be remaining endogeneity concerns due to omitted variables, so we do not interpret these estimates as causal, but rather as the relationships between disciplinary consequences and student outcomes, controlling for a uniquely rich set of covariates including type of behavior reported.These relationships are directly relevant to discussions surrounding discipline policy reform.
One concern with administrative discipline datasets is the underreporting of infractions that do not result in suspensions or expulsions.Even though we do not know how many instances of misbehavior are not reported, we do at least observe a variety of consequences beyond simply suspensions or expulsions.Less than a quarter of reported infractions result in the most exclusionary consequences (expulsions, ALE referrals, and OSS), 37% result in ISS, and about 40% result in other types of consequences (see Table 1). 9Some infractions such as truancy seldom result in OSS, but for others, like drugs or alcohol, students receive OSS in almost 90% of cases.Within each infraction type, there is variation in disciplinary response, and several infraction types resulted in a relatively even mix of OSS and non-OSS consequences. 10The use of school-fixed effects helps control for differences in reporting patterns across schools.
Descriptive statistics for the full state and four key analytic samples are in Table 2.The analytic samples are generally similar to the entire state, except that students are less likely to be FRLeligible or LEP in the grade retention sample, which only includes Grades 9-11.
In Table 3, we provide descriptive statistics for five groups of student-academic year observations: all observations, those with any infractions, those with any exclusionary discipline, those with at least one ISS, and those with at least one OSS.Disciplined students, particularly excluded students, are more likely to be older, non-White, FRL-eligible, receiving special education services, lower performing, and retained in grade than the general student population.The first column shows the relative rarity of exclusionary discipline in general.The average student has 0.408 infractions and 0.089 OSS incidents per year.

Relationship Between Discipline and Student Achievement
Table 4 shows the relationship between disciplinary consequences and math test scores.Recall that the coefficients in the models with infraction controls (columns 2, 4, 6, 8, and 9) should be interpreted as the relationship between consequences and math test scores, relative to the relationship between an OSS consequence and math test scores, for the same reported infraction(s).This differs from columns 1, 3, 5, and 7, which estimate the relationship between consequences and outcomes without controlling for infractions.In columns 1-4, CY measures of disciplinary outcomes are used; in columns 5-8, PY measures of disciplinary outcomes are used; and in column 9, both CY and PY measures are included.
The results in columns 1, 3, 5, and 7 indicate a consistently negative or null relationship between counts of consequences and test scores.The largest relationships are between expulsions and test scores, and the magnitude tends to decline as the severity or degree of exclusion declines.Columns 2, 4, 6, 8, and 9 also demonstrate that more exclusionary consequences are associated with lower test scores.For example, in these columns, which use OSS as the reference category, the coefficients on expulsion and ALE are generally negative, while the coefficients on less exclusionary consequences are generally positive.After controlling for infraction types and lagged math test scores (column 4) and relative to an OSS consequence, each expulsion is associated with -0.103 SD lower math test scores, each ISS incident is associated with 0.013 SD higher scores, and each "other" (generally nonexclusionary) consequence is associated with 0.026 SD higher scores.Thus, there is a clear relationship between degree of exclusion and achievement outcomes.
Notably, without controlling for student behavior (columns 1, 3, and 5), even "no action/warning" has a negative relationship to test scores, demonstrating the importance of infraction controls for drawing conclusions about the impact of  Note.Descriptive statistics for student-by-year observations.Current year (CY) samples refer to columns 3 and 4 in Tables 4 and 5.The "fully specified" samples refer to columns 7-9 in Table 4 and columns 7, 8, and 10 in Table 5.The math and ELA test scores reported in the grade retention samples refer to eighth grade (baseline) test scores.FRL = free-and reduced-price lunch; ELA = English language arts.
consequences.Similarly, the estimated adverse relationships are greatly diminished after including baseline test scores, indicating the importance of baseline characteristics.Thus, a key takeaway from Table 4 is the importance of controlling for reported infraction type and baseline achievement when assessing the potential effects of consequences on academic achievement.Further, there do appear to be some persistent relationships between test scores and disciplinary consequences in the PY.For example, relative to OSS in the PY, ALE in the PY is associated with lower test scores, and some nonexclusionary consequences in the PY (ISS and "other") are associated with higher test scores.Notably, when adding the PY measures (column 9), the point estimates on the CY measures change very little (relative to in column 4).We may be underestimating the importance of PY consequences if they produce future bad behavior, as this relationship would be captured in our controls for CY discipline.On the other hand, it is also possible that remaining unobservables are driving part of these relationships, which appear persistent over time.

Relationship Between Discipline and Grade Retention
The results of our grade retention models are in Table 5.In columns 1-4, CY measures of disciplinary outcomes are used; in columns 5-8, PY measures are used; and in columns 9-10, both CY and PY measures are included.Columns 3, 4, 7, 8, and 10 control for eighth grade test scores.The models that do not control for reported infraction types (columns 1, 3, 5, and 7) generally indicate that exclusionary consequences such as expulsions, referrals to ALE, OSS, and ISS are all associated with higher risk of grade retention."Other" consequences are sometimes associated with lower risk of grade retention.Columns 2, 4, 6, and 8-10 control for the types of infractions reported.These results indicate that more exclusionary consequences like expulsion and ALE-particularly in the CY-are generally associated with a higher likelihood of grade retention, relative to OSS.On the other hand, ISS, corporal punishment, no action/warning, and "other" consequences are associated with lower risk, relative to OSS, indicating, as in Table 4, that less exclusionary consequences have a weaker association with negative academic outcomes.To interpret the size of these coefficients, it is important to note that grade retention is quite rare.Only 2.85% of student-year observations in Grades 9-11 indicated grade retention, and so a 0.7 PPT increase in the likelihood of grade retention for each OSS incident (as in column 3) represents a 25% increase, a large effect.Columns 9-10 suggest that PY consequences are also predictive of grade retention, independent of CY consequences, with all the Note.Descriptive statistics are based on student-year observations (obs.) over the full panel and include all grades unless otherwise specified.For the outcome variables, we limit the observations to those for Grades 3-8 (test score outcomes) or 9-11 (grade retention outcome).The counts for these 3-8 and 9-11 grade outcomes are the count of all student observations in those grades, even if they did not have the outcome variable.ISS = in-school suspension; OSS = out-of-school suspension; FRL = free-and reduced-price lunch; ALE = referral to Alternative Learning Environment; ELA = English language arts.
less exclusionary consequences having a statistically significant difference, relative to OSS.As in Table 4, when adding the PY measures (column 10), the point estimates on the CY measures change very little (relative to in column 4).
Relative to Table 4, the inclusion of baseline test score measures does not change the point estimates as much (e.g., comparing columns 1 and 3 and comparing columns 2 and 4).Including eighth grade test scores greatly diminishes our sample size, because a significant proportion (18.73%) of students enter our dataset after eighth grade.Therefore, given the similarity between the results with and without eighth grade scores, for additional tests, we focus on the broader sample that does not require eighth grade scores.

Assessing Remaining Selection Bias
To assess whether selection bias may remain, we conduct sensitivity tests proposed by Altonji et al. (2005) and Oster (2017).pun.= corporal punishment; FRL = free-and reduced-price lunch; LEP = limited English proficiency.Altonji et al. (2005) propose the use of an estimation of the degree of observed selection on observable characteristics as a source of information about the potential selection on unobservables.In particular, their proposed method allows for the estimation of the ratio of selection on unobservables to selection on observables that would be required to attribute the entire estimated effect to selection bias.An assumption required in the case of Altonji et al.'s (2005) test is that if all unobservables were actually observed, the outcome variable could be fully explained (i.e. an R-squared of 1), which Oster (2017) argues is not reasonable in the presence of measurement error.Thus, she proposes testing the robustness to the results to alternative assumptions about the maximum R-squared possible.
After conducting these tests, 11 we find that we cannot rule out the existence of significant remaining bias, creating uncertainty about whether the estimated relationship is the causal impact of suspension or the result of reverse causality, other confounds, or a mix thereof.For the test score models, the amount of selection on unobservable characteristics would only have to be about 0.26 to 0.56 times as large as the degree of observed selection on observable characteristics to explain all of the estimated effects of expulsion, relative to OSS.To explain all of the Note.Robust standard errors, clustered at the school level, are in parentheses.All models include school, academic year, and grade fixed effects and controls for student gender, free-and reduced-price lunch (FRL) status, special education status, limited English proficiency (LEP) status, and race/ethnicity (White, Black, Hispanic, Asian, and Other).Current year (CY) and prior year (PY) infraction counts are the number of each of 17 infraction types in the current or prior year, respectively.ELA = English language arts; ALE = referral to Alternative Learning Environment; OSS = out-of-school suspension; ISS = in-school suspension; corp.pun.= corporal punishment; cons.= consequences.*p < 0.1.**p < 0.05.***p < 0.01.
estimated effects of ISS on test scores, relative to OSS, selection on unobservables would only have to be about 0.02 to 0.05 times as large as selection on observables.Selection bias in the grade retention models appears even more problematic.Thus, similarly to Cobb-Clark et al. (2015), while this does not prove our estimates are necessarily biased, we cannot rule out the possibility that our observed effects, even after including infraction type controls, are due to remaining selection on unobservables and that true causal effects are actually negligible.

Subgroup Heterogeneity
Despite a lack of support for a causal interpretation, we hypothesize that relationships might differ for certain groups of students if their family background, prior achievement, language proficiency, or disability status make it difficult to make up for lost instructional time.Therefore, we estimated separate results for FRL/non-FRL, White/non-White, LEP/non-LEP, special/ regular education, and students whose first test score was above/ below average, focusing on the models including CY measures of consequences, with infraction controls. 12 In general, there are not large differences across subgroups with respect to the relationship between disciplinary consequences and math test scores.In particular, the estimates on corporal punishment and other action were quite similar across subgroups.One notable difference is that ISS is estimated to be potentially more beneficial for relatively disadvantaged groups, when compared to OSS.
Similarly, in our grade retention models by subgroup, the likelihood of grade retention for relatively disadvantaged groups is more sensitive to consequence type.The higher risk associated with expulsion and the lower risk associated with ISS and "other" consequences, relative to receiving OSS, are larger in magnitude for relatively disadvantaged subgroups.If these relationships are causal, this might suggest that the population of students at-risk for grade retention-who are also generally at higher risk of exclusionary discipline-is relatively small but also particularly sensitive to the choice of disciplinary consequence.These findings are consistent whether or not eighth grade math and reading test scores are controlled for.

Robustness of Results Within Restricted Samples of Ever-Disciplined and Ever-Excluded Students
It is possible students who are never disciplined or excluded for disciplinary reasons may affect our estimates if the relationships between covariates and test scores are different for this set of students.Therefore, we re-estimate our models using everdisciplined students and ever-excluded students (those who received expulsion, OSS, or ALE referral).The results are largely robust to these different samples, except that for the grade retention models, the estimated differences between OSS and other consequences were largest in the full-sample. 13

Estimate of Nonlinear Relationships
We also tested whether the first exclusionary consequence has a different impact than subsequent consequences by replacing Cons it and Cons it-1 with a vector of binary indicators for whether the student had at least one, at least two, at least three, or four or more of that consequence type during the year, with zero incidents as the reference category.We do not draw conclusions about expulsions and ALE for which multiple incidents are extremely rare.For other consequences, the first incident generally has the largest relationship to academic outcomes, suggesting that focusing on preventative strategies and reducing exclusionary discipline for first infractions may be effective.

Discussion and Conclusions
In light of the concerns that exclusionary discipline harms students academically, we set out to generate a better understanding of the magnitude of and the nature of the relationship (e.g., causal or correlational) between disciplinary responses and academic outcomes after controlling for selection into disciplinary consequences.This work makes a key contribution to the field's knowledge about the potential impact of disciplinary consequences: Controlling for reported infractions is important when estimating the relationship between disciplinary consequences and student outcomes.In addition, controlling for baseline test scores is also important, particularly when estimating the relationship between consequences and test scores.
Not surprisingly, the controls for baseline test scores were more influential in models predicting test scores as outcomes rather than for the models predicting grade retention.Why might the sensitivity of results to the inclusion of baseline test scores as controls vary across academic outcomes?Different results could be due to differences in the analytic samples, grade level in particular (e.g.Grades 3-8 for the test score outcomes and Grades 9-11 for grade retention).Further, serial correlation in test scores is stronger than the correlation between test scores and grade retention, as grade retention is particularly rare.Another possibility is that low achieving students may be more likely to misbehave and perform poorly on standardized tests regardless of the type of punishment (exclusionary or not) but that grade retention is more likely if the resulting punishment is exclusionary, perhaps resulting in lower attendance and failure to complete a course.
Even with the inclusion of important controls such as baseline test scores and behavioral infractions, we still have concerns about remaining selection bias.Specifically, in both our math achievement and grade retention models, the results of sensitivity tests proposed by Altonji et al. (2005) and Oster (2017) suggest there may be bias remaining in our estimated effects with full controls.
Overall, our results indicate that the choice of disciplinary consequence is not driving the entire decline in academic progress, and it is important to consider the relative influence of various approaches in response to student behavior.The recent policy focus on limiting suspensions might miss the mark if there is little guidance about appropriate alternatives.For example, if instead of suspensions, schools refer more students to ALE, this might be a more costly and potentially more harmful approach.Further, simply replacing OSS with ISS may not eliminate the academic decline of disciplined students, because even ISS is associated with negative academic outcomes for students.
Less exclusionary consequences such as those in the "other" consequence category (e.g., Saturday school, detentions, parent/ guardian conferences) are associated with better outcomes, relative to both OSS and ISS, although future work is needed to understand more about which specific approaches schools are using and having success with.
Another key finding is that the first disciplinary consequence in a given year is associated with larger academic declines.This implies that policies should target preventative approaches, rather than waiting until students get into major trouble before getting involved.Supporting this idea, many scholars have argued for more proactive discipline focusing on preventing misbehavior by setting clear expectations and teaching students prosocial behaviors (Chin, Dowdy, Jimerson, & Rime, 2012;Sharkey & Fenning, 2012).
Another important takeaway is that, particularly in terms of grade retention, the association between suspensions, expulsions, and negative academic outcomes may be larger for students from historically underserved backgrounds.Given that these students are at elevated risk of being suspended or expelled, this has direct implications for educational attainment gaps.
A few limitations remain.First, we do not estimate the potential impacts on all adverse student outcomes (e.g., we do not link these data to court records to test hypotheses related to the school-to-prison pipeline or assess impacts on student attendance or drop-out), nor do we test the specific mechanisms through which these impacts might occur.In addition, we do not estimate the impacts of suspensions on the nonsuspended students in the school.
Second, we use administrative data that may include attrition or measurement error.With respect to attrition, we observe that students expelled or referred to ALE are slightly more likely to drop out of the Arkansas public school dataset altogether, which may mean that we are slightly underestimating the negative consequences for some students, if the students who were most harmed dropped out or sought other schooling options. 14Another attrition-related concern is that students who we report as being retained in grade may have effectively dropped out, despite being enrolled for a short period of time.We found, however, that the results were generally similar when dropping students for whom this could possibly be the case. 15 With respect to measurement error, administrative data only indicate a category of offense, but the underlying behavior of one student's insubordination, for example, may be very different from another's.If anything, we expect that unobservables-such as severity of offense within infraction type-would be correlated with the risk of exclusionary discipline and student academic outcomes in a way that would bias our estimated relationships upward in magnitude.For example, within the category of disorderly conduct, if we assume that students committing worse offenses tend to receive more exclusionary consequences and have worse academic outcomes, we would be overstating the magnitude of the relationship between exclusion and academic outcomes.Thus, we might view our estimates of the relationships between exclusionary consequences and student outcomes as upper bounds on the true relationships.We also use school fixed effects to help account for differences in reporting patterns or policies across schools that are stable over time.
Further, there are likely unobservable characteristics of students that are associated with their risk of discipline and academic outcomes, and earlier versions of this work found that accounting for student heterogeneity greatly attenuated the estimated relationships. 16 Finally, we rely on school district reports of discipline, and some incidents never make it into the system.This is the case for any analysis using this type of administrative data.A potential area for future research would be ethnographic research to determine the extent to which misbehaviors are simply underreported and how this varies by type of student, teacher, or school.
This study provides a novel look at the impacts of disciplinary responses.Our key contributions are the ability to compare the relationships between various types of consequences and student outcomes, controlling for a unique set of covariates, including reported behavioral infractions, as well as conducting tests for remaining bias, following Altonji et al. (2005) and Oster (2017).To our knowledge, only one study of the relationship between exclusionary discipline and student outcomes was able to control for the particular infractions leading to consequences (Swanson et al., 2017), and only one (Cobb Clark et al., 2015) applied the Altonji et al. (2005) test.
Going forward, as states or districts consider discipline policy reforms, there is a compelling argument for studying the impact of such reforms at the same time.For example, policies aimed at reducing suspensions should consider what the appropriate counterfactual response should be.While there is some experimental evidence supporting the use of School-Wide Positive Behavioral Interventions and Supports (Bradshaw, Mitchell, & Leaf, 2010;Horner et al., 2009) and nonexperimental evidence suggesting the benefits of restorative justice (Fronius, Persson, Guckenberg, Hurley, & Petrosino, 2016), overall, there is little causal evidence on the effectiveness of alternative disciplinary approaches (Steinberg & Lacoe, 2016).Further, interventions such as Positive Behavioral Interventions and Supports (PBIS; Kaufman et al., 2010;Vincent & Tobin, 2011) and restorative justice (Hashim, Strunk, & Dhaliwal, 2018) do not necessarily eliminate racial disproportionalities in suspension, and some suspension-reducing policies also have been implemented inequitably (Anderson, 2018).Thus, evaluations of new programs or policies should address the potential for unintended outcomes as well.Observations from the final study year, 2016-17, are dropped from the grade retention analyses, as without a future year of data it is impossible to infer retention.Similarly, 12th graders are removed from these analyses as it would be difficult to distinguish between two counterfactuals to grade retention: graduation and drop-out.Dropping 12th graders allows us to identify grade retentions more consistently across grade levels.
3 "Other" nonspecified infractions were coded as a specific infraction type at the school level but were grouped into an "other" category when reported by the ADE.This is not a researcher-created category.
4 "Other" nonspecified consequences were coded as a specific consequence type at the school level but were grouped into an "other" category when reported by the ADE.This is not a researcher-created category.Conversations with the ADE Assistant Commissioner for Research and Technology, Eric Saunders, indicated that the majority of these other consequences are detentions, bus suspensions, parent/guardian conferences, Saturday school, or warnings.In fact, in 2016-17, the state started separately reporting additional categories, and in that year at least, 29% were detention, 13% were warnings, 6.2% were Saturday school, 4.5% were bus suspensions, 2.4% were parent/guardian conferences, and the rest were still nonspecified.This reiterates that this category is comprised of predominantly nonexclusionary consequences. 5We also estimate effects on reading/English language arts test scores, and the results were generally similar to the math results.Results are available by request. 6We also test a variety of specifications including (a) lagged versions of both test scores, (b) lagged and twice lagged versions of the same subject test score, and (c) lagged and twice lagged versions of both test scores.The results are generally robust to these various specifications. 7The results are largely robust to the inclusion of this control, but because inclusion of this control greatly diminishes the sample size, we generally focus on models predicting grade retention without the eighth grade test scores.
8 However, we do test for heterogeneous or nonlinear relationships based on whether the consequence was the first, second, third, or fourth or more for the student in that year and estimate the first infraction to be more highly associated with negative outcomes. 9See Note 4 for more detail on these "other" nonspecified consequences.
10 Table 1 shows the results for all infractions across all grades, but there are some differences based on grade level.For example, relative to the full sample, students in Grades 9-12 were less likely to receive corporal punishment and more likely to receive ISS or "other."Students in Grades 3-8 were more likely to receive ISS and OSS and less likely to receive "other."There were also some more nuanced differences by infraction type.Tables for Grades 3-8 and Grades 9-12 are available from the authors by request.
11 We conducted the Oster (2017) test using the psacalc user-written Stata command.
12 Tables are available by request. 13For example, in the full sample, each CY ISS is associated with a 0.54 PPT lower likelihood of grade retention relative to OSS.On a base grade retention rate of 2.9%, this represents a 18.6% decline.In the ever-disciplined sample, each CY ISS is associated with a 0.8 PPT decrease in the likelihood of grade retention, relative to receiving OSS.On a base grade retention rate of 6.7%, this represents an 11.9% decline.In the ever-excluded sample, each CY ISS is associated with a 1.35 PPT decrease in the likelihood of grade retention, relative to receiving OSS.On a base grade retention rate of 11.6%, this represents an 11.6% decline.Results are available by request. 14We modeled exit from the dataset using a similar approach as in our main models and found that ALE and expulsion were associated with a 1.6 to 1.9 PPT increase in the likelihood of attrition from the data.OSS was not associated with a statistically significantly higher risk of attrition, except when compared to less exclusionary consequences for similar types of infractions. 15We estimated models that drop all observations for any students who were ever completely missing attendance data or for whom their days attended totaled less than 30 days in any given year, limiting the sample to students who attended school for at least one sixth of a typical school year.While many of the estimates on expulsion, ALE, no action/ warning, and other were noisily estimated and lost significance in these new samples, the estimated coefficients on OSS, ISS, and corporal punishment were quite stable to this sample restriction. 16With the goal of addressing this, in earlier versions of this work, we estimated dynamic panel data models using within-student variation to identify the relationship between exclusionary discipline and academic achievement and found that accounting for student heterogeneity greatly attenuated the estimated relationships.However, given the assumptions required and challenges with the data available, there was still not strong support for causal identification in that case.

Table 1
Percent of Incidents Resulting in Various Consequences, Note.ALE = referral to Alternative Learning Environment; Corp. Pun.= corporal punishment; ISS = in-school suspension; OSS = out-of-school suspension."Other" categories for both infractions and consequences refer to incidents that were not included in one of the state reporting categories and are not researcher-created categories.

Table 2
Descriptive Statistics for State and Analytic Samples

Table 3
Descriptive Statistics for Student-Year Observations, by Level of Discipline Exposure

Table 4
Relationship Between Disciplinary Consequences and Math Test Scores in Grades 3-8 Robust standard errors, clustered at the school level, are in parentheses.All models include school fixed effects, academic year fixed effects, grade level fixed effects, and student demographic controls including gender, FRL status, special education status, LEP, and a vector of race/ethnicity indicators (White, Black, Hispanic, Asian, and Other).CY and PY infraction (inf.)counts are vectors of variables representing the number of infractions of each type in the current year (CY) or prior year (PY).

Table 5
Relationship Between Disciplinary Consequences and Grade Retention, Grades 9-11 Arkansas administered the Partnership for Assessment of Readiness for College and Careers (PARCC) exam, and in 2015-16 and 2016-17, Arkansas administered the ACT Aspire tests.To test the sensitivity of our results to these testing administration changes, we estimated additional models using only the six ACTAAP years, and the results are generally robust.Results are available by request.