While working on a project, I recently came across a report with astounding results. Results that were, honestly, too good (well, strong) to be true. Logically, it just didn’t make sense that the authors found such a strong effect. I read and re-read it a couple times. On the surface, the statistical technique seemed legit. But something was off. I dug a bit deeper and on subsequent reads I think I solved the puzzle. Here’s what I found.
The report in question, “Final Report on the Evaluation of the First Offender Prostitution Program,” was written by Michael Shively et al. in 2008 and funded by the National Institute of Justice. The authors used time series data in two types of quasi-experimental designs to assess the impact that the San Francisco John School (known as the First Offender Prostitution Program, FOPP) had on re-offending. Time series analysis is a type of quantitative analysis that uses chronological sequenced observations to examine some social/behavioral science phenomenon over time, such as public opinion, public policy, crime rates, etc. Basically, you use this analysis when you are interested in (and have data for) explaining some phenomenon and the temporary ordering over time. In this case, they used California arrest data from 1985 to 2005.
The authors employed two types of quasi-experimental designs, Differences-in-Differences (DiD) and Regression Discontinuity (RD). Quasi-experimental designs are similar to randomized experimental designs except that there is no random assignment. Through a variety of techniques, quasi-experimental designs construct control and treatment groups ex post facto.
For the DiD model, in comparing the control (the rest of California excluding San Francisco) and the treatment (San Francisco) groups, the authors found a difference—the recidivism rate in San Francisco fell approximately 50% after 1995—when the program was implemented. Let me repeat that: They state that the FOPP reduced recidivism (defined as being re-arrested for solicitation within 15 months) by approximately 50%. This number sounds astonishing. In fact, the authors admit this finding even surprised them: “[the FOPP’s] design appeared to violate several of the principles of effective intervention with offenders that have been derived from more than 40 years of research” (pg 81).
The problem lies in their construction of, and trends in, the treatment and control groups. The assumption in experimental and quasi-experimental designs is that the treatment and control groups are the same (read: very similar) or at least have random differences. For the DiD models (for brevity’s sake I’ll only be discussing the DiD model, but the RD model has somewhat similar issues), the treatment and control group are not sufficiently comparable for several reasons.
First, logically, it doesn’t make much sense to compare a city with the rest of the state. That aside, I think the more serious methodological problem is that the analyses fail to meet the DiD assumption that the two groups to have similar trends (here, similar recidivism rates) prior to the program implementation in order to attribute any observed change after the program is implemented to the program. In other words, you must compare apples to apples in order to assume that the treatment and control groups are “very similar.”
The report does not meet this requirement. California and San Francisco have vastly different trends in the years prior to 1995. In the few years leading up to 1995, San Francisco had abnormally high re-offending rates (table 21, p.75) compared to the rest of California. These different trends open the door to other possible explanations for the decrease in re-offending after 1995. The “effect” of FOPP could just be re-offending rates returning to more normal levels—thus making the results of the recidivism analyses conducted in this report inconclusive at best.
Take away: A quick review of the literature illustrates that this methodological assumption is not often discussed with these types of models. You can avoid making this mistake by (1) providing some logical explanation for why you are choosing your treatment and control group and (2) providing evidence (i.e., graphical and/or statistical) that the two groups have similar trends prior to implementation.
I encourage you to examine this topic more fully. Read the report. Do you agree with my assessment? Let’s start a dialogue.