Sunlight Project: Finding Legislative Plagiarism – Data Science for Social Good Fellowship

Eugenia Giraudy & Joe Walsh

In 2005, Florida implemented a new “Stand Your Ground” law, which legally protected the use of deadly force in self-defense. The law, which removes the “duty to retreat” when a person is threatened with serious bodily harm, gained national attention after George Zimmerman fatally shot Trayvon Martin in 2012.

Soon after its passage in Florida, Stand Your Ground laws went “viral,” spreading to other parts of the country. Currently, at least two dozen states have implemented a version of Florida’s legislation. These laws didn’t arise in response to broad, spontaneous popular demand. Interest groups, in particular the National Rifle Association and the American Legislative Exchange Council (ALEC), drafted a model bill to ease passage across the country. Ten states have passed nearly identical bills to the ones Florida used and ALEC promoted (see Figure 1 for identical paragraphs in ALEC’s legislative proposal and Michigan’s bill).

This summer, Data Science for Social Good fellows Matt Burgess, Eugenia Giraudy, and Julian Katz-Samuels, technical mentor Joe Walsh, and project manager Lauren Haynes are working with the Sunlight Foundation to automatically detect cases of text reuse in state legislative bills. The goal is to build a computational tool that identifies text reuse far more quickly than a manual, human search. This tool will help journalists, scholars, and advocacy groups shed light on the origins and diffusion of state legislation.

State governments have a central but understudied role for policy making. Each year, states spend 1.5 trillion dollars on programs and services for their citizens, and state legislatures pass about 75 times more bills than Congress. However, state legislators often lack the time, staff, and expertise necessary to draft each bill. As a consequence, they often rely on legislation written by outside entities, such as interest groups or legislators from other states.

Legislative text reuse is not a new phenomenon. For more than 100 years, interest groups have written model legislation to influence the language used in state and federal bills. The first evidence of model legislation dates back to 1910, when the Russell Sage Foundation published a model tenement house law, and to 1912, when the annual meeting of the Association of Life Insurance Presidents included a report on progress of a model law on the registration of vital statistics.

One member of the DSSG team even has experience with this practice. As an intern in the Michigan House of Representatives in 2004, Joe modified another state’s National Nurse’s Week resolution and submitted it to his boss, and the House passed HR 237 a few weeks later. Eleven years later, the resolution still resembles the American Nursing Association’s current model resolution.

Despite negative connotations, legislative text reuse has benefits for constituents. As the Supreme Court has noted, US states may operate as “laboratories,” adopting new policies and seeing what works. If a Connecticut law saved thousands of residents each year, one would likely want Colorado to adopt similar text. In an influential study, political scientist Craig Volden found that states are more likely to emulate successful policies than failing policies, especially when those policies lower program costs or are made by the legislature rather than the administrative agency. That’s why both conservative and liberal groups do it.

Still, identifying legislative text reuse is difficult. It’s hard enough at the federal level, where most reused text comes from previously unsuccessful congressional bills, all the bills are stored in one format, and there are armies of journalists to sort through it all. Finding text reuse at the state level is far more difficult because bills may originate in one of 49 other states, each state stores its bills in a different format, and there are far fewer journalistic resources to cover each state capital.

Automated text analysis provides useful tools to tackle this problem. In particular, computational tools can be of great help to identify shared language across thousands of bills and laws. This contrasts starkly with previous approaches, which can only focus on a small number of model bills, forcing researchers to pick specific legislation or topics.

Our partner for this project, the Sunlight Foundation, has worked hard to collect state legislative texts in a reasonably standard format, which we can then use to detect legislative text reuse at the state level. The data includes 500,000 bills and 200,000 resolutions, ranging from 2007 to 2015. In addition, we have collected more than 1500 pieces of model legislation, from both conservative and liberal groups.

To detect legislative text reuse we search for short sequences of text that occur both in model legislation and state bills. The algorithms perform too slowly on large sets of text, so we first use a customized search engine that we built using Elasticsearch to limit the number of documents we use for comparison. Sometimes legislators use most of the text but change a word here and there to meet the state’s legal needs; to account for this, the algorithm allows for synonyms.

At the end of the summer, the team hopes to create a website where users can search for text reuse in state legislatures. In addition, we will share our methods and algorithms with the world so future users can build on and modify our approach to meet their needs. This tool will not only shed light on state-level politics, but it will also help us better understand the role of interest groups on our governments.