Legislative Influence Detector (LID)

Follow updates on this project on Twitter(@InfluenceDetect) to learn where state legislation is coming from. You can also search legislation at lid.dssg.io.

Journalists, researchers, and concerned citizens would like to know who’s actually writing legislative bills. But trying to read those bills, let alone trace their source, is tedious and time consuming. This is especially true at the state level, where important policy decisions are made every day. State legislatures consider roughly 70,000 bills each year, covering taxes, education, healthcare, crime, transportation, and more.

To solve this problem, we have created a tool we call the “Legislative Influence Detector” (LID, for short). LID helps watchdogs turn a mountain of text into digestible insights about the origin and diffusion of policy ideas and the real influence of various lobbying organizations. LID draws on more than 500,000 state bills (collected by the Sunlight Foundation) and 2,400 pieces of model legislation written by lobbyists (collected by us, ALEC Exposed, and other groups), searches for similarities, and flags them for review. LID users can then investigate the matches to look for possible lobbyist and special interest influence.

The screenshot below shows LID at work. On the left-hand side is text from Wisconsin Senate Bill 179 (2015), which bans most abortions past the 19th week of pregnancy. On the right-hand side, LID found and presented SB 179’s highest-ranked match, Louisiana Senate Bill 593 (2012). The highlighting shows that these text sections match each other almost perfectly. Where differences exist, they are usually misspellings like “neurodeveolopmental” or formatting differences like “16”/“sixteen”.

LID finds legislative influence more quickly and easily than other tools. Reading bills manually takes too long. Google helps, but users can only search for short strings, not complete documents, and they must weed through many non-legislative results to find good matches. Inspired by Wilkerson, Smith, and Stramp (2015) and Hertel-Fernandez and Kashin, LID takes seconds to use, searches the entire document for matches, and returns only state bills and model legislation in the results.

Government transparency is key to democracy, as is the public’s ability to understand the true influences at work in the legislative systems charged with their representation and protection. LID shines a light in the dark places of the legislative process, adding transparency and accountability to state government.

How Does It Work?

When users want to check for text matches for any given legislative document, they simply copy and paste the document text into LID’s input box. LID then quickly scans its data set of bills and model legislation for similarities and returns the best matches with the matching text highlighted for the user’s review.

NOTE: LID is not yet robust enough to handle significant public traffic. We hope to make the interactive tool available to the public in the coming months. In the meantime, we ran all the documents we have through LID, stored the matches in files, and made them available for download below.

We use the Smith-Waterman local-alignment algorithm to find matching text across documents. This algorithm grabs pieces of text from each document and compares each word, adding points for matches and subtracting points for mismatches. Unfortunately, the local-alignment algorithm is too slow for large sets of text, such as ours. It could take the algorithm thousands of years to finish analyzing the legislation. We improved the speed of the analysis by first limiting the number of documents that need to be compared. Elasticsearch, our database of choice for this project, efficiently calculates Lucene scores. When we use LID to search for a document, it quickly compares our document against all others and grabs the 100 most similar documents as measured by their Lucene scores. Then we run the local-alignment algorithm on those 100.

Many tools, such as the typical school plagiarism detector, use a “bag of words” approach, which is much faster. Local alignments are better suited to legislation for two reasons. First, word order provides additional information about the bill’s contents. Bag of words treats “I kicked the ball,” “The ball I kicked,” and even “kicked the I ball” the same, but Smith-Waterman does not. Second, bag of words can only compare entire documents, while Smith-Waterman looks at parts of a document. If a lobbyist writes 1 page in a 900-page bill, bag of words will struggle to find it but Smith-Waterman will not. This increases the chance that LID will find legislative influence.

You can find the code here.

How Do I Use the Datasets?

LID provides datasets in two formats: comma-separated values (CSV) and JavaScript Object Notation (JSON). Spreadsheet programs such as Excel can read CSVs, but you need more specialized programs to read JSON files, such as jsonlite (for R), json (for Python), or jq (for the linux command line). We dropped some fields in the CSV (e.g. the text of the query document), but the JSON files have all the data. The downloadable files contain the results for comparisons across states and between lobbying groups and states. Here are the fields:

query_document_id: The bill ID for the search document. The first two characters reference the state, the next set of characters reference the session (e.g. 2015-2016), and the last set of characters identify the bill.
comparison_document_id: The bill ID for the comparison document.
lucene_score: Lucene scores are a measure of the match we found in step 1. The better the match, the higher the number. We only use these scores as a rough way to limit the number of comparison documents for step 2.
score: The Smith-Waterman measure. The more similar two pieces of text are, the higher the score. The algorithm adds 2 when the words match and subtracts 1 when the words don’t or when there is a gap. If you order the results from highest to lowest, you’ll have the best results at the top and the worst results at the bottom.
Whatever Smith-Waterman cutoff you use, know that you’re making a tradeoff: The higher the cutoff, the fewer false positives (incorrectly flagged matches) and the more false negatives (missed matches) you will get. We have found a decent set of matches only using scores over 100, but if you’re worried about missing other bills, you can check the matches with scores under 100 too.
query_document_start: The starting position of the matched text in the query document.
query_document_end: The ending position of the matched text in the query document.
comparison_document_start: The starting position of the matched text in the comparison document.
comparison_document_end: The ending position of the matched text in the comparison document.
query_document_text: The matched text in the query document.
comparison_document_text: The matched text in the comparison document.

LID has generated the most comprehensive datasets of publicly available legislative text re-use. Still, the data are not comprehensive. We cannot find matches for bills that failed to make their way into Sunlight’s database, nor can we use model legislation that is not public. We also miss bills rewritten to avoid detection — although LID does make it more difficult for lobbyists and legislators to work in secret by forcing them to rewrite every time they want to introduce their legislation.

Finally, we have not yet run comparisons within states (for example, we do not compare an Illinois bill to past Illinois bills.) Many states reintroduce nearly entire bills, sometimes automatically, which means the matches are strong and the algorithm is slow. We plan to run the algorithm on all the bills soon.

Where Do I Download the Dataset?

We have two datasets: bill-to-bill comparisons and model legislation-to-bill comparisons. The former shows how bills spread across legislatures, and the latter show how ideas spread across lobbying groups and legislatures:

Bill-to-bill comparisons in CSV format (updated through May, 2015)
Bill-to-bill comparisons in JSON format (updated through May, 2015)
Model legislation-to-bill comparisons in CSV format (updated through May, 2015)
Model legislation-to-bill comparisons in JSON format (updated through May, 2015)

Next Steps

The data we present here are just the beginning. We are continuing to develop LID to make it more useful and accessible:

Look for all matches: To get results by the end of the DSSG, we only looked for matches across states, but many state legislatures re-introduce bills. We will re-run our documents through LID to find matches within states too.
Look for new matches every day: State legislators introduce bills nearly every day. We would like to run those bills through LID every night and flag text similarities for review. We may set up an email system to alert users of matches.
Make the interactive tool public: At the moment, only we can run queries on LID. We would like to open the tool to the public. This requires increased system robustness.
Automatically search for other lobbyist sources: Our set of model legislation comes from a small number of lobbyists. We will expand LID’s ability to find model legislation by automatically searching Google or Yahoo! for sources.
We would like to improve the matches: LID only counts exact matches. We would like to credit synonyms and other small differences, which LID currently treats as mismatches.

Who are we?

You can also find us on Twitter: @InfluenceDetect.