import pandas as pd
import matplotlib.pyplot as plt
import lzma
import json
import urllib.request
import shutil
%matplotlib inline
url = 'https://www.dropbox.com/s/6psloxyc0ekqzh2/data.jsonl.xz?dl=1'
with urllib.request.urlopen(url) as response, open("data.jsonl.xz", 'wb') as out_file:
shutil.copyfileobj(response, out_file)
This demo is designed to get you up and running with a sample of CAP data available here https://capapi.org/bulk-access/. Feel free to reuse, modify, or distribute it however you'd like. Most of this code is written to be adaptable to different chunks of CAP data! You can substitute in another .xz file corresponding to a different jurisdiction or mess around with any number of other parameters.
First, let's get the data into a format we can work with by decompressing the Illinois bulk file and pulling 1-year batches of cases at 10 year intervals from 1900 to 2000. This should give us a good sampling over time.
#a list to hold the cases we're sampling
cases = []
#decompress the file line by line
with lzma.open("data.jsonl.xz") as infile:
for line in infile:
#decode the file into a convenient format
record = json.loads(str(line, 'utf-8'))
#if the decision date on the case matches one we're interested in, add to our list
if int(record['decision_date'][:4]) in range(1900, 2010, 10):
cases.append(record)
print("Number of Cases: {}".format(len(cases)))
Number of Cases: 5020
Let's take a look at the case format by accessing the first entry in our list of matching cases.
cases[0]
{'id': 1152523, 'name': 'Halliday v. Smith', 'name_abbreviation': 'Halliday v. Smith', 'decision_date': '1900-01-06', 'docket_number': '', 'first_page': '310', 'last_page': '313', 'citations': [{'type': 'official', 'cite': '67 Ark. 310'}], 'volume': {'volume_number': '67'}, 'reporter': {'full_name': 'Arkansas Reports'}, 'court': {'id': 8808, 'name': 'Arkansas Supreme Court', 'name_abbreviation': 'Ark.', 'jurisdiction_url': None, 'slug': 'ark'}, 'jurisdiction': {'id': 34, 'slug': 'ark', 'name': 'Ark.', 'name_long': 'Arkansas', 'whitelisted': True}, 'casebody': {'data': {'judges': [], 'parties': ['Halliday v. Smith.'], 'opinions': [{'type': 'majority', 'author': 'Bunn, C. J.', 'text': 'Bunn, C. J.\nThis is a bill to enjoin the defendant, Benjamin H. Smith, from obstructing an alleged public road, and from interfering, to prevent their free passage along said road, with the plaintiff’s employees and tenants.\nThe complaint stated, among other things, “that there is a public road running from the gate on east bank of Yellow Bayou, near the Yellow Bayou Gin-house, and extending east, by the residence of B. H. Smith, Esq., on the Bellevue plantation, to the Mississippi river, there intersecting the road leading from Lin wood to Luna Landing; that this road has been a public road, and travelled as such for the last thirty years and more, and has been recognized and treated as a public road by the county court and circuit court of the said county of Chicot.”\nThe answer denied that the road had ever been a public road, and denied also that it had ever been attempted to use and treat it as such until in 1882 or 1883, and denied also that the road at that time was established by the county court, and denied that it has been a public road since that time, and, in fact, put in issue all the material allegations of the complaint.\nThe orders of the county court of Chicot county, made at its April term, 1882, and January term, 1883, exhibited with the complaint, and the evidence in relation thereto, do not purport to treat said road as a pre-existing public road, and the evidence does not sustain the allegation that it was treated as a public road previously to that time. The said orders of the county court, viewed as an attempt to lay out and establish a new road, do not purport to have been made in compliance with the provisions of the statutes on the subject, either in respect to notice or any other essential particular. They are therefore void, and were without force and effect from the beginning, and are of course subject to collateral, as well as direct, attack, and to be treated as nullities in any case.\nTbe transcript of the proceedings of the Chicot circuit court, exhibited with the bill, show that, after the defendant had entered his remonstrance in the county court against the attempt to make the road a public road, he had himself indicted in the circuit court for obstructing said road, and was convicted and fined. The reason given by the defendant for this strange proceeding, as given in his testimony, is shown thus:\n“Question. State, if you remember, why your attorney advised you to submit to an indictment and bring your suit ir the chancery court by injunction?\n“Answer. He said, if I was indicted, and the cause was brought up, this would settle the whole road case, and I was told by Judge Bradley that the restraining order would be the proper course to pursue, and I told my attorney what Judge Bradley (the then circuit judge) had said, and he made out the papers, and Judge Bradley granted the injunction.”\nPlaintiff contends that defendant estopped himself by this conduct from denying the existence of the public road. Defendant’s acts may have the appearance of admitting the fact that the road was a public road, but, under the circumstances, it was only for the purpose of laying the foundation of a more formal controversy of that fact in the proper tribunal. Besides, what a defendant may do in a criminal court can hardly be pleaded as an estoppel against him. It is plain that whatever defendant did was done in furtherance of his resistance to the establishment of this road. The evidence of the conduct and acts of the parties to this controversy, subsequent to the said orders of the county court, not only do not show an acquiescence on the part of the defendant, the owner of the plantation directly affected, in the use of the public road, but a constant and continual warfare and struggle against the plaintiff and his friends, who as persistently and continuously sought to have the road treated as a public road. There is, therefore, no ground whatever upon which it can be s.aid that this road was ever made or became a public road, in the meaning of the law then in force, either by prescription, by use, by the establishment by the county court, or by the assent and acquiescence of the owner of the land affected.\nThere was objection made to the reading of some of the depositions of defendant, one of the grounds being that the same were taken prematurely, that is, before the answer was filed. The complaint was filed September 11, 1890. The answer was filed November 20, 1890. The notice to take depositions was served on plaintiff’s attorney on November 24, 1890, and under that notice the depositions of C. C. Martin, J. F. Ward, James McMurry and Joseph Bryan, for defendant, were taken on November 26, 1890. Notice to take depositions on the 27th day of November, 1890, was served on plaintiff’s attorney November 24, 1890, and the deposition of H. N. Merriman, the county judge presiding when the said orders were made, was taken thereunder. The deposition of J. H. Worner and possibly some others was taken two or three days before the answer was filed. Without stopping to discuss the materiality of the objection on this ground, the depositions taken nQt subject to this particular objection fully sustain the defendant’s contention. The other grounds of objection do not appear material and prejudicial to plaintiff.\nThe decree was to the effect that the demui’rer to the complaint bé sustained, because the plaintiff, showed no legal capacity to sue, and also to the effect that there is no equity in the bill.\nThere is no provision in the statutes giving a right of action to an individual against another for obstructing a public road, but, without discussing the demurrer to the bill, we find no error in the decree of the chancellor to the effect that there is no equity in the bill itself. The decree is therefore affirmed.'}], 'head_matter': 'Halliday v. Smith.\nOpinion delivered January 6, 1900.\nJno. M. Rose, Special Judge.\nJD. R. Reynolds and Jno. 0. Gonnerly, for appellant.\nRose, Remingway & Rose, for appellee.\nAppeal from Chicot Circuit Court in Chancery.\nAs to the right to an injunction, see 35 Ark. 497; 40 Ark. 83. A judgment establishing a road cannot be collaterally attacked. 47 Ark. 431.\nThe order of the <o mty court was obtained through fraud, and is void. 42 Ark. 348; 2 Fr. Judg. § 489; Big. Fraud, 87. The statutory notice was requisite to the validity of the order. 51 Ark. 34; 65 Ark. 94; id. 142, 143; 13 Ark. 491; 52 id. 312; 55 id. 30; 54 id. 642; 59 id. 487; Sand. & H. Dig., § 4190. It was a taking of property without due process of law. 43 Ark. 545; 5 id. 409; id. 217; 3 id. 536; Ell. Streets and Roads, 233. There never having been any petition to the county court, it had no jurisdiction. Sand. & H. Dig., § 2817; 2 Rap. & Law. Diet. 958; 93 ü. S. 283; 55 Ark. 566. Nor did appellee’s appearance a year later, to file his protest, validate the order. 58 Ark. 186; 47 Pac. 330; 64 Ark. 108. Nor can the conviction in the criminal ease amount to an estoppel here. 1 Greenl. Ev. § 537; 13 Ark. 217; 15 id. 319. At most, it was an admission, and appellee can show it to have been made under a mistake of law. 15 Ark. 62; 1 Greenl. Ev. §§ 204, 205 , 2062, 209 ; 22 Ark. 496; 23 Ark. 134; 11 id. 263; 49 id. 300; 17 id. 221; 32 id. 266.\nEstoppel—Indictment.—Where, for the purpose of laying a foundation for a formal contest as to the existence of a public road, defendant had himself indicted, and was convicted and fined, for obstructing the same, such proceedings do not estop him from subsequently denying that the road was a public highway, in an action brought to restrain him from obstructing such road. (Page 312.)', 'attorneys': ['Jno. M. Rose, Special Judge.', 'JD. R. Reynolds and Jno. 0. Gonnerly, for appellant.', 'Rose, Remingway & Rose, for appellee.']}, 'status': 'ok'}}
A lot of info here, but it's quite messy. Let's pull out a few case metadata attributes we're interested in, leaving the actual case text aside for now – the decision date (year only), case name, case citation, court, and opinion count. We'll put them into a Pandas Dataframe for easier manipulation.
# use a list comprehension to pull out the metadata attributes specified above
case_metadata = [{'year': int(case['decision_date'][:4]),
'name': case['name'],
'citation': case['citations'][0]['cite'],
'court': case['court']['name'],
'opinion_count': len(case['casebody']['data']['opinions'])}
for case in cases]
# lists of dictionaries like `case_metadata` convert easily into Dataframes
metadata_df = pd.DataFrame(case_metadata)
metadata_df.head()
citation | court | name | opinion_count | year | |
---|---|---|---|---|---|
0 | 67 Ark. 310 | Arkansas Supreme Court | Halliday v. Smith | 1 | 1900 |
1 | 67 Ark. 314 | Arkansas Supreme Court | Leach v. State | 1 | 1900 |
2 | 67 Ark. 325 | Arkansas Supreme Court | Doster v. Manistee National Bank | 1 | 1900 |
3 | 67 Ark. 318 | Arkansas Supreme Court | Cash v. Kirkham | 1 | 1900 |
4 | 67 Ark. 320 | Arkansas Supreme Court | Rowland v. McGuire | 1 | 1900 |
Yay, we've got our first usable data! As minimal as this metadata is, we should still be able to get some useful insights out of it. First, let's check how many cases we have from each year in our sample.
metadata_df['year'].value_counts().sort_index()
1900 164 1910 493 1920 565 1930 506 1940 394 1950 326 1960 258 1970 352 1980 767 1990 594 2000 601 Name: year, dtype: int64
There's clearly a lot of variation in publication volume year to year, with pronounced upticks in 1980 and 2000. Let's break it down by court.
metadata_df['court'].value_counts().sort_index()
Arkansas Court of Appeals 776 Arkansas Supreme Court 4244 Name: court, dtype: int64
All of the cases in our sample are from either the Arkansas Court of Appeals or the Arkansas Supreme Court, with a large majority belonging to the latter. Let's look at the number of opinions per case.
metadata_df['opinion_count'].value_counts().sort_index()
1 4279 2 633 3 89 4 16 5 3 Name: opinion_count, dtype: int64
The majority of our cases have a single opinion. Let's try to identify some trends in opinion volume over time.
# get frequency of opinion counts for each year
n_opinions = [[year,
metadata_df[(metadata_df['year'] == year) & (metadata_df["opinion_count"] == 1)].shape[0],
metadata_df[(metadata_df['year'] == year) & (metadata_df["opinion_count"] == 2)].shape[0],
metadata_df[(metadata_df['year'] == year) & (metadata_df["opinion_count"] == 3)].shape[0],
metadata_df[(metadata_df['year'] == year) & (metadata_df["opinion_count"] >= 4)].shape[0]]
for year in metadata_df['year'].unique()]
# reformat for graph
n_opinions = list(zip(*n_opinions))
n_opinions = [list(item) for item in n_opinions]
plt.figure(figsize=(10,7))
ind = n_opinions[0]
handles = []
for i, count in enumerate(n_opinions[1:]):
bot = n_opinions[1:i+1]
bot = [sum(x) for x in zip(*n_opinions[1:i+1])]
bot = [0]*len(ind) if not bot else bot
h = plt.bar(ind, count, 5, bottom=bot, label=i+1)
handles.append(h)
plt.legend(handles=handles[::-1], title="Opinion Count")
plt.xlabel("Year")
plt.ylabel("Cases")
plt.title("Opinion counts in cases by year")
plt.show()
Interesting – it seems that cases from 1980, 1990, and 2000 tend to have more opinions than those from earlier years in the sample.
Now let's get to the rich part of the dataset – the opinions themselves! These are a bit messier to wrangle than the metadata was. There are a couple ways that we might structure our dataframe, but to keep it simple we'll just do one opinion per row. If a case has multiple opinions, each will be a separate row (linked by the case id).
#Loop through cases and build rows with case metadata AND opinion metadata/text.
#We load in all of the keys initially, then modify the ones we want to.
opinion_data = []
for case in cases:
for opinion in case["casebody"]["data"]["opinions"]:
temp = {}
keys = list(case.keys())
keys.remove('casebody')
for key in keys:
temp[key] = case[key]
keys = list(opinion.keys())
for key in keys:
temp[key] = opinion[key]
opinion_data.append(temp)
opinions_df = pd.DataFrame(opinion_data)
opinions_df["citations"] = opinions_df["citations"].apply(lambda x:x[0]['cite'])
opinions_df["court"] = opinions_df["court"].apply(lambda x:x['name'])
opinions_df["decision_date"] = opinions_df["decision_date"].apply(lambda x:int(x[:4]))
opinions_df = opinions_df.drop(["docket_number", "first_page",
"last_page", "name_abbreviation",
"reporter", "volume", "jurisdiction"], axis=1)
opinions_df = opinions_df[["id", "name", "decision_date", "court", "citations", "author", "type", "text"]]
opinions_df.head()
id | name | decision_date | court | citations | author | type | text | |
---|---|---|---|---|---|---|---|---|
0 | 1152523 | Halliday v. Smith | 1900 | Arkansas Supreme Court | 67 Ark. 310 | Bunn, C. J. | majority | Bunn, C. J.\nThis is a bill to enjoin the defe... |
1 | 1152531 | Leach v. State | 1900 | Arkansas Supreme Court | 67 Ark. 314 | Bunn, C. J. | majority | Bunn, C. J.\nThis is an indictment against Rob... |
2 | 1152554 | Doster v. Manistee National Bank | 1900 | Arkansas Supreme Court | 67 Ark. 325 | Wood, J. | majority | Wood, J.\nThis suit is between judgment credit... |
3 | 1152577 | Cash v. Kirkham | 1900 | Arkansas Supreme Court | 67 Ark. 318 | Battle, J. | majority | Battle, J.\nZ. L. Kirkham presented two accoun... |
4 | 1152578 | Rowland v. McGuire | 1900 | Arkansas Supreme Court | 67 Ark. 320 | Battle, J. | majority | Battle, J.\nAlice J. Rowland, a married woman,... |
After dropping some extraneous information, we're left with a number of useful attributes for each opinion:
Let's try to a more complicated question using this corpus. Comparing cases from 1900, 1910, 1920, and 1930 against cases from 1970, 1980, 1990, and 2000, what words can we say are distinctive to each time period? Are there words from opinions dating to the beginning of the 20th century that don't occur in opinions dating to the end of the 20th century?
We'll start by implementing a basic n-gram search function and graphing our results.
def search_ngram(ngram):
pairs = []
for year in opinions_df["decision_date"].unique():
temp = opinions_df[opinions_df["decision_date"] == year]["text"].tolist()
temp = " ".join(temp).lower()
n = len(temp.split(" "))
ngram_count = temp.count(ngram.lower())
pairs.append((year, ngram_count/n))
return pairs
def graph_ngram(pairs, ax, title):
x,y = [list(x) for x in zip(*pairs)]
ax.plot(x,y)
ax.set_title(title)
ax.set_xlabel("Year")
ax.set_ylabel("fraction of words")
return ax
fig, axes = plt.subplots(2, 2, figsize=(15,8))
graph_ngram(search_ngram("plow"), axes[0,0], "Plow")
graph_ngram(search_ngram("computer"), axes[0,1], "Computer")
graph_ngram(search_ngram("horse"), axes[1,0], "Horse")
graph_ngram(search_ngram("truck"), axes[1,1], "Truck")
plt.tight_layout()
plt.show()
Fewer horses and plows, more trucks and computers! Let's get a little bit more sophisticated (still keeping it simple) and find a list of words which occur fewer than 10 times in cases from 1900-1940 but frequently in cases from 1970-2000.
def tokenize_cases(cases):
cases = " ".join(cases).lower()
cases = cases.replace("\n", " ").replace(",", "").replace(";", "").replace("'", "").replace("’", "")
cases = cases.replace("(", "").replace(")", "").replace(".", "").replace("?", "").replace("!", "").split(" ")
return cases
early_cases = tokenize_cases(opinions_df[opinions_df["decision_date"] <= 1930]["text"].tolist())
late_cases = tokenize_cases(opinions_df[opinions_df["decision_date"] >= 1970]["text"].tolist())
early_cases_dict = {}
for word in early_cases:
if word in early_cases_dict:
early_cases_dict[word] += 1
else:
early_cases_dict[word] = 1
len(early_cases_dict)
41725
new_words_dict = {}
for word in late_cases:
if word in early_cases_dict:
if early_cases_dict[word] < 10:
if word in new_words_dict:
new_words_dict[word] += 1
else:
new_words_dict[word] = 1
else:
if word in new_words_dict:
new_words_dict[word] += 1
else:
new_words_dict[word] = 1
def isInt(s):
try:
int(s)
return True
except ValueError:
return False
sorted_new_words = sorted(new_words_dict.items(), key=lambda x:-x[1])
sorted_new_words = [item for item in sorted_new_words if not isInt(item[0])]
sorted_new_words = sorted_new_words[:100]
print("Most common words from 1970-2000 that occured <10 times from 1900-1930\n")
print ("{0:14}|{1:5}".format("Word", "Occurances"))
print ("-------------------------")
for word in sorted_new_words:
print ("{0:14}|{1:5}".format(word[0], word[1]))
Most common words from 1970-2000 that occured <10 times from 1900-1930 Word |Occurances ------------------------- sw2d | 8878 repl | 1700 coverage | 774 victim | 678 emphasis | 606 sw3d | 598 timely | 549 marijuana | 536 percent | 507 disagree | 492 factors | 486 suppress | 482 cir | 479 compensable | 440 problem | 422 mistrial | 401 probation | 400 victims | 399 surgery | 393 sentencing | 389 convictions | 385 problems | 374 aggravated | 368 procedures | 353 visitation | 343 cocaine | 332 subsection | 319 program | 317 factual | 300 wal-mart | 297 activity | 294 ineffective | 292 counsels | 290 f2d | 284 pm | 273 activities | 268 someone | 260 parole | 259 evidentiary | 257 voter | 257 prosecutors | 252 underlying | 251 despite | 251 eg | 248 sentences | 246 im | 241 document | 239 unemployment | 239 pretrial | 237 informant | 236 arguing | 235 plus | 232 allegedly | 232 rehabilitation| 232 scheduled | 231 additionally | 231 questioning | 231 paternity | 231 states: | 228 prison | 224 thats | 222 ie | 219 basic | 216 jeopardy | 216 contraband | 215 certification | 215 disclosure | 213 ultimately | 212 part: | 211 initially | 207 potential | 205 trailer | 204 newbern | 201 parental | 197 group | 196 battery | 195 popular | 195 reveals | 194 miranda | 194 sera | 192 indication | 189 workmens | 186 sellers | 186 10% | 184 fogleman | 183 similarly | 182 cert | 182 proffered | 181 procedural | 181 hearings | 179 wage | 179 habitual | 178 firearm | 174 commitment | 172 smiths | 172 womack | 170 model | 170 motorist | 169 bureau | 168 parking | 168
fig, axes = plt.subplots(2, 2, figsize=(15,8))
graph_ngram(search_ngram("coverage"), axes[0,0], "Coverage")
graph_ngram(search_ngram("victim"), axes[0,1], "Victim")
graph_ngram(search_ngram("marijuana"), axes[1,0], "Marijuana")
graph_ngram(search_ngram("wal-mart"), axes[1,1], "Wal-Mart")
plt.tight_layout()
plt.show()