In [76]:
import pandas as pd
import matplotlib.pyplot as plt
import lzma
import json
import urllib.request
import shutil
%matplotlib inline
In [77]:
url = 'https://www.dropbox.com/s/6psloxyc0ekqzh2/data.jsonl.xz?dl=1'

with urllib.request.urlopen(url) as response, open("data.jsonl.xz", 'wb') as out_file:
    shutil.copyfileobj(response, out_file)

This demo is designed to get you up and running with a sample of CAP data available here https://capapi.org/bulk-access/. Feel free to reuse, modify, or distribute it however you'd like. Most of this code is written to be adaptable to different chunks of CAP data! You can substitute in another .xz file corresponding to a different jurisdiction or mess around with any number of other parameters.

First, let's get the data into a format we can work with by decompressing the Illinois bulk file and pulling 1-year batches of cases at 10 year intervals from 1900 to 2000. This should give us a good sampling over time.

In [78]:
#a list to hold the cases we're sampling
cases = []

#decompress the file line by line
with lzma.open("data.jsonl.xz") as infile:
    for line in infile:
        #decode the file into a convenient format
        record = json.loads(str(line, 'utf-8'))
        #if the decision date on the case matches one we're interested in, add to our list
        if int(record['decision_date'][:4]) in range(1900, 2010, 10):
            cases.append(record)

print("Number of Cases: {}".format(len(cases)))
Number of Cases: 5020

Let's take a look at the case format by accessing the first entry in our list of matching cases.

In [4]:
cases[0]
Out[4]:
{'id': 1152523,
 'name': 'Halliday v. Smith',
 'name_abbreviation': 'Halliday v. Smith',
 'decision_date': '1900-01-06',
 'docket_number': '',
 'first_page': '310',
 'last_page': '313',
 'citations': [{'type': 'official', 'cite': '67 Ark. 310'}],
 'volume': {'volume_number': '67'},
 'reporter': {'full_name': 'Arkansas Reports'},
 'court': {'id': 8808,
  'name': 'Arkansas Supreme Court',
  'name_abbreviation': 'Ark.',
  'jurisdiction_url': None,
  'slug': 'ark'},
 'jurisdiction': {'id': 34,
  'slug': 'ark',
  'name': 'Ark.',
  'name_long': 'Arkansas',
  'whitelisted': True},
 'casebody': {'data': {'judges': [],
   'parties': ['Halliday v. Smith.'],
   'opinions': [{'type': 'majority',
     'author': 'Bunn, C. J.',
     'text': 'Bunn, C. J.\nThis is a bill to enjoin the defendant, Benjamin H. Smith, from obstructing an alleged public road, and from interfering, to prevent their free passage along said road, with the plaintiff’s employees and tenants.\nThe complaint stated, among other things, “that there is a public road running from the gate on east bank of Yellow Bayou, near the Yellow Bayou Gin-house, and extending east, by the residence of B. H. Smith, Esq., on the Bellevue plantation, to the Mississippi river, there intersecting the road leading from Lin wood to Luna Landing; that this road has been a public road, and travelled as such for the last thirty years and more, and has been recognized and treated as a public road by the county court and circuit court of the said county of Chicot.”\nThe answer denied that the road had ever been a public road, and denied also that it had ever been attempted to use and treat it as such until in 1882 or 1883, and denied also that the road at that time was established by the county court, and denied that it has been a public road since that time, and, in fact, put in issue all the material allegations of the complaint.\nThe orders of the county court of Chicot county, made at its April term, 1882, and January term, 1883, exhibited with the complaint, and the evidence in relation thereto, do not purport to treat said road as a pre-existing public road, and the evidence does not sustain the allegation that it was treated as a public road previously to that time. The said orders of the county court, viewed as an attempt to lay out and establish a new road, do not purport to have been made in compliance with the provisions of the statutes on the subject, either in respect to notice or any other essential particular. They are therefore void, and were without force and effect from the beginning, and are of course subject to collateral, as well as direct, attack, and to be treated as nullities in any case.\nTbe transcript of the proceedings of the Chicot circuit court, exhibited with the bill, show that, after the defendant had entered his remonstrance in the county court against the attempt to make the road a public road, he had himself indicted in the circuit court for obstructing said road, and was convicted and fined. The reason given by the defendant for this strange proceeding, as given in his testimony, is shown thus:\n“Question. State, if you remember, why your attorney advised you to submit to an indictment and bring your suit ir the chancery court by injunction?\n“Answer. He said, if I was indicted, and the cause was brought up, this would settle the whole road case, and I was told by Judge Bradley that the restraining order would be the proper course to pursue, and I told my attorney what Judge Bradley (the then circuit judge) had said, and he made out the papers, and Judge Bradley granted the injunction.”\nPlaintiff contends that defendant estopped himself by this conduct from denying the existence of the public road. Defendant’s acts may have the appearance of admitting the fact that the road was a public road, but, under the circumstances, it was only for the purpose of laying the foundation of a more formal controversy of that fact in the proper tribunal. Besides, what a defendant may do in a criminal court can hardly be pleaded as an estoppel against him. It is plain that whatever defendant did was done in furtherance of his resistance to the establishment of this road. The evidence of the conduct and acts of the parties to this controversy, subsequent to the said orders of the county court, not only do not show an acquiescence on the part of the defendant, the owner of the plantation directly affected, in the use of the public road, but a constant and continual warfare and struggle against the plaintiff and his friends, who as persistently and continuously sought to have the road treated as a public road. There is, therefore, no ground whatever upon which it can be s.aid that this road was ever made or became a public road, in the meaning of the law then in force, either by prescription, by use, by the establishment by the county court, or by the assent and acquiescence of the owner of the land affected.\nThere was objection made to the reading of some of the depositions of defendant, one of the grounds being that the same were taken prematurely, that is, before the answer was filed. The complaint was filed September 11, 1890. The answer was filed November 20, 1890. The notice to take depositions was served on plaintiff’s attorney on November 24, 1890, and under that notice the depositions of C. C. Martin, J. F. Ward, James McMurry and Joseph Bryan, for defendant, were taken on November 26, 1890. Notice to take depositions on the 27th day of November, 1890, was served on plaintiff’s attorney November 24, 1890, and the deposition of H. N. Merriman, the county judge presiding when the said orders were made, was taken thereunder. The deposition of J. H. Worner and possibly some others was taken two or three days before the answer was filed. Without stopping to discuss the materiality of the objection on this ground, the depositions taken nQt subject to this particular objection fully sustain the defendant’s contention. The other grounds of objection do not appear material and prejudicial to plaintiff.\nThe decree was to the effect that the demui’rer to the complaint bé sustained, because the plaintiff, showed no legal capacity to sue, and also to the effect that there is no equity in the bill.\nThere is no provision in the statutes giving a right of action to an individual against another for obstructing a public road, but, without discussing the demurrer to the bill, we find no error in the decree of the chancellor to the effect that there is no equity in the bill itself. The decree is therefore affirmed.'}],
   'head_matter': 'Halliday v. Smith.\nOpinion delivered January 6, 1900.\nJno. M. Rose, Special Judge.\nJD. R. Reynolds and Jno. 0. Gonnerly, for appellant.\nRose, Remingway & Rose, for appellee.\nAppeal from Chicot Circuit Court in Chancery.\nAs to the right to an injunction, see 35 Ark. 497; 40 Ark. 83. A judgment establishing a road cannot be collaterally attacked. 47 Ark. 431.\nThe order of the <o mty court was obtained through fraud, and is void. 42 Ark. 348; 2 Fr. Judg. § 489; Big. Fraud, 87. The statutory notice was requisite to the validity of the order. 51 Ark. 34; 65 Ark. 94; id. 142, 143; 13 Ark. 491; 52 id. 312; 55 id. 30; 54 id. 642; 59 id. 487; Sand. & H. Dig., § 4190. It was a taking of property without due process of law. 43 Ark. 545; 5 id. 409; id. 217; 3 id. 536; Ell. Streets and Roads, 233. There never having been any petition to the county court, it had no jurisdiction. Sand. & H. Dig., § 2817; 2 Rap. & Law. Diet. 958; 93 ü. S. 283; 55 Ark. 566. Nor did appellee’s appearance a year later, to file his protest, validate the order. 58 Ark. 186; 47 Pac. 330; 64 Ark. 108. Nor can the conviction in the criminal ease amount to an estoppel here. 1 Greenl. Ev. § 537; 13 Ark. 217; 15 id. 319. At most, it was an admission, and appellee can show it to have been made under a mistake of law. 15 Ark. 62; 1 Greenl. Ev. §§ 204, 205 , 2062, 209 ; 22 Ark. 496; 23 Ark. 134; 11 id. 263; 49 id. 300; 17 id. 221; 32 id. 266.\nEstoppel—Indictment.—Where, for the purpose of laying a foundation for a formal contest as to the existence of a public road, defendant had himself indicted, and was convicted and fined, for obstructing the same, such proceedings do not estop him from subsequently denying that the road was a public highway, in an action brought to restrain him from obstructing such road. (Page 312.)',
   'attorneys': ['Jno. M. Rose, Special Judge.',
    'JD. R. Reynolds and Jno. 0. Gonnerly, for appellant.',
    'Rose, Remingway & Rose, for appellee.']},
  'status': 'ok'}}

A lot of info here, but it's quite messy. Let's pull out a few case metadata attributes we're interested in, leaving the actual case text aside for now – the decision date (year only), case name, case citation, court, and opinion count. We'll put them into a Pandas Dataframe for easier manipulation.

In [5]:
# use a list comprehension to pull out the metadata attributes specified above
case_metadata = [{'year': int(case['decision_date'][:4]),
                'name': case['name'],
                'citation': case['citations'][0]['cite'],
                'court': case['court']['name'],
                'opinion_count': len(case['casebody']['data']['opinions'])} 
                 for case in cases]

# lists of dictionaries like `case_metadata` convert easily into Dataframes
metadata_df = pd.DataFrame(case_metadata)
metadata_df.head()
Out[5]:
citation court name opinion_count year
0 67 Ark. 310 Arkansas Supreme Court Halliday v. Smith 1 1900
1 67 Ark. 314 Arkansas Supreme Court Leach v. State 1 1900
2 67 Ark. 325 Arkansas Supreme Court Doster v. Manistee National Bank 1 1900
3 67 Ark. 318 Arkansas Supreme Court Cash v. Kirkham 1 1900
4 67 Ark. 320 Arkansas Supreme Court Rowland v. McGuire 1 1900

Yay, we've got our first usable data! As minimal as this metadata is, we should still be able to get some useful insights out of it. First, let's check how many cases we have from each year in our sample.

In [6]:
metadata_df['year'].value_counts().sort_index()
Out[6]:
1900    164
1910    493
1920    565
1930    506
1940    394
1950    326
1960    258
1970    352
1980    767
1990    594
2000    601
Name: year, dtype: int64

There's clearly a lot of variation in publication volume year to year, with pronounced upticks in 1980 and 2000. Let's break it down by court.

In [7]:
metadata_df['court'].value_counts().sort_index()
Out[7]:
Arkansas Court of Appeals     776
Arkansas Supreme Court       4244
Name: court, dtype: int64

All of the cases in our sample are from either the Arkansas Court of Appeals or the Arkansas Supreme Court, with a large majority belonging to the latter. Let's look at the number of opinions per case.

In [8]:
metadata_df['opinion_count'].value_counts().sort_index()
Out[8]:
1    4279
2     633
3      89
4      16
5       3
Name: opinion_count, dtype: int64

The majority of our cases have a single opinion. Let's try to identify some trends in opinion volume over time.

In [9]:
# get frequency of opinion counts for each year
n_opinions = [[year, 
               metadata_df[(metadata_df['year'] == year) & (metadata_df["opinion_count"] == 1)].shape[0],
               metadata_df[(metadata_df['year'] == year) & (metadata_df["opinion_count"] == 2)].shape[0],
               metadata_df[(metadata_df['year'] == year) & (metadata_df["opinion_count"] == 3)].shape[0],
               metadata_df[(metadata_df['year'] == year) & (metadata_df["opinion_count"] >= 4)].shape[0]]
              for year in metadata_df['year'].unique()]

# reformat for graph
n_opinions = list(zip(*n_opinions))
n_opinions = [list(item) for item in n_opinions]
plt.figure(figsize=(10,7))

ind = n_opinions[0]
handles = []
for i, count in enumerate(n_opinions[1:]):
    bot = n_opinions[1:i+1]
    bot = [sum(x) for x in zip(*n_opinions[1:i+1])]
    bot = [0]*len(ind) if not bot else bot
    h = plt.bar(ind, count, 5, bottom=bot, label=i+1)
    handles.append(h)
    
plt.legend(handles=handles[::-1], title="Opinion Count")
plt.xlabel("Year")
plt.ylabel("Cases")
plt.title("Opinion counts in cases by year")
plt.show()

Interesting – it seems that cases from 1980, 1990, and 2000 tend to have more opinions than those from earlier years in the sample.

Now let's get to the rich part of the dataset – the opinions themselves! These are a bit messier to wrangle than the metadata was. There are a couple ways that we might structure our dataframe, but to keep it simple we'll just do one opinion per row. If a case has multiple opinions, each will be a separate row (linked by the case id).

In [10]:
#Loop through cases and build rows with case metadata AND opinion metadata/text.
#We load in all of the keys initially, then modify the ones we want to.

opinion_data = []
for case in cases:
    for opinion in case["casebody"]["data"]["opinions"]:
        temp = {}
        keys = list(case.keys())
        keys.remove('casebody')
        for key in keys:         
            temp[key] = case[key]
        keys = list(opinion.keys())
        for key in keys:         
            temp[key] = opinion[key]
        opinion_data.append(temp)

opinions_df = pd.DataFrame(opinion_data)
opinions_df["citations"] = opinions_df["citations"].apply(lambda x:x[0]['cite'])
opinions_df["court"] = opinions_df["court"].apply(lambda x:x['name'])
opinions_df["decision_date"] = opinions_df["decision_date"].apply(lambda x:int(x[:4]))
opinions_df = opinions_df.drop(["docket_number", "first_page", 
                                "last_page", "name_abbreviation",
                                "reporter", "volume", "jurisdiction"], axis=1)
opinions_df = opinions_df[["id", "name", "decision_date", "court", "citations", "author", "type", "text"]]

opinions_df.head()
Out[10]:
id name decision_date court citations author type text
0 1152523 Halliday v. Smith 1900 Arkansas Supreme Court 67 Ark. 310 Bunn, C. J. majority Bunn, C. J.\nThis is a bill to enjoin the defe...
1 1152531 Leach v. State 1900 Arkansas Supreme Court 67 Ark. 314 Bunn, C. J. majority Bunn, C. J.\nThis is an indictment against Rob...
2 1152554 Doster v. Manistee National Bank 1900 Arkansas Supreme Court 67 Ark. 325 Wood, J. majority Wood, J.\nThis suit is between judgment credit...
3 1152577 Cash v. Kirkham 1900 Arkansas Supreme Court 67 Ark. 318 Battle, J. majority Battle, J.\nZ. L. Kirkham presented two accoun...
4 1152578 Rowland v. McGuire 1900 Arkansas Supreme Court 67 Ark. 320 Battle, J. majority Battle, J.\nAlice J. Rowland, a married woman,...

After dropping some extraneous information, we're left with a number of useful attributes for each opinion:

  • id (assigned by CAP database): A unique case identifier that we can use to link opinions belonging to the same case
  • name: The case's name
  • court: The court in which the case was heard and decided
  • citations: The official citation to the case
  • author: The author of the opinion
  • type: The type of the opinion (ex. 'majority,''dissent,''concurrence')
  • text: The full text of the opinion

Let's try to a more complicated question using this corpus. Comparing cases from 1900, 1910, 1920, and 1930 against cases from 1970, 1980, 1990, and 2000, what words can we say are distinctive to each time period? Are there words from opinions dating to the beginning of the 20th century that don't occur in opinions dating to the end of the 20th century?

We'll start by implementing a basic n-gram search function and graphing our results.

In [23]:
def search_ngram(ngram):
    pairs = []
    for year in opinions_df["decision_date"].unique():
        temp = opinions_df[opinions_df["decision_date"] == year]["text"].tolist()
        temp = " ".join(temp).lower()
        n = len(temp.split(" "))
        ngram_count = temp.count(ngram.lower())
        pairs.append((year, ngram_count/n))
    return pairs

def graph_ngram(pairs, ax, title):
    x,y = [list(x) for x in zip(*pairs)]
    ax.plot(x,y)
    ax.set_title(title)
    ax.set_xlabel("Year")
    ax.set_ylabel("fraction of words")
    return ax

fig, axes = plt.subplots(2, 2, figsize=(15,8))
graph_ngram(search_ngram("plow"), axes[0,0], "Plow")
graph_ngram(search_ngram("computer"), axes[0,1], "Computer")
graph_ngram(search_ngram("horse"), axes[1,0], "Horse")
graph_ngram(search_ngram("truck"), axes[1,1], "Truck")

plt.tight_layout()
plt.show()

Fewer horses and plows, more trucks and computers! Let's get a little bit more sophisticated (still keeping it simple) and find a list of words which occur fewer than 10 times in cases from 1900-1940 but frequently in cases from 1970-2000.

In [28]:
def tokenize_cases(cases):
    cases = " ".join(cases).lower()
    cases = cases.replace("\n", " ").replace(",", "").replace(";", "").replace("'", "").replace("’", "")
    cases = cases.replace("(", "").replace(")", "").replace(".", "").replace("?", "").replace("!", "").split(" ")
    return cases

early_cases = tokenize_cases(opinions_df[opinions_df["decision_date"] <= 1930]["text"].tolist())
late_cases = tokenize_cases(opinions_df[opinions_df["decision_date"] >= 1970]["text"].tolist())
In [29]:
early_cases_dict = {}

for word in early_cases:
    if word in early_cases_dict:
        early_cases_dict[word] += 1
    else:
        early_cases_dict[word] = 1

len(early_cases_dict)
Out[29]:
41725
In [66]:
new_words_dict = {}

for word in late_cases:
    if word in early_cases_dict:
        if early_cases_dict[word] < 10:
            if word in new_words_dict:
                new_words_dict[word] += 1
            else:
                new_words_dict[word] = 1
    else:
        if word in new_words_dict:
            new_words_dict[word] += 1
        else:
            new_words_dict[word] = 1
In [67]:
def isInt(s):
    try: 
        int(s)
        return True
    except ValueError:
        return False

sorted_new_words = sorted(new_words_dict.items(), key=lambda x:-x[1])
sorted_new_words = [item for item in sorted_new_words if not isInt(item[0])]
sorted_new_words = sorted_new_words[:100]

print("Most common words from 1970-2000 that occured <10 times from 1900-1930\n")
print ("{0:14}|{1:5}".format("Word", "Occurances"))
print ("-------------------------")
for word in sorted_new_words:
    print ("{0:14}|{1:5}".format(word[0], word[1]))
Most common words from 1970-2000 that occured <10 times from 1900-1930

Word          |Occurances
-------------------------
sw2d          | 8878
repl          | 1700
coverage      |  774
victim        |  678
emphasis      |  606
sw3d          |  598
timely        |  549
marijuana     |  536
percent       |  507
disagree      |  492
factors       |  486
suppress      |  482
cir           |  479
compensable   |  440
problem       |  422
mistrial      |  401
probation     |  400
victims       |  399
surgery       |  393
sentencing    |  389
convictions   |  385
problems      |  374
aggravated    |  368
procedures    |  353
visitation    |  343
cocaine       |  332
subsection    |  319
program       |  317
factual       |  300
wal-mart      |  297
activity      |  294
ineffective   |  292
counsels      |  290
f2d           |  284
pm            |  273
activities    |  268
someone       |  260
parole        |  259
evidentiary   |  257
voter         |  257
prosecutors   |  252
underlying    |  251
despite       |  251
eg            |  248
sentences     |  246
im            |  241
document      |  239
unemployment  |  239
pretrial      |  237
informant     |  236
arguing       |  235
plus          |  232
allegedly     |  232
rehabilitation|  232
scheduled     |  231
additionally  |  231
questioning   |  231
paternity     |  231
states:       |  228
prison        |  224
thats         |  222
ie            |  219
basic         |  216
jeopardy      |  216
contraband    |  215
certification |  215
disclosure    |  213
ultimately    |  212
part:         |  211
initially     |  207
potential     |  205
trailer       |  204
newbern       |  201
parental      |  197
group         |  196
battery       |  195
popular       |  195
reveals       |  194
miranda       |  194
sera          |  192
indication    |  189
workmens      |  186
sellers       |  186
10%           |  184
fogleman      |  183
similarly     |  182
cert          |  182
proffered     |  181
procedural    |  181
hearings      |  179
wage          |  179
habitual      |  178
firearm       |  174
commitment    |  172
smiths        |  172
womack        |  170
model         |  170
motorist      |  169
bureau        |  168
parking       |  168
In [68]:
fig, axes = plt.subplots(2, 2, figsize=(15,8))
graph_ngram(search_ngram("coverage"), axes[0,0], "Coverage")
graph_ngram(search_ngram("victim"), axes[0,1], "Victim")
graph_ngram(search_ngram("marijuana"), axes[1,0], "Marijuana")
graph_ngram(search_ngram("wal-mart"), axes[1,1], "Wal-Mart")

plt.tight_layout()
plt.show()