How many newspaper titles are available for each year in the Chronicling America collection?¶

To answer this question, we'll use the Python library pandas, which is intended for data analysis. The ultimate goal is to create a CSV file with a column for each year and a column with the count of titles available for that year.

I'm still wrapping my brain around pandas and I'm sure there are other ways to achieve similar results more efficiently. Let me know!

The initial steps of this workbook are from the "Read a Plain Text File of Newspapers into a DataFrame" notebook posted by the Library of Congress at https://github.com/LibraryOfCongress/data-exploration/blob/master/Newspapers.txt.ipynb

In [538]:

import pandas as pd

The newspapers.txt file on Chronicling America is delimited by the pipe | character, not a comma, with spaces around it. You can read it into a pandas DataFrame with the following command. Note we have to specify the python engine because we have to use a regex (regular expression) to define the separator. Also, we're not bringing in the final column from the original file, which is a "more info" link.

In [539]:

papers = pd.read_csv('http://chroniclingamerica.loc.gov/newspapers.txt', sep=' \| ', skipinitialspace=True, usecols=[0,1,2,3,4,5,6,7,8], engine='python')

In [540]:

type(papers)

Out[540]:

pandas.core.frame.DataFrame

Here's what the first few lines of the initial data looks like:

In [541]:

papers.head()

Out[541]:

	Persistent Link	State	Title	LCCN	OCLC	ISSN	No. of Issues	First Issue Date	Last Issue Date
0	http://chroniclingamerica.loc.gov/lccn/sn85025...	Alabama	Chattanooga daily rebel. (Selma, Ala.) 1865-1865	sn85025905	2638820	2328-5869	3	April 19, 1865	April 27, 1865
1	http://chroniclingamerica.loc.gov/lccn/sn82015...	Alabama	The Chattanooga Daily Rebel. (Chattanooga, Ten...	sn82015209	8807897	2328-5826	228	Aug. 9, 1862	April 22, 1864
2	http://chroniclingamerica.loc.gov/lccn/sn82014...	Alabama	The daily Chattanooga rebel. (Griffin, Ga.) 18...	sn82014371	8793427	2328-5761	41	June 9, 1864	April 15, 1865
3	http://chroniclingamerica.loc.gov/lccn/sn83045...	Alabama	Memphis daily appeal. (Memphis, Tenn.) 1847-1886	sn83045160	9355541	2166-1898	8088	Jan. 1, 1857	Jan. 27, 1886
4	http://chroniclingamerica.loc.gov/lccn/sn94051...	Arizona	The argus. (Holbrook, Ariz.) 1895-1900	sn94051341	25084608	2375-169X	220	Dec. 12, 1895	May 5, 1900

How many papers are listed in the set?

In [542]:

len(papers.index)

Out[542]:

Some of these fields in the Last Issue Date have trailing delimiters. Let's use a regular expression to remove a space followed by a | in those cells.

In [543]:

papers["Last Issue Date"] = papers["Last Issue Date"].str.replace('(\s+\|)', '')

Now convert each of those dates to a datetime format. That will make it easier to pull out just the year later on.

In [544]:

papers["First Issue Date"] = pd.to_datetime(papers["First Issue Date"])

In [545]:

papers["Last Issue Date"] = pd.to_datetime(papers["Last Issue Date"])

What data type is the Issue date column?

In [546]:

papers["Last Issue Date"].dtype

Out[546]:

dtype('<M8[ns]')

That will work. So what does the data look like with the dates converted to date formats?

In [374]:

papers.head()

Out[374]:

	Persistent Link	State	Title	LCCN	OCLC	ISSN	No. of Issues	First Issue Date	Last Issue Date
0	http://chroniclingamerica.loc.gov/lccn/sn85025...	Alabama	Chattanooga daily rebel. (Selma, Ala.) 1865-1865	sn85025905	2638820	2328-5869	3	1865-04-19	1865-04-27
1	http://chroniclingamerica.loc.gov/lccn/sn82015...	Alabama	The Chattanooga Daily Rebel. (Chattanooga, Ten...	sn82015209	8807897	2328-5826	228	1862-08-09	1864-04-22
2	http://chroniclingamerica.loc.gov/lccn/sn82014...	Alabama	The daily Chattanooga rebel. (Griffin, Ga.) 18...	sn82014371	8793427	2328-5761	41	1864-06-09	1865-04-15
3	http://chroniclingamerica.loc.gov/lccn/sn83045...	Alabama	Memphis daily appeal. (Memphis, Tenn.) 1847-1886	sn83045160	9355541	2166-1898	8088	1857-01-01	1886-01-27
4	http://chroniclingamerica.loc.gov/lccn/sn94051...	Arizona	The argus. (Holbrook, Ariz.) 1895-1900	sn94051341	25084608	2375-169X	220	1895-12-12	1900-05-05

Now let's pull out the years and put them into their own new columns.

In [553]:

papers["Start"] = papers["First Issue Date"].dt.year

In [554]:

papers["Last"] = papers["Last Issue Date"].dt.year

In [555]:

papers.head()

Out[555]:

	Persistent Link	State	Title	LCCN	OCLC	ISSN	No. of Issues	First Issue Date	Last Issue Date	Start	Last
0	http://chroniclingamerica.loc.gov/lccn/sn85025...	Alabama	Chattanooga daily rebel. (Selma, Ala.) 1865-1865	sn85025905	2638820	2328-5869	3	1865-04-19	1865-04-27	1865	1865
1	http://chroniclingamerica.loc.gov/lccn/sn82015...	Alabama	The Chattanooga Daily Rebel. (Chattanooga, Ten...	sn82015209	8807897	2328-5826	228	1862-08-09	1864-04-22	1862	1864
2	http://chroniclingamerica.loc.gov/lccn/sn82014...	Alabama	The daily Chattanooga rebel. (Griffin, Ga.) 18...	sn82014371	8793427	2328-5761	41	1864-06-09	1865-04-15	1864	1865
3	http://chroniclingamerica.loc.gov/lccn/sn83045...	Alabama	Memphis daily appeal. (Memphis, Tenn.) 1847-1886	sn83045160	9355541	2166-1898	8088	1857-01-01	1886-01-27	1857	1886
4	http://chroniclingamerica.loc.gov/lccn/sn94051...	Arizona	The argus. (Holbrook, Ariz.) 1895-1900	sn94051341	25084608	2375-169X	220	1895-12-12	1900-05-05	1895	1900

Just for kicks, let's create a new dataframe with a row for each year of a title's coverage. This might be useful for other purposes later on.

In [556]:

years_df = pd.DataFrame(columns=("State","Title","Year"))
years_df.head()

Out[556]:

	State	Title	Year

In [557]:

# For each title in the original dataframe, 
# create a range for the years from Start to Last. 
# Then add a new row for each year of coverage. 
for i in range(len(papers.index)):
    for year in range(papers.loc[i]["Start"], (papers.loc[i]["Last"] + 1)):
        s = pd.Series({"State": papers.loc[i]["State"], "Title": papers.loc[i]["Title"], "Year": year})
        years_df.loc[len(years_df)]= s.astype(str)
        

How many rows are in this new dataframe?

In [558]:

len(years_df.index)

Out[558]:

In [559]:

years_df.head()

Out[559]:

	State	Title	Year
0	Alabama	Chattanooga daily rebel. (Selma, Ala.) 1865-1865	1865
1	Alabama	The Chattanooga Daily Rebel. (Chattanooga, Ten...	1862
2	Alabama	The Chattanooga Daily Rebel. (Chattanooga, Ten...	1863
3	Alabama	The Chattanooga Daily Rebel. (Chattanooga, Ten...	1864
4	Alabama	The daily Chattanooga rebel. (Griffin, Ga.) 18...	1864

That's a useful table. Let's save it as CSV, without the index column.

In [483]:

years_df.to_csv(path_or_buf="years-list.csv", index=False)

Now roll up this list by year, with a show the count for each year, and put in a new DataFrame.

In [560]:

years_count_df = pd.DataFrame(years_df.groupby(["Year"]).size())

In [561]:

years_count_df.head()

Out[561]:

	0
Year
1789	2
1790	2
1791	3
1792	3
1793	3

In [562]:

len(years_count_df)

Out[562]:

That's not enough years between 1789 and 1924, so some years must be missing. We'll need to find and insert those years, with a count of 0. Then we need to sort the index so that those new index values fall into place in the timeline, rather than being added at the end.

In [563]:

#need to identify missing years (e.g. 1810s) and add a count of 0
for y in range(1789,1925):
    x = str(y)
    if x not in years_count_df.index.values:
        years_count_df.loc[x]=[0]

years_count_df.index.tolist()

Out[563]:

['1789',
 '1790',
 '1791',
 '1792',
 '1793',
 '1794',
 '1795',
 '1796',
 '1797',
 '1798',
 '1799',
 '1800',
 '1801',
 '1802',
 '1803',
 '1804',
 '1805',
 '1806',
 '1807',
 '1808',
 '1809',
 '1828',
 '1829',
 '1830',
 '1831',
 '1832',
 '1833',
 '1834',
 '1835',
 '1836',
 '1837',
 '1838',
 '1839',
 '1840',
 '1841',
 '1842',
 '1843',
 '1844',
 '1845',
 '1846',
 '1847',
 '1848',
 '1849',
 '1850',
 '1851',
 '1852',
 '1853',
 '1854',
 '1855',
 '1856',
 '1857',
 '1858',
 '1859',
 '1860',
 '1861',
 '1862',
 '1863',
 '1864',
 '1865',
 '1866',
 '1867',
 '1868',
 '1869',
 '1870',
 '1871',
 '1872',
 '1873',
 '1874',
 '1875',
 '1876',
 '1877',
 '1878',
 '1879',
 '1880',
 '1881',
 '1882',
 '1883',
 '1884',
 '1885',
 '1886',
 '1887',
 '1888',
 '1889',
 '1890',
 '1891',
 '1892',
 '1893',
 '1894',
 '1895',
 '1896',
 '1897',
 '1898',
 '1899',
 '1900',
 '1901',
 '1902',
 '1903',
 '1904',
 '1905',
 '1906',
 '1907',
 '1908',
 '1909',
 '1910',
 '1911',
 '1912',
 '1913',
 '1914',
 '1915',
 '1916',
 '1917',
 '1918',
 '1919',
 '1920',
 '1921',
 '1922',
 '1923',
 '1924',
 '1810',
 '1811',
 '1812',
 '1813',
 '1814',
 '1815',
 '1816',
 '1817',
 '1818',
 '1819',
 '1820',
 '1821',
 '1822',
 '1823',
 '1824',
 '1825',
 '1826',
 '1827']

Oops, those new rows got added at the end, which won't make sense for time-ordered data. Let's sort the index.

In [564]:

years_count_df.sort_index(inplace=True)

In [565]:

len(years_count_df)

Out[565]:

Now to create a chart showing the numbers of papers each year.

In [566]:

% matplotlib inline

In [567]:

years_count_df.plot(kind='barh', figsize=(20,75)).invert_yaxis()

Let's save this data to CSV! This file will be useful in other applications for examining the data.

In [568]:

years_count_df.to_csv(path_or_buf="final-counts.csv")

Change of pace: want to hear a sonification of this graph? I posted a sound file on Soundcloud. Be sure to read the description to get the details on what you're hearing.

So what are the min and max counts? How many years altogether in this file?

In [570]:

years_count_df.describe()

Out[570]:

	0
count	136.000000
mean	177.544118
std	179.249252
min	0.000000
25%	1.000000
50%	130.000000
75%	320.500000
max	526.000000

One other way to look at the data from one of the earlier dataframes is to group by year and then group within that by state. (We're going to ignore the years with no titles.)

In [402]:

years_df.groupby(["Year","State"]).size()

Out[402]:

Year  State               
1789  New York                 1
      Pennsylvania             1
1790  New York                 1
      Pennsylvania             1
1791  New York                 1
      Pennsylvania             2
1792  New York                 1
      Pennsylvania             2
1793  New York                 1
      Pennsylvania             2
1794  Pennsylvania             2
1795  Pennsylvania             2
1796  Pennsylvania             2
1797  Pennsylvania             1
1798  Pennsylvania             1
1799  Pennsylvania             1
1800  District of Columbia     1
      Pennsylvania             2
1801  District of Columbia     1
1802  District of Columbia     1
1803  District of Columbia     1
1804  District of Columbia     1
1805  District of Columbia     1
1806  District of Columbia     1
1807  District of Columbia     1
1808  District of Columbia     1
1809  District of Columbia     1
1828  Georgia                  1
1829  Georgia                  2
1830  Georgia                  1
                              ..
1922  Kansas                   4
      Kentucky                17
      Louisiana               18
      Maryland                 5
      Michigan                 7
      Minnesota               14
      Mississippi              9
      Missouri                17
      Montana                  4
      Nebraska                 5
      Nevada                   3
      New Mexico               7
      New York                 3
      North Carolina           9
      North Dakota            10
      Ohio                     2
      Oklahoma                 7
      Oregon                  11
      Pennsylvania             2
      South Carolina          20
      Tennessee               12
      Texas                    4
      Utah                     5
      Vermont                 11
      Virginia                 6
      Washington               7
      West Virginia            3
      Wisconsin                1
1923  Michigan                 1
1924  Michigan                 1
dtype: int64