https://github.com/fchasen/bookhunters
Fred, Luis, Sonali
When we search for books on the internet, we are often overwhelmed with results coming from various sources in both physical and digital locations. Futhermore, it’s difficult to get direct trusted urls to books. Project Gutenberg, HathiTrust and Open Library, all provide an extensive library of books online, each with their own large repositories. By combining their catalogs, Book Hunters enables querying for a book across those different sources, our project will also highlight key statistics about the three datasets which will make it easier for the user to choose the library that best meets their needs.
Open Library - http://openlibrary.org/
Open Library is an open, editable library catalog, building towards a web page for every book ever published.
HathiTrust - http://www.hathitrust.org/
HathiTrust is a partnership of major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future. There are more than sixty partners in HathiTrust, and membership is open to institutions worldwide.
Gutenberg - http://www.gutenberg.org/
Project Gutenberg is the first and largest single collection of free electronic books, or eBooks. Michael Hart, founder of Project Gutenberg, invented eBooks in 1971 and continues to inspire the creation of eBooks and related technologies today. All books in Gutenberg are in Public Domain and may exist across the format spectrum from simple plain text to mp3.
We have followed the plan pretty closely as far as functionality we wanted to establish in regard to finding and downloading books immediately. As far as calculating when books would enter Public Domain, this proved to be difficult. First, depending on when the book was published, a copyright can either be renewed or is automatically renewed. The system would have to keep track of not just the date, but the status of a renewal and whether the copyright was held by an individual author or a corporation.
Due to the large amount of records that the data sources contain, we needed to move to a more persistent storage in order to run various metrics and achieve better retrieval performance.
Hmmmmm, where do we start
The goal here is to show some general metrics which will help if/when you want to choose what data source to look at more closely
In order to reproduce the results, there is some setup required:
Fred
Luis
Sonali
The goal for the presentation was to show that Notebooks could still be used even though the data source is a database. Our main next step would be to turn this into a web application, with a sophisticated UI in order to help people find "on demand" books. Other update include:
Each library's data needed to be imported into the databse differently.
The following links provided the urls to the data and code we used to import each of the datasets.
This process takes several hours, so shown for reproducibility only.
Open Library - https://ec2-54-225-70-191.compute-1.amazonaws.com:80/5fb75e17-1eb2-4eb1-8d52-2eb5c3e18fe9
HathiTrust - https://ec2-54-225-70-191.compute-1.amazonaws.com:80/69a6a181-4243-4b85-8a30-d74e90584bc7
Gutenberg - https://ec2-54-225-70-191.compute-1.amazonaws.com:80/82ab62ce-c46f-4c3f-9c2c-97fe3f2a6220
#Importing libraries
import matplotlib.pyplot as plt
import pandas as pd
from itertools import islice
import numpy as np
#Useful way of priting dataframe using html
from IPython.core.display import HTML
import IPython.core.display
/usr/local/lib/python2.7/dist-packages/pytz/__init__.py:35: UserWarning: Module logging was already imported from /usr/lib/python2.7/logging/__init__.pyc, but /usr/local/lib/python2.7/dist-packages is being added to sys.path from pkg_resources import resource_stream
#Connect to database, our data is in mysql hosted on ec2 cluster. Since the size of data was huge hence it was essential to use a database.
import MySQLdb
db = MySQLdb.connect(host="bookhunters.cicwejvpogpp.us-east-1.rds.amazonaws.com", port=3306, user="bookhunters", passwd="wwod13pw!",
db="bookhunters", charset='utf8')
db.set_character_set('utf8')
cursor = db.cursor()
cursor.execute('SET NAMES utf8')
cursor.execute('SET CHARACTER SET utf8')
0L
# execute SQL select statement to fetch results from database
cursor.execute("select 'Hathtrust' , count(1) from ht_books union select 'Gutenberg', count(1) from gut_books union select 'Open library', count(1) from ol_books")
results = cursor.fetchall()
#Store results in a dataframe
l = list(results)
df = pd.DataFrame(l, columns=["Source","Count"])
df = df.set_index(df.Source)
df
Source | Count | |
---|---|---|
Source | ||
Hathtrust | Hathtrust | 4418000 |
Gutenberg | Gutenberg | 42213 |
Open library | Open library | 197245 |
df['Count'].plot(kind = 'bar',title="Total books")
<matplotlib.axes.AxesSubplot at 0x2ca5f50>
#Plot graphs based on publish date
cursor.execute("select pubdate,'Hathitrust' as Source, COUNT(1) from ht_books group by pubdate, 'Hathitrust' union select publish_date,'Openlibrary' as Source, COUNT(1) from ol_books group by publish_date,'Openlibrary' union select created,'Gutenberg' as Source, COUNT(1) from gut_books group by created,'Gutenberg'")
dates = cursor.fetchall()
dateslist = list(dates)
dfdates=pd.DataFrame(dateslist,columns=["Year","Source","Count"])
#display(HTML(dfdates[:-25].to_html()))
#Parsing the dates field to fetch only the year, remove redundant text.
import re
def returnnumber(num):
try:
tmp = re.findall(("(\d{4})") ,num)
if (len(tmp)==1):
return int("".join(tmp))
else:
return None
except:
return None
#Removing null fields from the dataframe to add another column called FilteredDate to store parsed date.
dfdates.dropna()
s = pd.Series()
s = dfdates.Year.apply(returnnumber)
dfdates["FilteredDate"]=s
dfdates.sort('FilteredDate')
dfdates = dfdates[(dfdates.FilteredDate>1990) & (dfdates.FilteredDate<2007)]
#display(HTML(dfdates.to_html()))
#Plot the books publishes by year between 1990 and 2002
d = dfdates.pivot_table('Count', rows='FilteredDate', cols='Source', aggfunc=sum)
d
Source | Gutenberg | Hathitrust | Openlibrary |
---|---|---|---|
FilteredDate | |||
1991 | 5 | 58774 | 1389 |
1992 | 13 | 52561 | 1498 |
1993 | 36 | 32109 | 1832 |
1994 | 58 | 16576 | 1724 |
1995 | 110 | 16136 | 1620 |
1996 | 335 | 14864 | 1649 |
1997 | 335 | 13812 | 1679 |
1998 | 350 | 12870 | 1703 |
1999 | 344 | 19557 | 1630 |
2000 | 356 | 12103 | 1817 |
2001 | 332 | 11023 | 1427 |
2002 | 388 | 10277 | 1423 |
2003 | 1481 | 10003 | 1398 |
2004 | 7124 | 9891 | 1262 |
2005 | 5145 | 9425 | 1222 |
2006 | 3637 | 8401 | 1147 |
d.plot(title="Total books collection by year", linewidth=2.5 )
<matplotlib.axes.AxesSubplot at 0x31e4bd0>
def ipynb_input(varname, prompt=''):
"""Prompt user for input and assign string val to given variable name."""
js_code = ("""
var value = prompt("{prompt}","");
var py_code = "{varname} = '" + value + "'";
IPython.notebook.kernel.execute(py_code);
""").format(prompt=prompt, varname=varname)
return IPython.core.display.Javascript(js_code)
ipynb_input("fromyear", prompt='Enter the from year: ')
ipynb_input("toyear", prompt='Enter the to year: ')
dfdatesuser = dfdates[(dfdates.FilteredDate>int(fromyear)) & (dfdates.FilteredDate<int(toyear))]
duser = dfdatesuser.pivot_table('Count', rows='FilteredDate', cols='Source', aggfunc=sum)
duser.plot(title="Books collection by year", linewidth=2.5 )
<matplotlib.axes.AxesSubplot at 0x3316e10>
cursor.execute("select lang as lang,count(1) as count,'Hathitrust' as Source from ht_books where lang is not null group by (lang)")
dflang_ht = pd.DataFrame(list(cursor.fetchall()),columns=['Language','Count','Source'])
cursor.execute("select lang as lang ,count(1) as count,'Gutenberg' as Source from gut_books where lang is not null group by (lang)")
dflang_gb = pd.DataFrame(list(cursor.fetchall()),columns=['Language','Count','Source'])
cursor.execute("select language as lang,count(1) as count, 'Openlibrary' as Source from ol_books where language is not null group by (language)")
dflang_ol = pd.DataFrame(list(cursor.fetchall()),columns=['Language','Count','Source'])
#Function to translate date abbreviation to their full form
def language(l):
langdict= { 'eng':'English',
'ger':'German',
'fre':'French',
'chi':'Chinese',
'jpn':'Japan',
'rus':'Russian',
'en':'English',
'fr':'French',
'de':'Dutch',
'nl':'Dutch',
'pt':'Portugese',
'spa':'Spanish',
'ita':'Italian',
'lat':'Latin',
'fr':'French',
'fi':'Finnish' }
try:
if len(l)>3:
tmp= l[-3:]
else:
tmp=l
return langdict[tmp]
except:
return 'Other'
dflang_gb6= dflang_gb[dflang_gb.Language.str.strip()!=""].sort('Count')[::-1][:6]
dflang_ol6 = dflang_ol[dflang_ol.Language.str.strip()!=""].sort('Count')[::-1][:6]
dflang_ht6 = dflang_ht[dflang_ht.Language.str.strip()!=""].sort('Count')[::-1][:6]
dflang_gb6['Lang']= dflang_gb[dflang_gb.Language.str.strip()!=""].sort('Count')[::-1][:6].Language.apply(language)
dflang_ol6['Lang']= dflang_ol[dflang_ol.Language.str.strip()!=""].sort('Count')[::-1][:6].Language.apply(language)
dflang_ht6['Lang']= dflang_ht[dflang_ht.Language.str.strip()!=""].sort('Count')[::-1][:6].Language.apply(language)
df_mer_lang = dflang_gb6.append(dflang_ol6).append(dflang_ht6)
df_mer_lang = df_mer_lang.pivot_table('Count', rows='Lang', cols='Source', aggfunc=sum)
df_mer_lang.fillna(0)
Source | Gutenberg | Hathitrust | Openlibrary |
---|---|---|---|
Lang | |||
Chinese | 0 | 221058 | 0 |
Dutch | 1520 | 0 | 0 |
English | 35198 | 2352951 | 108873 |
Finnish | 712 | 0 | 0 |
French | 2158 | 343066 | 14479 |
German | 0 | 398413 | 9456 |
Italian | 0 | 0 | 2323 |
Japan | 0 | 212208 | 0 |
Latin | 0 | 0 | 1898 |
Portugese | 528 | 0 | 0 |
Russian | 0 | 147235 | 0 |
Spanish | 0 | 0 | 2785 |
fig = plt.figure()
fig.set_size_inches(10,10)
ax1 = fig.add_subplot(2, 2, 1)
df_mer_lang.fillna(0)['Gutenberg'].plot(kind="barh", title="Gutenberg",color='r')
ax2 = fig.add_subplot(2, 2, 2)
df_mer_lang.fillna(0)['Openlibrary'].plot(kind="barh", title="Openlibrary",color='b')
ax3 = fig.add_subplot(2, 2, 3)
print "\n"
print "\n"
df_mer_lang.fillna(0)['Hathitrust'].plot(kind="barh", title="Hathitrust",color='g')
<matplotlib.axes.AxesSubplot at 0x384ed10>
query = " ".join(["select COUNT(FORM) AS formatcnt, FORM as format, Source",
"from( select case WHEN gut_files.FORMAT LIKE '%epub%' then 'epub'",
"else case WHEN gut_files.FORMAT LIKE '%pdf%' THEN 'pdf' ",
"else case WHEN gut_files.format like '%text%' THEN 'text' END END END AS FORM , 'Gutenberg' as Source",
"from gut_files) a",
"group by FORM , Source",
"union SELECT count('PDF') as formatcnt,'pdf' as format, 'Hathitrust'as Source from ht_books where Access='allow' group by 'PDF','Hathitrust' ",
"union select count('epub') as formatcnt,'epub' as format, 'Openlibrary' as Source from ol_books group by 'epub','Openlibrary' ",
"union select count('text') as formatcnt, 'text' as format, 'Openlibrary' as Source from ol_books group by 'text','Openlibrary' ",
"union select count('pdf') as formatcnt, 'pdf' as format, 'Openlibrary' as Source from ol_books group by 'pdf','Openlibrary' "
])
cursor.execute(query)
formatres = list(cursor.fetchall())
dfformatres = pd.DataFrame(formatres,columns=['Count','Format','Source'])
dfformatres = dfformatres.pivot_table('Count', rows='Format', cols='Source', aggfunc=sum)
dfformatres.plot(kind="bar", stacked=True)
<matplotlib.axes.AxesSubplot at 0x33f7650>
def maketable(df):
imagedict={6:"http://fchasen.com/cal/open-data/icons/pdf-icon.png",
8:"http://fchasen.com/cal/open-data/icons/epub-icon.png",
7:"http://fchasen.com/cal/open-data/icons/text-icon.png"}
columns=df.columns
table ="<table><thead>"
table +=" <tr><th>Results</th>"
for i in columns:
table += "<th>{0}</th>".format(i)
table += "</tr></thead><tbody>"
for k,s in df.iterrows():
table +="<tr>"
table += "<th>{0}</th>".format(k)
for colnum,j in enumerate(s):
#print type(j)
try:
if (colnum in [6,7,8]):
if j is not None and j[:4]=="http":
table += "<td style='word-break:break-word;'><a href='{0}' target='_blank' style='word-break:break-word;'><img src='{1}' style='height:50px; margin: 0 auto;'/></td>".format(j,imagedict[colnum])
else:
table += "<td style='text-align:center'><img src='http://fchasen.com/cal/open-data/icons/none-icon.png' style='height:30px; margin: 10px auto;'/></td>"
else:
table += "<td>{0}</td>".format(j.encode('utf-8'))
except:
table += "<td>{0}</td>".format(j)
table += "</tr>"
table += "</tbody></table>"
return table
def searchBook(q, o="title"):
if o == "title":
g = "title"
ol = "title"
h = "Title"
if o == "author":
ol = "ol_authors.name"
g = "creator"
h = "Imprint"
query = "".join(["select distinct 'Gutenberg' as source, g.title as title, " ,
"g.creator as author, g.lang as language,'Unknown' as pubdate, 'public domain' as rights, ",
"(select about from gut_files where format like '%pdf%' and etext_id=g.etext_id limit 1) as pdf, ",
"(select about from gut_files where format like '%text%' and etext_id=g.etext_id limit 1) as text, ",
"(select about from gut_files where format like '%epub%' and etext_id=g.etext_id limit 1) as epub ",
"from gut_books g ",
"where lower("+g+") like lower('%"+q+"%') " ,
"union ",
"select distinct 'Hathitrust' as source, Title as title, ",
"'Unknown' as author, lang as language, PubDate as pubdate, case Access when 'allow' then 'public domain' else 'non public' end as rights, ",
"concat('http://hdl.handle.net/2027/',VolumeID) as pdf,'N/A' as epub, 'N/A' as text from ",
"ht_books where lower("+h+") like lower('%"+q+"%') and Access = 'allow' ",
"union ",
"select distinct 'OpenLibrary' as source, title, name as author, ",
"language, publish_date as pubdate, 'Unknown' as rights, ",
"concat('http://www.archive.org/download/',ocaid, '/', ocaid, '.pdf') as pdf, ",
"concat('http://www.archive.org/download/',ocaid, '/', ocaid, '_djvu.txt') as text, ",
"concat('http://www.archive.org/download/',ocaid, '/', ocaid, '.epub') as epub ",
"from ol_books LEFT JOIN ol_authors ON ol_books.author_key = ol_authors.key where lower("+ol+") like lower('%"+q+"%')"])
cursor.execute(query)
res = list(cursor.fetchall())
if(len(res)):
dfresult = pd.DataFrame(res ,columns=['source','title', 'author','language','pubdate','rights','pdf','text','epub'] )
dfresult['language']=dfresult.language.apply(language)
print "Total Results:",len(dfresult)
return dfresult
else:
print "No results returned, try again"
return False
inputHTML = '''
<input id="book-search" stype="width:400px">
<select id="book-select">
<option value="title">Title</option>
<option value="author">Author</option>
</select>
<input type="submit" id="book-submit" value="Search">'''
js_code = ("""
var $body = $("body");
$body.on('click', '#book-submit', function() {
var $search = $('#book-search'),
$select = $('#book-select');
var py_code1 = "queryterm = '" + $search.val() + "'";
var py_code2 = "on = '" + $select.val() + "'";
console.log('submit:', $search.val(), $select.val());
IPython.notebook.kernel.execute(py_code1);
IPython.notebook.kernel.execute(py_code2);
});
""")
IPython.core.display.Javascript(js_code)
HTML(inputHTML)
dfresult = searchBook(queryterm, on)
Total Results: 12
source = dfresult.groupby('source')
source['source'].value_counts().plot(kind="bar", title="Count distribution of searched results")
<matplotlib.axes.AxesSubplot at 0x3ac8b10>
lbl= dfresult['language'].value_counts().index
#dfresult.value_counts().plot(kind="bar")
pie(dfresult['language'].value_counts(), labels=lbl, autopct='%1.1f%%')
#type(dfresult['language'])
([<matplotlib.patches.Wedge at 0x3f58990>], [<matplotlib.text.Text at 0x3f58e50>], [<matplotlib.text.Text at 0x3f58f50>])
display(HTML(maketable(dfresult)))
if on == "title":
h = "Title"
if on == "author":
h = "Imprint"#Fetching books only from hathitrust
query = "".join(["select Title as title, 'Unknown' as subtitle, 'Unknown' as author,",
"lang as language, PubDate as pubdate, Access as rights, concat('http://hdl.handle.net/2027/',VolumeID) as link,",
"'pdf' as format from ht_books where "+h+" like '%" ,
queryterm,
"%'"])
query
cursor.execute(query)
htres = list(cursor.fetchall())
dfht= pd.DataFrame(htres , columns=['title','subtitle','author','language','pubdate','rights','link','format'])
#dislaying results from hathitrust
display(HTML(dfht.to_html()))
title | subtitle | author | language | pubdate | rights | link | format | |
---|---|---|---|---|---|---|---|---|
0 | The hound of the Baskervilles : another advent... | Unknown | Unknown | eng | 1921 | allow | http://hdl.handle.net/2027/mdp.39015062324325 |
dfht.pubdate.apply(returnnumber).value_counts().plot(kind="barh", title = "number of books published per year for the queries book")
<matplotlib.axes.AxesSubplot at 0x3fefa50>
dfht.language.apply(language)
lbl = dfht.language.apply(language).value_counts().index
pie(dfht.language.apply(language).value_counts(), labels=lbl, autopct='%1.1f%%')
([<matplotlib.patches.Wedge at 0x487b410>, <matplotlib.patches.Wedge at 0x487ba50>, <matplotlib.patches.Wedge at 0x487bfd0>, <matplotlib.patches.Wedge at 0x487d5d0>, <matplotlib.patches.Wedge at 0x487db90>, <matplotlib.patches.Wedge at 0x4882190>, <matplotlib.patches.Wedge at 0x4882750>], [<matplotlib.text.Text at 0x487b8d0>, <matplotlib.text.Text at 0x487bf10>, <matplotlib.text.Text at 0x487d510>, <matplotlib.text.Text at 0x487dad0>, <matplotlib.text.Text at 0x487dfd0>, <matplotlib.text.Text at 0x4882690>, <matplotlib.text.Text at 0x4882c50>], [<matplotlib.text.Text at 0x487b9d0>, <matplotlib.text.Text at 0x487bad0>, <matplotlib.text.Text at 0x487d090>, <matplotlib.text.Text at 0x487d650>, <matplotlib.text.Text at 0x487dc10>, <matplotlib.text.Text at 0x4882210>, <matplotlib.text.Text at 0x48827d0>])
Analyis of Hathitrust's non digitized data. This consists of volume of books before 1923 https://ec2-54-225-70-191.compute-1.amazonaws.com:80/25162c63-57c0-4c4d-b809-5c6a8f55b11b