Book Hunters

https://github.com/fchasen/bookhunters

Team Members

Fred, Luis, Sonali

Problem and Overview

When we search for books on the internet, we are often overwhelmed with results coming from various sources in both physical and digital locations. Futhermore, it’s difficult to get direct trusted urls to books. Project Gutenberg, HathiTrust and Open Library, all provide an extensive library of books online, each with their own large repositories. By combining their catalogs, Book Hunters enables querying for a book across those different sources, our project will also highlight key statistics about the three datasets which will make it easier for the user to choose the library that best meets their needs.

Data sources

Open Library

Open Library - http://openlibrary.org/

Open Library is an open, editable library catalog, building towards a web page for every book ever published.

HathiTrust

HathiTrust - http://www.hathitrust.org/

HathiTrust is a partnership of major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future. There are more than sixty partners in HathiTrust, and membership is open to institutions worldwide.

Gutenberg

Gutenberg - http://www.gutenberg.org/

Project Gutenberg is the first and largest single collection of free electronic books, or eBooks. Michael Hart, founder of Project Gutenberg, invented eBooks in 1971 and continues to inspire the creation of eBooks and related technologies today. All books in Gutenberg are in Public Domain and may exist across the format spectrum from simple plain text to mp3.

The Plan vs. The Execution

We have followed the plan pretty closely as far as functionality we wanted to establish in regard to finding and downloading books immediately. As far as calculating when books would enter Public Domain, this proved to be difficult. First, depending on when the book was published, a copyright can either be renewed or is automatically renewed. The system would have to keep track of not just the date, but the status of a renewal and whether the copyright was held by an individual author or a corporation.

Bookhunters Database Schema

Due to the large amount of records that the data sources contain, we needed to move to a more persistent storage in order to run various metrics and achieve better retrieval performance.

Bookhunters Schema

Trials and Tribulations

Hmmmmm, where do we start

  • Configuring systems and databases
  • Vocabulary issue between the disparate sources and formats (XML, CSV, JSON)
  • Large datasets
  • Deciding the overlap between the disparate sources

Analyzing The Data

The goal here is to show some general metrics which will help if/when you want to choose what data source to look at more closely

  • Total book counts for each source
  • Books by year published/released
  • Books by language
  • Books by format
  • Search example and results
    • Links to formats
  • Hathi Trust publications per search

Results Reproduction

In order to reproduce the results, there is some setup required:

  1. Install a MySQL database and the necessary python libraries needed to load the data sources
  2. Download the latest data dumps from the various sources
  3. Use provided python data loading scripts for each of the data sources
  4. Run the final notebook

Workload Distribution

Fred

  • OpenLibrary dataset expert, data loader and database schema designer
  • Notebook HTML author

Luis

  • General system and database configuration
  • Gutenberg dataset expert, data loader and database schema designer

Sonali

  • Graphing and visualizations author across the datasets
  • Hathi Trust dataset expert, data loader and database schema designer

Where Do We Go From Here?

The goal for the presentation was to show that Notebooks could still be used even though the data source is a database. Our main next step would be to turn this into a web application, with a sophisticated UI in order to help people find "on demand" books. Other update include:

  • Create a materialized view between the three tables to simplify search and increase performance across the databases
  • Incorporating more statistics from the sources, i.e. - Gutenberg number of downloads
  • Including and mitigating subject vocabulary issues between the sources or possibly looking up book metadata from a source like Wikipedia
  • Incorporating book rankings technologies along with user metrics in order to identify books the user may enjoy

Import libraries and Connect to database

Each library's data needed to be imported into the databse differently.

The following links provided the urls to the data and code we used to import each of the datasets.

This process takes several hours, so shown for reproducibility only.

Open Library - https://ec2-54-225-70-191.compute-1.amazonaws.com:80/5fb75e17-1eb2-4eb1-8d52-2eb5c3e18fe9

HathiTrust - https://ec2-54-225-70-191.compute-1.amazonaws.com:80/69a6a181-4243-4b85-8a30-d74e90584bc7

Gutenberg - https://ec2-54-225-70-191.compute-1.amazonaws.com:80/82ab62ce-c46f-4c3f-9c2c-97fe3f2a6220

In [1]:
#Importing libraries

import matplotlib.pyplot as plt
import pandas as pd
from itertools import islice
import numpy as np
#Useful way of priting dataframe using html
from IPython.core.display import HTML
import IPython.core.display
/usr/local/lib/python2.7/dist-packages/pytz/__init__.py:35: UserWarning: Module logging was already imported from /usr/lib/python2.7/logging/__init__.pyc, but /usr/local/lib/python2.7/dist-packages is being added to sys.path
  from pkg_resources import resource_stream
In [2]:
#Connect to database, our data is in mysql hosted on ec2 cluster. Since the size of data was huge hence it was essential to use a database.
import MySQLdb
db = MySQLdb.connect(host="bookhunters.cicwejvpogpp.us-east-1.rds.amazonaws.com", port=3306, user="bookhunters", passwd="wwod13pw!",
db="bookhunters", charset='utf8')
db.set_character_set('utf8')
cursor = db.cursor()
cursor.execute('SET NAMES utf8')
cursor.execute('SET CHARACTER SET utf8')
Out[2]:
0L
In [4]:
# execute SQL select statement to fetch results from database
cursor.execute("select 'Hathtrust' , count(1) from ht_books union select 'Gutenberg', count(1) from gut_books union select 'Open library', count(1) from ol_books")
results = cursor.fetchall()

Fetching overall statistics

Total Count

In [5]:
#Store results in a dataframe
l = list(results)
df = pd.DataFrame(l, columns=["Source","Count"])
df = df.set_index(df.Source)
df
Out[5]:
Source Count
Source
Hathtrust Hathtrust 4418000
Gutenberg Gutenberg 42213
Open library Open library 197245

Books count in a bar graph

In [6]:
df['Count'].plot(kind = 'bar',title="Total books")
Out[6]:
<matplotlib.axes.AxesSubplot at 0x2ca5f50>
In [7]:
#Plot graphs based on publish date
cursor.execute("select pubdate,'Hathitrust' as Source, COUNT(1) from ht_books group by pubdate, 'Hathitrust' union select publish_date,'Openlibrary' as Source, COUNT(1) from ol_books group by publish_date,'Openlibrary' union select created,'Gutenberg' as Source, COUNT(1) from gut_books group by created,'Gutenberg'")
dates = cursor.fetchall()
In [8]:
dateslist = list(dates)
dfdates=pd.DataFrame(dateslist,columns=["Year","Source","Count"])
#display(HTML(dfdates[:-25].to_html()))

Number of books by Year

In [9]:
#Parsing the dates field to fetch only the year, remove redundant text.
import re
def returnnumber(num): 
    try:
        tmp = re.findall(("(\d{4})") ,num)
        if (len(tmp)==1):
            return int("".join(tmp))
        else:
            return None
    except:
        return None
    
In [10]:
#Removing null fields from the dataframe to add another column called FilteredDate to store parsed date.
dfdates.dropna()
s = pd.Series()
s = dfdates.Year.apply(returnnumber)
dfdates["FilteredDate"]=s
In [11]:
dfdates.sort('FilteredDate')
dfdates = dfdates[(dfdates.FilteredDate>1990) & (dfdates.FilteredDate<2007)]
#display(HTML(dfdates.to_html()))
In [12]:
#Plot the books publishes by year between 1990 and 2002
d = dfdates.pivot_table('Count', rows='FilteredDate', cols='Source', aggfunc=sum)
d
Out[12]:
Source Gutenberg Hathitrust Openlibrary
FilteredDate
1991 5 58774 1389
1992 13 52561 1498
1993 36 32109 1832
1994 58 16576 1724
1995 110 16136 1620
1996 335 14864 1649
1997 335 13812 1679
1998 350 12870 1703
1999 344 19557 1630
2000 356 12103 1817
2001 332 11023 1427
2002 388 10277 1423
2003 1481 10003 1398
2004 7124 9891 1262
2005 5145 9425 1222
2006 3637 8401 1147
In [13]:
d.plot(title="Total books collection by year", linewidth=2.5 )
Out[13]:
<matplotlib.axes.AxesSubplot at 0x31e4bd0>
In [14]:
def ipynb_input(varname, prompt=''):
    """Prompt user for input and assign string val to given variable name."""
    js_code = ("""
        var value = prompt("{prompt}","");
        var py_code = "{varname} = '" + value + "'";
        IPython.notebook.kernel.execute(py_code);
    """).format(prompt=prompt, varname=varname)
    return IPython.core.display.Javascript(js_code)
In [15]:
ipynb_input("fromyear", prompt='Enter the from year: ')
Out[15]:
In [16]:
ipynb_input("toyear", prompt='Enter the to year: ')
Out[16]:
In [17]:
dfdatesuser = dfdates[(dfdates.FilteredDate>int(fromyear)) & (dfdates.FilteredDate<int(toyear))]
duser = dfdatesuser.pivot_table('Count', rows='FilteredDate', cols='Source', aggfunc=sum)
duser.plot(title="Books collection by year", linewidth=2.5 )
Out[17]:
<matplotlib.axes.AxesSubplot at 0x3316e10>

Plot number of books by language

In [18]:
cursor.execute("select lang as lang,count(1) as count,'Hathitrust' as Source from ht_books where lang is not null group by (lang)") 
dflang_ht = pd.DataFrame(list(cursor.fetchall()),columns=['Language','Count','Source'])
cursor.execute("select lang as lang ,count(1) as count,'Gutenberg' as Source from gut_books where lang is not null group by (lang)")
dflang_gb = pd.DataFrame(list(cursor.fetchall()),columns=['Language','Count','Source'])
cursor.execute("select language as lang,count(1) as count, 'Openlibrary' as Source from ol_books where language is not null group by (language)")
dflang_ol = pd.DataFrame(list(cursor.fetchall()),columns=['Language','Count','Source'])
In [8]:
#Function to translate date abbreviation to their full form
def language(l):
    langdict= { 'eng':'English',
                'ger':'German',
                'fre':'French',
                'chi':'Chinese',
                'jpn':'Japan',
                'rus':'Russian',
                'en':'English',
                'fr':'French',
                'de':'Dutch',
                'nl':'Dutch',
                'pt':'Portugese',
                'spa':'Spanish',
                'ita':'Italian',
                'lat':'Latin',
                'fr':'French',
                'fi':'Finnish'    }
    try:
        if len(l)>3:
            tmp= l[-3:]
        else:
            tmp=l
        return langdict[tmp]
    except:
        return 'Other'
    
In [20]:
dflang_gb6= dflang_gb[dflang_gb.Language.str.strip()!=""].sort('Count')[::-1][:6]
dflang_ol6 = dflang_ol[dflang_ol.Language.str.strip()!=""].sort('Count')[::-1][:6]
dflang_ht6 = dflang_ht[dflang_ht.Language.str.strip()!=""].sort('Count')[::-1][:6]
dflang_gb6['Lang']= dflang_gb[dflang_gb.Language.str.strip()!=""].sort('Count')[::-1][:6].Language.apply(language)
dflang_ol6['Lang']= dflang_ol[dflang_ol.Language.str.strip()!=""].sort('Count')[::-1][:6].Language.apply(language)
dflang_ht6['Lang']= dflang_ht[dflang_ht.Language.str.strip()!=""].sort('Count')[::-1][:6].Language.apply(language)
df_mer_lang = dflang_gb6.append(dflang_ol6).append(dflang_ht6)

Books by language

In [21]:
df_mer_lang = df_mer_lang.pivot_table('Count', rows='Lang', cols='Source', aggfunc=sum)
df_mer_lang.fillna(0)
Out[21]:
Source Gutenberg Hathitrust Openlibrary
Lang
Chinese 0 221058 0
Dutch 1520 0 0
English 35198 2352951 108873
Finnish 712 0 0
French 2158 343066 14479
German 0 398413 9456
Italian 0 0 2323
Japan 0 212208 0
Latin 0 0 1898
Portugese 528 0 0
Russian 0 147235 0
Spanish 0 0 2785

Plot books count by language for the three data sources

In [22]:
fig = plt.figure()
fig.set_size_inches(10,10)
ax1 = fig.add_subplot(2, 2, 1)
df_mer_lang.fillna(0)['Gutenberg'].plot(kind="barh", title="Gutenberg",color='r')
ax2 = fig.add_subplot(2, 2, 2)
df_mer_lang.fillna(0)['Openlibrary'].plot(kind="barh", title="Openlibrary",color='b')
ax3 = fig.add_subplot(2, 2, 3)
print "\n"
print "\n"
df_mer_lang.fillna(0)['Hathitrust'].plot(kind="barh", title="Hathitrust",color='g')



Out[22]:
<matplotlib.axes.AxesSubplot at 0x384ed10>

Number of books by format

In [23]:
query = " ".join(["select COUNT(FORM) AS formatcnt, FORM as format, Source",
"from( select case WHEN gut_files.FORMAT LIKE '%epub%' then 'epub'", 
"else case WHEN gut_files.FORMAT LIKE '%pdf%' THEN 'pdf' ",
"else case WHEN gut_files.format like '%text%' THEN 'text'  END END END AS FORM ,  'Gutenberg' as Source",
"from gut_files) a",
"group by FORM , Source",
"union SELECT count('PDF') as formatcnt,'pdf'  as format, 'Hathitrust'as Source from ht_books where Access='allow' group by 'PDF','Hathitrust' ",
"union select count('epub') as formatcnt,'epub' as format, 'Openlibrary' as Source from ol_books group by 'epub','Openlibrary' ",
"union select count('text') as formatcnt, 'text' as format, 'Openlibrary' as Source from ol_books group by 'text','Openlibrary' ",
"union select count('pdf') as formatcnt, 'pdf' as format, 'Openlibrary' as Source from ol_books group by 'pdf','Openlibrary' "
])
cursor.execute(query)
formatres = list(cursor.fetchall())
dfformatres = pd.DataFrame(formatres,columns=['Count','Format','Source'])
In [24]:
dfformatres = dfformatres.pivot_table('Count', rows='Format', cols='Source', aggfunc=sum)
In [25]:
dfformatres.plot(kind="bar", stacked=True)
Out[25]:
<matplotlib.axes.AxesSubplot at 0x33f7650>

Code for Book Query

In [3]:
def maketable(df):
    imagedict={6:"http://fchasen.com/cal/open-data/icons/pdf-icon.png",
               8:"http://fchasen.com/cal/open-data/icons/epub-icon.png",
               7:"http://fchasen.com/cal/open-data/icons/text-icon.png"}
    columns=df.columns
    
    table ="<table><thead>"
    table +=" <tr><th>Results</th>"
    for i in columns:
        table += "<th>{0}</th>".format(i)
    table += "</tr></thead><tbody>"
    for k,s  in df.iterrows():
        table +="<tr>"
        table +=    "<th>{0}</th>".format(k)
    
        for colnum,j in enumerate(s):
            #print type(j)
            try:
                if (colnum in [6,7,8]):
                    if j is not None and j[:4]=="http":
                        table +=    "<td style='word-break:break-word;'><a href='{0}' target='_blank' style='word-break:break-word;'><img src='{1}' style='height:50px; margin: 0 auto;'/></td>".format(j,imagedict[colnum])
                    else:
                        table +=    "<td style='text-align:center'><img src='http://fchasen.com/cal/open-data/icons/none-icon.png' style='height:30px; margin: 10px auto;'/></td>"
                else:
                    table +=    "<td>{0}</td>".format(j.encode('utf-8'))
            except:
                table +=    "<td>{0}</td>".format(j)
        table += "</tr>"

    table += "</tbody></table>"
    return table
In [4]:
def searchBook(q, o="title"):
    if o == "title":
        g = "title"
        ol = "title"
        h = "Title"
        
    if o == "author":
        ol = "ol_authors.name"
        g = "creator"
        h = "Imprint" 
        
    
    query = "".join(["select distinct 'Gutenberg' as source, g.title as title, " ,
    "g.creator as author,  g.lang as language,'Unknown' as pubdate, 'public domain' as rights, ",
    "(select about from gut_files where format like '%pdf%' and etext_id=g.etext_id limit 1) as pdf, ",
    "(select about from gut_files where format like '%text%' and etext_id=g.etext_id limit 1) as text, ",
    "(select about from gut_files where format like '%epub%' and etext_id=g.etext_id limit 1) as epub ",
    "from gut_books g  ",
    "where lower("+g+") like lower('%"+q+"%')  "  ,
    "union ",
    "select distinct 'Hathitrust' as source, Title as title,  ",
    "'Unknown' as author, lang as language, PubDate as pubdate, case Access when 'allow' then 'public domain' else 'non public' end as rights, ",
    "concat('http://hdl.handle.net/2027/',VolumeID) as pdf,'N/A' as epub, 'N/A' as text from ",
    "ht_books where lower("+h+") like lower('%"+q+"%') and Access = 'allow' ",
    "union ",
    "select distinct 'OpenLibrary' as source, title, name as author, ",
    "language, publish_date as pubdate, 'Unknown' as rights, ",
    "concat('http://www.archive.org/download/',ocaid, '/', ocaid, '.pdf') as pdf, ",
    "concat('http://www.archive.org/download/',ocaid, '/', ocaid, '_djvu.txt') as text, ",
    "concat('http://www.archive.org/download/',ocaid, '/', ocaid, '.epub') as epub ",
    "from ol_books LEFT JOIN ol_authors ON ol_books.author_key = ol_authors.key where lower("+ol+") like lower('%"+q+"%')"])
    
    cursor.execute(query)
    res = list(cursor.fetchall())
    
    if(len(res)):
        dfresult = pd.DataFrame(res ,columns=['source','title', 'author','language','pubdate','rights','pdf','text','epub'] )
        dfresult['language']=dfresult.language.apply(language)
        print "Total Results:",len(dfresult)
        return dfresult
    else:
        print "No results returned, try again"
        return False
In [5]:
inputHTML = '''
<input id="book-search" stype="width:400px">
    <select id="book-select">
       <option value="title">Title</option>
      <option value="author">Author</option>
    </select> 
<input type="submit" id="book-submit" value="Search">'''

js_code = ("""
    var $body = $("body");
        
    $body.on('click', '#book-submit', function() {
        var $search = $('#book-search'),
            $select = $('#book-select');
        
        var py_code1 = "queryterm = '" + $search.val() + "'";
        var py_code2 = "on = '" + $select.val() + "'";

        console.log('submit:', $search.val(), $select.val());
        
        IPython.notebook.kernel.execute(py_code1);
        IPython.notebook.kernel.execute(py_code2);

    });
    

    """)


IPython.core.display.Javascript(js_code)
Out[5]:

Search for a Book

In [6]:
HTML(inputHTML)
Out[6]:

Total search results

In [17]:
dfresult = searchBook(queryterm, on)
Total Results: 12

Distribution by source

In [18]:
source = dfresult.groupby('source')
source['source'].value_counts().plot(kind="bar", title="Count distribution of searched results")
Out[18]:
<matplotlib.axes.AxesSubplot at 0x3ac8b10>

Distribution by language

In [19]:
lbl= dfresult['language'].value_counts().index
#dfresult.value_counts().plot(kind="bar")
pie(dfresult['language'].value_counts(), labels=lbl, autopct='%1.1f%%')
#type(dfresult['language'])
Out[19]:
([<matplotlib.patches.Wedge at 0x3f58990>],
 [<matplotlib.text.Text at 0x3f58e50>],
 [<matplotlib.text.Text at 0x3f58f50>])
In [20]:
display(HTML(maketable(dfresult)))
Resultssourcetitleauthorlanguagepubdaterightspdftextepub
0GutenbergMoby DickMelville, Herman, 1819-1891EnglishUnknownpublic domain
1GutenbergMoby Dick: or, the White WhaleMelville, Herman, 1819-1891EnglishUnknownpublic domain
2GutenbergMoby Dick, or, the whaleMelville, Herman, 1819-1891EnglishUnknownpublic domain
3GutenbergMoby DickMelville, Herman, 1819-1891EnglishUnknownpublic domain
4GutenbergMoby DickMelville, Herman, 1819-1891EnglishUnknownpublic domain
5HathitrustThe meaning of Moby Dick, by William S. Gleim.UnknownEnglish1938public domain
6HathitrustThe meaning of Moby Dick.UnknownEnglish1962public domain
7HathitrustMoby Dick,UnknownEnglish1929public domain
8HathitrustMoby Dick,UnknownEnglish1929public domain
9OpenLibraryMoby Dick With Readers Guide (R 89 ALP)NoneEnglishJune 1970Unknown
10OpenLibraryMoby DickNoneEnglish1966Unknown
11OpenLibraryMoby DickNoneEnglish1988Unknown

APPENDIX

Hathitrust results

In [13]:
if on == "title":
    h = "Title"
        
if on == "author":
    h = "Imprint"#Fetching books only from hathitrust


query = "".join(["select Title as title, 'Unknown' as subtitle, 'Unknown' as author,",
"lang as language, PubDate as pubdate, Access as rights, concat('http://hdl.handle.net/2027/',VolumeID) as link,",
"'pdf' as format from ht_books where "+h+" like '%" ,
queryterm,
"%'"])
query
cursor.execute(query)
htres = list(cursor.fetchall())
dfht= pd.DataFrame(htres , columns=['title','subtitle','author','language','pubdate','rights','link','format'])
In [14]:
#dislaying results from hathitrust
display(HTML(dfht.to_html()))
title subtitle author language pubdate rights link format
0 The hound of the Baskervilles : another advent... Unknown Unknown eng 1921 allow http://hdl.handle.net/2027/mdp.39015062324325 pdf
In [35]:
dfht.pubdate.apply(returnnumber).value_counts().plot(kind="barh", title = "number of books published per year for the queries book")
Out[35]:
<matplotlib.axes.AxesSubplot at 0x3fefa50>
In [36]:
dfht.language.apply(language)
lbl = dfht.language.apply(language).value_counts().index
pie(dfht.language.apply(language).value_counts(), labels=lbl, autopct='%1.1f%%')
Out[36]:
([<matplotlib.patches.Wedge at 0x487b410>,
  <matplotlib.patches.Wedge at 0x487ba50>,
  <matplotlib.patches.Wedge at 0x487bfd0>,
  <matplotlib.patches.Wedge at 0x487d5d0>,
  <matplotlib.patches.Wedge at 0x487db90>,
  <matplotlib.patches.Wedge at 0x4882190>,
  <matplotlib.patches.Wedge at 0x4882750>],
 [<matplotlib.text.Text at 0x487b8d0>,
  <matplotlib.text.Text at 0x487bf10>,
  <matplotlib.text.Text at 0x487d510>,
  <matplotlib.text.Text at 0x487dad0>,
  <matplotlib.text.Text at 0x487dfd0>,
  <matplotlib.text.Text at 0x4882690>,
  <matplotlib.text.Text at 0x4882c50>],
 [<matplotlib.text.Text at 0x487b9d0>,
  <matplotlib.text.Text at 0x487bad0>,
  <matplotlib.text.Text at 0x487d090>,
  <matplotlib.text.Text at 0x487d650>,
  <matplotlib.text.Text at 0x487dc10>,
  <matplotlib.text.Text at 0x4882210>,
  <matplotlib.text.Text at 0x48827d0>])

Analyis of Hathitrust's non digitized data. This consists of volume of books before 1923 https://ec2-54-225-70-191.compute-1.amazonaws.com:80/25162c63-57c0-4c4d-b809-5c6a8f55b11b