We are using python3 today, but this code should work almost without change in python2. To libraries are required.
Both can be installed with pip
.
We are also going to use the standard library component:
import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup
import re
ebooklib.epub breaks the epub up into items. These are files with in the zip archieve. Generally most booked have one item (ie file), per chapter. That is the case for our book.
Of these items, there are three catagories of item we want to keep:
def is_not_chapter(item):
return item.get_type() != ebooklib.ITEM_DOCUMENT
All the normal chapted in out case are named along the lines of: b01-c01
for book 1 (as it is a complation) chapter 1. Special chapters like the appendix don't follow this pattern. We can check for it with a regex
def is_univeral_chapter(chapter):
return not re.match("(b\d\d.c\d\d)",chapter.get_id())
In this particular book all the character names are in the chapter headings.
However it does represent them in two different ways. In some sections it is with a <h1>
element, in others in is in a <p class="ct">
element. We'll check for both.
Notice this function is a higher order function that returns a function. That makes it work nice with filter -- useful for testing, if you've already stripped down to just the normal chapters.
filter(is_character("JON"), chapters)
def is_character(name):
def inner(chapter):
soup = BeautifulSoup(chapter.get_content())
heading_matchers = [lambda: soup.find_all('h1'),
lambda: soup.find_all(class_='ct')
]
headings=[]
for matcher in heading_matchers:
headings = matcher()
if len(headings)>0: break
else:
return False
assert(len(headings)==1)
heading = headings[0]
chapter_character_name = heading.text.strip()
return chapter_character_name == name
return inner
Another higher order function, again to make it work with filter
.
In this case it is a closure.
def keep_item(character_name):
is_our_charatacter = is_character(character_name)
def inner(item):
return (is_not_chapter(item)
or is_univeral_chapter(item)
or is_our_charatacter(item))
return inner
Also we'll modify the title, don't want to get them confused. There is also a helper function below to workout the new filename
def rewrite_book_by_character(filename, character):
book = epub.read_epub(filename)
book.items = list(filter(keep_item(character), book.items))
book.title+=": " + character + "POVs_ONLY"
new_filename = get_new_filename(filename,character)
epub.write_epub(new_filename, book, {})
return new_filename
def get_new_filename(filename,character):
import os.path
filename_base, ext = os.path.splitext(filename)
new_filename = filename_base +"_" + character+"_ONLY"+ext
return new_filename
from IPython.display import FileLink
filename = rewrite_book_by_character('asoiaf01-04.epub', "JON")
FileLink(filename)
Copyright (c) 2015 Lyndon White
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.