#!/usr/bin/env python
# coding: utf-8

# # Python classes and iterators for the Buckeye Corpus
# 
# This package is for iterating through the [Buckeye Corpus](http://buckeyecorpus.osu.edu/) annotations in Python. It uses the annotation timestamps to cross-reference the .words, .phones, and .log files, and can be used to extract sound clips from the .wav files. It is available on [PyPI](https://pypi.org/) under the name `buckeye` and is tested to work with Python 2.7 and 3.6.
# 
# This document contains a short tutorial for using the package. The source for the package is on [GitHub](https://github.com/scjs/buckeye). The docstrings in `buckeye.py`, `containers.py`, and `utterance.py` have more detail about usage.
# 
# If you need to create Praat TextGrids based on the Buckeye Corpus, the [tgre](https://github.com/scjs/tgre) or [textgrid](https://github.com/kylebgorman/textgrid) packages may be helpful. You can also read the unzipped corpus files directly into Praat with this Praat menu option: `Open -> Read from special tier file -> Read IntervalTier from Xwaves`

# # Corpus organization
# 
# The corpus files are organized into one zipped archive per speaker. There are 40 speakers that were interviewed in the corpus, and each speaker is assigned a code-name, from `s01` through `s40`. Each speaker's interview is divided into about 6 tracks. For example, the first track for the first speaker is called `s0101a`. Each track contains three files (e.g., `s0101a.words`, `s0101a.phones`, and `s0101a.log`) which each contain the time-alignments for one annotation tier. Each track also includes a `.wav` file with the audio, plus a `.txt` file containing a list of speaker turns without time-alignments.

# # Speaker
# 
# A `Speaker` instance is created by calling the `from_zip` method on one of the zipped speaker
# archives that can be downloaded from the corpus website.

# In[1]:


import buckeye

speaker = buckeye.Speaker.from_zip('speakers/s01.zip')
speaker


# This will open and process the annotations in each of the sub-archives inside
# the speaker archive (the tracks, such as `s0101a` and `s0101b`). If an optional
# `load_wavs` argument is set to `True` when creating a `Speaker` instance, the
# wav files associated with each track will also be loaded into memory:

# In[2]:


speaker = buckeye.Speaker.from_zip('speakers/s01.zip', load_wavs=True)
speaker


# Otherwise, only the annotations are loaded.
# 
# Each `Speaker` instance has the speaker's code-name, sex, age, and interviewer
# sex available as attributes.

# In[3]:


print(speaker.name)
print(speaker.sex) # f for female, m for male
print(speaker.age) # o for old, y for young
print(speaker.interviewer) # f for female, m for male


# The tracks can be accessed by iterating through the `Speaker` instance. There is more detail about accessing the annotations below under the heading **Track**.

# In[4]:


for track in speaker:
    print(track.name)


# The tracks can also be accessed as a list through the `tracks` attribute.

# In[5]:


print(speaker.tracks)


# # Track
# 
# Each speaker has 6 or so tracks.

# In[6]:


speaker


# In[7]:


track = speaker[0]
track


# The annotations and recordings for each track are stored in the `words`, `phones`, `log`, `txt`, and `wav` attributes. See below for more information on each attribute.

# If you don't want to load all of the tracks for a speaker, there are two ways to load tracks individually.
# 
# First, you can call the `Track.from_zip` method directly on a zipped track archive.

# In[8]:


another_track = buckeye.Track.from_zip('s0303b.zip')
another_track


# Second, if you're working with the original uncompressed files, you can create a track without using the `from_zip` method by passing filepaths for the five track files (ending in `.words`, `.phones`, `.log`, `.txt`, and `.wav`), plus the name of the track, as arguments. For example:

# In[9]:


another_track = buckeye.Track(name='s0303b',
                              words='tracks/s0303b/s0303b.words',
                              phones='tracks/s0303b/s0303b.phones',
                              log='tracks/s0303b/s0303b.log',
                              txt='tracks/s0303b/s0303b.txt',
                              wav='tracks/s0303b/s0303b.wav' # wav is optional
                             )

another_track


# ### Words
# 
# The `words` attribute stores a list of Word and Pause instances, created from the `.words` file.

# In[10]:


track.words[:10]


# Word instances have these nine attributes:
# 
# * `orthography` - the word's spelling
# * `beg` - the timestamp when the word begins (relative to the start of the track), in seconds
# * `end` - the timestamp when the word ends
# * `dur` - the duration of the word
# * `phonemic` - the canonical transcription
# * `phonetic` - the close transcription
# * `pos` - the word's part of speech
# * `misaligned` - marked as True if the word has a negative duration, or if the phonetic transcription doesn't match what's in the `.phones` file
# * `phones` - a list of references to Phone instances that have the labels and timestamps for the phonetic transcription

# In[11]:


word = track.words[4]

print(word.orthography)
print(word.beg)
print(word.end)
print(word.dur)
print(word.phonemic)
print(word.phonetic)
print(word.pos)
print(word.misaligned)


# The phones are retrieved based on the timestamps for the word and for the entries in the `.phones` file.

# In[12]:


word.phones


# Phones have four attributes:
# 
# * `seg` - the pseudo-ARPABET transcription of the phone
# * `beg` - the timestamp when the phone begins (relative to the start of the track), in seconds
# * `end` - the timestamp when the phone ends
# * `dur` - duration

# In[13]:


for phone in word.phones:
    print(phone.seg, phone.beg, phone.end, phone.dur)


# Many of the annotations are things like `<SIL>` (silence) or `<IVER>` (the interviewer's turn). These are stored as Pause instances, rather than Word instances. Pause instances have six attributes:
# 
# * `entry` - the kind of Pause, e.g. `<SIL>`
# * `beg` - when the pause begins
# * `end` - when the pause ends
# * `dur` - duration
# * `misaligned` - marked as True if the Pause has a negative duration
# * `phones` - a list of references to Phone instances that are associated with this Pause, e.g. one or more `SIL` tokens

# In[14]:


pause = track.words[1]

print(pause.entry, pause.beg, pause.end, pause.dur, pause.misaligned)


# In[15]:


pause.phones


# ### Phones
# 
# The phones in a track can also be accessed directly through the `phones` attribute of a Track instance.

# In[16]:


for phone in track.phones[:10]:
    print(phone.seg, phone.beg, phone.end, phone.dur)


# ### Log
# 
# The list of entries in the Track's `.log` file can be accessed through the `log` attribute, which stores a list of the `LogEntry` instances for the Track.
# 
# `LogEntry` instances have `entry`, `beg`, `end`, and `dur` attributes.

# In[17]:


for log in track.log:
    print(log.entry, log.beg, log.end, log.dur)


# You can call the `get_logs()` method of a Track to retrieve the log entries that overlap with the given timestamps.
# 
# For example, the log entries that overlap with the interval from 60 seconds to 62 seconds can be found like this:

# In[18]:


logs = track.get_logs(60.0, 62.0)

for log in logs:
    print(log.entry, log.beg, log.end)


# ### Txt

# The `txt` attribute holds a list of speaker turns without timestamps, read from
# the `.txt` file in the track.

# In[19]:


track.txt[1]


# ### Wav
# 
# If a Speaker instance is created with `load_wavs=True`, each Track will also have a `wav` attribute that stores a `Wave_read` instance.

# In[20]:


speaker = buckeye.Speaker.from_zip('speakers/s01.zip', load_wavs=True)
track = speaker[0]

track.wav


# You can extract sound clips from the wav file with the `clip_wav()` method of each Track:

# In[21]:


track.clip_wav('myclip.wav', 60.0, 62.0)


# This will create a wav file in the current directory called `myclip.wav` which
# contains the sound between 60 and 62 seconds in the track audio.

# # Corpus generator

# The `corpus()` generator function is a convenience for iterating through all of
# the speaker archives together. Put all forty speaker archives in one directory,
# such as a directory named `speakers`. Create a new generator with this directory as an argument.

# In[22]:


corpus = buckeye.corpus('speakers/')


# The generator will yield the `Speaker` instances in numerical order.

# In[23]:


for speaker in corpus:
    print(speaker.name, end=' ')


# If you're using a `corpus()` generator, you can set `load_wavs` to `True` and it will be passed down to every `Track` instance, so that the all of the wav files will be loaded.

# In[24]:


corpus = buckeye.corpus('speakers/', load_wavs=True)