#!/usr/bin/env python # coding: utf-8 # # Python classes and iterators for the Buckeye Corpus # # This package is for iterating through the [Buckeye Corpus](http://buckeyecorpus.osu.edu/) annotations in Python. It uses the annotation timestamps to cross-reference the .words, .phones, and .log files, and can be used to extract sound clips from the .wav files. It is available on [PyPI](https://pypi.org/) under the name `buckeye` and is tested to work with Python 2.7 and 3.6. # # This document contains a short tutorial for using the package. The source for the package is on [GitHub](https://github.com/scjs/buckeye). The docstrings in `buckeye.py`, `containers.py`, and `utterance.py` have more detail about usage. # # If you need to create Praat TextGrids based on the Buckeye Corpus, the [tgre](https://github.com/scjs/tgre) or [textgrid](https://github.com/kylebgorman/textgrid) packages may be helpful. You can also read the unzipped corpus files directly into Praat with this Praat menu option: `Open -> Read from special tier file -> Read IntervalTier from Xwaves` # # Corpus organization # # The corpus files are organized into one zipped archive per speaker. There are 40 speakers that were interviewed in the corpus, and each speaker is assigned a code-name, from `s01` through `s40`. Each speaker's interview is divided into about 6 tracks. For example, the first track for the first speaker is called `s0101a`. Each track contains three files (e.g., `s0101a.words`, `s0101a.phones`, and `s0101a.log`) which each contain the time-alignments for one annotation tier. Each track also includes a `.wav` file with the audio, plus a `.txt` file containing a list of speaker turns without time-alignments. # # Speaker # # A `Speaker` instance is created by calling the `from_zip` method on one of the zipped speaker # archives that can be downloaded from the corpus website. # In[1]: import buckeye speaker = buckeye.Speaker.from_zip('speakers/s01.zip') speaker # This will open and process the annotations in each of the sub-archives inside # the speaker archive (the tracks, such as `s0101a` and `s0101b`). If an optional # `load_wavs` argument is set to `True` when creating a `Speaker` instance, the # wav files associated with each track will also be loaded into memory: # In[2]: speaker = buckeye.Speaker.from_zip('speakers/s01.zip', load_wavs=True) speaker # Otherwise, only the annotations are loaded. # # Each `Speaker` instance has the speaker's code-name, sex, age, and interviewer # sex available as attributes. # In[3]: print(speaker.name) print(speaker.sex) # f for female, m for male print(speaker.age) # o for old, y for young print(speaker.interviewer) # f for female, m for male # The tracks can be accessed by iterating through the `Speaker` instance. There is more detail about accessing the annotations below under the heading **Track**. # In[4]: for track in speaker: print(track.name) # The tracks can also be accessed as a list through the `tracks` attribute. # In[5]: print(speaker.tracks) # # Track # # Each speaker has 6 or so tracks. # In[6]: speaker # In[7]: track = speaker[0] track # The annotations and recordings for each track are stored in the `words`, `phones`, `log`, `txt`, and `wav` attributes. See below for more information on each attribute. # If you don't want to load all of the tracks for a speaker, there are two ways to load tracks individually. # # First, you can call the `Track.from_zip` method directly on a zipped track archive. # In[8]: another_track = buckeye.Track.from_zip('s0303b.zip') another_track # Second, if you're working with the original uncompressed files, you can create a track without using the `from_zip` method by passing filepaths for the five track files (ending in `.words`, `.phones`, `.log`, `.txt`, and `.wav`), plus the name of the track, as arguments. For example: # In[9]: another_track = buckeye.Track(name='s0303b', words='tracks/s0303b/s0303b.words', phones='tracks/s0303b/s0303b.phones', log='tracks/s0303b/s0303b.log', txt='tracks/s0303b/s0303b.txt', wav='tracks/s0303b/s0303b.wav' # wav is optional ) another_track # ### Words # # The `words` attribute stores a list of Word and Pause instances, created from the `.words` file. # In[10]: track.words[:10] # Word instances have these nine attributes: # # * `orthography` - the word's spelling # * `beg` - the timestamp when the word begins (relative to the start of the track), in seconds # * `end` - the timestamp when the word ends # * `dur` - the duration of the word # * `phonemic` - the canonical transcription # * `phonetic` - the close transcription # * `pos` - the word's part of speech # * `misaligned` - marked as True if the word has a negative duration, or if the phonetic transcription doesn't match what's in the `.phones` file # * `phones` - a list of references to Phone instances that have the labels and timestamps for the phonetic transcription # In[11]: word = track.words[4] print(word.orthography) print(word.beg) print(word.end) print(word.dur) print(word.phonemic) print(word.phonetic) print(word.pos) print(word.misaligned) # The phones are retrieved based on the timestamps for the word and for the entries in the `.phones` file. # In[12]: word.phones # Phones have four attributes: # # * `seg` - the pseudo-ARPABET transcription of the phone # * `beg` - the timestamp when the phone begins (relative to the start of the track), in seconds # * `end` - the timestamp when the phone ends # * `dur` - duration # In[13]: for phone in word.phones: print(phone.seg, phone.beg, phone.end, phone.dur) # Many of the annotations are things like `` (silence) or `` (the interviewer's turn). These are stored as Pause instances, rather than Word instances. Pause instances have six attributes: # # * `entry` - the kind of Pause, e.g. `` # * `beg` - when the pause begins # * `end` - when the pause ends # * `dur` - duration # * `misaligned` - marked as True if the Pause has a negative duration # * `phones` - a list of references to Phone instances that are associated with this Pause, e.g. one or more `SIL` tokens # In[14]: pause = track.words[1] print(pause.entry, pause.beg, pause.end, pause.dur, pause.misaligned) # In[15]: pause.phones # ### Phones # # The phones in a track can also be accessed directly through the `phones` attribute of a Track instance. # In[16]: for phone in track.phones[:10]: print(phone.seg, phone.beg, phone.end, phone.dur) # ### Log # # The list of entries in the Track's `.log` file can be accessed through the `log` attribute, which stores a list of the `LogEntry` instances for the Track. # # `LogEntry` instances have `entry`, `beg`, `end`, and `dur` attributes. # In[17]: for log in track.log: print(log.entry, log.beg, log.end, log.dur) # You can call the `get_logs()` method of a Track to retrieve the log entries that overlap with the given timestamps. # # For example, the log entries that overlap with the interval from 60 seconds to 62 seconds can be found like this: # In[18]: logs = track.get_logs(60.0, 62.0) for log in logs: print(log.entry, log.beg, log.end) # ### Txt # The `txt` attribute holds a list of speaker turns without timestamps, read from # the `.txt` file in the track. # In[19]: track.txt[1] # ### Wav # # If a Speaker instance is created with `load_wavs=True`, each Track will also have a `wav` attribute that stores a `Wave_read` instance. # In[20]: speaker = buckeye.Speaker.from_zip('speakers/s01.zip', load_wavs=True) track = speaker[0] track.wav # You can extract sound clips from the wav file with the `clip_wav()` method of each Track: # In[21]: track.clip_wav('myclip.wav', 60.0, 62.0) # This will create a wav file in the current directory called `myclip.wav` which # contains the sound between 60 and 62 seconds in the track audio. # # Corpus generator # The `corpus()` generator function is a convenience for iterating through all of # the speaker archives together. Put all forty speaker archives in one directory, # such as a directory named `speakers`. Create a new generator with this directory as an argument. # In[22]: corpus = buckeye.corpus('speakers/') # The generator will yield the `Speaker` instances in numerical order. # In[23]: for speaker in corpus: print(speaker.name, end=' ') # If you're using a `corpus()` generator, you can set `load_wavs` to `True` and it will be passed down to every `Track` instance, so that the all of the wav files will be loaded. # In[24]: corpus = buckeye.corpus('speakers/', load_wavs=True)