Python classes and iterators for the Buckeye Corpus¶

This package is for iterating through the Buckeye Corpus annotations in Python. It uses the annotation timestamps to cross-reference the .words, .phones, and .log files, and can be used to extract sound clips from the .wav files. It is available on PyPI under the name buckeye and is tested to work with Python 2.7 and 3.6.

This document contains a short tutorial for using the package. The source for the package is on GitHub. The docstrings in buckeye.py, containers.py, and utterance.py have more detail about usage.

If you need to create Praat TextGrids based on the Buckeye Corpus, the tgre or textgrid packages may be helpful. You can also read the unzipped corpus files directly into Praat with this Praat menu option: Open -> Read from special tier file -> Read IntervalTier from Xwaves

Corpus organization¶

The corpus files are organized into one zipped archive per speaker. There are 40 speakers that were interviewed in the corpus, and each speaker is assigned a code-name, from s01 through s40. Each speaker's interview is divided into about 6 tracks. For example, the first track for the first speaker is called s0101a. Each track contains three files (e.g., s0101a.words, s0101a.phones, and s0101a.log) which each contain the time-alignments for one annotation tier. Each track also includes a .wav file with the audio, plus a .txt file containing a list of speaker turns without time-alignments.

Speaker¶

A Speaker instance is created by calling the from_zip method on one of the zipped speaker archives that can be downloaded from the corpus website.

In [1]:

import buckeye

speaker = buckeye.Speaker.from_zip('speakers/s01.zip')
speaker

Out[1]:

Speaker("s01")

This will open and process the annotations in each of the sub-archives inside the speaker archive (the tracks, such as s0101a and s0101b). If an optional load_wavs argument is set to True when creating a Speaker instance, the wav files associated with each track will also be loaded into memory:

In [2]:

speaker = buckeye.Speaker.from_zip('speakers/s01.zip', load_wavs=True)
speaker

Out[2]:

Speaker("s01")

Otherwise, only the annotations are loaded.

Each Speaker instance has the speaker's code-name, sex, age, and interviewer sex available as attributes.

In [3]:

print(speaker.name)
print(speaker.sex) # f for female, m for male
print(speaker.age) # o for old, y for young
print(speaker.interviewer) # f for female, m for male

s01
f
y
f

The tracks can be accessed by iterating through the Speaker instance. There is more detail about accessing the annotations below under the heading Track.

In [4]:

for track in speaker:
    print(track.name)

s0101a
s0101b
s0102a
s0102b
s0103a

The tracks can also be accessed as a list through the tracks attribute.

In [5]:

print(speaker.tracks)

[Track("s0101a"), Track("s0101b"), Track("s0102a"), Track("s0102b"), Track("s0103a")]

Track¶

Each speaker has 6 or so tracks.

In [6]:

speaker

Out[6]:

Speaker("s01")

In [7]:

track = speaker[0]
track

Out[7]:

Track("s0101a")

The annotations and recordings for each track are stored in the words, phones, log, txt, and wav attributes. See below for more information on each attribute.

If you don't want to load all of the tracks for a speaker, there are two ways to load tracks individually.

First, you can call the Track.from_zip method directly on a zipped track archive.

In [8]:

another_track = buckeye.Track.from_zip('s0303b.zip')
another_track

Out[8]:

Track("s0303b")

Second, if you're working with the original uncompressed files, you can create a track without using the from_zip method by passing filepaths for the five track files (ending in .words, .phones, .log, .txt, and .wav), plus the name of the track, as arguments. For example:

In [9]:

another_track = buckeye.Track(name='s0303b',
                              words='tracks/s0303b/s0303b.words',
                              phones='tracks/s0303b/s0303b.phones',
                              log='tracks/s0303b/s0303b.log',
                              txt='tracks/s0303b/s0303b.txt',
                              wav='tracks/s0303b/s0303b.wav' # wav is optional
                             )

another_track

Out[9]:

Track("s0303b")

Words¶

The words attribute stores a list of Word and Pause instances, created from the .words file.

In [10]:

track.words[:10]

Out[10]:

[Pause('{B_TRANS}', 0.0, 0.102385),
 Pause('<SIL>', 0.102385, 4.275744),
 Pause('<NOISE>', 4.275744, 8.513518),
 Pause('<IVER>', 8.513518, 32.216575),
 Word('okay', 32.216575, 32.622045, ['ow', 'k', 'ey'], ['k', 'ay'], 'NN'),
 Pause('<IVER>', 32.622045, 37.129002),
 Pause('<VOCNOISE>', 37.129002, 38.123014),
 Pause('<IVER>', 38.123014, 44.617996),
 Word('um', 44.617996, 44.946848, ['ah', 'm'], ['ah', 'm'], 'UH'),
 Pause('<SIL>', 44.946848, 45.355708)]

Word instances have these nine attributes:

orthography - the word's spelling
beg - the timestamp when the word begins (relative to the start of the track), in seconds
end - the timestamp when the word ends
dur - the duration of the word
phonemic - the canonical transcription
phonetic - the close transcription
pos - the word's part of speech
misaligned - marked as True if the word has a negative duration, or if the phonetic transcription doesn't match what's in the .phones file
phones - a list of references to Phone instances that have the labels and timestamps for the phonetic transcription

In [11]:

word = track.words[4]

print(word.orthography)
print(word.beg)
print(word.end)
print(word.dur)
print(word.phonemic)
print(word.phonetic)
print(word.pos)
print(word.misaligned)

okay
32.216575
32.622045
0.4054700000000011
['ow', 'k', 'ey']
['k', 'ay']
NN
False

The phones are retrieved based on the timestamps for the word and for the entries in the .phones file.

In [12]:

word.phones

Out[12]:

[Phone('k', 32.216575, 32.376593), Phone('ay', 32.376593, 32.622045)]

Phones have four attributes:

seg - the pseudo-ARPABET transcription of the phone
beg - the timestamp when the phone begins (relative to the start of the track), in seconds
end - the timestamp when the phone ends
dur - duration

In [13]:

for phone in word.phones:
    print(phone.seg, phone.beg, phone.end, phone.dur)

k 32.216575 32.376593 0.16001800000000088
ay 32.376593 32.622045 0.24545200000000023

Many of the annotations are things like <SIL> (silence) or <IVER> (the interviewer's turn). These are stored as Pause instances, rather than Word instances. Pause instances have six attributes:

entry - the kind of Pause, e.g. <SIL>
beg - when the pause begins
end - when the pause ends
dur - duration
misaligned - marked as True if the Pause has a negative duration
phones - a list of references to Phone instances that are associated with this Pause, e.g. one or more SIL tokens

In [14]:

pause = track.words[1]

print(pause.entry, pause.beg, pause.end, pause.dur, pause.misaligned)

<SIL> 0.102385 4.275744 4.1733590000000005 False

In [15]:

pause.phones

Out[15]:

[Phone('SIL', 0.102385, 4.275744)]

Phones¶

The phones in a track can also be accessed directly through the phones attribute of a Track instance.

In [16]:

for phone in track.phones[:10]:
    print(phone.seg, phone.beg, phone.end, phone.dur)

{B_TRANS} 0.0 0.102385 0.102385
SIL 0.102385 4.275744 4.1733590000000005
NOISE 4.275744 8.513763 4.238019
IVER 8.513763 32.216575 23.702811999999998
k 32.216575 32.376593 0.16001800000000088
ay 32.376593 32.622045 0.24545200000000023
IVER 32.622045 37.129002 4.506957
VOCNOISE 37.129002 38.123014 0.9940119999999979
IVER 38.123014 44.617996 6.494982
ah 44.617996 44.820731 0.2027350000000041

Log¶

The list of entries in the Track's .log file can be accessed through the log attribute, which stores a list of the LogEntry instances for the Track.

LogEntry instances have entry, beg, end, and dur attributes.

In [17]:

for log in track.log:
    print(log.entry, log.beg, log.end, log.dur)

<VOICE=modal> 0.0 61.142603 61.142603
<VOICE=creaky> 61.142603 61.397647 0.25504399999999805
<VOICE=modal> 61.397647 176.705681 115.30803399999999
<VOICE=creaky> 176.705681 177.442715 0.7370339999999942
<VOICE=modal> 177.442715 208.458474 31.015759000000003
<VOICE=creaky> 208.458474 208.998197 0.5397230000000093
<IVER_overlap-start> 208.998197 218.326046 9.327848999999986
<IVER_overlap-end> 218.326046 218.619639 0.29359300000001554
<IVER_overlap-start> 218.619639 281.4126 62.79296099999999
<IVER_overlap-end> 281.4126 282.015381 0.6027809999999931
<VOICE=modal> 282.015381 283.01414 0.9987590000000068
<VOICE=creaky> 283.01414 283.342991 0.328850999999986
<IVER_overlap-start> 283.342991 286.3691 3.0261090000000195
<IVER_overlap-end> 286.3691 286.587431 0.21833099999997785
<IVER_overlap-start> 286.587431 358.243781 71.65635000000003
<IVER_overlap-end> 358.243781 358.766553 0.5227719999999749
<VOICE=modal> 358.766553 570.891209 212.12465600000002
<VOICE=creaky> 570.891209 570.988848 0.09763899999995829
<IVER_overlap-start> 570.988848 595.595736 24.606888000000026
<IVER_overlap-end> 595.595736 596.178854 0.5831180000000131

You can call the get_logs() method of a Track to retrieve the log entries that overlap with the given timestamps.

For example, the log entries that overlap with the interval from 60 seconds to 62 seconds can be found like this:

In [18]:

logs = track.get_logs(60.0, 62.0)

for log in logs:
    print(log.entry, log.beg, log.end)

<VOICE=modal> 0.0 61.142603
<VOICE=creaky> 61.142603 61.397647
<VOICE=modal> 61.397647 176.705681

Txt¶

The txt attribute holds a list of speaker turns without timestamps, read from the .txt file in the track.

In [19]:

track.txt[1]

Out[19]:

'okay <IVER>'

Wav¶

If a Speaker instance is created with load_wavs=True, each Track will also have a wav attribute that stores a Wave_read instance.

In [20]:

speaker = buckeye.Speaker.from_zip('speakers/s01.zip', load_wavs=True)
track = speaker[0]

track.wav

Out[20]:

<wave.Wave_read at 0x81fd4a8>

You can extract sound clips from the wav file with the clip_wav() method of each Track:

In [21]:

track.clip_wav('myclip.wav', 60.0, 62.0)

This will create a wav file in the current directory called myclip.wav which contains the sound between 60 and 62 seconds in the track audio.

Corpus generator¶

The corpus() generator function is a convenience for iterating through all of the speaker archives together. Put all forty speaker archives in one directory, such as a directory named speakers. Create a new generator with this directory as an argument.

In [22]:

corpus = buckeye.corpus('speakers/')

The generator will yield the Speaker instances in numerical order.

In [23]:

for speaker in corpus:
    print(speaker.name, end=' ')

s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17 s18 s19 s20 s21 s22 s23 s24 s25 s26 s27 s28 s29 s30 s31 s32 s33 s34 s35 s36 s37 s38 s39 s40

If you're using a corpus() generator, you can set load_wavs to True and it will be passed down to every Track instance, so that the all of the wav files will be loaded.

In [24]:

corpus = buckeye.corpus('speakers/', load_wavs=True)