This package is for iterating through the Buckeye Corpus annotations in Python. It uses the annotation timestamps to cross-reference the .words, .phones, and .log files, and can be used to extract sound clips from the .wav files. It is available on PyPI under the name buckeye
and is tested to work with Python 2.7 and 3.6.
This document contains a short tutorial for using the package. The source for the package is on GitHub. The docstrings in buckeye.py
, containers.py
, and utterance.py
have more detail about usage.
If you need to create Praat TextGrids based on the Buckeye Corpus, the tgre or textgrid packages may be helpful. You can also read the unzipped corpus files directly into Praat with this Praat menu option: Open -> Read from special tier file -> Read IntervalTier from Xwaves
The corpus files are organized into one zipped archive per speaker. There are 40 speakers that were interviewed in the corpus, and each speaker is assigned a code-name, from s01
through s40
. Each speaker's interview is divided into about 6 tracks. For example, the first track for the first speaker is called s0101a
. Each track contains three files (e.g., s0101a.words
, s0101a.phones
, and s0101a.log
) which each contain the time-alignments for one annotation tier. Each track also includes a .wav
file with the audio, plus a .txt
file containing a list of speaker turns without time-alignments.
A Speaker
instance is created by calling the from_zip
method on one of the zipped speaker
archives that can be downloaded from the corpus website.
import buckeye
speaker = buckeye.Speaker.from_zip('speakers/s01.zip')
speaker
Speaker("s01")
This will open and process the annotations in each of the sub-archives inside
the speaker archive (the tracks, such as s0101a
and s0101b
). If an optional
load_wavs
argument is set to True
when creating a Speaker
instance, the
wav files associated with each track will also be loaded into memory:
speaker = buckeye.Speaker.from_zip('speakers/s01.zip', load_wavs=True)
speaker
Speaker("s01")
Otherwise, only the annotations are loaded.
Each Speaker
instance has the speaker's code-name, sex, age, and interviewer
sex available as attributes.
print(speaker.name)
print(speaker.sex) # f for female, m for male
print(speaker.age) # o for old, y for young
print(speaker.interviewer) # f for female, m for male
s01 f y f
The tracks can be accessed by iterating through the Speaker
instance. There is more detail about accessing the annotations below under the heading Track.
for track in speaker:
print(track.name)
s0101a s0101b s0102a s0102b s0103a
The tracks can also be accessed as a list through the tracks
attribute.
print(speaker.tracks)
[Track("s0101a"), Track("s0101b"), Track("s0102a"), Track("s0102b"), Track("s0103a")]
Each speaker has 6 or so tracks.
speaker
Speaker("s01")
track = speaker[0]
track
Track("s0101a")
The annotations and recordings for each track are stored in the words
, phones
, log
, txt
, and wav
attributes. See below for more information on each attribute.
If you don't want to load all of the tracks for a speaker, there are two ways to load tracks individually.
First, you can call the Track.from_zip
method directly on a zipped track archive.
another_track = buckeye.Track.from_zip('s0303b.zip')
another_track
Track("s0303b")
Second, if you're working with the original uncompressed files, you can create a track without using the from_zip
method by passing filepaths for the five track files (ending in .words
, .phones
, .log
, .txt
, and .wav
), plus the name of the track, as arguments. For example:
another_track = buckeye.Track(name='s0303b',
words='tracks/s0303b/s0303b.words',
phones='tracks/s0303b/s0303b.phones',
log='tracks/s0303b/s0303b.log',
txt='tracks/s0303b/s0303b.txt',
wav='tracks/s0303b/s0303b.wav' # wav is optional
)
another_track
Track("s0303b")
The words
attribute stores a list of Word and Pause instances, created from the .words
file.
track.words[:10]
[Pause('{B_TRANS}', 0.0, 0.102385), Pause('<SIL>', 0.102385, 4.275744), Pause('<NOISE>', 4.275744, 8.513518), Pause('<IVER>', 8.513518, 32.216575), Word('okay', 32.216575, 32.622045, ['ow', 'k', 'ey'], ['k', 'ay'], 'NN'), Pause('<IVER>', 32.622045, 37.129002), Pause('<VOCNOISE>', 37.129002, 38.123014), Pause('<IVER>', 38.123014, 44.617996), Word('um', 44.617996, 44.946848, ['ah', 'm'], ['ah', 'm'], 'UH'), Pause('<SIL>', 44.946848, 45.355708)]
Word instances have these nine attributes:
orthography
- the word's spellingbeg
- the timestamp when the word begins (relative to the start of the track), in secondsend
- the timestamp when the word endsdur
- the duration of the wordphonemic
- the canonical transcriptionphonetic
- the close transcriptionpos
- the word's part of speechmisaligned
- marked as True if the word has a negative duration, or if the phonetic transcription doesn't match what's in the .phones
filephones
- a list of references to Phone instances that have the labels and timestamps for the phonetic transcriptionword = track.words[4]
print(word.orthography)
print(word.beg)
print(word.end)
print(word.dur)
print(word.phonemic)
print(word.phonetic)
print(word.pos)
print(word.misaligned)
okay 32.216575 32.622045 0.4054700000000011 ['ow', 'k', 'ey'] ['k', 'ay'] NN False
The phones are retrieved based on the timestamps for the word and for the entries in the .phones
file.
word.phones
[Phone('k', 32.216575, 32.376593), Phone('ay', 32.376593, 32.622045)]
Phones have four attributes:
seg
- the pseudo-ARPABET transcription of the phonebeg
- the timestamp when the phone begins (relative to the start of the track), in secondsend
- the timestamp when the phone endsdur
- durationfor phone in word.phones:
print(phone.seg, phone.beg, phone.end, phone.dur)
k 32.216575 32.376593 0.16001800000000088 ay 32.376593 32.622045 0.24545200000000023
Many of the annotations are things like <SIL>
(silence) or <IVER>
(the interviewer's turn). These are stored as Pause instances, rather than Word instances. Pause instances have six attributes:
entry
- the kind of Pause, e.g. <SIL>
beg
- when the pause beginsend
- when the pause endsdur
- durationmisaligned
- marked as True if the Pause has a negative durationphones
- a list of references to Phone instances that are associated with this Pause, e.g. one or more SIL
tokenspause = track.words[1]
print(pause.entry, pause.beg, pause.end, pause.dur, pause.misaligned)
<SIL> 0.102385 4.275744 4.1733590000000005 False
pause.phones
[Phone('SIL', 0.102385, 4.275744)]
The phones in a track can also be accessed directly through the phones
attribute of a Track instance.
for phone in track.phones[:10]:
print(phone.seg, phone.beg, phone.end, phone.dur)
{B_TRANS} 0.0 0.102385 0.102385 SIL 0.102385 4.275744 4.1733590000000005 NOISE 4.275744 8.513763 4.238019 IVER 8.513763 32.216575 23.702811999999998 k 32.216575 32.376593 0.16001800000000088 ay 32.376593 32.622045 0.24545200000000023 IVER 32.622045 37.129002 4.506957 VOCNOISE 37.129002 38.123014 0.9940119999999979 IVER 38.123014 44.617996 6.494982 ah 44.617996 44.820731 0.2027350000000041
The list of entries in the Track's .log
file can be accessed through the log
attribute, which stores a list of the LogEntry
instances for the Track.
LogEntry
instances have entry
, beg
, end
, and dur
attributes.
for log in track.log:
print(log.entry, log.beg, log.end, log.dur)
<VOICE=modal> 0.0 61.142603 61.142603 <VOICE=creaky> 61.142603 61.397647 0.25504399999999805 <VOICE=modal> 61.397647 176.705681 115.30803399999999 <VOICE=creaky> 176.705681 177.442715 0.7370339999999942 <VOICE=modal> 177.442715 208.458474 31.015759000000003 <VOICE=creaky> 208.458474 208.998197 0.5397230000000093 <IVER_overlap-start> 208.998197 218.326046 9.327848999999986 <IVER_overlap-end> 218.326046 218.619639 0.29359300000001554 <IVER_overlap-start> 218.619639 281.4126 62.79296099999999 <IVER_overlap-end> 281.4126 282.015381 0.6027809999999931 <VOICE=modal> 282.015381 283.01414 0.9987590000000068 <VOICE=creaky> 283.01414 283.342991 0.328850999999986 <IVER_overlap-start> 283.342991 286.3691 3.0261090000000195 <IVER_overlap-end> 286.3691 286.587431 0.21833099999997785 <IVER_overlap-start> 286.587431 358.243781 71.65635000000003 <IVER_overlap-end> 358.243781 358.766553 0.5227719999999749 <VOICE=modal> 358.766553 570.891209 212.12465600000002 <VOICE=creaky> 570.891209 570.988848 0.09763899999995829 <IVER_overlap-start> 570.988848 595.595736 24.606888000000026 <IVER_overlap-end> 595.595736 596.178854 0.5831180000000131
You can call the get_logs()
method of a Track to retrieve the log entries that overlap with the given timestamps.
For example, the log entries that overlap with the interval from 60 seconds to 62 seconds can be found like this:
logs = track.get_logs(60.0, 62.0)
for log in logs:
print(log.entry, log.beg, log.end)
<VOICE=modal> 0.0 61.142603 <VOICE=creaky> 61.142603 61.397647 <VOICE=modal> 61.397647 176.705681
The txt
attribute holds a list of speaker turns without timestamps, read from
the .txt
file in the track.
track.txt[1]
'okay <IVER>'
If a Speaker instance is created with load_wavs=True
, each Track will also have a wav
attribute that stores a Wave_read
instance.
speaker = buckeye.Speaker.from_zip('speakers/s01.zip', load_wavs=True)
track = speaker[0]
track.wav
<wave.Wave_read at 0x81fd4a8>
You can extract sound clips from the wav file with the clip_wav()
method of each Track:
track.clip_wav('myclip.wav', 60.0, 62.0)
This will create a wav file in the current directory called myclip.wav
which
contains the sound between 60 and 62 seconds in the track audio.
The corpus()
generator function is a convenience for iterating through all of
the speaker archives together. Put all forty speaker archives in one directory,
such as a directory named speakers
. Create a new generator with this directory as an argument.
corpus = buckeye.corpus('speakers/')
The generator will yield the Speaker
instances in numerical order.
for speaker in corpus:
print(speaker.name, end=' ')
s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17 s18 s19 s20 s21 s22 s23 s24 s25 s26 s27 s28 s29 s30 s31 s32 s33 s34 s35 s36 s37 s38 s39 s40
If you're using a corpus()
generator, you can set load_wavs
to True
and it will be passed down to every Track
instance, so that the all of the wav files will be loaded.
corpus = buckeye.corpus('speakers/', load_wavs=True)