Quick look at NY Phil concert program data

Eamonn Bell, Columbia University <[email protected]>


Work in progress. Some interesting questions worth asking are bolded if there's anything in the notebook that moves towards a solution.

  • What composers tend to get programmed together?
  • Where did the orchestra play?
  • What does 'composer discovery' look like? Can we spot faddish composers by the shape of their performance frequency?
  • Who played with whom over the course of the existence of the orchestra? The social network of performers.
  • What are the significant differences between tour and subscription concert programs in general?
  • What conductors prefer which works?
  • What are the genres of the most-programmed works?
  • What time do concerts tend to start at?

Acknowledgements

Thanks to https://github.com/bmcfee for the parsing code. The dataset this notebook is based on was released under CC0 1.0 Universal.

In [3]:
import lxml
import pandas as pd
import matplotlib.pyplot as plt

import numpy as np
import folium
import collections
import glob

from pprint import pprint
from IPython.display import HTML, Image
from lxml import etree, objectify
In [4]:
%matplotlib inline
In [3]:
!git clone https://github.com/nyphilarchive/PerformanceHistory.git
Cloning into 'PerformanceHistory'...
remote: Counting objects: 350, done.
remote: Total 350 (delta 0), reused 0 (delta 0), pack-reused 350
Receiving objects: 100% (350/350), 12.53 MiB | 7.66 MiB/s, done.
Resolving deltas: 100% (253/253), done.
Checking connectivity... done.
In [13]:
# Author: https://github.com/bmcfee/nycphil (Brian McFee)

def parse_programs(programs):
    
    return [parse_program(x) for x in programs]

def parse_program(program):
    
    dispatch = dict(concertInfo=parse_concertInfo,
                    worksInfo=parse_worksInfo)
    data = dict()
    
    for child in program.getchildren():
        if child.tag in dispatch:
            data[child.tag] = dispatch[child.tag](child)
        else:
            data[child.tag] = child.text
            
    return data
            
def parse_concertInfo(concertInfo):
    data = dict()
    
    for child in concertInfo.getchildren():
        data[child.tag] = child.text
    
    return data

def parse_worksInfo(worksInfo):
    
    data = list()
    
    for child in worksInfo.getchildren():
        data.append(parse_work(child))
            
    return data

def parse_work(work):
    
    dispatch = dict(soloists=parse_soloists)
    data = dict()
    
    for child in work.getchildren():
        if child.tag in dispatch:
            data[child.tag] = dispatch[child.tag](child)
        else:
            data[child.tag] = child.text
            
    return data

def parse_soloists(soloists):
    data = list()
    for child in soloists.getchildren():
        data.append(parse_soloist(child))
    return data

def parse_soloist(soloist):
    data = dict()
    
    for child in soloist.getchildren():
        data[child.tag] = child.text
    
    return data

def flatten(d):
    
    works = d.pop('worksInfo', [])
    concertInfo = d.pop('concertInfo', [])
       
    out = []
    for w in works:
        out.append(concertInfo.copy())
        
        # Added this to get soloist's names in. Dirty.
        
        soloists = w.get('soloists', None)
        
        if soloists is not None:
            soloists_names = [s.get('soloistName') for s in soloists if s.get('soloistName') is not None]
            soloists_tsv = "\t".join(soloists_names)
            out[-1].update({'soloists_tsv' : soloists_tsv})
            
        w.pop('soloists', [])
        out[-1].update(d)
        out[-1].update(w)
        
    return out

def load_programs():
    # We need this to handle badly formatted &'s in strings
    parser = etree.XMLParser(recover=True)

    fd = []
    globbed = sorted(glob.glob('./PerformanceHistory/Programs/*.xml'))
    
    for xmlfile in globbed:
        obj = objectify.parse(xmlfile, parser=parser)
        dix = parse_programs(obj.getroot())
        for _ in dix:
            if _['programID'] == '11451':
                print _['programID']
            fd.extend(flatten(_))
    df = pd.DataFrame.from_records(fd)
    df['oldDate'] = df['Date']
    df['Date'] = pd.to_datetime(df['Date'])
    del df['worksInfo']
    del df['work']
    del df['concertInfo']

    return df
In [14]:
df = load_programs()
In [15]:
df.head()
Out[15]:
Date Location Time Venue composerName conductorName eventType id interval orchestra program programID season soloists_tsv workTitle oldDate
0 1842-12-07 05:00:00 Manhattan, NY 8:00PM Apollo Rooms Beethoven, Ludwig van Hill, Ureli Corelli Subscription Season 38e072a7-8fc9-4f9a-8eac-3957905c0002 NaN New York Philharmonic NaN 3853 1842-43 NaN SYMPHONY NO. 5 IN C MINOR, OP.67 1842-12-07T05:00:00Z
1 1842-12-07 05:00:00 Manhattan, NY 8:00PM Apollo Rooms Weber, Carl Maria Von Timm, Henry C. Subscription Season 38e072a7-8fc9-4f9a-8eac-3957905c0002 NaN New York Philharmonic NaN 3853 1842-43 Otto, Antoinette OBERON 1842-12-07T05:00:00Z
2 1842-12-07 05:00:00 Manhattan, NY 8:00PM Apollo Rooms Hummel, Johann NaN Subscription Season 38e072a7-8fc9-4f9a-8eac-3957905c0002 NaN New York Philharmonic NaN 3853 1842-43 Scharfenberg, William\tHill, Ureli Corelli\tDe... QUINTET, PIANO, D MINOR, OP. 74 1842-12-07T05:00:00Z
3 1842-12-07 05:00:00 Manhattan, NY 8:00PM Apollo Rooms NaN NaN Subscription Season 38e072a7-8fc9-4f9a-8eac-3957905c0002 Intermission New York Philharmonic NaN 3853 1842-43 NaN NaN 1842-12-07T05:00:00Z
4 1842-12-07 05:00:00 Manhattan, NY 8:00PM Apollo Rooms Weber, Carl Maria Von Etienne, Denis G. Subscription Season 38e072a7-8fc9-4f9a-8eac-3957905c0002 NaN New York Philharmonic NaN 3853 1842-43 NaN OBERON 1842-12-07T05:00:00Z

Number of works performed by composers over time

In [16]:
# Get, e.g., top 5 composers by performances of all time

sample_list = list(df.composerName.value_counts()[1:5].index)
In [17]:
sample = df[df.composerName.isin(sample_list)]
In [18]:
all_works = df.groupby(df['Date'].map(lambda x:x.year)).count()
yearly_counts = pd.Series(all_works['id'], index=all_works.index)
In [19]:
yearly_counts.describe()
Out[19]:
count     157.000000
mean      558.694268
std       567.675746
min        18.000000
25%       130.000000
50%       430.000000
75%       646.000000
max      2484.000000
Name: id, dtype: float64

Raw counts

In [20]:
yearly_counts.plot()
plt.title('# works performed in the NY Phil Program Archives')
Out[20]:
<matplotlib.text.Text at 0x110fd0350>
In [21]:
all_programs = df.groupby(df['Date'].map(lambda x:x.year)).programID.nunique()
all_programs.plot()
Out[21]:
<matplotlib.axes.AxesSubplot at 0x110f0bb50>

What's the deal with 1956? There's a bunch of phoney data somewhere. That peak of 900 or so should be distributed over the decade.

In [22]:
all_programs.ix[1950:1970].plot()
Out[22]:
<matplotlib.axes.AxesSubplot at 0x110e5dad0>
In [23]:
df.Year = df.Date.map(lambda x:x.year)
In [24]:
df.dtypes
Out[24]:
Date             datetime64[ns]
Location                 object
Time                     object
Venue                    object
composerName             object
conductorName            object
eventType                object
id                       object
interval                 object
orchestra                object
program                  object
programID                object
season                   object
soloists_tsv             object
workTitle                object
oldDate                  object
dtype: object
In [25]:
# See if the issue is with the datetime parsing from before. Seems like it's not.

df['Sanity'] = (df.Date.map(lambda x:str(x)) == df.oldDate.map(lambda x:str(x).replace('Z', '').replace('T', ' ')))
df.Sanity.value_counts()
Out[25]:
True    87715
dtype: int64
In [26]:
del df['Sanity']
In [27]:
# Now, check if the application of x.year is OK
# First do a kludgy year parse on the string rep of the datetime, i.e. oldDate

oldDate_example = '1842-12-07T05:00:00Z'

def year_from_oldDate(oldDate):
    return int(oldDate.split('-')[0])

assert year_from_oldDate('1842-12-07T05:00:00Z') == 1842
In [28]:
df['oldYear'] = df['oldDate'].map(year_from_oldDate)
In [29]:
# Same spike! 

all_programs = df.groupby(df['oldYear']).programID.nunique()
all_programs.plot()
Out[29]:
<matplotlib.axes.AxesSubplot at 0x110519bd0>

Here's one I know is messed up. See the season field.

In [33]:
df[df.programID == '11451']
Out[33]:
Date Location Time Venue composerName conductorName eventType id interval orchestra program programID season soloists_tsv workTitle oldDate oldYear
40633 1956-07-12 04:00:00 Manhattan, NY 8:30PM Lewisohn Stadium NaN NaN Stadium Concert c7f0aeae-8de8-4f42-a040-f5a7e624311b NaN Stadium-NY Philharmonic NaN 11451 1962-63 NaN NaN 1956-07-12T04:00:00Z 1956

Here's the salient part of the original .xml. I've no idea what the hell is going on.

In [38]:
!cat './PerformanceHistory/Programs/1955-56_TO_1962-63.xml' | grep -C 10 '11451'
            </work>
            <work>
                <composerName>Falla,  Manuel  de</composerName>
                <workTitle>THREE-CORNERED HAT (EL SOMBRERO DE TRES PICOS), SUITE NO. 2</workTitle>
                <conductorName>Rudel, Julius</conductorName>
            </work>
        </worksInfo>
    </program>
    <program>
        <id>c7f0aeae-8de8-4f42-a040-f5a7e624311b</id>
        <programID>11451</programID>
        <orchestra>Stadium-NY Philharmonic</orchestra>
        <season>1962-63</season>
        <concertInfo>
            <eventType>Stadium Concert</eventType>
            <Location>Manhattan, NY</Location>
            <Venue>Lewisohn Stadium</Venue>
            <Date>1963-07-25T04:00:00Z</Date>
            <Time>8:30PM</Time>
        </concertInfo>
        <worksInfo>
In [22]:
for composer in sample_list:
    one_composer = df[df.composerName == composer]
    aggregate = one_composer.groupby(one_composer['Date'].map(lambda x:x.year)).count()
    composer_counts = pd.Series(aggregate['id'], index=aggregate.index, name=composer)
    composer_counts.plot(legend=True, label=composer, alpha=0.7)

plt.ylabel('Number of works performed that year')
plt.xlabel('Year of performance')
Out[22]:
<matplotlib.text.Text at 0x10bf6b390>

As a proportion of all works played that year

In [54]:
for name in sample_list:
    one_composer = df[df.composerName == name]
    aggregate = one_composer.groupby(one_composer['Date'].map(lambda x:x.year)).count()
    composer_counts = pd.Series(aggregate['id'], index=aggregate.index, name=name)
    composer_counts_prop = composer_counts.divide(yearly_counts) * 100
    composer_counts_prop.plot(legend=True, label=name)

plt.ylabel('% of works performed that year')
plt.xlabel('Year of performance')
Out[54]:
<matplotlib.text.Text at 0x109fcbad0>

Discovery of new composers

In [20]:
df.composerName.value_counts()[50:60]
Out[20]:
Rubinstein,  Anton         162
Ives,  Charles             157
Goldmark,  Karl            156
Dukas,  Paul               154
Respighi,  Ottorino        145
Massenet,  Jules           142
Barber,  Samuel            142
Hadley,  Henry  Kimball    141
Borodin,  Alexander        137
Chabrier,  Emmanuel        130
dtype: int64

Who on earth is Hadley, Henry Kimball

In [61]:
def composer_counts_by_name(name):
    composer = df[df.composerName == name]
    aggregate = composer.groupby(composer['Date'].map(lambda x:x.year)).count()
    annual_composer_counts = pd.Series(aggregate['id'], index=aggregate.index, name=name)
    return annual_composer_counts
    
def plot_composer_by_name(name):
    composer_counts_by_name(name).plot(legend=True, label=name)
In [62]:
plot_composer_by_name('Milhaud,  Darius')
plot_composer_by_name('Gould,  Morton')
In [63]:
plot_composer_by_name('Ravel,  Maurice')
In [64]:
plot_composer_by_name('Hadley,  Henry  Kimball')
In [113]:
hadley = df[df.composerName == 'Hadley,  Henry  Kimball']
hadley.groupby([df.workTitle], sort=True).count()['id'].order(ascending=False).head(10)
Out[113]:
workTitle
CULPRIT FAY RHAPSODY, OP. 62                                   60
SALOME, OP. 55                                                 18
IN BOHEMIA                                                     10
OCEAN, THE, OP. 99                                              8
CHINESE SKETCHES - STREETS OF PEKIN                             6
SYMPHONY NO. 3, B MINOR, OP. 60                                 5
SILHOUETTES, OP.77 (ARR. Roberts)                               5
SYMPHONY NO. 4, D MINOR, OP. 64, "NORTH, EAST, SOUTH, WEST"     5
LUCIFER, OP. 66                                                 4
SYMPHONY NO. 2, F MINOR, OP.30 (FOUR SEASONS)                   3
Name: id, dtype: int64

The social network of NY Philharmonic Soloists

In [115]:
soloists = df.soloists_tsv[df.soloists_tsv.notnull()]
In [116]:
soloist_list = list(soloists)
In [117]:
len([s for s in soloists if ('\t' in s) and (';' in s)])
Out[117]:
0
In [118]:
tab_separated = [t.split('\t') for t in [s for s in soloists if ('\t') in s]]
semicolon_separated = [t.split(';') for t in [s for s in soloists if (';') in s]]
In [119]:
soloists_split = tab_separated + semicolon_separated
In [120]:
len(soloists_split)
Out[120]:
5250
In [121]:
from itertools import combinations

played_with_pairs = []

for collection in soloists_split:
    for pair in combinations(collection, 2):
        played_with_pairs.append(pair)
In [122]:
from collections import Counter
cnt = Counter(played_with_pairs)
In [123]:
top_ten_thou = cnt.most_common(n=10000)
In [124]:
with open('edges.txt', 'w') as f:
    f.write('source;target;weight\n')
    for edge, weight in top_ten_thou:
        try:
            f.write("{}".format(";".join(edge)) + ";{}\n".format(weight))
        except:
            continue
In [125]:
!wc edges.txt
    9783   32217  333024 edges.txt
In [126]:
!head edges.txt
source;target;weight
Smith, Philip;Alessi, Joseph;209
Smith, Philip;Myers, Philip;179
Myers, Philip;Alessi, Joseph;163
Smith, Philip;Sullivan, Robert;162
Sullivan, Robert;Alessi, Joseph;161
Alessi, Joseph;Deck, Warren;142
Smith, Philip;Deck, Warren;141
Sullivan, Robert;Myers, Philip;140
Myers, Philip;Deck, Warren;130

Process this with something like networkx or Gephi to get something like this:

[pending]


NY Phil on Tour

In [26]:
df.Location.value_counts().head(10)
Out[26]:
Manhattan, NY       48285
Philadelphia, PA     1217
Brooklyn, NY         1096
Washington, DC        441
Greenvale, NY         277
Baltimore, MD         272
Bronx, NY             202
Princeton, NJ         183
Newark, NJ            174
Pittsburgh, PA        168
dtype: int64
In [60]:
def get_state(location_str):
    splitted = location_str.split(', ')
    if len(splitted) != 2:
        return None
    elif len(splitted[1]) == 2:
        return splitted[1].strip()
    else:
        return None
        
def test_get_state():
    assert get_state('Manhattan, NY') == 'NY'
    assert get_state('Dublin, IRELAND') is None
    assert get_state('foobar,,') is None
    return True

def run_tests():
    assert test_get_state()
    return True

assert run_tests()
In [28]:
df['State'] = df.Location.apply(get_state)
In [29]:
in_usa = df[df.State.notnull()]
out_of_state = df[df.State != 'NY']
In [30]:
decade_state = out_of_state.groupby([(out_of_state.Date.apply(lambda x: x.year)//10)*10,
                                     out_of_state.State]).count()
In [31]:
decade_state.head(10)
Out[31]:
Date Location Time Venue composerName conductorName eventType id interval orchestra program programID season soloists_tsv workTitle
Date State
1890 MA 5 5 5 5 5 5 5 5 0 5 0 5 5 1 5
1900 DC 2 2 2 2 2 2 2 2 0 2 0 2 2 0 2
IA 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1
IL 8 8 8 8 8 8 8 8 0 8 0 8 8 2 8
IN 2 2 2 2 2 2 2 2 0 2 0 2 2 0 2
KS 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1
KY 2 2 2 2 2 2 2 2 0 2 0 2 2 0 2
MI 3 3 3 3 3 3 3 3 0 3 0 3 3 0 3
MN 4 4 4 4 4 4 4 4 0 4 0 4 4 0 4
MO 3 3 3 3 3 3 3 3 0 3 0 3 3 2 3
In [32]:
nineties = decade_state.loc[1990]
# Move index to column for use in plotting package later
nineties.reset_index(level=0, inplace=True)
In [33]:
twenties = decade_state.loc[1920]
twenties.reset_index(level=0, inplace=True)

Use folium for chloropleth visualization

In [ ]:
!wget https://raw.githubusercontent.com/python-visualization/folium/master/examples/us-states.json
In [34]:
def inline_map(m, width=650, height=500):
    """Takes a folium instance and embed HTML."""
    m._build_map()
    srcdoc = m.HTML.replace('"', '&quot;')
    embed = HTML('<iframe srcdoc="{}" '
                 'style="width: {}px; height: {}px; '
                 'border: none"></iframe>'.format(srcdoc, width, height))
    return embed
In [35]:
def state_concert_counts(state_data):
    state_geo = r'us-states.json'
    
    f = folium.Map(location=[48, -102], zoom_start=3, max_zoom=4, min_zoom=3)

    f.geo_json(geo_path=state_geo, data=state_data,
                 data_out='data.json',           
                 columns=['State', 'programID'],
                 key_on='feature.id',
                 fill_color='YlGn', fill_opacity=0.7, line_opacity=0.2,
                 legend_name='Concerts played')
    return inline_map(f)
In [36]:
state_concert_counts(nineties)
Out[36]:
In [37]:
state_concert_counts(twenties)
Out[37]:

For those of you following along with e.g. GitHub/ipynbviewer, this won't render because it depends on some hosted .json files. Screenshots below:

In [38]:
Image('1920s.png')
Out[38]:
In [39]:
Image('1990s.png')
Out[39]: