This short notebook shows a way to set the value of one column in a CSV file, that satisfies multiple conditions, by extracting information from another column using regular expressions. The pandas library is the best tool I know for programmatically working with CSV files. It offers many methods for modifying columns and supports different data types.
While working on ukealong.com, a website with more than 1,000 play along videos for the ukulele, I faced such a task. The meta data for the videos, which includes artists and track names is maintained in a CSV file. The source data comes from several YouTube channels, that create these play along videos. Let's look at a small subset of the data, that includes 10 records and only the relevant columns for demonstrating the column setting.
%load_ext signature
import re
import pandas as pd
df = pd.read_csv('data/ukealong-column-setting-sample.csv')
df
artists | track | title | channel_id | |
---|---|---|---|---|
0 | NaN | NaN | "Something" (The Beatles) Ukulele Play-Along! | UCDglnz22aXMzpug5HbD1bCA |
1 | NaN | NaN | Arthur Theme Song Ukulele Play-Along! | UCDglnz22aXMzpug5HbD1bCA |
2 | NaN | NaN | "Black Magic Woman" (Santana) Ukulele Play-Along! | UCDglnz22aXMzpug5HbD1bCA |
3 | Israel Kamakawiwo'ole | White Sandy Beach of Hawaii | "White Sandy Beach of Hawaii" (Braddah IZ) Uku... | UCDglnz22aXMzpug5HbD1bCA |
4 | NaN | What Will We Do With a Drunken Sailor | What Will We Do With a Drunken Sailor | UC6jZYLoYuV1CxY4Stzm6mgg |
5 | NaN | Frere Jacques | Frere Jacques | UCZjDV_1UEbVsAQA_q9tyTWw |
6 | NaN | NaN | "What's Up?" (4 Non Blondes) Ukulele Play Along! | UCDglnz22aXMzpug5HbD1bCA |
7 | NaN | NaN | "Lucy in the Sky with Diamonds" (The Beatles) ... | UCDglnz22aXMzpug5HbD1bCA |
8 | NaN | NaN | Lava Ukulele Play-along // Cynthia Lin (Chords... | UCD2q6i-C0ZLJUK-VCp49TJA |
9 | Weezer | Island In The Sun | "Island In The Sun" Ukulele Play-Along! | UCbQn9nS2_W-dsmnr3h_Rpvg |
After importing videos from this playlist by Ukulele Underground, I had more than 100 records with empty artists
and track
columns. I updated some of them manually and thought there must be a better way. Most of their YouTube titles follow the format "TRACK" (ARTIST) Ukulele Play-Along!, so the relevant information could be extracted using regular expressions. I only wanted to update records where the artists
and track
columns were not already set, which led to the following condition:
condition = (df['channel_id'] == 'UCDglnz22aXMzpug5HbD1bCA') & df['artists'].isna() & df['track'].isna()
The code cell below shows the two statements for extracting the information from the title
column, which contains the title used on YouTube, using two regular expression. This can be done comfortably using the pandas.Series.str.extract method. The expand
parameter has to be set to False
so a Series
object is returned.
df.loc[condition, 'artists'] = df.title.str.extract(r'\((.+?)\)', expand=False)
df.loc[condition, 'track'] = df.title.str.extract(r'^"(.+?)"', expand=False)
df
artists | track | title | channel_id | |
---|---|---|---|---|
0 | The Beatles | Something | "Something" (The Beatles) Ukulele Play-Along! | UCDglnz22aXMzpug5HbD1bCA |
1 | NaN | NaN | Arthur Theme Song Ukulele Play-Along! | UCDglnz22aXMzpug5HbD1bCA |
2 | Santana | Black Magic Woman | "Black Magic Woman" (Santana) Ukulele Play-Along! | UCDglnz22aXMzpug5HbD1bCA |
3 | Israel Kamakawiwo'ole | White Sandy Beach of Hawaii | "White Sandy Beach of Hawaii" (Braddah IZ) Uku... | UCDglnz22aXMzpug5HbD1bCA |
4 | NaN | What Will We Do With a Drunken Sailor | What Will We Do With a Drunken Sailor | UC6jZYLoYuV1CxY4Stzm6mgg |
5 | NaN | Frere Jacques | Frere Jacques | UCZjDV_1UEbVsAQA_q9tyTWw |
6 | 4 Non Blondes | What's Up? | "What's Up?" (4 Non Blondes) Ukulele Play Along! | UCDglnz22aXMzpug5HbD1bCA |
7 | The Beatles | Lucy in the Sky with Diamonds | "Lucy in the Sky with Diamonds" (The Beatles) ... | UCDglnz22aXMzpug5HbD1bCA |
8 | NaN | NaN | Lava Ukulele Play-along // Cynthia Lin (Chords... | UCD2q6i-C0ZLJUK-VCp49TJA |
9 | Weezer | Island In The Sun | "Island In The Sun" Ukulele Play-Along! | UCbQn9nS2_W-dsmnr3h_Rpvg |
After running this code on the actual CSV file, most of the Ukulele Underground records showed the correct information in the two columns.
In the first published version of this notebook I imported the standard library re module and defined two custom functions using regular expressions to extract the information. Thanks to redditor otterom, who made me aware of the extract
method, the code is now more succinct and straightforward.
Quite frankly, writing the initial version of the code probably took longer than updating 100 records manually would have, let alone documenting the process in this notebook und updating it. On the other hand I can integrate the code in the process of adding new videos. Moreover, I learned something new, now have a reference for similar tasks in the future and it hopefully helps other people who face a similar problem.
Last but not least, if you play the ukulele or want to learn it, I hope you'll dig ukealong.com, share the site with your friends and show the creators of the play along videos some love by subscribing to their channels and liking their videos on YouTube. Playing music is really a joyful way to spend your time and helps to clear the mind in these times of uncertainty.
%signature