Author: Gertjan van den Burg
In this note we'll show some examples of using CleverCSV, a package for handling messy CSV files. We'll start with a motivating example and then show some other files where CleverCSV shines. CleverCSV was developed as part of a research project on automating data wrangling. It achieves an accuracy of 97% on over 9300 real-world CSV files and improves the accuracy on messy files by 21% over standard tools.
Handy links:
Alice is a data scientist who would like to analyse the movie ratings on IMDB for movies of different genres. She found a dataset shared by a user on Kaggle that contains information of over 14,000 movies. Great!
The data is stored in a CSV file, which is a very common data format for sharing tabular data. The first few lines of the file look like this:
fn,tid,title,wordsInTitle,url,imdbRating,ratingCount,duration,year,type,nrOfWins,nrOfNominations,nrOfPhotos,nrOfNewsArticles,nrOfUserReviews,nrOfGenre,Action,Adult,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,FilmNoir,GameShow,History,Horror,Music,Musical,Mystery,News,RealityTV,Romance,SciFi,Short,Sport,TalkShow,Thriller,War,Western
titles01/tt0012349,tt0012349,Der Vagabund und das Kind (1921),der vagabund und das kind,http://www.imdb.com/title/tt0012349/,8.4,40550,3240,1921,video.movie,1,0,19,96,85,3,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
titles01/tt0015864,tt0015864,Goldrausch (1925),goldrausch,http://www.imdb.com/title/tt0015864/,8.3,45319,5700,1925,video.movie,2,1,35,110,122,3,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
titles01/tt0017136,tt0017136,Metropolis (1927),metropolis,http://www.imdb.com/title/tt0017136/,8.4,81007,9180,1927,video.movie,3,4,67,428,376,2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
titles01/tt0017925,tt0017925,Der General (1926),der general,http://www.imdb.com/title/tt0017925/,8.3,37521,6420,1926,video.movie,1,1,53,123,219,3,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
titles01/tt0021749,tt0021749,Lichter der Großstadt (1931),lichter der gro stadt,http://www.imdb.com/title/tt0021749/,8.7,70057,5220,1931,video.movie,2,0,38,187,186,3,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
Seems pretty standard, let's load it with Pandas!
%xmode Minimal
import pandas as pd
df = pd.read_csv('./data/imdb.csv')
Exception reporting mode: Minimal
ParserError: Error tokenizing data. C error: Expected 44 fields in line 66, saw 46
Oh, that doesn't work. Maybe there's something wrong with the file? Let's try opening it with the Python CSV reader:
import csv
with open('./data/imdb.csv', 'r', newline='') as fid:
dialect = csv.Sniffer().sniff(fid.read())
print("Detected delimiter = %r, quotechar = %r" % (dialect.delimiter, dialect.quotechar))
fid.seek(0)
reader = csv.reader(fid, dialect=dialect)
rows = list(reader)
print("Loaded %i rows." % len(rows))
Detected delimiter = ' ', quotechar = "'" Loaded 13928 rows.
Huh, that's strange, Python thinks the space is the delimiter and loads 13928 rows, but the file should contain 14,762 rows according to the documentation. What's going on here?
It turns out that on the 65th line of the file, there's a movie with the title
Dr. Seltsam\, oder wie ich lernte\, die Bombe zu lieben (1964)
(the German
version of Dr. Strangelove). The title has commas in it, that are escaped
using the \
character! Why are CSV files so hard? 😑
CleverCSV to the rescue!
CleverCSV detects the dialect of CSV files much more accurately than existing approaches, and it is therefore robust against these kinds of format variations. It even has a wrapper that works with DataFrames!
from clevercsv import read_dataframe
df = read_dataframe('./data/imdb.csv')
df
fn | tid | title | wordsInTitle | url | imdbRating | ratingCount | duration | year | type | ... | News | RealityTV | Romance | SciFi | Short | Sport | TalkShow | Thriller | War | Western | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | titles01/tt0012349 | tt0012349 | Der Vagabund und das Kind (1921) | der vagabund und das kind | http://www.imdb.com/title/tt0012349/ | 8.4 | 40550.0 | 3240.0 | 1921.0 | video.movie | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | titles01/tt0015864 | tt0015864 | Goldrausch (1925) | goldrausch | http://www.imdb.com/title/tt0015864/ | 8.3 | 45319.0 | 5700.0 | 1925.0 | video.movie | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | titles01/tt0017136 | tt0017136 | Metropolis (1927) | metropolis | http://www.imdb.com/title/tt0017136/ | 8.4 | 81007.0 | 9180.0 | 1927.0 | video.movie | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | titles01/tt0017925 | tt0017925 | Der General (1926) | der general | http://www.imdb.com/title/tt0017925/ | 8.3 | 37521.0 | 6420.0 | 1926.0 | video.movie | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | titles01/tt0021749 | tt0021749 | Lichter der Großstadt (1931) | lichter der gro stadt | http://www.imdb.com/title/tt0021749/ | 8.7 | 70057.0 | 5220.0 | 1931.0 | video.movie | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
14756 | titles04/index.html.9989 | tt0672488 | "Peep Show" Sectioning (TV Episode 2005) | peep show sectioning tv episode | http://www.imdb.com/title/tt0672488/ | 7.7 | 135.0 | 1440.0 | 2005.0 | video.episode | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
14757 | titles04/index.html.9992 | tt0675644 | "Playhouse 90" The Miracle Worker (TV Episode ... | playhouse the miracle worker tv episode | http://www.imdb.com/title/tt0675644/ | 7.3 | 8.0 | 5400.0 | 1957.0 | video.episode | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
14758 | titles04/index.html.9994 | tt0679222 | "Private Screenings" Robert Mitchum and Jane R... | private screenings robert mitchum and jane rus... | http://www.imdb.com/title/tt0679222/ | 7.0 | 20.0 | 3600.0 | 1996.0 | video.episode | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
14759 | titles04/index.html.9995 | tt0680064 | "Providence" All the King's Men (TV Episode 2002) | providence all the king s men tv episode | http://www.imdb.com/title/tt0680064/ | NaN | NaN | 3600.0 | 2002.0 | video.episode | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
14760 | titles04/index.html.9997 | tt0681024 | "QI" Adam (TV Episode 2003) | qi adam tv episode | http://www.imdb.com/title/tt0681024/ | 7.6 | 89.0 | 1800.0 | 2003.0 | video.episode | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
14761 rows × 44 columns
Hooray! 🎉
How does it work? CleverCSV searches the space of all possible dialects of a file, and computes a data consistency measure that quantifies how much the resulting table "looks like real data". The consistency measure combines patterns of row lengths in the parsing result and the data type of the resulting cells. This mimicks how a human would identify the dialect. If you're wondering why this problem is hard, it's because every dialect will give you some table, but not necessarily the correct one. More details can be found in the paper.
We'll compare CleverCSV to the built-in Python CSV module and to Pandas and show how these are not as robust as CleverCSV. Note that Pandas always uses the comma as separator, unless it is forced to autodetect the dialect, in which case it uses the Python Sniffer on the first line (we don't show that here). These files are of course selected for this tutorial, because it wouldn't be very interesting to show files where all methods are correct.
Some files come from the UK's open government data portal (see the repo for sources), whereas others come from MIT-licensed GitHub repositories (the URLs point directly to the source files).
We'll define some functions for easy comparisons.
import csv
import clevercsv
import io
import os
import requests
import pandas as pd
from termcolor import colored
from IPython.display import display
def page(url):
""" Get the content of a webpage using requests, assuming UTF-8 encoding """
page = requests.get(url)
content = page.content.decode('utf-8')
return content
def head(content, num=10):
""" Preview a CSV file """
print('--- File Preview ---')
for i, line in enumerate(io.StringIO(content, newline=None)):
print(line, end='')
if i == num - 1:
break
print('\n---')
def sniff_url(content):
""" Utility to run the python Sniffer on a CSV file at a URL """
try:
dialect = csv.Sniffer().sniff(content)
print("CSV Sniffer detected: delimiter = %r, quotechar = %r" % (dialect.delimiter,
dialect.quotechar))
except csv.Error as err:
print(colored("No result from the Python CSV Sniffer", "red"))
print(colored("Error was: %s" % err, "red"))
def detect_url(content, verbose=True):
""" Utility to run the CleverCSV detector on a CSV file at a URL """
# We have designed CleverCSV to be a drop-in replacement for the CSV module
try:
dialect = clevercsv.Sniffer().sniff(content, verbose=verbose)
print("CleverCSV detected: delimiter = %r, quotechar = %r" % (dialect.delimiter,
dialect.quotechar))
except clevercsv.Error:
print(colored("No result from CleverCSV", "red"))
def pandas_url(content):
""" Wrapper around pandas.read_csv(). """
buf = io.StringIO(content)
print(
"Pandas uses: delimiter = %r, quotechar = %r"
% (',', '"')
)
try:
df = pd.read_csv(buf)
display(df.head())
except pd.errors.ParserError:
print(colored("ParserError from pandas.", "red"))
def compare(input_, verbose=False, n_preview=10):
if os.path.exists(input_):
enc = clevercsv.utils.get_encoding(input_)
content = open(input_, 'r', newline='', encoding=enc).read()
else:
content = page(input_)
head(content, num=n_preview)
print("\n1. Running Python Sniffer")
sniff_url(content)
print("\n2. Running Pandas")
pandas_url(content)
print("\n3. Running CleverCSV")
detect_url(content, verbose=verbose)
compare('./data/airedale.csv', n_preview=5)
--- File Preview --- Department Family,Entity,Payment Date,Expense Type,Expense Area,Supplier,Transaction No.,Amount DEPARTMENT OF HEALTH,AIREDALE NHS FOUNDATION TRUST,16/07/2010,COMPUTER SOFTWARE / LICENSE FEES,INFORMATION MANAGEMENT & TECHNOLOGY,ACCENTURE PACS,3003126885,"43,774.58" DEPARTMENT OF HEALTH,AIREDALE NHS FOUNDATION TRUST,16/07/2010,COMPUTER SOFTWARE / LICENSE FEES,INFORMATION MANAGEMENT & TECHNOLOGY,ACCENTURE PACS,3003126885,"43,774.58" DEPARTMENT OF HEALTH,AIREDALE NHS FOUNDATION TRUST,16/07/2010,COMPUTER SOFTWARE / LICENSE FEES,INFORMATION MANAGEMENT & TECHNOLOGY,ACCENTURE PACS,3003126885,"7,660.55" DEPARTMENT OF HEALTH,AIREDALE NHS FOUNDATION TRUST,16/07/2010,COMPUTER SOFTWARE / LICENSE FEES,INFORMATION MANAGEMENT & TECHNOLOGY,ACCENTURE PACS,3003126885,"7,660.55" --- 1. Running Python Sniffer CSV Sniffer detected: delimiter = '.', quotechar = '"' 2. Running Pandas Pandas uses: delimiter = ',', quotechar = '"'
Department Family | Entity | Payment Date | Expense Type | Expense Area | Supplier | Transaction No. | Amount | |
---|---|---|---|---|---|---|---|---|
0 | DEPARTMENT OF HEALTH | AIREDALE NHS FOUNDATION TRUST | 16/07/2010 | COMPUTER SOFTWARE / LICENSE FEES | INFORMATION MANAGEMENT & TECHNOLOGY | ACCENTURE PACS | 3003126885 | 43,774.58 |
1 | DEPARTMENT OF HEALTH | AIREDALE NHS FOUNDATION TRUST | 16/07/2010 | COMPUTER SOFTWARE / LICENSE FEES | INFORMATION MANAGEMENT & TECHNOLOGY | ACCENTURE PACS | 3003126885 | 43,774.58 |
2 | DEPARTMENT OF HEALTH | AIREDALE NHS FOUNDATION TRUST | 16/07/2010 | COMPUTER SOFTWARE / LICENSE FEES | INFORMATION MANAGEMENT & TECHNOLOGY | ACCENTURE PACS | 3003126885 | 7,660.55 |
3 | DEPARTMENT OF HEALTH | AIREDALE NHS FOUNDATION TRUST | 16/07/2010 | COMPUTER SOFTWARE / LICENSE FEES | INFORMATION MANAGEMENT & TECHNOLOGY | ACCENTURE PACS | 3003126885 | 7,660.55 |
4 | DEPARTMENT OF HEALTH | AIREDALE NHS FOUNDATION TRUST | 16/07/2010 | COMPUTER SOFTWARE / LICENSE FEES | INFORMATION MANAGEMENT & TECHNOLOGY | ACCENTURE PACS | 3003129243 | 42,022.79 |
3. Running CleverCSV CleverCSV detected: delimiter = ',', quotechar = '"'
You'll notice that Python Sniffer says .
is the delimiter, Pandas is
correct because the file uses the default comma as separator, and CleverCSV
detects the dialect correctly as well.
compare('./data/milk.csv', n_preview=5)
--- File Preview --- UK Availability Disposals and Production of Milk and Milk Products Year AVAILABILITY of RAW MILK UK Milk Production million litres AVAILABILITY of RAW MILK Imports million litres AVAILABILITY of RAW MILK Total Available million litres DISPOSALS of RAW MILK For Manufacture For Liquid Milk (1) million litres DISPOSALS of RAW MILK For Manufacture Total million litres DISPOSALS of RAW MILK For Manufacture Condensed Milk (3) million litres DISPOSALS of RAW MILK For Manufacture Milk Powders (3)(4) million litres DISPOSALS of RAW MILK For Manufacture Butter million cream litres DISPOSALS of RAW MILK For Manufacture Cheese million litres DISPOSALS of RAW MILK For Manufacture million cream litres DISPOSALS of RAW MILK For ManufactureYoghurt million litres DISPOSALS of RAW MILK For Manufacture Other Products (5) million litres DISPOSALS of RAW MILK For Manufacture Exports million litres DISPOSALS of RAW MILK For Manufacture Stock change & wastage million litres WHOLESALE PRODUCTION (2) Liquid Milk (1) million litres WHOLESALE PRODUCTION (2) Condensed Milk (3) th. tonnes WHOLESALE PRODUCTION (2) Milk Powders (3) (4) th. tonnes WHOLESALE PRODUCTION (2) Butter (6) th. tonnes WHOLESALE PRODUCTION (2) Cheese (6) th. tonnes INTERVENTION STOCKS Skimmed Milk Powder th. tonnes INTERVENTION STOCKS Skimmed Milk Powder th. tonnes 1987 14718.2 14718.2 6813.1 7857 698.5 2872.9 373.1 2868.4 169.5 874.7 36 12.1 6576.3 180.3 288.4 182.3 270.2 1988 14398.8 14398.8 6858.3 7480.6 809.4 2311.9 301.8 3178.4 190.9 688.3 46.9 13 6605.9 182.8 240.1 146.7 299.9 1989 14186 14186 6859.4 7257.5 794.8 2313.5 287.4 3069.7 226.4 565.7 57.7 11.3 6623 207.3 227.9 139.5 283.1 1990 14465.7 14465.7 6892.4 7488.9 737.5 2482.5 308.8 3289.1 248.7 422.3 74.5 9.9 6654.8 203.8 236 151.2 315.1 --- 1. Running Python Sniffer CSV Sniffer detected: delimiter = ' ', quotechar = '"' 2. Running Pandas Pandas uses: delimiter = ',', quotechar = '"'
UK Availability Disposals and Production of Milk and Milk Products Year\t\tAVAILABILITY of RAW MILK UK Milk Production million litres\tAVAILABILITY of RAW MILK Imports million litres\tAVAILABILITY of RAW MILK Total Available million litres\tDISPOSALS of RAW MILK For Manufacture For Liquid Milk (1) million litres\tDISPOSALS of RAW MILK For Manufacture Total million litres\tDISPOSALS of RAW MILK For Manufacture Condensed Milk (3) million litres\tDISPOSALS of RAW MILK For Manufacture Milk Powders (3)(4) million litres\tDISPOSALS of RAW MILK For Manufacture Butter million cream litres\tDISPOSALS of RAW MILK For Manufacture Cheese million litres\tDISPOSALS of RAW MILK For Manufacture million cream litres\tDISPOSALS of RAW MILK For ManufactureYoghurt million litres\tDISPOSALS of RAW MILK For Manufacture Other Products (5) million litres\tDISPOSALS of RAW MILK For Manufacture Exports million litres\tDISPOSALS of RAW MILK For Manufacture Stock change & wastage million litres\tWHOLESALE PRODUCTION (2) Liquid Milk (1) million litres\tWHOLESALE PRODUCTION (2) Condensed Milk (3) th. tonnes\tWHOLESALE PRODUCTION (2) Milk Powders (3) (4) th. tonnes\tWHOLESALE PRODUCTION (2) Butter (6) th. tonnes\tWHOLESALE PRODUCTION (2) Cheese (6) th. tonnes\tINTERVENTION STOCKS Skimmed Milk Powder th. tonnes\tINTERVENTION STOCKS Skimmed Milk Powder th. tonnes | |
---|---|
0 | 1987\t\t14718.2\t\t14718.2\t6813.1\t7857\t698.... |
1 | 1988\t\t14398.8\t\t14398.8\t6858.3\t7480.6\t80... |
2 | 1989\t\t14186\t\t14186\t6859.4\t7257.5\t794.8\... |
3 | 1990\t\t14465.7\t\t14465.7\t6892.4\t7488.9\t73... |
4 | 1991\t\t13992\t\t13992\t6892.8\t7021.9\t706.7\... |
3. Running CleverCSV CleverCSV detected: delimiter = '\t', quotechar = '"'
Sniffer and Pandas are incorrect here, but CleverCSV gets it right.
The Python Sniffer gives no result for this file, and Pandas fails because it
checks for a rectangular table shape. Note that the text in the comments says
that the file uses |
as separator, even though it actually uses ,
!
compare("https://raw.githubusercontent.com/queq/just-stuff/c1b8714664cc674e1fc685bd957eac548d636a43/pov/TopFixed/build/project_r_pad.csv", n_preview=30)
--- File Preview --- #Release 14.4 - par P.49d (lin) #Copyright (c) 1995-2012 Xilinx, Inc. All rights reserved. #Thu Jun 20 07:23:42 2013 # ## NOTE: This file is designed to be imported into a spreadsheet program # such as Microsoft Excel for viewing, printing and sorting. The | # character is used as the data field separator. This file is also designed # to support parsing. # #INPUT FILE: project.ncd #OUTPUT FILE: project_r_pad.csv #PART TYPE: xc3s500e #SPEED GRADE: -5 #PACKAGE: fg320 # # Pinout by Pin Number: # # -----,-----,-----,-----,-----,-----,-----,-----,-----,-----,-----,-----,-----,-----,-----, Pin Number,Signal Name,Pin Usage,Pin Name,Direction,IO Standard,IO Bank Number,Drive (mA),Slew Rate,Termination,IOB Delay,Voltage,Constraint,IO Register,Signal Integrity, A1,,,GND,,,,,,,,,,,, A2,,,TDI,,,,,,,,,,,, A3,,IBUF,IP,UNUSED,,0,,,,,,,,, A4,,DIFFM,IO_L24P_0,UNUSED,,0,,,,,,,,, A5,,DIFFMI,IP_L22P_0,UNUSED,,0,,,,,,,,, A6,,DIFFS,IO_L20N_0,UNUSED,,0,,,,,,,,, A7,,IBUF,IP,UNUSED,,0,,,,,,,,, A8,,IOB,IO,UNUSED,,0,,,,,,,,, A9,,,VCCO_0,,,0,,,,,2.50,,,, --- 1. Running Python Sniffer No result from the Python CSV Sniffer Error was: Could not determine delimiter 2. Running Pandas Pandas uses: delimiter = ',', quotechar = '"' ParserError from pandas. 3. Running CleverCSV CleverCSV detected: delimiter = ',', quotechar = ''
compare("https://raw.githubusercontent.com/grezesf/Research/17b1e829d1d4b8954661270bd8b099e74bb45ce7/Reservoirs/Task0_Replication/code/preprocessing/factors.csv")
--- File Preview --- log(E);de_log(E);dede_log(E);cepstrum;de_cepstrum;dede_cepstrum 0.27;1.77;4.97;0.1;0.61;1.75; 1.75;1.25;1.00;1.25;0.50;0.25; --- 1. Running Python Sniffer No result from the Python CSV Sniffer Error was: Could not determine delimiter 2. Running Pandas Pandas uses: delimiter = ',', quotechar = '"'
log(E);de_log(E);dede_log(E);cepstrum;de_cepstrum;dede_cepstrum | |
---|---|
0 | 0.27;1.77;4.97;0.1;0.61;1.75; |
1 | 1.75;1.25;1.00;1.25;0.50;0.25; |
3. Running CleverCSV CleverCSV detected: delimiter = ';', quotechar = ''
Sniffer fails outright, Pandas is incorrect because it assumes comma.
compare("https://raw.githubusercontent.com/HAYASAKA-Ryosuke/TodenGraphDay/8f052219d037edabebd488e5f6dc2ddbe8367dc1/juyo-j.csv")
--- File Preview --- 2013/12/12 23:20 UPDATE ピーク時供給力(万kW),時間帯,供給力情報更新日,供給力情報更新時刻 4849,17:00〜18:00,12/12,8:30 予想最大電力(万kW),時間帯,予想最大電力情報更新日,予想最大電力情報更新時刻 4210,17:00〜18:00,12/12,8:30 DATE,TIME,当日実績(万kW),予測値(万kW) 2013/12/12,0:00,3098,0 2013/12/12,1:00,2948,0 --- 1. Running Python Sniffer CSV Sniffer detected: delimiter = '\r', quotechar = '"' 2. Running Pandas Pandas uses: delimiter = ',', quotechar = '"'
2013/12/12 23:20 UPDATE | |||
---|---|---|---|
ピーク時供給力(万kW) | 時間帯 | 供給力情報更新日 | 供給力情報更新時刻 |
4849 | 17:00〜18:00 | 12/12 | 8:30 |
予想最大電力(万kW) | 時間帯 | 予想最大電力情報更新日 | 予想最大電力情報更新時刻 |
4210 | 17:00〜18:00 | 12/12 | 8:30 |
DATE | TIME | 当日実績(万kW) | 予測値(万kW) |
3. Running CleverCSV CleverCSV detected: delimiter = ',', quotechar = ''
Sniffer says \r
(carriage return) is the delimiter!
We hope you find CleverCSV useful! The package is still in beta, so if you encounter any issues or files where CleverCSV fails, please leave a comment on GitHub!