CSV dialect detection with CleverCSV

Author: Gertjan van den Burg

In this note we'll show some examples of using CleverCSV, a package for handling messy CSV files. We'll start with a motivating example and then show some other files where CleverCSV shines. CleverCSV was developed as part of a research project on automating data wrangling. It achieves an accuracy of 97% on over 9300 real-world CSV files and improves the accuracy on messy files by 21% over standard tools.

Handy links:

IMDB Movie data

Alice is a data scientist who would like to analyse the movie ratings on IMDB for movies of different genres. She found a dataset shared by a user on Kaggle that contains information of over 14,000 movies. Great!

The data is stored in a CSV file, which is a very common data format for sharing tabular data. The first few lines of the file look like this:

fn,tid,title,wordsInTitle,url,imdbRating,ratingCount,duration,year,type,nrOfWins,nrOfNominations,nrOfPhotos,nrOfNewsArticles,nrOfUserReviews,nrOfGenre,Action,Adult,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,FilmNoir,GameShow,History,Horror,Music,Musical,Mystery,News,RealityTV,Romance,SciFi,Short,Sport,TalkShow,Thriller,War,Western
titles01/tt0012349,tt0012349,Der Vagabund und das Kind (1921),der vagabund und das kind,http://www.imdb.com/title/tt0012349/,8.4,40550,3240,1921,video.movie,1,0,19,96,85,3,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
titles01/tt0015864,tt0015864,Goldrausch (1925),goldrausch,http://www.imdb.com/title/tt0015864/,8.3,45319,5700,1925,video.movie,2,1,35,110,122,3,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
titles01/tt0017136,tt0017136,Metropolis (1927),metropolis,http://www.imdb.com/title/tt0017136/,8.4,81007,9180,1927,video.movie,3,4,67,428,376,2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
titles01/tt0017925,tt0017925,Der General (1926),der general,http://www.imdb.com/title/tt0017925/,8.3,37521,6420,1926,video.movie,1,1,53,123,219,3,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
titles01/tt0021749,tt0021749,Lichter der Großstadt (1931),lichter der gro stadt,http://www.imdb.com/title/tt0021749/,8.7,70057,5220,1931,video.movie,2,0,38,187,186,3,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0

Seems pretty standard, let's load it with Pandas!

In [1]:
%xmode Minimal
import pandas as pd
df = pd.read_csv('./data/imdb.csv')
Exception reporting mode: Minimal
ParserError: Error tokenizing data. C error: Expected 44 fields in line 66, saw 46

Oh, that doesn't work. Maybe there's something wrong with the file? Let's try opening it with the Python CSV reader:

In [2]:
import csv
with open('./data/imdb.csv', 'r', newline='') as fid:
    dialect = csv.Sniffer().sniff(fid.read())
    print("Detected delimiter = %r, quotechar = %r" % (dialect.delimiter, dialect.quotechar))
    fid.seek(0)
    reader = csv.reader(fid, dialect=dialect)
    rows = list(reader)

print("Loaded %i rows." % len(rows))
Detected delimiter = ' ', quotechar = "'"
Loaded 13928 rows.

Huh, that's strange, Python thinks the space is the delimiter and loads 13928 rows, but the file should contain 14,762 rows according to the documentation. What's going on here?

It turns out that on the 65th line of the file, there's a movie with the title Dr. Seltsam\, oder wie ich lernte\, die Bombe zu lieben (1964) (the German version of Dr. Strangelove). The title has commas in it, that are escaped using the \ character! Why are CSV files so hard? 😑

CleverCSV to the rescue!

CleverCSV detects the dialect of CSV files much more accurately than existing approaches, and it is therefore robust against these kinds of format variations. It even has a wrapper that works with DataFrames!

In [3]:
from clevercsv import read_dataframe

df = read_dataframe('./data/imdb.csv')
df
Out[3]:
fn tid title wordsInTitle url imdbRating ratingCount duration year type ... News RealityTV Romance SciFi Short Sport TalkShow Thriller War Western
0 titles01/tt0012349 tt0012349 Der Vagabund und das Kind (1921) der vagabund und das kind http://www.imdb.com/title/tt0012349/ 8.4 40550.0 3240.0 1921.0 video.movie ... 0 0 0 0 0 0 0 0 0 0
1 titles01/tt0015864 tt0015864 Goldrausch (1925) goldrausch http://www.imdb.com/title/tt0015864/ 8.3 45319.0 5700.0 1925.0 video.movie ... 0 0 0 0 0 0 0 0 0 0
2 titles01/tt0017136 tt0017136 Metropolis (1927) metropolis http://www.imdb.com/title/tt0017136/ 8.4 81007.0 9180.0 1927.0 video.movie ... 0 0 0 1 0 0 0 0 0 0
3 titles01/tt0017925 tt0017925 Der General (1926) der general http://www.imdb.com/title/tt0017925/ 8.3 37521.0 6420.0 1926.0 video.movie ... 0 0 0 0 0 0 0 0 0 0
4 titles01/tt0021749 tt0021749 Lichter der Großstadt (1931) lichter der gro stadt http://www.imdb.com/title/tt0021749/ 8.7 70057.0 5220.0 1931.0 video.movie ... 0 0 1 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14756 titles04/index.html.9989 tt0672488 "Peep Show" Sectioning (TV Episode 2005) peep show sectioning tv episode http://www.imdb.com/title/tt0672488/ 7.7 135.0 1440.0 2005.0 video.episode ... 0 0 0 0 0 0 0 0 0 0
14757 titles04/index.html.9992 tt0675644 "Playhouse 90" The Miracle Worker (TV Episode ... playhouse the miracle worker tv episode http://www.imdb.com/title/tt0675644/ 7.3 8.0 5400.0 1957.0 video.episode ... 0 0 0 0 0 0 0 0 0 0
14758 titles04/index.html.9994 tt0679222 "Private Screenings" Robert Mitchum and Jane R... private screenings robert mitchum and jane rus... http://www.imdb.com/title/tt0679222/ 7.0 20.0 3600.0 1996.0 video.episode ... 0 0 0 0 0 0 1 0 0 0
14759 titles04/index.html.9995 tt0680064 "Providence" All the King's Men (TV Episode 2002) providence all the king s men tv episode http://www.imdb.com/title/tt0680064/ NaN NaN 3600.0 2002.0 video.episode ... 0 0 0 0 0 0 0 0 0 0
14760 titles04/index.html.9997 tt0681024 "QI" Adam (TV Episode 2003) qi adam tv episode http://www.imdb.com/title/tt0681024/ 7.6 89.0 1800.0 2003.0 video.episode ... 0 0 0 0 0 0 0 0 0 0

14761 rows × 44 columns

Hooray! 🎉

How does it work? CleverCSV searches the space of all possible dialects of a file, and computes a data consistency measure that quantifies how much the resulting table "looks like real data". The consistency measure combines patterns of row lengths in the parsing result and the data type of the resulting cells. This mimicks how a human would identify the dialect. If you're wondering why this problem is hard, it's because every dialect will give you some table, but not necessarily the correct one. More details can be found in the paper.

Other Examples

We'll compare CleverCSV to the built-in Python CSV module and to Pandas and show how these are not as robust as CleverCSV. Note that Pandas always uses the comma as separator, unless it is forced to autodetect the dialect, in which case it uses the Python Sniffer on the first line (we don't show that here). These files are of course selected for this tutorial, because it wouldn't be very interesting to show files where all methods are correct.

Some files come from the UK's open government data portal (see the repo for sources), whereas others come from MIT-licensed GitHub repositories (the URLs point directly to the source files).

We'll define some functions for easy comparisons.

In [4]:
import csv
import clevercsv
import io
import os
import requests
import pandas as pd

from termcolor import colored
from IPython.display import display

def page(url):
    """ Get the content of a webpage using requests, assuming UTF-8 encoding """
    page = requests.get(url)
    content = page.content.decode('utf-8')
    return content

def head(content, num=10):
    """ Preview a CSV file """
    print('--- File Preview ---')
    for i, line in enumerate(io.StringIO(content, newline=None)):
        print(line, end='')
        if i == num - 1:
            break
    print('\n---')

def sniff_url(content):
    """ Utility to run the python Sniffer on a CSV file at a URL """
    try:
        dialect = csv.Sniffer().sniff(content)
        print("CSV Sniffer detected: delimiter = %r, quotechar = %r" % (dialect.delimiter,
                                                                        dialect.quotechar))
    except csv.Error as err:
        print(colored("No result from the Python CSV Sniffer", "red"))
        print(colored("Error was: %s" % err, "red"))

def detect_url(content, verbose=True):
    """ Utility to run the CleverCSV detector on a CSV file at a URL """
    # We have designed CleverCSV to be a drop-in replacement for the CSV module
    try:
        dialect = clevercsv.Sniffer().sniff(content, verbose=verbose)
        print("CleverCSV detected: delimiter = %r, quotechar = %r" % (dialect.delimiter, 
                                                                      dialect.quotechar))
    except clevercsv.Error:
        print(colored("No result from CleverCSV", "red"))

def pandas_url(content):
    """ Wrapper around pandas.read_csv(). """
    buf = io.StringIO(content)
    print(
        "Pandas uses: delimiter = %r, quotechar = %r"
        % (',', '"')
    )
    try:
        df = pd.read_csv(buf)
        display(df.head())
    except pd.errors.ParserError:
        print(colored("ParserError from pandas.", "red"))


def compare(input_, verbose=False, n_preview=10):
    if os.path.exists(input_):
      enc = clevercsv.utils.get_encoding(input_)
      content = open(input_, 'r', newline='', encoding=enc).read()
    else:
      content = page(input_)
    head(content, num=n_preview)
    print("\n1. Running Python Sniffer")
    sniff_url(content)
    print("\n2. Running Pandas")
    pandas_url(content)
    print("\n3. Running CleverCSV")
    detect_url(content, verbose=verbose)

Numbers with comma for decimal point

In [5]:
compare('./data/airedale.csv', n_preview=5)
--- File Preview ---
Department Family,Entity,Payment Date,Expense Type,Expense Area,Supplier,Transaction No.,Amount
DEPARTMENT OF HEALTH,AIREDALE NHS FOUNDATION TRUST,16/07/2010,COMPUTER SOFTWARE / LICENSE  FEES,INFORMATION MANAGEMENT & TECHNOLOGY,ACCENTURE PACS,3003126885,"43,774.58"
DEPARTMENT OF HEALTH,AIREDALE NHS FOUNDATION TRUST,16/07/2010,COMPUTER SOFTWARE / LICENSE  FEES,INFORMATION MANAGEMENT & TECHNOLOGY,ACCENTURE PACS,3003126885,"43,774.58"
DEPARTMENT OF HEALTH,AIREDALE NHS FOUNDATION TRUST,16/07/2010,COMPUTER SOFTWARE / LICENSE  FEES,INFORMATION MANAGEMENT & TECHNOLOGY,ACCENTURE PACS,3003126885,"7,660.55"
DEPARTMENT OF HEALTH,AIREDALE NHS FOUNDATION TRUST,16/07/2010,COMPUTER SOFTWARE / LICENSE  FEES,INFORMATION MANAGEMENT & TECHNOLOGY,ACCENTURE PACS,3003126885,"7,660.55"

---

1. Running Python Sniffer
CSV Sniffer detected: delimiter = '.', quotechar = '"'

2. Running Pandas
Pandas uses: delimiter = ',', quotechar = '"'
Department Family Entity Payment Date Expense Type Expense Area Supplier Transaction No. Amount
0 DEPARTMENT OF HEALTH AIREDALE NHS FOUNDATION TRUST 16/07/2010 COMPUTER SOFTWARE / LICENSE FEES INFORMATION MANAGEMENT & TECHNOLOGY ACCENTURE PACS 3003126885 43,774.58
1 DEPARTMENT OF HEALTH AIREDALE NHS FOUNDATION TRUST 16/07/2010 COMPUTER SOFTWARE / LICENSE FEES INFORMATION MANAGEMENT & TECHNOLOGY ACCENTURE PACS 3003126885 43,774.58
2 DEPARTMENT OF HEALTH AIREDALE NHS FOUNDATION TRUST 16/07/2010 COMPUTER SOFTWARE / LICENSE FEES INFORMATION MANAGEMENT & TECHNOLOGY ACCENTURE PACS 3003126885 7,660.55
3 DEPARTMENT OF HEALTH AIREDALE NHS FOUNDATION TRUST 16/07/2010 COMPUTER SOFTWARE / LICENSE FEES INFORMATION MANAGEMENT & TECHNOLOGY ACCENTURE PACS 3003126885 7,660.55
4 DEPARTMENT OF HEALTH AIREDALE NHS FOUNDATION TRUST 16/07/2010 COMPUTER SOFTWARE / LICENSE FEES INFORMATION MANAGEMENT & TECHNOLOGY ACCENTURE PACS 3003129243 42,022.79
3. Running CleverCSV
CleverCSV detected: delimiter = ',', quotechar = '"'

You'll notice that Python Sniffer says . is the delimiter, Pandas is correct because the file uses the default comma as separator, and CleverCSV detects the dialect correctly as well.

Tab-separated

In [6]:
compare('./data/milk.csv', n_preview=5)
--- File Preview ---
UK Availability  Disposals and Production of Milk and Milk Products Year		AVAILABILITY of RAW MILK UK Milk   Production million litres	AVAILABILITY of RAW MILK Imports  million litres	AVAILABILITY of RAW MILK Total Available million litres	DISPOSALS of RAW MILK For Manufacture For Liquid Milk (1) million litres	DISPOSALS of RAW MILK For Manufacture Total million litres	DISPOSALS of RAW MILK For Manufacture Condensed Milk (3) million litres	DISPOSALS of RAW MILK For Manufacture Milk Powders (3)(4) million litres	DISPOSALS of RAW MILK For Manufacture Butter million cream litres	DISPOSALS of RAW MILK For Manufacture Cheese million litres	DISPOSALS of RAW MILK For Manufacture million cream litres	DISPOSALS of RAW MILK For ManufactureYoghurt million litres	DISPOSALS of RAW MILK For Manufacture Other Products (5) million litres	DISPOSALS of RAW MILK For Manufacture Exports million litres	DISPOSALS of RAW MILK For Manufacture  Stock change & wastage million litres	WHOLESALE PRODUCTION (2) Liquid Milk (1) million litres	WHOLESALE PRODUCTION (2) Condensed Milk (3) th. tonnes	WHOLESALE PRODUCTION (2) Milk Powders (3) (4) th. tonnes	WHOLESALE PRODUCTION (2) Butter (6) th. tonnes	WHOLESALE PRODUCTION (2) Cheese (6)  th. tonnes	INTERVENTION STOCKS Skimmed Milk Powder th. tonnes	INTERVENTION STOCKS Skimmed Milk Powder th. tonnes
1987		14718.2		14718.2	6813.1	7857	698.5	2872.9	373.1	2868.4	169.5		874.7	36	12.1	6576.3	180.3	288.4	182.3	270.2		
1988		14398.8		14398.8	6858.3	7480.6	809.4	2311.9	301.8	3178.4	190.9		688.3	46.9	13	6605.9	182.8	240.1	146.7	299.9		
1989		14186		14186	6859.4	7257.5	794.8	2313.5	287.4	3069.7	226.4		565.7	57.7	11.3	6623	207.3	227.9	139.5	283.1		
1990		14465.7		14465.7	6892.4	7488.9	737.5	2482.5	308.8	3289.1	248.7		422.3	74.5	9.9	6654.8	203.8	236	151.2	315.1		

---

1. Running Python Sniffer
CSV Sniffer detected: delimiter = ' ', quotechar = '"'

2. Running Pandas
Pandas uses: delimiter = ',', quotechar = '"'
UK Availability Disposals and Production of Milk and Milk Products Year\t\tAVAILABILITY of RAW MILK UK Milk Production million litres\tAVAILABILITY of RAW MILK Imports million litres\tAVAILABILITY of RAW MILK Total Available million litres\tDISPOSALS of RAW MILK For Manufacture For Liquid Milk (1) million litres\tDISPOSALS of RAW MILK For Manufacture Total million litres\tDISPOSALS of RAW MILK For Manufacture Condensed Milk (3) million litres\tDISPOSALS of RAW MILK For Manufacture Milk Powders (3)(4) million litres\tDISPOSALS of RAW MILK For Manufacture Butter million cream litres\tDISPOSALS of RAW MILK For Manufacture Cheese million litres\tDISPOSALS of RAW MILK For Manufacture million cream litres\tDISPOSALS of RAW MILK For ManufactureYoghurt million litres\tDISPOSALS of RAW MILK For Manufacture Other Products (5) million litres\tDISPOSALS of RAW MILK For Manufacture Exports million litres\tDISPOSALS of RAW MILK For Manufacture Stock change & wastage million litres\tWHOLESALE PRODUCTION (2) Liquid Milk (1) million litres\tWHOLESALE PRODUCTION (2) Condensed Milk (3) th. tonnes\tWHOLESALE PRODUCTION (2) Milk Powders (3) (4) th. tonnes\tWHOLESALE PRODUCTION (2) Butter (6) th. tonnes\tWHOLESALE PRODUCTION (2) Cheese (6) th. tonnes\tINTERVENTION STOCKS Skimmed Milk Powder th. tonnes\tINTERVENTION STOCKS Skimmed Milk Powder th. tonnes
0 1987\t\t14718.2\t\t14718.2\t6813.1\t7857\t698....
1 1988\t\t14398.8\t\t14398.8\t6858.3\t7480.6\t80...
2 1989\t\t14186\t\t14186\t6859.4\t7257.5\t794.8\...
3 1990\t\t14465.7\t\t14465.7\t6892.4\t7488.9\t73...
4 1991\t\t13992\t\t13992\t6892.8\t7021.9\t706.7\...
3. Running CleverCSV
CleverCSV detected: delimiter = '\t', quotechar = '"'

Sniffer and Pandas are incorrect here, but CleverCSV gets it right.

File with comments

The Python Sniffer gives no result for this file, and Pandas fails because it checks for a rectangular table shape. Note that the text in the comments says that the file uses | as separator, even though it actually uses ,!

In [7]:
compare("https://raw.githubusercontent.com/queq/just-stuff/c1b8714664cc674e1fc685bd957eac548d636a43/pov/TopFixed/build/project_r_pad.csv", n_preview=30)
--- File Preview ---
#Release 14.4 - par P.49d (lin)
#Copyright (c) 1995-2012 Xilinx, Inc.  All rights reserved.

#Thu Jun 20 07:23:42 2013

#
## NOTE: This file is designed to be imported into a spreadsheet program
# such as Microsoft Excel for viewing, printing and sorting. The |
# character is used as the data field separator. This file is also designed
# to support parsing.
#
#INPUT FILE:       project.ncd
#OUTPUT FILE:      project_r_pad.csv
#PART TYPE:        xc3s500e
#SPEED GRADE:      -5
#PACKAGE:          fg320
#
# Pinout by Pin Number:
# 
# -----,-----,-----,-----,-----,-----,-----,-----,-----,-----,-----,-----,-----,-----,-----,
Pin Number,Signal Name,Pin Usage,Pin Name,Direction,IO Standard,IO Bank Number,Drive (mA),Slew Rate,Termination,IOB Delay,Voltage,Constraint,IO Register,Signal Integrity,
A1,,,GND,,,,,,,,,,,,
A2,,,TDI,,,,,,,,,,,,
A3,,IBUF,IP,UNUSED,,0,,,,,,,,,
A4,,DIFFM,IO_L24P_0,UNUSED,,0,,,,,,,,,
A5,,DIFFMI,IP_L22P_0,UNUSED,,0,,,,,,,,,
A6,,DIFFS,IO_L20N_0,UNUSED,,0,,,,,,,,,
A7,,IBUF,IP,UNUSED,,0,,,,,,,,,
A8,,IOB,IO,UNUSED,,0,,,,,,,,,
A9,,,VCCO_0,,,0,,,,,2.50,,,,

---

1. Running Python Sniffer
No result from the Python CSV Sniffer
Error was: Could not determine delimiter

2. Running Pandas
Pandas uses: delimiter = ',', quotechar = '"'
ParserError from pandas.

3. Running CleverCSV
CleverCSV detected: delimiter = ',', quotechar = ''

Semi-colon separated

In [8]:
compare("https://raw.githubusercontent.com/grezesf/Research/17b1e829d1d4b8954661270bd8b099e74bb45ce7/Reservoirs/Task0_Replication/code/preprocessing/factors.csv")
--- File Preview ---
log(E);de_log(E);dede_log(E);cepstrum;de_cepstrum;dede_cepstrum
0.27;1.77;4.97;0.1;0.61;1.75;
1.75;1.25;1.00;1.25;0.50;0.25;
---

1. Running Python Sniffer
No result from the Python CSV Sniffer
Error was: Could not determine delimiter

2. Running Pandas
Pandas uses: delimiter = ',', quotechar = '"'
log(E);de_log(E);dede_log(E);cepstrum;de_cepstrum;dede_cepstrum
0 0.27;1.77;4.97;0.1;0.61;1.75;
1 1.75;1.25;1.00;1.25;0.50;0.25;
3. Running CleverCSV
CleverCSV detected: delimiter = ';', quotechar = ''

Sniffer fails outright, Pandas is incorrect because it assumes comma.

File with multiple tables

In [9]:
compare("https://raw.githubusercontent.com/HAYASAKA-Ryosuke/TodenGraphDay/8f052219d037edabebd488e5f6dc2ddbe8367dc1/juyo-j.csv")
--- File Preview ---
2013/12/12 23:20 UPDATE
ピーク時供給力(万kW),時間帯,供給力情報更新日,供給力情報更新時刻
4849,17:00〜18:00,12/12,8:30

予想最大電力(万kW),時間帯,予想最大電力情報更新日,予想最大電力情報更新時刻
4210,17:00〜18:00,12/12,8:30

DATE,TIME,当日実績(万kW),予測値(万kW)
2013/12/12,0:00,3098,0
2013/12/12,1:00,2948,0

---

1. Running Python Sniffer
CSV Sniffer detected: delimiter = '\r', quotechar = '"'

2. Running Pandas
Pandas uses: delimiter = ',', quotechar = '"'
2013/12/12 23:20 UPDATE
ピーク時供給力(万kW) 時間帯 供給力情報更新日 供給力情報更新時刻
4849 17:00〜18:00 12/12 8:30
予想最大電力(万kW) 時間帯 予想最大電力情報更新日 予想最大電力情報更新時刻
4210 17:00〜18:00 12/12 8:30
DATE TIME 当日実績(万kW) 予測値(万kW)
3. Running CleverCSV
CleverCSV detected: delimiter = ',', quotechar = ''

Sniffer says \r (carriage return) is the delimiter!

Conclusion

We hope you find CleverCSV useful! The package is still in beta, so if you encounter any issues or files where CleverCSV fails, please leave a comment on GitHub!