In this analysis, the Gunning Fog Index was used to measure the readibility score of EDMW and Reddit comments.

The Gunning Fog formula generates a grade level, typically between 0 and 20. The formula estimates the years of formal education the reader requires to understand the text on first reading.

So, if a piece of text has a grade level readability score of 6 then this should be easily readable by those educated to 6th grade in the US schooling system, i.e. 11-12 year olds.

Text to be read by the general public should aim for a grade level of around 8. Text above a score of 17 should be taken to have graduate level readability.

The formula for Gunning Fog is 0.4 [(words/sentences) + 100 (complex words/words)], where complex words are defined as those containing three or more syllables.

The textatistic module was used to derive the Gunning Fog Index

http://www.erinhengel.com/software/textatistic/

In [1]:
import re
import pandas as pd
import numpy as np
from textatistic import Textatistic
In [2]:
edmw = pd.read_csv('/Users/nus/Desktop/Project_Forum/edmw_data_clean.csv')
reddit = pd.read_csv('/Users/nus/Desktop/Project_Forum/reddit_data_clean.csv')
In [3]:
reddit_comment = reddit['Comment']
edmw_comment = edmw['Comment']
In [4]:
#Make sure there is a punctuation at the end of every comment
def fullstop(x):  
    if str(x)[-1] == '.' or str(x)[-1] == '?' or str(x)[-1] == '!':
        return str(x)
    else:
        return str(x) + '.'
In [5]:
reddit_comment = reddit_comment.apply(fullstop)
edmw_comment = edmw_comment.apply(fullstop)
In [6]:
#Take away multiple punctuation (eg '???','...') as they will be perceived as multiple sentences
reddit_comment = reddit_comment.str.replace(r'[.?!]{2,}','.',regex = True)
edmw_comment = edmw_comment.str.replace(r'[.?!]{2,}','.',regex = True)
In [7]:
reddit_comment = reddit_comment.to_numpy()
edmw_comment = edmw_comment.to_numpy()
In [8]:
#Combine comments to form texts with >100 words
reddit_text = ''
reddit_combined_comments = []
for x in reddit_comment:
    if len(re.findall(r'\w+', reddit_text))<100:
        reddit_text = reddit_text + str(x)
    else:
        reddit_combined_comments.append(reddit_text)
        reddit_text = ''
        
len(reddit_combined_comments)
Out[8]:
204955
In [9]:
edmw_text = ''
edmw_combined_comments = []
for x in edmw_comment:
    if len(re.findall(r'\w+', edmw_text))<100:
        edmw_text = edmw_text + str(x)
    else:
        edmw_combined_comments.append(edmw_text)
        edmw_text = ''
        
len(edmw_combined_comments)
Out[9]:
288474
In [10]:
reddit_df = pd.DataFrame({'Combined Comments':reddit_combined_comments})
edmw_df = pd.DataFrame({'Combined Comments':edmw_combined_comments})
In [11]:
# Find the Gunning Fog Index score
reddit_gunningfog=[]
count = 0
for x in reddit_combined_comments:
    count+=1
    print(count, end='\r')
    try:
        score = Textatistic(str(x)).gunningfog_score
        reddit_gunningfog.append(score)
    except:
        reddit_gunningfog.append('error')
204955
In [28]:
edmw_gunningfog=[]
count = 0
for x in edmw_combined_comments:
    count+=1
    print(count, end='\r')
    try:
        score = Textatistic(str(x)).gunningfog_score
        edmw_gunningfog.append(score)
    except:
        edmw_gunningfog.append('error')
288474
In [30]:
reddit_df['Gunning Fog Index'] = reddit_gunningfog
edmw_df['Gunning Fog Index'] = edmw_gunningfog
In [35]:
reddit_df.drop(reddit_df[reddit_df['Gunning Fog Index']=='error'].index,inplace = True)
edmw_df.drop(edmw_df[edmw_df['Gunning Fog Index']=='error'].index,inplace = True)
In [38]:
edmw_df.to_csv('~/Desktop/Project_Forum/edmw_readibility.csv')
reddit_df.to_csv('~/Desktop/Project_Forum/reddit_readibility.csv')
In [43]:
np.mean(reddit_df['Gunning Fog Index'])
Out[43]:
7.605431178459541
In [44]:
np.mean(edmw_df['Gunning Fog Index'])
Out[44]:
6.250669099489326

Limitations:

  1. This metric works best when used on pure English text but comments from EDMW and Reddit forums will definitely contain occasional Chinese characters (especially so for EDMW), emoticons, text emojis :) and other unconventionally structured content ¯_(ツ)_/¯. Effort has been made to clean the comments as much as possible to reflect an accurate score but there will still be remnants of weird comments that affect the score.
  1. Due to the unstructured and informal nature of comments, the score of readibility metrics (Flesch, Flesch-Kincaid, Dale-Chall etc) vary quite substantially among different Python readibility modules (Textatistic, Readibility etc) and readibility calculator websites, depending on how they are coded. This could be due to how different methods determine the number of sentences or number of syllables in a text. The presence of emoticons and symbols in comments would further complicate this. After testing out several modules and online calculators, the Gunning Fog Index seemed to be relatively the most consistent of the metrics and perhaps more suitable for this context.
  1. Though considered as an accurate Readability Formula, The Gunning Fog Index Formula has some flaws. For example, it discounts that not all multi-syllabic words are difficult.
  1. The Gunning Fog Index considers a text to be readible if it has short sentences with little or no multi-syllable words. Hence Reddit has a higher mean readibility score (less readible) than EDMW due to longer, complete sentences and more sophisticated English vocab in Reddit comments. However, this might not be true in reality. The much more abundant usage of internal lingo and multiple languages/dialects in EDMW serves as quite a large barrier of entry to some. An unpractised user of EDMW might not find EDMW comments to be 'readible' at all even though comments are generally shorter than Reddit.
In [ ]: