Exploring the 2013 BRFSS dataset with R¶

About the BRFSS Survay¶

The Behavioral Risk Factor Surveillance System (BRFSS) is a system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. By collecting behavioral health risk data at the state and local level, BRFSS has become a powerful tool for targeting and building health promotion activities. As a result, BRFSS users have increasingly demanded more data and asked for more questions on the survey. Source

In this data exploration exercise, We'll examin the data reported in the BRFSS of 2013 and look for interesting (maybe even important) things.

In [1]:

load('brfss2013.RDATA')
# realy hard work here to supress annoying messages
suppressWarnings(library(ggplot2))
suppressPackageStartupMessages(suppressWarnings(library(dplyr)))

# setup printing options
options(repr.plot.width = 4)
options(repr.plot.height = 2.5)

Topic 1: Health, alcohol and vegetables¶

How is consumption of alcohol and vegetable consumption related to general health? This is interesting to see if people have awareness of the effect of nutritious habit on health.

Variables: genhlth, _vegesum, _drnkmo4

The variable "genhlth" which is an hierarchical categorical variable which is defined as ones general health. We start by looking at it and plot its distribution:

In [2]:

table(brfss2013$genhlth)

Excellent Very good      Good      Fair      Poor 
    85482    159076    150555     66726     27951

In [3]:

ggplot(brfss2013, aes(genhlth)) + geom_bar() + xlab('Health')

Most people describe their health as being "Very good" or "Good".

We now make a new variable "q1_data" by selecting the variables of interest, filter exceptions and modify the X_vegsum variable remove to implied decimal points:

In [4]:

q1_data = brfss2013 %>%
        select(genhlth, X_vegesum, X_drnkmo4) %>%
        filter(X_vegesum < 1000 & X_drnkmo4 < 100) %>%
        mutate(vegetables = X_vegesum / 100)

Let's look at the data of daily vegetables consumption, using histogram:

In [5]:

ggplot(q1_data, aes(vegetables)) + 
geom_histogram(bins=30) + 
xlab('Vegetables consumed per day')

The distribution is right-skewed normal. The median quantity will provide good measure which is robust with respect to its right tail.

Now will examine the drinking data, using a histogram:

In [6]:

ggplot(q1_data, aes(X_drnkmo4)) + 
geom_histogram(bins=30) + 
xlab('Drinks per month')

This distribution is not approximately normal. However, a lot of people report not drinking alcohol at all - we will use this finding and instead of looking at the mean or median of this non-normal distribution, we will look at the percent of non-drinkers as an indication of drinking habits.

The following table lists the median number of vegetables consumed per-day and the percent of non-drinking population grouped by health groups:

In [7]:

q1_data %>%
    group_by("General Health" = genhlth) %>%
    summarise("vegetables per day" = median(vegetables, na.rm=TRUE),
              "%non-drinkers" = mean(X_drnkmo4==0, na.rm=TRUE))

General Health	vegetables per day	%non-drinkers
Excellent	1.93	0.3997699
Very good	1.72	0.4216613
Good	1.57	0.5479147
Fair	1.43	0.6835390
Poor	1.37	0.7918899
NA	1.50	0.6670609

The table shows that vegetable consumption is correlated with good health: healthier people eat more vegetables and drink more alcoholic beverages. However, we'd expect non-health people to eat more vegetables!

The conclusion is that there is not enough awareness for the important of vegetables consumption as a means of promoting good-health.

Topic 2: Days of poor health and their geographic distibution¶

Loss of workdays due to bad health can have negative impact on the economy. Which states lose the most? Is the number of poor health related on the health insurance coverage of the population? This question can have important consequences on federal funding on health care, as a means of improving the economy.

Variables: poorhlth, X_state, hlthpln1

Let's generate a compact data frame with the variables of interest:

In [8]:

q2_data = brfss2013 %>%
        filter(poorhlth < 30) %>%
        select(as.numeric(poorhlth), X_state, hlthpln1)

Let's start by looking at the number of "poor health" days 30 days prior to survay:

In [9]:

ggplot(q2_data, aes(poorhlth)) + geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Most people report no days of poor-health. However, significant numbers report at least one day of poor health. Let us calculate the percentage of people who report at least one day of poor-health:

In [10]:

1 - mean(q2_data$poorhlth==0)

0.374669671660259

That's 37% of the population. Let us calculate the average number of poor-health days in each US State:

In [11]:

mean_phlth_days = q2_data %>%
    group_by(state = X_state) %>%
    summarise(mean_days = mean(poorhlth, na.rm=TRUE))

Let's look at the distribution of the mean poor-health days in all the states:

In [12]:

ggplot(mean_phlth_days, aes(mean_days)) + geom_histogram(bins=10)

This distribution is roughly normal right skewed with a long tail towards the higher values. Which countries suffer from highest number of poor health? (top-5 only)

In [13]:

mean_phlth_days %>%
    arrange(desc(mean_days)) %>%
    head()

state	mean_days
Alabama	3.537448
Tennessee	3.511111
West Virginia	3.475074
Florida	3.392019
Kentucky	3.357873
Arkansas	3.323927

Since the whole table is too long for efficient presentation, we can also plot the distribution on a map:

In [27]:

states_map = map_data("state")
ggplot(mean_phlth_days, aes(map_id = tolower(state))) +
    geom_map(aes(fill = mean_days), map = states_map) +
    expand_limits(x = states_map$long, y = states_map$lat) +
    scale_fill_gradient(low = "blue", high = "yellow", guide = "colourbar", name = "Days") + 
    ggtitle('Number of poor-health days')

The map shows the south-east region suffers the most from poor health, while center-north the least poor-health days.

It is also important to know if poor health is correlated with lack of health coverage. The "hlthpln1" variable indicates health insurance coverage. Let's calculate the percentage of non-covered in each state:

In [15]:

q2_health_cover = q2_data %>%
    group_by(state = X_state) %>%
    summarise(non_cov = mean(hlthpln1=='No', na.rm=TRUE)
              /mean(hlthpln1=='Yes', na.rm=TRUE))

q2_health_cover %>%
    arrange(desc(non_cov)) %>%
    head()

state	non_cov
Texas	0.2703601
Guam	0.2511278
Georgia	0.2229602
Nevada	0.2073413
Mississippi	0.2035166
North Carolina	0.2011531

In [26]:

states_map = map_data("state")
ggplot(q2_health_cover, aes(map_id = tolower(state))) +
    geom_map(aes(fill = non_cov), map = states_map) +
    expand_limits(x = states_map$long, y = states_map$lat) +
    scale_fill_gradient(low = "blue", high = "yellow", guide = "colourbar", name = "%Non covered") +
    ggtitle('Population with no health insurance coverage')

It can be seen in the table and map that there is no significant overlap/correlation between the states of high number of poor health days and the states of high percent of no health coverage.

Topic 3: Sleep time and emplyment status¶

How employment statuses (employed, retired etc.) is related to sleep time. This is important since insufficient sleep is related to various health harzards, such as car accidents or work injuries.

Variables: sleptim1, employ1

Let us start our exploration by looking at the summary statistics of the 'sleptim1' variable which represent the average hours a person sleeps at night:

In [17]:

summary(brfss2013$sleptim1)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   6.000   7.000   7.052   8.000 450.000    7387

Most people (median) sleep about 7 hours a night, while 95% of people sleep between 6 - 8 hours. Since the median value is close to the mean value, we speculate that the distribution of sleptim1 is approximately normal. Let is look at the histogram:

In [18]:

brfss2013 %>%
    filter(sleptim1 < 30) %>%
    ggplot(aes(sleptim1)) + geom_histogram(bins=25) + 
    labs(x = "Sleep time (hours)")

As expected, the histogram is approximately normal. There are significant outliers that stretch all the way to 15 hours of sleep or more.

From all the data available, we'll look only at the sleep time of people who are either retired, self-employed or work for wage, since these groups are the larger. The homemaker population will not be considered, since the differences might be influenced by gender. A new dataframe is generated, and summery statistics is shown in the table:

In [19]:

q3_data = brfss2013 %>%
    select(sleptim1, employ1) %>%
    mutate(employ1 = as.numeric(employ1)) %>%
    filter(employ1 == 1 | employ1 == 2 | employ1 == 7 & sleptim1 < 30) %>%
    mutate(employ1 = ifelse(employ1 == 1, 'Wage', ifelse(employ1==2, 'Self-employed', 'Retired')))

In [20]:

q3_data %>%
    group_by('Employment status' = employ1) %>%
    summarise('mean sleep time' = mean(sleptim1, na.rm=TRUE),
              'median sleep time' = median(sleptim1, na.rm=TRUE),
              'standard deviation' = sd(sleptim1, na.rm=TRUE))

Employment status	mean sleep time	median sleep time	standard deviation
Retired	7.345644	7	1.468565
Self-employed	7.081790	7	1.260557
Wage	6.890475	7	1.212602

The table shows that retired people sleep more than working people. Self-employed people sleep very slightly more than people who work for wage.

Let us compare the box plot for the "sleptim1" variable for the various groups:

In [21]:

q3_data %>%
    ggplot(aes(employ1, sleptim1)) + 
    geom_boxplot() +
    labs(x = "Employment status", 
         y = "Average sleep time")

Warning message:
"Removed 1609 rows containing non-finite values (stat_boxplot)."

The distribution of the average sleep time is very similar between all employment groups, with the median, mean 1st and 2nd quartile very like each other. We conclude that the self-employed and wage groups sleep approximately the same time, although the common sense is that self-employed people work harder.

In [ ]: