The Behavioral Risk Factor Surveillance System (BRFSS) is a system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. By collecting behavioral health risk data at the state and local level, BRFSS has become a powerful tool for targeting and building health promotion activities. As a result, BRFSS users have increasingly demanded more data and asked for more questions on the survey. Source
In this data exploration exercise, We'll examin the data reported in the BRFSS of 2013 and look for interesting (maybe even important) things.
load('brfss2013.RDATA')
# realy hard work here to supress annoying messages
suppressWarnings(library(ggplot2))
suppressPackageStartupMessages(suppressWarnings(library(dplyr)))
# setup printing options
options(repr.plot.width = 4)
options(repr.plot.height = 2.5)
How is consumption of alcohol and vegetable consumption related to general health? This is interesting to see if people have awareness of the effect of nutritious habit on health.
Variables: genhlth, _vegesum, _drnkmo4
The variable "genhlth" which is an hierarchical categorical variable which is defined as ones general health. We start by looking at it and plot its distribution:
table(brfss2013$genhlth)
Excellent Very good Good Fair Poor 85482 159076 150555 66726 27951
ggplot(brfss2013, aes(genhlth)) + geom_bar() + xlab('Health')
Most people describe their health as being "Very good" or "Good".
We now make a new variable "q1_data" by selecting the variables of interest, filter exceptions and modify the X_vegsum variable remove to implied decimal points:
q1_data = brfss2013 %>%
select(genhlth, X_vegesum, X_drnkmo4) %>%
filter(X_vegesum < 1000 & X_drnkmo4 < 100) %>%
mutate(vegetables = X_vegesum / 100)
Let's look at the data of daily vegetables consumption, using histogram:
ggplot(q1_data, aes(vegetables)) +
geom_histogram(bins=30) +
xlab('Vegetables consumed per day')
The distribution is right-skewed normal. The median quantity will provide good measure which is robust with respect to its right tail.
Now will examine the drinking data, using a histogram:
ggplot(q1_data, aes(X_drnkmo4)) +
geom_histogram(bins=30) +
xlab('Drinks per month')
This distribution is not approximately normal. However, a lot of people report not drinking alcohol at all - we will use this finding and instead of looking at the mean or median of this non-normal distribution, we will look at the percent of non-drinkers as an indication of drinking habits.
The following table lists the median number of vegetables consumed per-day and the percent of non-drinking population grouped by health groups:
q1_data %>%
group_by("General Health" = genhlth) %>%
summarise("vegetables per day" = median(vegetables, na.rm=TRUE),
"%non-drinkers" = mean(X_drnkmo4==0, na.rm=TRUE))
General Health | vegetables per day | %non-drinkers |
---|---|---|
Excellent | 1.93 | 0.3997699 |
Very good | 1.72 | 0.4216613 |
Good | 1.57 | 0.5479147 |
Fair | 1.43 | 0.6835390 |
Poor | 1.37 | 0.7918899 |
NA | 1.50 | 0.6670609 |
The table shows that vegetable consumption is correlated with good health: healthier people eat more vegetables and drink more alcoholic beverages. However, we'd expect non-health people to eat more vegetables!
The conclusion is that there is not enough awareness for the important of vegetables consumption as a means of promoting good-health.
Loss of workdays due to bad health can have negative impact on the economy. Which states lose the most? Is the number of poor health related on the health insurance coverage of the population? This question can have important consequences on federal funding on health care, as a means of improving the economy.
Variables: poorhlth, X_state, hlthpln1
Let's generate a compact data frame with the variables of interest:
q2_data = brfss2013 %>%
filter(poorhlth < 30) %>%
select(as.numeric(poorhlth), X_state, hlthpln1)
Let's start by looking at the number of "poor health" days 30 days prior to survay:
ggplot(q2_data, aes(poorhlth)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Most people report no days of poor-health. However, significant numbers report at least one day of poor health. Let us calculate the percentage of people who report at least one day of poor-health:
1 - mean(q2_data$poorhlth==0)
That's 37% of the population. Let us calculate the average number of poor-health days in each US State:
mean_phlth_days = q2_data %>%
group_by(state = X_state) %>%
summarise(mean_days = mean(poorhlth, na.rm=TRUE))
Let's look at the distribution of the mean poor-health days in all the states:
ggplot(mean_phlth_days, aes(mean_days)) + geom_histogram(bins=10)
This distribution is roughly normal right skewed with a long tail towards the higher values. Which countries suffer from highest number of poor health? (top-5 only)
mean_phlth_days %>%
arrange(desc(mean_days)) %>%
head()
state | mean_days |
---|---|
Alabama | 3.537448 |
Tennessee | 3.511111 |
West Virginia | 3.475074 |
Florida | 3.392019 |
Kentucky | 3.357873 |
Arkansas | 3.323927 |
Since the whole table is too long for efficient presentation, we can also plot the distribution on a map:
states_map = map_data("state")
ggplot(mean_phlth_days, aes(map_id = tolower(state))) +
geom_map(aes(fill = mean_days), map = states_map) +
expand_limits(x = states_map$long, y = states_map$lat) +
scale_fill_gradient(low = "blue", high = "yellow", guide = "colourbar", name = "Days") +
ggtitle('Number of poor-health days')
The map shows the south-east region suffers the most from poor health, while center-north the least poor-health days.
It is also important to know if poor health is correlated with lack of health coverage. The "hlthpln1" variable indicates health insurance coverage. Let's calculate the percentage of non-covered in each state:
q2_health_cover = q2_data %>%
group_by(state = X_state) %>%
summarise(non_cov = mean(hlthpln1=='No', na.rm=TRUE)
/mean(hlthpln1=='Yes', na.rm=TRUE))
q2_health_cover %>%
arrange(desc(non_cov)) %>%
head()
state | non_cov |
---|---|
Texas | 0.2703601 |
Guam | 0.2511278 |
Georgia | 0.2229602 |
Nevada | 0.2073413 |
Mississippi | 0.2035166 |
North Carolina | 0.2011531 |
states_map = map_data("state")
ggplot(q2_health_cover, aes(map_id = tolower(state))) +
geom_map(aes(fill = non_cov), map = states_map) +
expand_limits(x = states_map$long, y = states_map$lat) +
scale_fill_gradient(low = "blue", high = "yellow", guide = "colourbar", name = "%Non covered") +
ggtitle('Population with no health insurance coverage')
It can be seen in the table and map that there is no significant overlap/correlation between the states of high number of poor health days and the states of high percent of no health coverage.
How employment statuses (employed, retired etc.) is related to sleep time. This is important since insufficient sleep is related to various health harzards, such as car accidents or work injuries.
Variables: sleptim1, employ1
Let us start our exploration by looking at the summary statistics of the 'sleptim1' variable which represent the average hours a person sleeps at night:
summary(brfss2013$sleptim1)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.000 6.000 7.000 7.052 8.000 450.000 7387
Most people (median) sleep about 7 hours a night, while 95% of people sleep between 6 - 8 hours. Since the median value is close to the mean value, we speculate that the distribution of sleptim1 is approximately normal. Let is look at the histogram:
brfss2013 %>%
filter(sleptim1 < 30) %>%
ggplot(aes(sleptim1)) + geom_histogram(bins=25) +
labs(x = "Sleep time (hours)")
As expected, the histogram is approximately normal. There are significant outliers that stretch all the way to 15 hours of sleep or more.
From all the data available, we'll look only at the sleep time of people who are either retired, self-employed or work for wage, since these groups are the larger. The homemaker population will not be considered, since the differences might be influenced by gender. A new dataframe is generated, and summery statistics is shown in the table:
q3_data = brfss2013 %>%
select(sleptim1, employ1) %>%
mutate(employ1 = as.numeric(employ1)) %>%
filter(employ1 == 1 | employ1 == 2 | employ1 == 7 & sleptim1 < 30) %>%
mutate(employ1 = ifelse(employ1 == 1, 'Wage', ifelse(employ1==2, 'Self-employed', 'Retired')))
q3_data %>%
group_by('Employment status' = employ1) %>%
summarise('mean sleep time' = mean(sleptim1, na.rm=TRUE),
'median sleep time' = median(sleptim1, na.rm=TRUE),
'standard deviation' = sd(sleptim1, na.rm=TRUE))
Employment status | mean sleep time | median sleep time | standard deviation |
---|---|---|---|
Retired | 7.345644 | 7 | 1.468565 |
Self-employed | 7.081790 | 7 | 1.260557 |
Wage | 6.890475 | 7 | 1.212602 |
The table shows that retired people sleep more than working people. Self-employed people sleep very slightly more than people who work for wage.
Let us compare the box plot for the "sleptim1" variable for the various groups:
q3_data %>%
ggplot(aes(employ1, sleptim1)) +
geom_boxplot() +
labs(x = "Employment status",
y = "Average sleep time")
Warning message: "Removed 1609 rows containing non-finite values (stat_boxplot)."
The distribution of the average sleep time is very similar between all employment groups, with the median, mean 1st and 2nd quartile very like each other. We conclude that the self-employed and wage groups sleep approximately the same time, although the common sense is that self-employed people work harder.