Project - Probability and Data

This is my final project for the course “Introduction to Probability and Data”. It covers data analysis of a survey data set including a discussion on the data set itself and then answering three research questions (I chose myself) through:

  • Exploratory Data Analysis
  • Summary Statistics and Plotting
  • Interpretation of Findings

The Project file starts here…

<!DOCTYPE html>

Exploring the BRFSS data

Part 1: Data

The brfss data is collected as an observational health-related study on the adult (18 years or older) population. It is conducted through phone interviews and only within the US and US territories. People are randomly selected to be phone-interviewed via random digit dialing. As the data comes from an observational study (and random assignment was not used), we can’t infer causality from the results of an analysis. However, we can investigate interactions between variables, keeping in mind that other factors not accounted for, might be influencing our observations (correlation does not imply causation). The results of any analysis are not representative of the health-related behaviours of the world population since the sample only includes US states and territories and therefore analysis findings can only be generalized to represent this population. In addition, when analysing a subset of this data (i.e. one state or a few states together), any results are unlikely to be representative of other states or territories in this data set, as the sample covers a geographically diverse region.

The potential biases in this sample are convenience bias and non-response bias. Only people were questioned that could be reached by phone (convenience bias). Furthermore, only people that were available to go through the questionnaire and were not busy otherwise (i.e. working) could complete the survey (non-response bias).

Overall, a weighting (raking) to known proportions of age, sex, education, marital status, phone ownership etc. was applied to the data to reduce bias in the sample.


Part 2: Research questions

Research question 1: Do People with different health status sleep the same amount of hours per day?

Sleep plays an important role in overall health and well-being. While we sleep our brain and body can rest and recover. Therefore, having a good amount of sleep is important for overall mental and physical health. To investigate the importance of sleep we will look at whether there is a relationship between sleep and general health of individuals in the data set. More specifically, we will investigate the relationship between time people sleep (variable sleptim1) and general health (variable genhlth).

Research question 2: Are People with good and poor health status, that sleep normal and abnormal respectively, represented equally in income groups?

As we have seen from question 1, health status and sleeping behaviour seem to be dependent variables. It is also known, that people with higher income have better overall health, as they can afford health services more easily. Therefore, I am interested to find out, whether people with higher income in turn show more normal sleeping behaviour than people with lower income.

Research question 3: Is the drinking behaviour the same amongst males and females in different income brackets?

Previously, income did not seem to affect sleeping behaviour amongst people of varying health groups. I am interested to see whether income can explain variation in other health-related behaviour. We know, that heavy alcohol consumption has a detrimental effect on overall health. Therefore, I am interested to find out whether income can explain variation in drinking behaviour amongst people. Moreover, I will test whether males and females show the same or different drinking behaviour when compared between income brackets.


Part 3: Exploratory data analysis

Research quesion 1:

Do People with different health status sleep the same amount of hours per day?

To answer this question, we will calculate sleep time summary measures and compare them across people with varying general health status. The health status groups are: Excellent, Very good, Good, Fair and Poor health. The amount people sleep in each category of general health status (variable: genhlth) is shown in the table below. Mean, standard deviation, Median and IQR of sleep time (variable: sleptim1) were calculated.

## # A tibble: 5 x 5
##   genhlth    Mean   Std Median   IQR
##   <fct>     <dbl> <dbl>  <dbl> <dbl>
## 1 Excellent  7.19  1.21      7     2
## 2 Very good  7.10  1.21      7     2
## 3 Good       7.04  1.44      7     2
## 4 Fair       6.90  1.81      7     2
## 5 Poor       6.74  2.39      6     3

In the above table we can see that sleep time varies with general health status. More specifically, it appears that people with good health sleep more (high mean/median sleeptime) than people with bad health (lower mean/median sleeptime). It appears that general health status and sleeptime are dependent variables.

Next we create a bar graph of mean sleeptime across the general health status groups to illustrate our finding.

This bar graph shows the mean sleep time (y-axis) for each general health status group (x-axis). We can see a step-wise decrease of sleep time across groups from excellent health towards poor health.

Therefore, the answer to research question 1 is: No, people with different general health status do not sleep the same amount of hours per day. The worse the general health the status, the less people sleep.

Research quesion 2:

Are People that sleep normal and abnormal, with good and poor health status, respectively, represented equally in income groups?

To answer this question, we will calculate the number of people that sleep normal (7-9 hours) vs. abnormal (other sleep amount) for each general health group and compare them within two income brackets: “Above $75 000”and “Below $75 000”.

First we create a categorical variable of normal sleepers and other (using variable sleptim1). Then we create a categorical variable of above and below $75k income (using variable income2). Finally we create a summary table grouping by income, general health and sleep status. Further we calculate the proportion of people with normal sleep in each group.

To begin with, we create two new categorical sleep and income variables.

Then we generate a table grouping by income bracket, then health status and calculate the proportion of normal sleepers per health status group.

## # A tibble: 10 x 3
## # Groups:   incomestatus [2]
##    incomestatus genhlth   Normal_sleep
##    <fct>        <fct>            <dbl>
##  1 Above 75k    Excellent        0.250
##  2 Above 75k    Very good        0.238
##  3 Above 75k    Good             0.220
##  4 Above 75k    Fair             0.184
##  5 Above 75k    Poor             0.142
##  6 Below 75k    Excellent        0.231
##  7 Below 75k    Very good        0.227
##  8 Below 75k    Good             0.202
##  9 Below 75k    Fair             0.165
## 10 Below 75k    Poor             0.122

In the above table we can see varying percentages of normal sleepers between health status groups and within income groups. In both income brackets, people with excellent health show about a quarter of normal sleepers (Above 75k Excellent health with 25% normal sleepers; Below 75k Excellent health with 23% normal sleepers). Moreover, we see, the worse the health status of a group, the less people have a normal amount of sleep. This statement is true for both income groups. However, the proportion of normal sleepers between the two income brackets, when comparing the same health status group is similar (i.e. Above 75k Excellent health 25% vs. Below 75k Excellent health 23%, Above 75k Poor health 14% vs. Below 75k Poor health 12%).

Next, we create a bar graph to illustrate the findings.

This bar graph shows the general health status (x-axis) and the proportion of normal and other sleepers (y-axis). A bar graph for families with an income above $75 000 (left) and below $75 000 (right) is shown. We can see that there are overall less normal sleepers compared to sleepers that sleep either less or more in each health group. Whereas the proportion of normal sleepers shows a stepwise decrease from excellent health towards less good health status, there appears no difference of normal/other sleepers between income brackets.

Therefore, the answer to research question 2 is: Yes, people with normal/abnormal sleep, irrespective of general health status, are equally represented in income groups.

Research question 3:

Is the drinking behaviour the same amongst males and females in different income brackets?

To answer this question, we will calculate the number of people that show moderate drinking (0-2 standard drinks/day) vs. above moderate drinking (2 < standard drinks/day) behaviour for males and females and compare them within two income brackets: “Above $75 000”and “Below $75 000”.

To begin with, we create a new categorical alcohol consumption variable.

Then we generate a table grouping by income bracket, then health status and calculate the proportion of normal sleepers per health status group.

## # A tibble: 4 x 3
## # Groups:   incomestatus [2]
##   incomestatus sex    Moderate_Alcohol
##   <fct>        <fct>             <dbl>
## 1 Above 75k    Male              0.722
## 2 Above 75k    Female            0.881
## 3 Below 75k    Male              0.621
## 4 Below 75k    Female            0.817

In the above table we can see that there is a higher percentage of moderate female drinkers (~88% and 82%) compared to moderate male drinkers (72% and 62%) in both income groups. Furthermore, there are more moderate drinkers in the “Above 75k” compared to the “Below 75k” income brackets. It seems, people with lower income, irrespective of gender, tend to drink alcohol above 2 standard drinks per day.

Next, we create a bar graph to illustrate the findings.

This bar graph shows sex (x-axis) and the proportion of “Moderate” and “Above Moderate” alcohol consumption (y-axis). A bar graph for families with an income above $75 000 (left) and below $75 000 (right) is shown. We can see that there are overall more moderate than above moderate drinkers. Irrespective of income bracket, males show a higher percentage of above moderate drinking behaviour compared to females. Between income brackets, people with lower income show a higher percentage of above moderate drinking behaviour compared to people with a higher income.

Therefore, the answer to research question 3 is: No, the drinking behaviour amongst males/females is not the same between income brackets. Males/females with higher income drink less on average than those with lower income.

Written on January 19, 2020