Project - Inferential Statistics
This is my final project for the course “Inferential Statistics”. It covers data analysis of a survey data set in R and the project report was created using R-markdown. This project covers:
- Discussion on Data Set
- Exploratory Data Analysis
- Summary Statistics and Plotting
- Inference via Hypothesis testing
- Interpretation of Findings
The Project file starts here…
<!DOCTYPE html>
Statistical inference with the GSS data
Part 1: Data
Part 2: Research question
Is the confidence in TV the same in 2012 as compared to 1983?
In recent years, the media has been criticized more and more for biased reporting of news. Information sources such as Television, the press and the internet are questioned for their legitimacy. For example, the current US president Donald Trump who coined the well-known phrase “fake news” by calling out TV networks as reporting false stories, has openly condemned specific TV outlets. As information platforms such as TV are seemingly losing their credibility in the public eye, I’m interested to find out what the US residents think of TV as an institution and whether confidence in TV is the same in 2012 as compared to almost 30 years earlier (1983).
The GSS survey asks the following question regarding confidence in TV:
As far as the people running TV are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them?
We will use the categorical data from this survey question asked in the years 2012 and 1983 to answer our research question.
Part 3: Exploratory data analysis
According to the GSS codebook, the variable corresponding to the question on confidence in TV is called “contv”.
## Factor w/ 3 levels "A Great Deal",..: NA NA NA NA NA NA NA NA NA NA ...
As we can see it is a variable of type Factor with 3 levels corresponding to the responses “A Great Deal”, “Only Some” and “Hardly Any”. There are also NA values which we will remove for making analysis easier.
First, we create a subset of data only including years 2012 and 1983 and we remove NA values from the contv variable.
# Create dataset through filtering by year and exclude NA values
gss_83_12_contv <- gss %>% filter(year == 2012 | year == 1983, !is.na(contv))
# convert year to a factor for use in plotting
gss_83_12_contv$year <- as.factor(gss_83_12_contv$year)
Next, we look at the responses to the question about confidence in TV.
# Using the newly created dataset we look at number of responses for each answer
gss_83_12_contv %>%
group_by(year, contv) %>%
summarise(count = n())
## # A tibble: 6 x 3
## # Groups: year [2]
## year contv count
## <fct> <fct> <int>
## 1 1983 A Great Deal 199
## 2 1983 Only Some 918
## 3 1983 Hardly Any 449
## 4 2012 A Great Deal 139
## 5 2012 Only Some 630
## 6 2012 Hardly Any 540
In the above table we can see the sample year in first column, the response in the second column and the number of responses in the third column. In both years, the answer that received the smallest response is having “A Great Deal” of confidence in TV. “Hardly Any” confidence shows a similar number of responses with 449 (1983) and 540 (2012). In both years, most people answer that they have “Only some” confidence in TV and regarding this answer we can see a large difference in the number of responses between the years: 918 (1983) vs. 630 (2012).
To get a better understanding of the relative amount of people in each answer category, we calculate the proportion of answers.
# Group data by year and summarise the proportion of the total for each answer
gss_83_12_contv %>%
group_by(year) %>%
summarise(A_great_deal = sum(contv == "A Great Deal")/n(),
Only_some = sum(contv == "Only Some")/n(),
Hardly_any = sum(contv == "Hardly Any")/n())
## # A tibble: 2 x 4
## year A_great_deal Only_some Hardly_any
## <fct> <dbl> <dbl> <dbl>
## 1 1983 0.127 0.586 0.287
## 2 2012 0.106 0.481 0.413
We can see in the above table, that as compared to 1983, in 2012 less people have “A Great Deal” (1983: 12.7%, 2012: 10.6%) and “Only Some” (1983: 58.6%, 2012: 48.1%) confidence in TV. On the other hand, in 2012 a larger proportion of people have “Hardly Any” confidence in TV (1983: 28.7%, 2012: 41.3%)
To illustrate this, we create a bar graph comparing the proportion of responses between the years 1983 and 2012.
# Create a bar graph with year on the x-axis and fill with the contv variable
p1 <- ggplot(data = gss_83_12_contv, mapping = aes(x = year, fill = contv)) + geom_bar(position = "fill")
p1 + labs(title = "Confidence in TV 1983 vs. 2012", x = "Year", y = "Proportion", fill = "Response")
The above bar graph shows the sample year on the x-axis and the proportion of answers on the y-axis. As can be seen, the proportion of people saying they have “Hardly Any” confidence in TV has increased in 2012 compared to 1983, whereas the proportion of people responding having “Only Some” or “A Great Deal” of confidence has decreased.
Part 4: Inference
To test whether there is a difference between the years 1983 and 2012 in the distribution of answers on confidence in TV we will use a Chi-square test of indepence, as we are dealing with two categorical variables (year and confidence in tv) with one of the variables (confidence in tv) having more than 2 levels.
The Chi-square conditions for inference are: Independence - random sampling/assignment, if sampling without replacement n < 10% of population, each case only contributes to one cell in the table Sample size - each cell at least 5 expected cases
GSS used random sampling and sampled less than 10% of the US population. Each case only contributes to one answer and as we saw in the above tables, each cell has more than 4 cases. Since the conditions for inference are all met we can go ahead and perform the Chi-square test.
# To use the Chi-square function in R we need to create a matrix with the responses for each year
# First we create individual data sets for each sample year
gss_12_contv <- gss_83_12_contv %>% filter(year == "2012")
gss_83_contv <- gss_83_12_contv %>% filter(year == "1983")
# Then we calculate the summary for variable contv for each year and use rbind to create a matrix
contv <- rbind(summary(gss_12_contv$contv), summary(gss_83_contv$contv))
# Set the rownames to the appropriate year
rownames(contv) <- c("2012", "1983")
# Display the created matrix to check it is correct
contv
## A Great Deal Only Some Hardly Any
## 2012 139 630 540
## 1983 199 918 449
Next, we can use the above created matrix as an input for the Chi-square function.
Our hypotheses are:
H0: No difference in the distributions of answers on confidence in TV between 1983 and 2012 HA: There is a difference in the distributions of answers on confidence in TV between 1983 and 2012
##
## Pearson's Chi-squared test
##
## data: contv
## X-squared = 50.032, df = 2, p-value = 1.367e-11
As can be seen above, a Chi-square statistic of ~50 at df=2 results in a very small p-value. Hence, we reject the null hypothesis in favour of the alternative and conclude that there is indeed a difference in attitude of US residents regarding confidence in TV between the years 1983 and 2012.