Project - Inferential Statistics

This is my final project for the course “Inferential Statistics”. It covers data analysis of a survey data set in R and the project report was created using R-markdown. This project covers:

  • Discussion on Data Set
  • Exploratory Data Analysis
  • Summary Statistics and Plotting
  • Inference via Hypothesis testing
  • Interpretation of Findings

The Project file starts here…

<!DOCTYPE html>

Statistical inference with the GSS data

Part 1: Data

General Social Survey (GSS) data properties

The GSS is an ongoing survey collecting data on social characteristics and attitudes of people living in the United States. Data has been collected since 1972, first annually, then only in even-numbered years starting from 1994. No data was collected in certain years due to lack of funding. In general, samples of ~1500 people were collected once a year and from 1994 onwards twice a year.

Data collection

Random sampling was used to select households and one adult per household was interviewed. Most of the GSS data was collected via personal interview. Computer assisted interviewing has been used since 2002 and when necessary, interviews were conducted through telephone. Since 1988 many core survey questions have been collected on a random two thirds of the sample. From 1972 to 2004 the survey was restricted to english speakersand then english or spanish speakers were the target population.

Implications for inference/generelizability

The GSS collects data on adult (18+) US residents and inference can not be generalized to the world population, however we can use the GSS data to compare the US to other nations. GSS is an observational study and hence any findings of interactions between variables should be reported as associations and not interpreted as causal relationships.

Depending on the specific sampling year, data collection was potentially biased. Nonresponse bias: In early years (1972-2002) data collection was affected by nonresponse bias - certain subpopulations potentially not responding and therefore not being included in the survey. Oversampling: In some years specific subpopulations (i.e. blacks) were overrepresented in the sample. Undercoverage: Only one person per household is interviewed and therefore people living in large households are underrepresented in samples. Additionally, before 1975 households did not have equal probability of being included in the sample.

To correct for these biases, weights provided by GSS can be used to adjust response data.


Part 2: Research question

Is the confidence in TV the same in 2012 as compared to 1983?

In recent years, the media has been criticized more and more for biased reporting of news. Information sources such as Television, the press and the internet are questioned for their legitimacy. For example, the current US president Donald Trump who coined the well-known phrase “fake news” by calling out TV networks as reporting false stories, has openly condemned specific TV outlets. As information platforms such as TV are seemingly losing their credibility in the public eye, I’m interested to find out what the US residents think of TV as an institution and whether confidence in TV is the same in 2012 as compared to almost 30 years earlier (1983).

The GSS survey asks the following question regarding confidence in TV:

As far as the people running TV are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them?

We will use the categorical data from this survey question asked in the years 2012 and 1983 to answer our research question.


Part 3: Exploratory data analysis

According to the GSS codebook, the variable corresponding to the question on confidence in TV is called “contv”.

##  Factor w/ 3 levels "A Great Deal",..: NA NA NA NA NA NA NA NA NA NA ...

As we can see it is a variable of type Factor with 3 levels corresponding to the responses “A Great Deal”, “Only Some” and “Hardly Any”. There are also NA values which we will remove for making analysis easier.

First, we create a subset of data only including years 2012 and 1983 and we remove NA values from the contv variable.

Next, we look at the responses to the question about confidence in TV.

## # A tibble: 6 x 3
## # Groups:   year [2]
##   year  contv        count
##   <fct> <fct>        <int>
## 1 1983  A Great Deal   199
## 2 1983  Only Some      918
## 3 1983  Hardly Any     449
## 4 2012  A Great Deal   139
## 5 2012  Only Some      630
## 6 2012  Hardly Any     540

In the above table we can see the sample year in first column, the response in the second column and the number of responses in the third column. In both years, the answer that received the smallest response is having “A Great Deal” of confidence in TV. “Hardly Any” confidence shows a similar number of responses with 449 (1983) and 540 (2012). In both years, most people answer that they have “Only some” confidence in TV and regarding this answer we can see a large difference in the number of responses between the years: 918 (1983) vs. 630 (2012).

To get a better understanding of the relative amount of people in each answer category, we calculate the proportion of answers.

## # A tibble: 2 x 4
##   year  A_great_deal Only_some Hardly_any
##   <fct>        <dbl>     <dbl>      <dbl>
## 1 1983         0.127     0.586      0.287
## 2 2012         0.106     0.481      0.413

We can see in the above table, that as compared to 1983, in 2012 less people have “A Great Deal” (1983: 12.7%, 2012: 10.6%) and “Only Some” (1983: 58.6%, 2012: 48.1%) confidence in TV. On the other hand, in 2012 a larger proportion of people have “Hardly Any” confidence in TV (1983: 28.7%, 2012: 41.3%)

To illustrate this, we create a bar graph comparing the proportion of responses between the years 1983 and 2012.

The above bar graph shows the sample year on the x-axis and the proportion of answers on the y-axis. As can be seen, the proportion of people saying they have “Hardly Any” confidence in TV has increased in 2012 compared to 1983, whereas the proportion of people responding having “Only Some” or “A Great Deal” of confidence has decreased.


Part 4: Inference

To test whether there is a difference between the years 1983 and 2012 in the distribution of answers on confidence in TV we will use a Chi-square test of indepence, as we are dealing with two categorical variables (year and confidence in tv) with one of the variables (confidence in tv) having more than 2 levels.

The Chi-square conditions for inference are: Independence - random sampling/assignment, if sampling without replacement n < 10% of population, each case only contributes to one cell in the table Sample size - each cell at least 5 expected cases

GSS used random sampling and sampled less than 10% of the US population. Each case only contributes to one answer and as we saw in the above tables, each cell has more than 4 cases. Since the conditions for inference are all met we can go ahead and perform the Chi-square test.

##      A Great Deal Only Some Hardly Any
## 2012          139       630        540
## 1983          199       918        449

Next, we can use the above created matrix as an input for the Chi-square function.

Our hypotheses are:

H0: No difference in the distributions of answers on confidence in TV between 1983 and 2012 HA: There is a difference in the distributions of answers on confidence in TV between 1983 and 2012

## 
##  Pearson's Chi-squared test
## 
## data:  contv
## X-squared = 50.032, df = 2, p-value = 1.367e-11

As can be seen above, a Chi-square statistic of ~50 at df=2 results in a very small p-value. Hence, we reject the null hypothesis in favour of the alternative and conclude that there is indeed a difference in attitude of US residents regarding confidence in TV between the years 1983 and 2012.

Written on January 22, 2020