R Markdown
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(scales)
data("HairEyeColor")
The Hair x Eye table comes from a survey of students at the University of Delaware reported by Snee (1974). The split by Sex was added by Friendly (1992a) for didactic purposes.
This data set is useful for illustrating various techniques for the analysis of contingency tables, such as the standard chi-squared test or, more generally, log-linear modelling, and graphical methods such as mosaic plots, sieve diagrams or association plots.
Source http://euclid.psych.yorku.ca/ftp/sas/vcd/catdata/haireye.sas
Exploring the data by using str() and summary() function
str(HairEyeColor)
## 'table' num [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
## - attr(*, "dimnames")=List of 3
## ..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond"
## ..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green"
## ..$ Sex : chr [1:2] "Male" "Female"
summary(HairEyeColor)
## Number of cases in table: 592
## Number of factors: 3
## Test for independence of all factors:
## Chisq = 164.92, df = 24, p-value = 5.321e-23
## Chi-squared approximation may be incorrect
head(HairEyeColor)
## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
data.df<- as.data.frame(HairEyeColor)
str(data.df)
## 'data.frame': 32 obs. of 4 variables:
## $ Hair: Factor w/ 4 levels "Black","Brown",..: 1 2 3 4 1 2 3 4 1 2 ...
## $ Eye : Factor w/ 4 levels "Brown","Blue",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq: num 32 53 10 3 11 50 10 30 10 25 ...
Contungency Tables
With the categorical variables, we usually want to calculate the frequencies for each category. To show frequencies, contingency tables can be produced. For example we want to get the total count of female and male participants
To flatten data into gender/eye color we can make table contains both then calculate the probability table for them
gendereyemix<-xtabs(Freq~Sex+Eye,data.frame(HairEyeColor))
prop.table(gendereyemix, 1)# % of men and women across eye color
## Eye
## Sex Brown Blue Hazel Green
## Male 0.35125448 0.36200717 0.16845878 0.11827957
## Female 0.38977636 0.36421725 0.14696486 0.09904153
# % of men and women for each specific eye color
prop.table(gendereyemix, 2)
## Eye
## Sex Brown Blue Hazel Green
## Male 0.4454545 0.4697674 0.5053763 0.5156250
## Female 0.5545455 0.5302326 0.4946237 0.4843750
# Number of men and women in the mix
margin.table(gendereyemix, 1)
## Sex
## Male Female
## 279 313
# Number of men and women per eye color
margin.table(gendereyemix, 2)
## Eye
## Brown Blue Hazel Green
## 220 215 93 64
qplot(data = data.df, Eye, Freq, geom="boxplot", color=Sex)
Most males and females have blue and brown eyes
qplot(data = data.df, Hair, Freq, geom="boxplot", color=Sex)
Most males and females have brown hair.
Let’s assume we are interested in the percentage of male and female with blue eyes
B_M<-data.df %>% select(Eye, Sex, Freq) %>%filter(Sex=="Male" & Eye=="Blue") %>% summarise(Male_Blue=sum(Freq))
B_F<-data.df %>% select(Eye, Sex, Freq) %>%filter(Sex=="Female" & Eye=="Blue") %>% summarise(Female_Blue=sum(Freq))
TOT<-data.df %>% summarise(TotH=sum(Freq))
male_blue <-B_M/TOT*100
female_blue<- B_F/TOT*100
male_blue
## Male_Blue
## 1 17.06081
female_blue
## Female_Blue
## 1 19.25676
Density plot of different hair colors
qplot(data=data.df, Eye, geom="density", fill=Eye, alpha=0.6)
You can find the html file in RPubs
Comments