R Coding Exercise

Setting Up the Exercise

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(dslabs)

#look at help file for gapminder data
#help(gapminder)

#get an overview of data structure
str(gapminder)
'data.frame':   10545 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ infant_mortality: num  115.4 148.2 208 NA 59.9 ...
 $ life_expectancy : num  62.9 47.5 36 63 65.4 ...
 $ fertility       : num  6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
 $ population      : num  1636054 11124892 5270844 54681 20619075 ...
 $ gdp             : num  NA 1.38e+10 NA NA 1.08e+11 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...
#get a summary of data
summary(gapminder)
                country           year      infant_mortality life_expectancy
 Albania            :   57   Min.   :1960   Min.   :  1.50   Min.   :13.20  
 Algeria            :   57   1st Qu.:1974   1st Qu.: 16.00   1st Qu.:57.50  
 Angola             :   57   Median :1988   Median : 41.50   Median :67.54  
 Antigua and Barbuda:   57   Mean   :1988   Mean   : 55.31   Mean   :64.81  
 Argentina          :   57   3rd Qu.:2002   3rd Qu.: 85.10   3rd Qu.:73.00  
 Armenia            :   57   Max.   :2016   Max.   :276.90   Max.   :83.90  
 (Other)            :10203                  NA's   :1453                    
   fertility       population             gdp               continent   
 Min.   :0.840   Min.   :3.124e+04   Min.   :4.040e+07   Africa  :2907  
 1st Qu.:2.200   1st Qu.:1.333e+06   1st Qu.:1.846e+09   Americas:2052  
 Median :3.750   Median :5.009e+06   Median :7.794e+09   Asia    :2679  
 Mean   :4.084   Mean   :2.701e+07   Mean   :1.480e+11   Europe  :2223  
 3rd Qu.:6.000   3rd Qu.:1.523e+07   3rd Qu.:5.540e+10   Oceania : 684  
 Max.   :9.220   Max.   :1.376e+09   Max.   :1.174e+13                  
 NA's   :187     NA's   :185         NA's   :2972                       
             region    
 Western Asia   :1026  
 Eastern Africa : 912  
 Western Africa : 912  
 Caribbean      : 741  
 South America  : 684  
 Southern Europe: 684  
 (Other)        :5586  
#determine the type of object gapminder is
class(gapminder)
[1] "data.frame"
#get better view of the data
glimpse(gapminder)
Rows: 10,545
Columns: 9
$ country          <fct> "Albania", "Algeria", "Angola", "Antigua and Barbuda"…
$ year             <int> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960,…
$ infant_mortality <dbl> 115.40, 148.20, 208.00, NA, 59.87, NA, NA, 20.30, 37.…
$ life_expectancy  <dbl> 62.87, 47.50, 35.98, 62.97, 65.39, 66.86, 65.66, 70.8…
$ fertility        <dbl> 6.19, 7.65, 7.32, 4.43, 3.11, 4.55, 4.82, 3.45, 2.70,…
$ population       <dbl> 1636054, 11124892, 5270844, 54681, 20619075, 1867396,…
$ gdp              <dbl> NA, 13828152297, NA, NA, 108322326649, NA, NA, 966778…
$ continent        <fct> Europe, Africa, Africa, Americas, Americas, Asia, Ame…
$ region           <fct> Southern Europe, Northern Africa, Middle Africa, Cari…

Examining Africa Data

africadata<- gapminder %>%
  filter(continent == "Africa")

str(africadata)
'data.frame':   2907 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 2 3 18 22 26 27 29 31 32 33 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ infant_mortality: num  148 208 187 116 161 ...
 $ life_expectancy : num  47.5 36 38.3 50.3 35.2 ...
 $ fertility       : num  7.65 7.32 6.28 6.62 6.29 6.95 5.65 6.89 5.84 6.25 ...
 $ population      : num  11124892 5270844 2431620 524029 4829291 ...
 $ gdp             : num  1.38e+10 NA 6.22e+08 1.24e+08 5.97e+08 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...
summary(africadata)
         country          year      infant_mortality life_expectancy
 Algeria     :  57   Min.   :1960   Min.   : 11.40   Min.   :13.20  
 Angola      :  57   1st Qu.:1974   1st Qu.: 62.20   1st Qu.:48.23  
 Benin       :  57   Median :1988   Median : 93.40   Median :53.98  
 Botswana    :  57   Mean   :1988   Mean   : 95.12   Mean   :54.38  
 Burkina Faso:  57   3rd Qu.:2002   3rd Qu.:124.70   3rd Qu.:60.10  
 Burundi     :  57   Max.   :2016   Max.   :237.40   Max.   :77.60  
 (Other)     :2565                  NA's   :226                     
   fertility       population             gdp               continent   
 Min.   :1.500   Min.   :    41538   Min.   :4.659e+07   Africa  :2907  
 1st Qu.:5.160   1st Qu.:  1605232   1st Qu.:8.373e+08   Americas:   0  
 Median :6.160   Median :  5570982   Median :2.448e+09   Asia    :   0  
 Mean   :5.851   Mean   : 12235961   Mean   :9.346e+09   Europe  :   0  
 3rd Qu.:6.860   3rd Qu.: 13888152   3rd Qu.:6.552e+09   Oceania :   0  
 Max.   :8.450   Max.   :182201962   Max.   :1.935e+11                  
 NA's   :51      NA's   :51          NA's   :637                        
                       region   
 Eastern Africa           :912  
 Western Africa           :912  
 Middle Africa            :456  
 Northern Africa          :342  
 Southern Africa          :285  
 Australia and New Zealand:  0  
 (Other)                  :  0  

Infant Mortality and Life Expectancy

mort_expect <- africadata %>%
  select(infant_mortality, life_expectancy, region) #kept region for sake of visualization

#examine data 
str(mort_expect)
'data.frame':   2907 obs. of  3 variables:
 $ infant_mortality: num  148 208 187 116 161 ...
 $ life_expectancy : num  47.5 36 38.3 50.3 35.2 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...
summary(mort_expect)
 infant_mortality life_expectancy                       region   
 Min.   : 11.40   Min.   :13.20   Eastern Africa           :912  
 1st Qu.: 62.20   1st Qu.:48.23   Western Africa           :912  
 Median : 93.40   Median :53.98   Middle Africa            :456  
 Mean   : 95.12   Mean   :54.38   Northern Africa          :342  
 3rd Qu.:124.70   3rd Qu.:60.10   Southern Africa          :285  
 Max.   :237.40   Max.   :77.60   Australia and New Zealand:  0  
 NA's   :226                      (Other)                  :  0  
#Plot
ggplot()+
  geom_point(aes(x=infant_mortality, y=life_expectancy, color=region), data=mort_expect)+ #kept region to more easily see trends since there's a lot of countries.
  xlab("Infant Mortality Rate")+ ylab("Life Expectancy (Yrs)")+
  theme_bw()
Warning: Removed 226 rows containing missing values (`geom_point()`).

Population and Life Expectancy

pop_expect <- africadata %>%
  select(population, life_expectancy, region) #kept region for same reason as prior plot

#examine data 
str(pop_expect)
'data.frame':   2907 obs. of  3 variables:
 $ population     : num  11124892 5270844 2431620 524029 4829291 ...
 $ life_expectancy: num  47.5 36 38.3 50.3 35.2 ...
 $ region         : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...
summary(pop_expect)
   population        life_expectancy                       region   
 Min.   :    41538   Min.   :13.20   Eastern Africa           :912  
 1st Qu.:  1605232   1st Qu.:48.23   Western Africa           :912  
 Median :  5570982   Median :53.98   Middle Africa            :456  
 Mean   : 12235961   Mean   :54.38   Northern Africa          :342  
 3rd Qu.: 13888152   3rd Qu.:60.10   Southern Africa          :285  
 Max.   :182201962   Max.   :77.60   Australia and New Zealand:  0  
 NA's   :51                          (Other)                  :  0  
#Plot
ggplot()+
  geom_point(aes(x=log(population), y=life_expectancy, color=region), data=pop_expect)+
  theme_bw()+ xlab("Log Population")+ ylab("Life Expectancy (Yrs)")
Warning: Removed 51 rows containing missing values (`geom_point()`).

Select Years

Which years have the data missing for infant mortality?

mort_expect_years <- africadata %>%
  select(infant_mortality, life_expectancy, region, year)%>% #include years
  filter(is.na(infant_mortality)) #look for missing values

head(mort_expect_years) #print
  infant_mortality life_expectancy         region year
1               NA           50.12 Western Africa 1960
2               NA           40.95  Middle Africa 1960
3               NA           45.77 Eastern Africa 1960
4               NA           37.69  Middle Africa 1960
5               NA           39.03 Eastern Africa 1960
6               NA           38.83  Middle Africa 1960

Select Data for 2000

Filter all Africa data for 2000

africadata2000<-africadata%>%
  filter(year == 2000)

#Double check code 
str(africadata2000)
'data.frame':   51 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 2 3 18 22 26 27 29 31 32 33 ...
 $ year            : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
 $ infant_mortality: num  33.9 128.3 89.3 52.4 96.2 ...
 $ life_expectancy : num  73.3 52.3 57.2 47.6 52.6 46.7 54.3 68.4 45.3 51.5 ...
 $ fertility       : num  2.51 6.84 5.98 3.41 6.59 7.06 5.62 3.7 5.45 7.35 ...
 $ population      : num  31183658 15058638 6949366 1736579 11607944 ...
 $ gdp             : num  5.48e+10 9.13e+09 2.25e+09 5.63e+09 2.61e+09 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...
summary(africadata2000)
         country        year      infant_mortality life_expectancy
 Algeria     : 1   Min.   :2000   Min.   : 12.30   Min.   :37.60  
 Angola      : 1   1st Qu.:2000   1st Qu.: 60.80   1st Qu.:51.75  
 Benin       : 1   Median :2000   Median : 80.30   Median :54.30  
 Botswana    : 1   Mean   :2000   Mean   : 78.93   Mean   :56.36  
 Burkina Faso: 1   3rd Qu.:2000   3rd Qu.:103.30   3rd Qu.:60.00  
 Burundi     : 1   Max.   :2000   Max.   :143.30   Max.   :75.00  
 (Other)     :45                                                  
   fertility       population             gdp               continent 
 Min.   :1.990   Min.   :    81154   Min.   :2.019e+08   Africa  :51  
 1st Qu.:4.150   1st Qu.:  2304687   1st Qu.:1.274e+09   Americas: 0  
 Median :5.550   Median :  8799165   Median :3.238e+09   Asia    : 0  
 Mean   :5.156   Mean   : 15659800   Mean   :1.155e+10   Europe  : 0  
 3rd Qu.:5.960   3rd Qu.: 17391242   3rd Qu.:8.654e+09   Oceania : 0  
 Max.   :7.730   Max.   :122876723   Max.   :1.329e+11                
                                                                      
                       region  
 Eastern Africa           :16  
 Western Africa           :16  
 Middle Africa            : 8  
 Northern Africa          : 6  
 Southern Africa          : 5  
 Australia and New Zealand: 0  
 (Other)                  : 0  

Visualize the data for 2000

#Plots
##Infant Mortality
ggplot()+
    geom_smooth(aes(x=infant_mortality, y=life_expectancy), color = "gray80", alpha = 0.1, data=africadata2000)+
  geom_point(aes(x=infant_mortality, y=life_expectancy, color=region), data=africadata2000)+
  labs(title = "Infant Mortality vs Life Expectancy", subtitle = "Year 2000") + 
  xlab("Infant Mortality")+ ylab("Life Expectancy (Yrs)") + 
  theme_bw() 
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

##Population
ggplot()+
    geom_smooth(aes(x=log(population), y=life_expectancy), color = "gray80", alpha = 0.1, data=africadata2000)+
  geom_point(aes(x=log(population), y=life_expectancy, color=region), data=africadata2000)+
  labs(title="Population vs. Life Expectancy", subtitle="Year 2000") + 
  xlab("Log Population")+ ylab("Life Expectancy (Yrs)") +
  theme_bw()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Quantify the Data

Fitting infant mortality as a predictor of life expectancy

fit1<- lm(life_expectancy ~ infant_mortality, data=africadata2000)
summary(fit1)

Call:
lm(formula = life_expectancy ~ infant_mortality, data = africadata2000)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.6651  -3.7087   0.9914   4.0408   8.6817 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      71.29331    2.42611  29.386  < 2e-16 ***
infant_mortality -0.18916    0.02869  -6.594 2.83e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.221 on 49 degrees of freedom
Multiple R-squared:  0.4701,    Adjusted R-squared:  0.4593 
F-statistic: 43.48 on 1 and 49 DF,  p-value: 2.826e-08

We have evidence to support that infant mortality is a predictor of life expectancy. Based on the regression model, we can predict that for every unit increase in infant mortality, the average life expectancy decreases by 0.19 years (p=2.83e-8). The average predicted life expectancy with an infant mortality rate of 0 is 71.3 years.

Fitting population as a predictor of life expectancy

fit2<- lm(population ~ infant_mortality, data=africadata2000)
summary(fit2)

Call:
lm(formula = population ~ infant_mortality, data = africadata2000)

Residuals:
      Min        1Q    Median        3Q       Max 
-16307667 -12769228  -7828854    733380 105710100 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)      12063474    8682734   1.389    0.171
infant_mortality    45564     102671   0.444    0.659

Residual standard error: 22260000 on 49 degrees of freedom
Multiple R-squared:  0.004003,  Adjusted R-squared:  -0.01632 
F-statistic: 0.1969 on 1 and 49 DF,  p-value: 0.6592

Similar to what we saw in the plots, we do not have evidence to say that population is associated with infant mortality (p=0.66).

This section added by Annabella Hines

First, I wanted to look at the gapminder dataset as a whole specifically in the year 2000. I decided to create a boxplot to see the distributions of life expectancy for each continent.

##create an object of the gapminder data for the year 2000
continent<- gapminder %>% filter(year==2000)
#A boxplot of the continent data viewing life expectancy by continent
ggplot(data=continent, aes(x=continent, y=life_expectancy, color=continent))+geom_boxplot()+xlab("Continent")+ylab("Life Expectancy")

The life expectancy distributions of each continent look fairly comparable except for Africa which has a lower overall distribution, lower median, and more outliers.

Next, I compared fertility to infant mortality grouped by continent for the year 2000 to see if there were any noticeable trends.

##created a scatterplot of fertility and infant mortality color coded by continent
ggplot(data=continent, aes(x=fertility, y=infant_mortality, color=continent))+geom_point()+ylab("Infant Mortality")+ xlab("Fertility")
Warning: Removed 7 rows containing missing values (`geom_point()`).

There seems to be a positive correlation between fertility and infant mortality, with Europe having low values in each and Africa having the highest.

In the next section I wanted to explore how the life expectancy changed across the years for the different regions in Africa.

#create an object out of africadata with year, life expectancy, and region
africaregions<- africadata %>% select(year, life_expectancy, region)
#create plot showing year vs. life expectancy color coded by region
ggplot(data=africaregions, aes(x=year, y=life_expectancy, color=region))+geom_smooth()+ylab("Life Expectancy")+xlab("Year")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The plot shows a relatively positive correlation between year and life expectancy, so I wanted to run a fit to verify this observation.

#Fit life expectanct against year for the africadata
fit3<-lm(life_expectancy~year, data=africadata)
summary(fit3)

Call:
lm(formula = life_expectancy ~ year, data = africadata)

Residuals:
    Min      1Q  Median      3Q     Max 
-43.133  -5.197  -0.555   4.332  18.368 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -592.32739   16.12341  -36.74   <2e-16 ***
year           0.32531    0.00811   40.11   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.194 on 2905 degrees of freedom
Multiple R-squared:  0.3564,    Adjusted R-squared:  0.3562 
F-statistic:  1609 on 1 and 2905 DF,  p-value: < 2.2e-16

According to the above data, year and life expectancy for the African countries are positively correlated at the 0.05 significance level.

#Load broom and present lm output in a table
library(broom)
map_df(list(fit3), tidy)
# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept) -592.     16.1         -36.7 5.24e-243
2 year           0.325   0.00811      40.1 2.37e-280