Data Analysis Exercise - Smoking and Tobacco Use

About the Dataset

This data set is the Behavioral Risk Factor Surveillance System (CDC BRFSS) Smoking and Tobacco Use from 1995 to 2010. It includes the rate by which each state (and US territory) exhibits different smoking habits. The rates are weighted to population characteristics to allow for comparison of different population sizes. It includes variables such as “Smokes everyday” “Former Smoker” “Smokes some days” and “Never Smoked”. I chose it because it was complete and relatively clean, as well as spanned over a large amount of years. This allowed for more than 53 rows which I saw in multiple other data sets. I wanted to use something a bit larger than only having one entry per state.

This data has been used to create data visualizations also published on the CDC website if you’re interested in looking more into this data.

Processing the Data

Reading in the Data

Tobacco Use - Smoking Data for 1995-2010

A Note I commented out the code chunk to only display the data table.

Rows: 876 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): State, Location 1
dbl (5): Year, Smoke everyday, Smoke some days, Former smoker, Never smoked

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 10 × 7
    Year State          `Smoke everyday` Smoke some da…¹ Forme…² Never…³ Locat…⁴
   <dbl> <chr>                     <dbl>           <dbl>   <dbl>   <dbl> <chr>  
 1  1996 Puerto Rico                 9.4             5.1    16      69.5 "Puert…
 2  2005 Virgin Islands              5.3             2.8    12.8    79.1 "Virgi…
 3  2005 Puerto Rico                 7.9             5.2    16.9    70   "Puert…
 4  2002 Virgin Islands              7               2.4    12.1    78.5 "Virgi…
 5  2003 Guam                       26.3             7.8    14.3    51.7  <NA>  
 6  2000 Oregon                     15.3             5.4    28.3    50.9 "Orego…
 7  2002 New Mexico                 15               6.2    26      52.8 "New M…
 8  1996 Indiana                    24.9             3.7    23.7    47.6 "India…
 9  2006 Louisiana                  17.7             5.7    20.1    56.4 "Louis…
10  2002 Georgia                    17.7             5.6    22.4    54.3 "Georg…
# … with abbreviated variable names ¹​`Smoke some days`, ²​`Former smoker`,
#   ³​`Never smoked`, ⁴​`Location 1`

We can see that there are two potential location variables, one which also includes latitude and longitude. I want to see if these locations are different, and separate the coordinates from the location. I’m also not sure what the coordinates mean – they may be the midpoint of the state, the capital, the data collection center, etc.

A quick Google search for the first set of coordinates in Oregon showed that they point to a potential state midpoint. This is confirmed by also looking at those for Indiana (my hometown!) and being placed in the heart of downtown Indianapolis in the center of the state and those for Georgia and being placed in Macon.

We can also see 4 entries which are only coordinates for 2009-2010 Guam and Virgin Islands, so we need to make sure these get moved appropriately to the correct collumns when they are not structured the same way as the rest of the data.

Cleaning the Data

#Separate location from coordinates
tobacco_clean1<-separate(tobacco, col = `Location 1`, into = c("Location", "LatLong"), sep = "\n")
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 27 rows [1, 2,
3, 4, 41, 42, 56, 121, 192, 204, 253, 293, 300, 366, 409, 424, 430, 447, 490,
562, ...].

There are a couple warnings where there is no data for this field. I’m not super worried about those, I more just want to investigate and standardize the data that exists.

#account for data with different structures
tobacco_clean1$LatLong<-ifelse(tobacco_clean1$Year %in% c(2009, 2010) & 
                               tobacco_clean1$State %in% c("Virgin Islands", "Guam"), 
                                  tobacco_clean1$Location,
                                  tobacco_clean1$LatLong)

tobacco_clean1$Location<-ifelse(tobacco_clean1$Year %in% c(2009, 2010) & 
                               tobacco_clean1$State %in% c("Virgin Islands", "Guam"), 
                                 NA,
                                 tobacco_clean1$Location)


#separate from each other coordinates
tobacco_clean2<-separate(tobacco_clean1, col = LatLong, into = c("Lat", "Long"), sep = ",")
tobacco_clean2$Long<-str_sub(tobacco_clean2$Long, 1, str_length(tobacco_clean2$Long)-1)
tobacco_clean2$Lat<-str_sub(tobacco_clean2$Lat, 2, -2)

#look at similarities/differences in Location and State
different_locs<-tobacco_clean2%>%
  filter(State != Location) 
different_locs #none!
# A tibble: 0 × 9
# … with 9 variables: Year <dbl>, State <chr>, Smoke everyday <dbl>,
#   Smoke some days <dbl>, Former smoker <dbl>, Never smoked <dbl>,
#   Location <chr>, Lat <chr>, Long <chr>

Great! Since there are no unusual or different inputs we can remove one of the duplicate columns

tobacco_clean_F<-tobacco_clean2%>%
  select(-Location)

Wide to Tall

Next, since each rate is its own column, this may make it difficult to analyze and compare, so I want to change it to wide-to-tall format.

tobacco_tall<-gather(tobacco_clean_F, SmokeAmount, Rate, `Smoke everyday`:`Never smoked`)

Data Table for Tobacco Use - Smoking Data for 1995-2010

This will be our dataset for analysis! It allows us to group by the “SmokeAmount” variable while having a consistent variable of interest “Rate” among all categories.

Data Analysis

Since this data covers from 1995 to 2010, I want to first look at these variables over time. Below are plots for each category across all years.

ggplot()+
  geom_line(aes(x=Year, y=Rate, group=State), data=tobacco_tall, alpha = 0.2, color = "blue")+
  facet_wrap(.~SmokeAmount)+
  theme_bw()

This is a bit muddy of a plot, but we can see the general trends in rates between each category. It seems like among most states it is most common to not smoke. Former smoker and Smoking Everyday seem pretty comparable with smoking everyday on the decline. Finally smoking some days is the least common, but there does seem to be a very slight increase since 1995.

This is where my work ends, best of luck to Player 2.

Kelly Hatfield’s Section

Step 1: Viewing the Tobacco R data

summary(tobacco_tall)
      Year         State                Lat             Long        
 Min.   :1995   Length:3504        Min.   :13.40   Min.   :-157.86  
 1st Qu.:1999   Class :character   1st Qu.:35.47   1st Qu.:-106.13  
 Median :2003   Mode  :character   Median :39.49   Median : -89.54  
 Mean   :2003                      Mean   :39.60   Mean   : -92.63  
 3rd Qu.:2007                      3rd Qu.:43.63   3rd Qu.: -78.46  
 Max.   :2010                      Max.   :64.85   Max.   : 144.78  
                                   NA's   :240     NA's   :240      
 SmokeAmount             Rate       
 Length:3504        Min.   : 1.300  
 Class :character   1st Qu.: 7.275  
 Mode  :character   Median :21.000  
                    Mean   :24.994  
                    3rd Qu.:34.925  
                    Max.   :83.700  
                                    

Step 2: See how many years and states are represented.

table(tobacco_tall$SmokeAmount)

  Former smoker    Never smoked  Smoke everyday Smoke some days 
            876             876             876             876 
tobacco_tall_NS = subset(tobacco_tall, SmokeAmount == "Never smoked")
tobacco_tall_NS_2000=subset(tobacco_tall_NS, Year == 2000)

table(tobacco_tall_NS$State)

                                 Alabama 
                                      16 
                                  Alaska 
                                      16 
                                 Arizona 
                                      16 
                                Arkansas 
                                      16 
                              California 
                                      16 
                                Colorado 
                                      16 
                             Connecticut 
                                      16 
                                Delaware 
                                      16 
                    District of Columbia 
                                      15 
                                 Florida 
                                      16 
                                 Georgia 
                                      16 
                                    Guam 
                                       7 
                                  Hawaii 
                                      15 
                                   Idaho 
                                      16 
                                Illinois 
                                      16 
                                 Indiana 
                                      16 
                                    Iowa 
                                      16 
                                  Kansas 
                                      16 
                                Kentucky 
                                      16 
                               Louisiana 
                                      16 
                                   Maine 
                                      16 
                                Maryland 
                                      16 
                           Massachusetts 
                                      16 
                                Michigan 
                                      16 
                               Minnesota 
                                      16 
                             Mississippi 
                                      16 
                                Missouri 
                                      16 
                                 Montana 
                                      16 
              Nationwide (States and DC) 
                                      16 
Nationwide (States, DC, and Territories) 
                                      16 
                                Nebraska 
                                      16 
                                  Nevada 
                                      16 
                           New Hampshire 
                                      16 
                              New Jersey 
                                      16 
                              New Mexico 
                                      16 
                                New York 
                                      16 
                          North Carolina 
                                      16 
                            North Dakota 
                                      16 
                                    Ohio 
                                      16 
                                Oklahoma 
                                      16 
                                  Oregon 
                                      16 
                            Pennsylvania 
                                      16 
                             Puerto Rico 
                                      15 
                            Rhode Island 
                                      16 
                          South Carolina 
                                      16 
                            South Dakota 
                                      16 
                               Tennessee 
                                      16 
                                   Texas 
                                      16 
                                    Utah 
                                      14 
                                 Vermont 
                                      16 
                          Virgin Islands 
                                      10 
                                Virginia 
                                      16 
                              Washington 
                                      16 
                           West Virginia 
                                      16 
                               Wisconsin 
                                      16 
                                 Wyoming 
                                      16 
table(tobacco_tall_NS$Year)

1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 
  51   53   54   54   54   54   56   56   56   54   55   55   56   56   56   56 

Step 3: Create a table of average rate of Never Smokers by year

results <- aggregate(tobacco_tall_NS$Rate, list(tobacco_tall_NS$Year), FUN=mean)

results <- tobacco_tall_NS %>%
  group_by(Year)%>%
  summarise_at(vars(Rate), list('Average'=mean))
#
library(knitr)


knitr::kable(results, caption = "Percent of Non-Smokers by State", digits=2)
Percent of Non-Smokers by State
Year Average
1995 52.08
1996 52.37
1997 52.98
1998 52.84
1999 53.39
2000 52.93
2001 52.67
2002 52.69
2003 53.04
2004 55.55
2005 54.87
2006 55.52
2007 55.83
2008 56.54
2009 56.89
2010 57.54
#Change Never smoked variable name

#

Step 4: Create box plots and spaghetti plots showing rates of % Never smoked for states by year

#Change Never smoked variable name



ggplot(tobacco_tall_NS, aes(x=factor(Year), y=Rate)) + geom_boxplot() + ylim(0,100)

ggplot(tobacco_tall_NS, aes(x=(Year), y=Rate, group=State)) + geom_line() + ylim(0,100)

#

Step 5: Print Top 5 States with highest percentage of never smokers in 2010

#
library(dplyr)

sorted_data <- subset(tobacco_tall_NS_2000,select=c(State,Year,Rate))
sorted_data2 <- top_n(sorted_data,5,Rate) 
print(sorted_data2)
# A tibble: 6 × 3
  State                 Year  Rate
  <chr>                <dbl> <dbl>
1 Arizona               2000  59.7
2 Minnesota             2000  57.1
3 Utah                  2000  68.8
4 District of Columbia  2000  57.1
5 Oklahoma              2000  57.4
6 Puerto Rico           2000  71.6