Fitting

Load In Data

load("../../fluanalysis/data/clean_symptoms.rds")

Model Fitting

Goals:

Loads cleaned data.
Fits a linear model to the continuous outcome (Body temperature) using only the main predictor of interest.
Fits another linear model to the continuous outcome using all (important) predictors of interest.
Compares the model results for the model with just the main predictor and all predictors.
Fits a logistic model to the categorical outcome (Nausea) using only the main predictor of interest.
Fits another logistic model to the categorical outcome using all (important) predictors of interest.
Compares the model results for the categorical model with just the main predictor and all predictors.

Continuous + Runny Nose

lm_fit <- 
  linear_reg()  %>% 
  set_engine("lm")%>%
  fit(BodyTemp ~ RunnyNose,  data = select_sympact)
tidy(lm_fit)

# A tibble: 2 × 5
  term         estimate std.error statistic p.value
  <chr>           <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    99.1      0.0819   1210.   0      
2 RunnyNoseYes   -0.293    0.0971     -3.01 0.00268

Continuous + Everything

Swollen lymph nodes, chills, subjective fever, myalgia, and weakness (Y/N)

lm_fit2 <- 
  linear_reg()  %>% 
  set_engine("lm") %>%
  fit(BodyTemp ~ SwollenLymphNodes + ChillsSweats +   SubjectiveFever + MyalgiaYN +  Weakness,  data = select_sympact)
tidy(lm_fit2)

# A tibble: 8 × 5
  term                 estimate std.error statistic   p.value
  <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)           98.4       0.197    499.    0        
2 SwollenLymphNodesYes  -0.120     0.0880    -1.36  0.174    
3 ChillsSweatsYes        0.157     0.126      1.24  0.214    
4 SubjectiveFeverYes     0.443     0.102      4.34  0.0000164
5 MyalgiaYNYes           0.0748    0.152      0.491 0.623    
6 WeaknessMild           0.107     0.190      0.562 0.574    
7 WeaknessModerate       0.0972    0.191      0.510 0.610    
8 WeaknessSevere         0.335     0.212      1.57  0.116

Categorical + Runny Nose

lm_fit3 <- 
  logistic_reg()  %>% 
  set_engine("glm") %>%
  fit(Nausea ~ RunnyNose,  data = select_sympact)
tidy(lm_fit3)

# A tibble: 2 × 5
  term         estimate std.error statistic    p.value
  <chr>           <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)   -0.658      0.145    -4.53  0.00000589
2 RunnyNoseYes   0.0605     0.172     0.353 0.724

Categorical + Everything

Abdomen pain, chest pain, insomnia, vision, and vomit

lm_fit4 <- 
  logistic_reg()  %>% 
  set_engine("glm") %>%
  fit(Nausea ~ AbPain + ChestPain +   Insomnia + Vision +  Vomit,  data = select_sympact)
tidy(lm_fit4)

# A tibble: 6 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    -1.21      0.145    -8.36  6.22e-17
2 AbPainYes       1.24      0.255     4.87  1.09e- 6
3 ChestPainYes    0.170     0.183     0.928 3.53e- 1
4 InsomniaYes     0.193     0.172     1.12  2.61e- 1
5 VisionYes       0.588     0.510     1.15  2.49e- 1
6 VomitYes        2.27      0.314     7.24  4.57e-13

Comparisons

In terms of both of our outcomes of interest, using a more complex model may be more advantageous, especially considering the dataset has 32 variables of interest. It is limited to say that one variable has stronger predictive qualities than multiple used jointly (although chosen out of the statistician’s (me) biases).

One drawback of using multiple variables is that in both categorical and continuous models, the predictors were split between yes/no and likert scales. For more in-depth analyses these should be considered more in depth - either by only using the yes/no scales, or finding a way to moderate the effect of different levels of factors between the different predictors. Applying the models’ findings within the predict() function would also highlight drawbacks between the different models, and it’s great that tidymodels has this capability built in rather than going through 5 extra steps using individual model packages. I hate to say that I’m not able to try it out at the moment due to this excercise’s deadline.

I would love to play around with tidymodels more as well as KNN and Bayesian models. Exploring more of the tidymodels output as well as different model types is super interesting to me. I think I’m beginning to understand the tidymodels framework, and I’m intrigued by the implications it has outside of regression and logistical models.

Unfortunately due to the nature of deadlines, being able to take advantage of this exercise in its entirety hasn’t been possible, but over the next few weeks I’m very interested in learning more about machine learning and its methods through R. Over spring break I’ll likely revisit this exercise to better understand tidymodels and its capabilities - both for this class as well as my own research. If I discover any new insights I can add an addendum to this code.

Thank you!