load("../../fluanalysis/data/clean_symptoms.rds")
Fitting
Load In Data
Model Fitting
Goals:
Loads cleaned data.
Fits a linear model to the continuous outcome (Body temperature) using only the main predictor of interest.
Fits another linear model to the continuous outcome using all (important) predictors of interest.
Compares the model results for the model with just the main predictor and all predictors.
Fits a logistic model to the categorical outcome (Nausea) using only the main predictor of interest.
Fits another logistic model to the categorical outcome using all (important) predictors of interest.
Compares the model results for the categorical model with just the main predictor and all predictors.
Continuous + Runny Nose
<-
lm_fit linear_reg() %>%
set_engine("lm")%>%
fit(BodyTemp ~ RunnyNose, data = select_sympact)
tidy(lm_fit)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 99.1 0.0819 1210. 0
2 RunnyNoseYes -0.293 0.0971 -3.01 0.00268
Continuous + Everything
Swollen lymph nodes, chills, subjective fever, myalgia, and weakness (Y/N)
<-
lm_fit2 linear_reg() %>%
set_engine("lm") %>%
fit(BodyTemp ~ SwollenLymphNodes + ChillsSweats + SubjectiveFever + MyalgiaYN + Weakness, data = select_sympact)
tidy(lm_fit2)
# A tibble: 8 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 98.4 0.197 499. 0
2 SwollenLymphNodesYes -0.120 0.0880 -1.36 0.174
3 ChillsSweatsYes 0.157 0.126 1.24 0.214
4 SubjectiveFeverYes 0.443 0.102 4.34 0.0000164
5 MyalgiaYNYes 0.0748 0.152 0.491 0.623
6 WeaknessMild 0.107 0.190 0.562 0.574
7 WeaknessModerate 0.0972 0.191 0.510 0.610
8 WeaknessSevere 0.335 0.212 1.57 0.116
Categorical + Runny Nose
<-
lm_fit3 logistic_reg() %>%
set_engine("glm") %>%
fit(Nausea ~ RunnyNose, data = select_sympact)
tidy(lm_fit3)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.658 0.145 -4.53 0.00000589
2 RunnyNoseYes 0.0605 0.172 0.353 0.724
Categorical + Everything
Abdomen pain, chest pain, insomnia, vision, and vomit
<-
lm_fit4 logistic_reg() %>%
set_engine("glm") %>%
fit(Nausea ~ AbPain + ChestPain + Insomnia + Vision + Vomit, data = select_sympact)
tidy(lm_fit4)
# A tibble: 6 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1.21 0.145 -8.36 6.22e-17
2 AbPainYes 1.24 0.255 4.87 1.09e- 6
3 ChestPainYes 0.170 0.183 0.928 3.53e- 1
4 InsomniaYes 0.193 0.172 1.12 2.61e- 1
5 VisionYes 0.588 0.510 1.15 2.49e- 1
6 VomitYes 2.27 0.314 7.24 4.57e-13
Comparisons
In terms of both of our outcomes of interest, using a more complex model may be more advantageous, especially considering the dataset has 32 variables of interest. It is limited to say that one variable has stronger predictive qualities than multiple used jointly (although chosen out of the statistician’s (me) biases).
One drawback of using multiple variables is that in both categorical and continuous models, the predictors were split between yes/no and likert scales. For more in-depth analyses these should be considered more in depth - either by only using the yes/no scales, or finding a way to moderate the effect of different levels of factors between the different predictors. Applying the models’ findings within the predict() function would also highlight drawbacks between the different models, and it’s great that tidymodels has this capability built in rather than going through 5 extra steps using individual model packages. I hate to say that I’m not able to try it out at the moment due to this excercise’s deadline.
I would love to play around with tidymodels more as well as KNN and Bayesian models. Exploring more of the tidymodels output as well as different model types is super interesting to me. I think I’m beginning to understand the tidymodels framework, and I’m intrigued by the implications it has outside of regression and logistical models.
Unfortunately due to the nature of deadlines, being able to take advantage of this exercise in its entirety hasn’t been possible, but over the next few weeks I’m very interested in learning more about machine learning and its methods through R. Over spring break I’ll likely revisit this exercise to better understand tidymodels and its capabilities - both for this class as well as my own research. If I discover any new insights I can add an addendum to this code.
Thank you!