<-readRDS("../../fluanalysis/data/SympAct_Any_Pos.Rda") SympAct_Any_Pos
Wrangling
Read In the Code
I was having some difficulties “loading” the code as mentioned in the description for this exercise due to a “magic number X,” but readRDS function seemed ok!
Clean the Code
Alrighty, now it’s time to clean this up for the exercise. The first step is to remove all variables that have Score or Total or FluA or FluB or DxName or Activity in their name
<-SympAct_Any_Pos%>%
select_sympactselect(-c(contains("Score"),contains("Total"), contains("FluA"), contains("FluB"), contains("DxName"), contains("Activity"), Unique.Visit))
Great, this takes us down to 32 variables from 63, each with a list of symptoms the pt may or may not be experiencing. Now we have a lovely dataset we can use for future analysis.
Since we are looking at Body Temperature and Nausea as our main outcome variables, let’s investigate them quickly and make sure they are good to go for our next exploratory and analysis steps.
%>%
select_sympactfilter(is.na(BodyTemp))
SwollenLymphNodes ChestCongestion ChillsSweats NasalCongestion CoughYN Sneeze
1 No Yes Yes Yes Yes Yes
2 Yes Yes Yes No Yes Yes
3 No No Yes No Yes Yes
4 Yes No No Yes Yes Yes
5 No No Yes No Yes No
Fatigue SubjectiveFever Headache Weakness WeaknessYN CoughIntensity CoughYN2
1 Yes Yes Yes Moderate Yes Moderate Yes
2 Yes Yes Yes Moderate Yes Moderate Yes
3 Yes Yes Yes Severe Yes Moderate Yes
4 Yes Yes Yes Mild Yes Mild Yes
5 Yes Yes Yes Moderate Yes Mild Yes
Myalgia MyalgiaYN RunnyNose AbPain ChestPain Diarrhea EyePn Insomnia
1 Moderate Yes Yes No No No No Yes
2 Severe Yes Yes Yes Yes No No Yes
3 Severe Yes Yes Yes No No No No
4 Mild Yes Yes No No No No Yes
5 Moderate Yes Yes No No No No Yes
ItchyEye Nausea EarPn Hearing Pharyngitis Breathless ToothPn Vision Vomit
1 No No No No No Yes No No No
2 Yes Yes No No Yes Yes No No No
3 No Yes No No Yes No No No No
4 Yes No No No Yes No No No No
5 Yes Yes No No No Yes Yes No Yes
Wheeze BodyTemp
1 Yes NA
2 No NA
3 No NA
4 No NA
5 No NA
%>%
select_sympactfilter(is.na(Nausea))
[1] SwollenLymphNodes ChestCongestion ChillsSweats NasalCongestion
[5] CoughYN Sneeze Fatigue SubjectiveFever
[9] Headache Weakness WeaknessYN CoughIntensity
[13] CoughYN2 Myalgia MyalgiaYN RunnyNose
[17] AbPain ChestPain Diarrhea EyePn
[21] Insomnia ItchyEye Nausea EarPn
[25] Hearing Pharyngitis Breathless ToothPn
[29] Vision Vomit Wheeze BodyTemp
<0 rows> (or 0-length row.names)
To be aware of in the future, we have five missing values for BodyTemp, but none for Nausea. This is good to know as we dive into EDA and model fitting.
Continued Data Cleaning for Models
We need to adjust the variables included as we continue with model creation. This will prevent us from confusing/adding redundancy for the model. Specifically, for those symptoms with both multiple levels and yes/no, remove all the yes/no versions.
<-select_sympact%>%
no_YN_symptomsselect(-c(CoughYN, CoughYN2, WeaknessYN, MyalgiaYN))
Now that we only have the leveled options left, we need to code the categorical variables as unordered factors and the others as ordered factors.
<-c("None", "Mild" ,"Moderate", "Severe")
severity$CoughIntensity<-factor(no_YN_symptoms$CoughIntensity, levels = severity, ordered = TRUE)
no_YN_symptoms$Myalgia<-factor(no_YN_symptoms$Myalgia, levels = severity, ordered = TRUE)
no_YN_symptoms$Weakness<-factor(no_YN_symptoms$Weakness, levels = severity, ordered = TRUE) no_YN_symptoms
Check for unbalanced variables. We’re removing any Yes/No options with less than 50 occurences of “Yes” reports as they are not anticipated to be helpful here.
length(which(no_YN_symptoms$Hearing == "Yes"))
[1] 30
length(which(no_YN_symptoms$Vision == "Yes"))
[1] 19
<-no_YN_symptoms%>%
balanced_symptomsselect(-c(Hearing, Vision))
Save Necessary Files
save(select_sympact, balanced_symptoms, file = "../../fluanalysis/data/clean_symptoms.RData")