Wrangling

Read In the Code

I was having some difficulties “loading” the code as mentioned in the description for this exercise due to a “magic number X,” but readRDS function seemed ok!

SympAct_Any_Pos<-readRDS("../../fluanalysis/data/SympAct_Any_Pos.Rda")

Clean the Code

Alrighty, now it’s time to clean this up for the exercise. The first step is to remove all variables that have Score or Total or FluA or FluB or DxName or Activity in their name

select_sympact<-SympAct_Any_Pos%>%
  select(-c(contains("Score"),contains("Total"), contains("FluA"), contains("FluB"), contains("DxName"), contains("Activity"), Unique.Visit))

Great, this takes us down to 32 variables from 63, each with a list of symptoms the pt may or may not be experiencing. Now we have a lovely dataset we can use for future analysis.

Since we are looking at Body Temperature and Nausea as our main outcome variables, let’s investigate them quickly and make sure they are good to go for our next exploratory and analysis steps.

select_sympact%>%
  filter(is.na(BodyTemp))
  SwollenLymphNodes ChestCongestion ChillsSweats NasalCongestion CoughYN Sneeze
1                No             Yes          Yes             Yes     Yes    Yes
2               Yes             Yes          Yes              No     Yes    Yes
3                No              No          Yes              No     Yes    Yes
4               Yes              No           No             Yes     Yes    Yes
5                No              No          Yes              No     Yes     No
  Fatigue SubjectiveFever Headache Weakness WeaknessYN CoughIntensity CoughYN2
1     Yes             Yes      Yes Moderate        Yes       Moderate      Yes
2     Yes             Yes      Yes Moderate        Yes       Moderate      Yes
3     Yes             Yes      Yes   Severe        Yes       Moderate      Yes
4     Yes             Yes      Yes     Mild        Yes           Mild      Yes
5     Yes             Yes      Yes Moderate        Yes           Mild      Yes
   Myalgia MyalgiaYN RunnyNose AbPain ChestPain Diarrhea EyePn Insomnia
1 Moderate       Yes       Yes     No        No       No    No      Yes
2   Severe       Yes       Yes    Yes       Yes       No    No      Yes
3   Severe       Yes       Yes    Yes        No       No    No       No
4     Mild       Yes       Yes     No        No       No    No      Yes
5 Moderate       Yes       Yes     No        No       No    No      Yes
  ItchyEye Nausea EarPn Hearing Pharyngitis Breathless ToothPn Vision Vomit
1       No     No    No      No          No        Yes      No     No    No
2      Yes    Yes    No      No         Yes        Yes      No     No    No
3       No    Yes    No      No         Yes         No      No     No    No
4      Yes     No    No      No         Yes         No      No     No    No
5      Yes    Yes    No      No          No        Yes     Yes     No   Yes
  Wheeze BodyTemp
1    Yes       NA
2     No       NA
3     No       NA
4     No       NA
5     No       NA
select_sympact%>%
  filter(is.na(Nausea))
 [1] SwollenLymphNodes ChestCongestion   ChillsSweats      NasalCongestion  
 [5] CoughYN           Sneeze            Fatigue           SubjectiveFever  
 [9] Headache          Weakness          WeaknessYN        CoughIntensity   
[13] CoughYN2          Myalgia           MyalgiaYN         RunnyNose        
[17] AbPain            ChestPain         Diarrhea          EyePn            
[21] Insomnia          ItchyEye          Nausea            EarPn            
[25] Hearing           Pharyngitis       Breathless        ToothPn          
[29] Vision            Vomit             Wheeze            BodyTemp         
<0 rows> (or 0-length row.names)

To be aware of in the future, we have five missing values for BodyTemp, but none for Nausea. This is good to know as we dive into EDA and model fitting.

Continued Data Cleaning for Models

We need to adjust the variables included as we continue with model creation. This will prevent us from confusing/adding redundancy for the model. Specifically, for those symptoms with both multiple levels and yes/no, remove all the yes/no versions.

no_YN_symptoms<-select_sympact%>%
  select(-c(CoughYN, CoughYN2, WeaknessYN, MyalgiaYN))

Now that we only have the leveled options left, we need to code the categorical variables as unordered factors and the others as ordered factors.

severity<-c("None", "Mild" ,"Moderate", "Severe")
no_YN_symptoms$CoughIntensity<-factor(no_YN_symptoms$CoughIntensity, levels = severity, ordered = TRUE)
no_YN_symptoms$Myalgia<-factor(no_YN_symptoms$Myalgia, levels = severity, ordered = TRUE)
no_YN_symptoms$Weakness<-factor(no_YN_symptoms$Weakness, levels = severity, ordered = TRUE)

Check for unbalanced variables. We’re removing any Yes/No options with less than 50 occurences of “Yes” reports as they are not anticipated to be helpful here.

length(which(no_YN_symptoms$Hearing == "Yes"))
[1] 30
length(which(no_YN_symptoms$Vision == "Yes"))
[1] 19
balanced_symptoms<-no_YN_symptoms%>%
 select(-c(Hearing, Vision))

Save Necessary Files

save(select_sympact, balanced_symptoms, file = "../../fluanalysis/data/clean_symptoms.RData")