Predicting Depression: Analyzing Demographic and Lifestyle Factors using Logistic Regression

3 min readMay 8, 2024

We aim to analyze depression status based on demographic and lifestyle factors using the NHANES dataset. We start by preprocessing the data, creating a binary outcome variable for depression status, fitting a logistic regression model, and evaluating its performance.

Part A: Data Preprocessing

We install and load the NHANES package to access the dataset. Subsequently, we subset the NHANES dataset to include individuals aged 30 and above, as this age group is often associated with higher rates of depression. We select specific variables relevant to our analysis, such as depression status, gender, age, education level, socioeconomic status indicators, and lifestyle factors. Any rows containing missing values are removed from the dataset to ensure data quality.

install.packages("NHANES")
library(NHANES)

my.data.2 <- subset(NHANES, Age >= 30)
my.data.2 <- select(my.data.2, Depressed, Gender, Age, Education, Poverty, HomeOwn, Work,  PhysActive, SmokeNow, SleepTrouble)
my.data.2 <- na.omit(my.data.2)

Part B: Model Fitting and Evaluation

We create a binary outcome variable, Depressed_Yes, based on the Depressed variable, where individuals classified as ‘Several’ or ‘Most’ depressed are assigned a value of 1, while others are assigned 0.

my.data.2$Depressed_Yes <- ifelse(my.data.2$Depressed == 'Several' | my.data.2$Depressed == 'Most', 1, 0)

A logistic regression model (model.2b) is fitted to predict Depressed_Yes using selected predictors, including gender, age, education, poverty status, homeownership, employment status, physical activity, smoking habits, and sleep quality.

model.2b <- glm(Depressed_Yes ~ Gender + Age + Education + Poverty + HomeOwn + Work + PhysActive + SmokeNow + SleepTrouble,
                data = my.data.2)

We summarize the model using the summary function to examine the coefficients and statistical significance of predictors.

summary(model.2b)

The model’s predictive performance is evaluated by calculating the predicted probabilities of depression (Depressed_Yes_prob) and classifying individuals as depressed or not based on a probability threshold of 0.5.

my.data.2$Depressed_Yes_prob <- predict(model.2b, type = "response")
my.data.2$Depressed_Yes_hat <- ifelse(my.data.2$Depressed_Yes_prob > 0.5, 1, 0)

Finally, we compute the accuracy of our classification model by comparing predicted depression status to the actual values.

accuracy <- mean(my.data.2$Depressed_Yes == my.data.2$Depressed_Yes_hat)
print(accuracy)

Conclusion

This article demonstrates the process of building and evaluating a logistic regression model to predict depression status based on demographic and lifestyle factors. By preprocessing the NHANES dataset and fitting the model, we gain insights into the relationship between various predictors and depression. Additionally, evaluating the model’s accuracy allows us to assess its effectiveness in identifying individuals at risk of depression.

Predicting Depression: Analyzing Demographic and Lifestyle Factors using Logistic Regression

Part A: Data Preprocessing

Part B: Model Fitting and Evaluation

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Idrissa Tankari

No responses yet