Back to Research & Publication
Just this year, the World Health Organization found that globally there are 264 million people affected by depression. Mental illness is a major issue in our society today. Driven out of curiosity, I decided to explore the relationship between personality, and the likelihood of experiencing symptoms of depression. Through this project, I wanted to understand which dimensions of personality are most indicative of depression in order to better understand what can be done to reduce cases of depression. This is how you add an image:
The DASS (Depression, Anxiety, and Stress Scale) is a questionnaire made up of 42 self-report items which assess severity levels of symptoms associated with depression, anxiety, and stress. The 42 items are divided into 3 subgroups of 14 questions, with each subgroup assessing a specific condition (depression, anxiety, stress), and each item contains a 4-point Likert scale with which a participant will rate how much they agree with a particular statement. The DASS is a widely used research metric as it is recognized for its reliability in assessing each of these negative conditions.
The TIPI (Ten Item Personality Inventory) is a 10-item measure of the Big 5 Personality Traits (Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism). Each of the items contains a 7 point Likert-scale with which participants select how much they agree with a statement on personality, with a 1 being “Disagree Strongly” and a 7 being “Agree Strongly.” The TIPI is known to be used only as a quick evaluation of personality, and has been tested to provide moderate reliability.
The dataset used is an online collection of 39,775 responses to the DASS found on openpsychometrics.org. In addition to responses to the DASS, participants were also given the option of answering the TIPI as well as other demographic questions such as age, religion, ethnicity, sex, etc. Note that in my dataset, participants answering the questionnaire could opt to not respond to the TIPI items, in which case there would be a 0 assigned to that item in the dataset.
I removed all columns except those pertaining to the TIPI and DASS responses because my research focus centered on the relationship between personality and depression. Summing up the 14 responses referring to questions in the DASS assessing severity of depressive symptoms, I calculated a “depression score” for each participant.
One-Hot Encoding is a data science technique used to convert categorical variables to numerical inputs acceptable for various machine learning algorithms. Through a one-hot encoding, categorical variables are each converted into a vector of 0's and 1's, with the location of the 1 within the vector corresponding to each unique value in the specified column. I utilized the sklearn OneHotEncoder to encode each of my TIPI responses.
Dividing a dataset into two separate datasets, a training and a testing set, is a crucial step of the data science cycle. The training set is used for developing a model, and the testing set is used to evaluate the accuracy and overall success of the constructed model. The testing set essentially assesses if the model is able to work effectively on new data. I used sklearn to split my dataset into training and testing sets with an 80:20 ratio.
In order to better understand my dataset, I decided to generate visualizations graphing the relationship between responses for each item on the TIPI as well as the mean depression score for each response.
This visualization offers several important insights. First, the graph reveals that the mean depression score of each response for each item depicts a relatively linear trend, perhaps indicating the appropriateness of utilizing a Linear Regression model. Second, the graph reveals which questions are more discriminating for predicting depression score. The lines with a steeper slope should be associated with questions that are more important for predictive purposes. Based on the graph, it appears that ‘TIPI4’ and ‘TIPI9’ both have relatively steeper slopes than the other items of the TIPI. These two items also are coincidentally associated with the ‘Neuroticism’ dimension of the Big 5 Personality Traits, which reveals that ‘Neuroticism’ may be highly correlated with depression score.
I utilized a Random Forest Regressor, built from the sklearn library, to assess feature importance. The Random Forest Regressor was able to find the weight of each TIPI response in predicting my dependent variable, the “depression score”:
I summed up the feature importances of the questions corresponding to each Big 5 Personality trait and sorted them in descending order:
Very clearly, neuroticism, associated with TIPI4 and TIPI9, is a leading trait which is indicative of the experience of more severe depressive symptoms. Neuroticism does not have a clear cut definition, though it can be described as emotional instability and a long term occupation of a negative state. Based on this generalized definition, it seems fairly intuitive that this would be a leading personality trait with high correlation with the severity of depressive symptoms.
The second personality trait which stands out is Extraversion, which is a personality trait that relates to the capacity of deriving energy from others. People who are more extraverted tend to be more outgoing and enjoy large social gatherings. These behaviors also provide an intuitive explanation for why Extraversion has greater importance: humans are social creatures. A lot of research has been conducted on happiness and many studies have led to the conclusion that social connection and strong relationships are important in fostering happiness for both introverts and extroverts. As more extroverted people are generally more outgoing, they are able to form more social connections, which may explain a negative correlation between ‘Extraversion’ and depression score.
The Random Forest Regressor ultimately performed with an r2 value of 0.28 on the testing data, meaning that only about 28% of the data could be captured by the regression model, and had a root mean squared error of 10.33. In addition, the r2 value of the model in predicting the depression score for the training set was 0.86, indicating that it may have overfit. I decided to look at other other models to improve these metrics.
Based on my Exploratory Data Analysis, the trends in the data appeared rather linear so Linear Regression seemed to be an appropriate model to use for regression purposes. As shown above, the r2 value of the Linear Regression model on both the Training and Testing Set are around the same, showing that there is not much overfitting happening. In addition, the r2 value of the model on the test set is much higher than the r2 value of the Random Forest Regression, and the Linear Regression model had a lower RMSE. The Linear Regression performed much better than the Random Forest Regressor. Though there are many arguments surrounding a ‘satisfactory’ r2 value for social science research, an r2 value of 0.1 is generally accepted as adequate. In light of this baseline, my Linear Regression model appears to explain the data relatively well.
For learning purposes, I created an interactive widget which allows users to answer each question of the TIPI with integer sliders, and a progress bar displays the depression score prediction made by my Linear Regression model. Once clicking on this link, the widget can be accessed by clicking on “Widget.ipynb” and running through all the cells of the Jupyter Notebook.
The results of my research were definitely very interesting, and personally insightful. Through the use of a Random Forest Regressor, I found that ‘Neuroticism’ and ‘Extraversion’ played significant roles in predicting the severity of depressive symptoms, with greater ‘Neuroticism’ predicting greater severity and greater ‘Extraversion’ predicting lesser severity. In addition, I developed a Linear Regression model with an r2 value of 0.37 indicating that personality plays a significant role in predicting an individual’s depression score.
My research had several limitations however. First, the TIPI is meant to be used as a quick and easy assessment for personality with moderate reliability. It was not meant to be used for rigorous data analysis. For future work it would be interesting to explore other assessments of personality deemed more reliable, and understand if the data could optimize my regression model. In addition, social science research in general is very complicated given the variation and unpredictability of human behavior. Although my Linear Regression model offered a relatively high r2 value, it could not be used in a clinical setting. Depression is a very intricate issue which is dependent upon many different factors. In my project I tried to generalize those factors to personality, but clearly there is much more that comes into play. Other factors I may decide to look into include age, education level, sexuality, and country.
Ultimately, I pursued this project out of curiosity, and I definitely learned a lot about social science research and depression. I believe that there can be more work done in understanding what practices can be done to minimize ‘Neurotic’ tendencies, or aid individuals in connecting with others to optimize ‘Extroverted’ tendencies. However, at the end of the day, I hope my project can raise conversations, and contribute to destigmatizing depression in our society.
“Depression.” World Health Organization, World Health Organization, 30 Jan. 2020, www.who.int/news-room/fact-sheets/detail/depression.
Falk, R. Frank., and Nancy B. Miller. A Primer for Soft Modeling. University of Akron Press, 1992.
Spring 2020
Steven Chen