We will load some libraries which are useful for this project.
import warnings # ignore warning
warnings.filterwarnings("ignore")
# matplot
import matplotlib.pyplot as plt
%matplotlib inline
# Set matplotlib sizes
plt.rc('font', size=20)
plt.rc('axes', titlesize=20)
plt.rc('axes', labelsize=15)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)
plt.rc('legend', fontsize=15)
plt.rc('figure', titlesize=20)
# sklearn metrics
import sklearn.metrics as metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
# Load some libraries
import pandas as pd
from pandas import DataFrame
import numpy as np
from numpy import array
import re
import os
from nltk.tokenize import RegexpTokenizer
from string import punctuation
from nltk.tokenize import sent_tokenize, word_tokenize
# keras packages
import keras
from keras import backend as K
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers import GlobalAveragePooling1D
from keras.layers import LSTM
from keras.layers import GRU
from keras.layers import Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.sequence import pad_sequences
# Make directory to save the models
directory = os.path.dirname('../model/')
if not os.path.exists(directory):
os.makedirs(directory)
# Make directory to save the figures
directory = os.path.dirname('../figure/')
if not os.path.exists(directory):
os.makedirs(directory)
# load the scrapped and labeled data
df = pd.read_csv("../data/data.csv")
df.head() # print first 5 rows of df
Before building the models, we need to preprocess the texts in the Reddit comments. All comments will be converted to lowercase letters. Irrelevant texts such as subreddits, warnings, html tags, numbers, extra punctuations and whitespaces will also be removed. Sentences will not be trimmed for length and stop words will be kept since they affect the sentiment and semantics of the comments.
def clean_text(document):
"""
The clean_text function preprocesses the texts in a document/comment
Parameters
----------
document: the raw text
Returns
----------
tokens: a list of preprocessed tokens
"""
document = ' '.join([word.lower() for word in word_tokenize(document)]) # lowercase texts
tokens = word_tokenize(document) # tokenize the document
for i in range(0,len(tokens)):
# remove whitespaces
tokens[i] = tokens[i].strip()
# remove html links
tokens[i] = re.sub(r'\S*http\S*', '', tokens[i]) # remove links with http
tokens[i] = re.sub(r'\S*\.org\S*', '', tokens[i]) # remove links with .org
tokens[i] = re.sub(r'\S*\.com\S*', '', tokens[i]) # remove links with .com
# remove subreddit titles (e.g /r/food)
tokens[i] = re.sub(r'S*\/r\/\S*', '' ,tokens[i])
# remove non-alphabet characters
tokens[i] = re.sub("[^a-zA-Z]+", "", tokens[i])
tokens[i] = tokens[i].strip() # remove whitespaces
# remove all blanks from the list
while("" in tokens):
tokens.remove("")
return tokens
# call clean_text on df
# for each row in df
for i in range(0,len(df)):
# use clean_text on the document/text stored in the content column
clean = clean_text(df.loc[i,"content"])
# joining the tokens together by whitespaces
df.loc[i,"clean_content"] = ' '.join([token for token in clean])
df = df.dropna() # remove null data due to some deleted comments
df = df[df["clean_content"] != ''] # remove blank comments
df.to_csv('../data/preprocessed_data.csv', index=False) # save the preprocessed data
# select label and clean_content columns
df = df[["label","clean_content"]]
df.reset_index(drop=True,inplace=True) # reset the df index
# print the number of subjects that are diagonosed with depression
print("The number of subjects with depression is:", df[df["label"]==1].shape[0])
# print the number of subjects that are not diagonosed with depression
print("The number of subjects without depression is:", df[df["label"]==0].shape[0])
The ratio of depression and non-depression is approximately 50:50. This shows that our dataset is symmetric and balance for modeling.
Before implementing word embedding models, we need to create a vocabulary of unique words in the Reddit comments. Words that appear only 1 time will be removed to tidy the data since they have no impact on the prediction.
from collections import Counter
words = [] # create a list to store words in the data
for i in range(0,len(df)):
tokens = df.loc[i,"clean_content"].split() # split words by whitespaces since we've already preprocessed text
for token in tokens:
words.append(token) # append word to the list words
vocab = Counter(words) # list the words and count their occurence
tokens = [k for k,c in vocab.items() if c >= 2] # remove words that appear only 1 time
vocab = tokens # get the vocabulary of unique words
# The vocabulary size is the total number of words in our vocabulary, plus one for unknown words
vocab_size = len(set(vocab)) + 1
vocab_size
for i in range(0,len(df)): # remove the words that are not in the vocab
tokens = df.loc[i,"clean_content"].split() # split words by whitespaces
# selecting the words in the vocab and re-join them by whitespaces
df.loc[i,"clean_content"] = ' '.join([token for token in tokens if token in vocab])
from sklearn.model_selection import train_test_split
# Divide the data into training (70%) and testing (30%)
df_train_valid, df_test = train_test_split(df, train_size=0.7, random_state=42, stratify=df["label"])
# Divide the trainning data into training (70%) and validation (30%)
df_train, df_valid = train_test_split(df_train_valid, train_size=0.7, random_state=42, stratify=df_train_valid["label"])
df_train.reset_index(drop=True,inplace=True) # reset index
df_valid.reset_index(drop=True,inplace=True)
df_test.reset_index(drop=True,inplace=True)
# print training data dimensions
df_train.shape
# print validation data dimensions
df_valid.shape
# print testing data dimensions
df_test.shape
train_docs = [] # a list to store comments from the training set
valid_docs = [] # a list to store comments from the validation set
test_docs = [] # a list to store comments from the testing set
for i in range(0,len(df_train)):
text = df_train.loc[i,"clean_content"] # selecting each comment in each row
train_docs.append(text) # append each comment to a list of documents
for i in range(0,len(df_valid)):
text = df_valid.loc[i,"clean_content"]
valid_docs.append(text)
for i in range(0,len(df_test)):
text = df_test.loc[i,"clean_content"]
test_docs.append(text)
docs = train_docs + valid_docs + test_docs
# get the max-length of the documents
max_length = max([len(document.split()) for document in docs])
# print max_length
max_length
# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(docs)
# getting the vocabulary of words after using Tokenizer()
# it will have the same unique words and the same length as the above vocab but with different orders of words
tokenizer_vocab = tokenizer.word_index
def defineXY(df, docs):
"""
defineXY convert the texts into numeric vectors using the Tokenizer class in the Keras API
Parameters
----------
df: the dataframe
Returns
----------
X_label: an array of vectorized comments
Y_label: an array of target labels
"""
# converting comments into numeric vectors using layer embedding
encoded_docs = tokenizer.texts_to_sequences(docs)
# save each vectorized comment into X_label array
# max_length set to ensure that all vectorized comments will have the same length
# since the recurrent network predicts the next elements using the outputs of previous ones
# we need to set padding as pre so that the RNN model won't predict zeros at the end of previous vectors
# and give wrong predictions
X_label = pad_sequences(encoded_docs, maxlen=max_length, padding='pre')
# saving the label for each comment into an array
df_label = df["label"]
y_label = array([df_label[i] for i in range(0,len(df_label))]) # save target label into an array
return X_label, y_label
# call defineXY on the training, validation and testing data
Xtrain, ytrain = defineXY(df_train, train_docs)
Xvalid, yvalid = defineXY(df_valid, valid_docs)
Xtest, ytest = defineXY(df_test, test_docs)
We will build a pre-trained Word2Vec model as the embedding layer.
sentences = [] # create a list to store sentences of tokens
for item in docs: # for each sentence in docs
tokens = item.split() # split the sentence into tokens
sentences.append(tokens) # append the list of tokens in a sentence to sentences
from gensim.models import Word2Vec
# train the sentences with Word2Vec model
# the dimension is set at 100 which is the default dimension output we will use in this project
# we only count words that appear at least 2 times
model_word2vec = Word2Vec(sentences, size=100, window=5, workers=8, min_count=2)
print(model_word2vec) # check the Word2Vec model details
# the length of the vocabulary when using Tokenizer in Keras API
print('The number of words in vocabulary using Keras embedding layer is:', len(tokenizer_vocab))
# checking the length of the vocab when using Word2Vec
print('The number of words in vocabulary using Word2Vec is:', len(model_word2vec.wv.vocab))
The length of the vocabulary after using the word2vec model and Keras embedding layer are the same. There is no mistakes in our word embedding process.
We will save the Word2Vec embedding model in a file. It will save time for the project since we do not have to re-train the Word2Vec model when we need to make changes.
from os import listdir
# save model in ASCII (word2vec) format
filename = '../word2vec_embedding/embedding_word2vec.txt'
model_word2vec.wv.save_word2vec_format(filename, binary=False)
def load_embedding(filename):
"""
load_embedding loads the saved Word2Vec embedding model
Parameters
----------
filename: the name of the embedding file
Returns
----------
raw_embedding: a dictionary of words and their vectors as arrays
"""
# load embedding into memory, skip first line
file = open(filename,'r')
lines = file.readlines()[1:]
file.close()
# create a map of words to vectors
embedding = dict()
for line in lines:
parts = line.split()
# key is string word, value is numpy array for vector
embedding[parts[0]] = np.asarray(parts[1:], dtype='float32')
return embedding
raw_embedding = load_embedding('../word2vec_embedding/embedding_word2vec.txt')
We need to map the vectors and words from the raw_embedding to the vectors and words from the tokenizer_vocab in a correct order.
# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab_size):
"""
get_weight_maxtrix re-arrange the raw_embedding dictionary in the correct order given by the tokenizer_vocab
Parameters
----------
embedding: the embedding dictionary
vocab_size: the length of vocabulary plus 1 for unknown words
Returns
----------
weight_matrix: the array of the vectorized words in the correct order
"""
# define weight matrix dimensions with all 0
weight_matrix = np.zeros((vocab_size, 100))
# step vocab, store vectors using the Tokenizer's integer mapping
for word, i in tokenizer_vocab.items(): # for word and their index i in the tokenizer_vocab
if i > vocab_size:
continue
vector = embedding.get(word) # get the word in the embedding file
if vector is not None: # word not found will be returned zero
weight_matrix[i] = vector # store the vector of the word in the weight_matrix at the position i
return weight_matrix
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, vocab_size)
In this part, we will build two kinds of neural network models: Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN).
from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
# ModelCheckpoint callback
model_checkpoint_cb_CNN_keras = ModelCheckpoint( # for CNN using Keras API Tokenizer
filepath = "../model/CNN_keras.h5",
save_best_only=True)
model_checkpoint_cb_RNN_keras = ModelCheckpoint( # for RNN using Keras API Tokenizer
filepath = "../model/RNN_keras.h5",
save_best_only=True)
model_checkpoint_cb_CNN_word2vec = ModelCheckpoint( # for CNN using word2vec
filepath = "../model/CNN_word2vec.h5",
save_best_only=True)
model_checkpoint_cb_RNN_word2vec = ModelCheckpoint( # for RNN using word2vec
filepath = "../model/RNN_word2vec.h5",
save_best_only=True)
# EarlyStopping callback
early_stopping_cb = EarlyStopping(
patience=5, # stopping after 5 epochs without improvement
restore_best_weights=True)
# ReduceLROnPlateau callback
reduce_lr_on_plateau_cb = ReduceLROnPlateau(
verbose = 1,
factor=0.1, # reducing the learning rate by 10 times
patience=2) # after 2 epochs without improvement in validation loss
We will build a simple architecture for CNN using a Conv1D layer. The GlobalAveragePooling1D layer is used as an alternative for the Flatten - Fully Connected (FC) - Dropout paradigm. In the future work, we can add more convolutional layers to obtain better results.
# create a squential
model = Sequential()
# add the embedding layer
model.add(Embedding(vocab_size, 100, input_length=max_length))
# add the convolutional layer
model.add(Conv1D(filters=32, kernel_size=8, padding="same", activation='relu'))
# add the GAP layer
model.add(GlobalAveragePooling1D())
# add a fully connected layer with the activation function as relu
model.add(Dense(10, activation='relu'))
# add the output layer
# since the this a binary prediciton (0 or 1)
# sigmoid is the activation function and the dimensionality of the output space is 1
model.add(Dense(1, activation='sigmoid'))
# print the model summary
model.summary()
# compile network
model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(learning_rate=10 ** -3), metrics=['accuracy'])
# fit network
history_CNN_keras = model.fit(Xtrain, ytrain, epochs=100, validation_data=(Xvalid, yvalid),
callbacks=[model_checkpoint_cb_CNN_keras, early_stopping_cb, reduce_lr_on_plateau_cb])
pd.DataFrame(history_CNN_keras.history).plot(figsize=(8, 5))
# Save and show the figure
plt.tight_layout()
plt.title('Learning Curve')
plt.xlabel('epochs')
plt.savefig('../figure/learning_curve_CNN_keras.pdf')
plt.show()
# Load the model
model = keras.models.load_model("../model/CNN_keras.h5")
# evaluating the model
loss, accuracy = model.evaluate(Xtest, ytest, verbose = 0)
# print loss and accuracy
print("loss:", loss)
print("accuracy:", accuracy)
# predict probabilities for test set
yhat_probs = model.predict(Xtest, verbose=0) # 2d arrary
# predict crisp classes for test set
yhat_classes = model.predict_classes(Xtest, verbose=0) # 2d array
# reduce to 1d array
yhat_probs = yhat_probs[:, 0]
yhat_classes = yhat_classes[:, 0]
print("Confusion Matrix")
pd.DataFrame(
confusion_matrix(ytest, yhat_classes, labels=[0,1]),
index=['True : {:}'.format(x) for x in [0,1]],
columns=['Pred : {:}'.format(x) for x in [0,1]])
# getting false positive rate, true positive rate
fpr, tpr, threshold = metrics.roc_curve(ytest, yhat_probs)
# roc auc score
auc = roc_auc_score(ytest, yhat_probs)
# plot ROC curve
plt.figure(figsize=(8,5))
plt.tight_layout()
plt.title('ROC Curve')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.4f' % auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('../figure/ROC_curve_CNN_keras.pdf')
plt.show()
This high AUC score of 0.977 shows that the model is outstanding at discrimination.
# precision tp / (tp + fp)
precision = precision_score(ytest, yhat_classes)
print('Precision: %f' % precision)
# recall: tp / (tp + fn)
recall = recall_score(ytest, yhat_classes)
print('Recall: %f' % recall)
# f1: 2 tp / (2 tp + fp + fn)
f1 = f1_score(ytest, yhat_classes)
print('F1 score: %f' % f1)
We will build a simple architecture for RNN with a Long Short Term Memory (LSTM) layer.
# define model
# create a squential
model = Sequential()
# add the embedding layer
model.add(Embedding(vocab_size, 100, input_length=max_length))
# add the LSTM layer
model.add(LSTM(100))
# add drop out to prevent overfitting
model.add(Dropout(0.2))
# add the output layer
model.add(Dense(1, activation='sigmoid'))
# print model summary
model.summary()
model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(learning_rate=10 ** -3), metrics=['accuracy'])
history_RNN_keras = model.fit(Xtrain, ytrain, validation_data=(Xvalid, yvalid), epochs=100,
callbacks=[model_checkpoint_cb_RNN_keras, early_stopping_cb, reduce_lr_on_plateau_cb])
pd.DataFrame(history_RNN_keras.history).plot(figsize=(8, 5))
# Save and show the figure
plt.tight_layout()
plt.title('Learning Curve')
plt.xlabel('epochs')
plt.savefig('../figure/learning_curve_RNN_keras.pdf')
plt.show()
# Load the model
model = keras.models.load_model("../model/RNN_keras.h5")
# evaluating the model
loss, accuracy = model.evaluate(Xtest, ytest, verbose = 0)
# print loss and accuracy
print("loss:", loss)
print("accuracy:", accuracy)
# predict probabilities for test set
yhat_probs = model.predict(Xtest, verbose=0) # 2d arrary
# predict crisp classes for test set
yhat_classes = model.predict_classes(Xtest, verbose=0) # 2d array
# reduce to 1d array
yhat_probs = yhat_probs[:, 0]
yhat_classes = yhat_classes[:, 0]
print("Confusion Matrix")
pd.DataFrame(
confusion_matrix(ytest, yhat_classes, labels=[0,1]),
index=['True : {:}'.format(x) for x in [0,1]],
columns=['Pred : {:}'.format(x) for x in [0,1]])
# getting false positive rate, true positive rate
fpr, tpr, threshold = metrics.roc_curve(ytest, yhat_probs)
# roc auc score
auc = roc_auc_score(ytest, yhat_probs)
# plot ROC curve
plt.figure(figsize=(8,5))
plt.tight_layout()
plt.title('ROC Curve')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.4f' % auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('../figure/ROC_curve_RNN_keras.pdf')
plt.show()
With a high AUC score at 0.9642, the CNN model with embedding layer provides an outstanding discrimination.
# precision tp / (tp + fp)
precision = precision_score(ytest, yhat_classes)
print('Precision: %f' % precision)
# recall: tp / (tp + fn)
recall = recall_score(ytest, yhat_classes)
print('Recall: %f' % recall)
# f1: 2 tp / (2 tp + fp + fn)
f1 = f1_score(ytest, yhat_classes)
print('F1 score: %f' % f1)
For the use of the Word2Vec embedding, we add this pre-trained embedding layer at the beginning of the architecture. To ensure that the neural network does not try to adapt the pre-learned vectors as part of training the network, we need to freeze this pre-trained Word2Vec layer.
# create the pre-trained embedding layer by using the embedding_vectors as the weights
# trainable is set as False to freeze the pre-trained layer
embedding_layer = Embedding(vocab_size, 100, embeddings_initializer=keras.initializers.Constant(embedding_vectors), input_length=max_length, trainable=False)
# create a squential
model = Sequential()
# add the embedding layer
model.add(embedding_layer)
# add the convolutional layer
model.add(Conv1D(filters=128, kernel_size=5, padding="same", activation='relu'))
# add the GAP layer
model.add(GlobalAveragePooling1D())
# add a fully connected layer with the activation function as relu
model.add(Dense(10, activation='relu'))
# add the output layer
# since the this a binary prediciton (0 or 1)
# sigmoid is the activation function and the dimensionality of the output space is 1
model.add(Dense(1, activation='sigmoid'))
# print the model summary
model.summary()
# compile network
model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(learning_rate=10 ** -3), metrics=['accuracy'])
# fit network
history_CNN_word2vec = model.fit(Xtrain, ytrain, epochs=100, validation_data=(Xvalid, yvalid),
callbacks=[model_checkpoint_cb_CNN_word2vec, early_stopping_cb, reduce_lr_on_plateau_cb])
pd.DataFrame(history_CNN_word2vec.history).plot(figsize=(8, 5))
# Save and show the figure
plt.tight_layout()
plt.title('Learning Curve')
plt.xlabel('epochs')
plt.savefig('../figure/learning_curve_CNN_word2vec.pdf')
plt.show()
# Load the model
model = keras.models.load_model("../model/CNN_word2vec.h5")
# evaluating the model
loss, accuracy = model.evaluate(Xtest, ytest, verbose = 0)
# print loss and accuracy
print("loss:", loss)
print("accuracy:", accuracy)
# predict probabilities for test set
yhat_probs = model.predict(Xtest, verbose=0) # 2d arrary
# predict crisp classes for test set
yhat_classes = model.predict_classes(Xtest, verbose=0) # 2d array
# reduce to 1d array
yhat_probs = yhat_probs[:, 0]
yhat_classes = yhat_classes[:, 0]
print("Confusion Matrix")
pd.DataFrame(
confusion_matrix(ytest, yhat_classes, labels=[0,1]),
index=['True : {:}'.format(x) for x in [0,1]],
columns=['Pred : {:}'.format(x) for x in [0,1]])
# getting false positive rate, true positive rate
fpr, tpr, threshold = metrics.roc_curve(ytest, yhat_probs)
# roc auc score
auc = roc_auc_score(ytest, yhat_probs)
# plot ROC curve
plt.figure(figsize=(8,5))
plt.tight_layout()
plt.title('ROC Curve')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.4f' % auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('../figure/ROC_curve_CNN_word2vec.pdf')
plt.show()
With a high AUC score at 0.9432, the RNN model with embedding layer has an outstanding discrimination.
# precision tp / (tp + fp)
precision = precision_score(ytest, yhat_classes)
print('Precision: %f' % precision)
# recall: tp / (tp + fn)
recall = recall_score(ytest, yhat_classes)
print('Recall: %f' % recall)
# f1: 2 tp / (2 tp + fp + fn)
f1 = f1_score(ytest, yhat_classes)
print('F1 score: %f' % f1)
# define model
# create a squential
model = Sequential()
# add the embedding layer
model.add(embedding_layer)
# add the LSTM layer
model.add(LSTM(100))
model.add(Dropout(0.2))
# add the output layer
model.add(Dense(1, activation='sigmoid'))
# print model summary
model.summary()
# compile network
model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(learning_rate=10 ** -3), metrics=['accuracy'])
# fit network
history_RNN_word2vec = model.fit(Xtrain, ytrain, epochs=100, validation_data=(Xvalid, yvalid),
callbacks=[model_checkpoint_cb_RNN_word2vec, early_stopping_cb, reduce_lr_on_plateau_cb])
pd.DataFrame(history_RNN_word2vec.history).plot(figsize=(8, 5))
# Save and show the figure
plt.tight_layout()
plt.title('Learning Curve')
plt.xlabel('epochs')
plt.savefig('../figure/learning_curve_RNN_word2vec.pdf')
plt.show()
# Load the model
model = keras.models.load_model("../model/RNN_word2vec.h5")
# evaluating the model
loss, accuracy = model.evaluate(Xtest, ytest, verbose = 0)
# print loss and accuracy
print("loss:", loss)
print("accuracy:", accuracy)
# predict probabilities for test set
yhat_probs = model.predict(Xtest, verbose=0) # 2d arrary
# predict crisp classes for test set
yhat_classes = model.predict_classes(Xtest, verbose=0) # 2d array
# reduce to 1d array
yhat_probs = yhat_probs[:, 0]
yhat_classes = yhat_classes[:, 0]
print("Confusion Matrix")
pd.DataFrame(
confusion_matrix(ytest, yhat_classes, labels=[0,1]),
index=['True : {:}'.format(x) for x in [0,1]],
columns=['Pred : {:}'.format(x) for x in [0,1]])
# getting false positive rate, true positive rate
fpr, tpr, threshold = metrics.roc_curve(ytest, yhat_probs)
# roc auc score
auc = roc_auc_score(ytest, yhat_probs)
# plot ROC curve
plt.figure(figsize=(8,5))
plt.tight_layout()
plt.title('ROC Curve')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.4f' % auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('../figure/ROC_curve_RNN_word2vec.pdf')
plt.show()
The AUC score of 0.9396 suggests that the model has an outstanding discrimination.
# precision tp / (tp + fp)
precision = precision_score(ytest, yhat_classes)
print('Precision: %f' % precision)
# recall: tp / (tp + fn)
recall = recall_score(ytest, yhat_classes)
print('Recall: %f' % recall)
# f1: 2 tp / (2 tp + fp + fn)
f1 = f1_score(ytest, yhat_classes)
print('F1 score: %f' % f1)
There are many studies that can be done to improve this project in the future:
In this project, our group presents a method to address early detection of depression using Reddit comments. Two neural network models including CNN and RNN are built to detect subjects with depression. For more insights, we use the keras layer embedding and the pre-trained Word2Vec model to create the input embedding layer of the networks. The results show that keras embedding layer has better performance than the Word2Vec. The reason may be the inefficiency of Word2Vec in accounting informal and ungrammatical data.
According to the model evaluation, CNN model with keras embedding layer has the best performance. The reason that CNN has better classification accuracy than RNN may be that the dataset does not have many textual comments so the extraction of sequential information is not necessary. RNN model is expected to have more outstanding performance when dealing with formal and academic writings such as personal experience stories and mental health reports.
For future improvement, more data containing both textual data and informal writings will be collected. An ensemble of CNN and RNN would be constructed to effectively recognize both local and sequential features. Different word embedding models such as GloVe and fastText would be pre-trained to implement to the embedding layer in order to extend the scope of the project.
Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: CoRR abs/1810.04805. 2018.
JT Wolohan. “Detecting Linguistic Traces of Depression in Topic-Restricted Text : Attending to Self-Stigmatized Depression with NLP”. In: 2018.
Kali Cornn. "Identifying Depression on Social Media". Department of Statistics, Stanford University. 2019.
Marcel Trotzek, Sven Koitka, and Christoph M. Friedrich. “Utilizing Neural Networks and Linguistic Metadata for Early Detection of Depression Indications in Text Sequences”. In: CoRR abs/1804.07000. 2018.
Soroush Vosoughi, Prashanth Vijayaraghavan, and Deb Roy. “Tweet2Vec: Learning Tweet Embeddings Using Character-level CNN-LSTM Encoder-Decoder”. In: SIGIR. 2016.