What makes certain language more easy to sympathize with?

Using a dataset on /r/Random_Acts_Of_Pizza, a subreddit where users can ask other users for free pizza. I was interested in understanding why certain posts tended to warrant a pizza, while others did not.

Predicting Altruism with Reddit Posts: An NLP Project

Exploring the data

 
data_science_matplotlib_lens.png

Length of Request

I was interested in the character length of requests. Requests that did receive pizza, on average, were 109 characters longer than those that did not receive pizza.

Perhaps requests that did receive pizza included more detail that helped users sympathize with the original poster (OP).

 
sentimentgraph_analysis_vader.png

Sentiment Analysis

Overall sentiment for requests that did receive pizza were slightly higher than those that did not receive pizza. This suggests that posts with more positive sentiment may be more likely to garner responses.

The above graph shows average compound sentiment using VADER sentiment analysis. I chose to use VADER because of its ability to handle internet-slang and emojis, which are often seen in reddit posts.

 

Topic Modeling with NMF

topic_modeling_nmf_yes_datascience.png
topic_modeling_no_datascience.png
 

Using NMF, I was able to extract topics from requests with both outcomes. For the most part, the topics were pretty consistent. Both outcomes spoke about “family” with the highest percentage.

However, those that received pizza tended to talk about illness, while those that did not tended more generally to talk about “bills”. Those that did receive pizza did not talk about “money” nearly as much as those who did not receive pizza.

for x in range(0,len(yespizza)):
    requests.append(yespizza['request_text'][x])

vectorizer = CountVectorizer()
doc_word = vectorizer.fit_transform(requests)
nmf_model = NMF(5)
doc_topic = nmf_model.fit_transform(doc_word)

display_topics(nmf_model, vectorizer.get_feature_names(), 10)

K-means clustering

tfidf = TfidfVectorizer(stop_words='english')   
X_1 = ["".join(review) for review in pizza['request_text'].values]
tfidfDF = tfidf.fit_transform(X_1)  
model = KMeans(n_clusters=4)
model.fit(tfidfDF)

def predictCluster(request_text):
    return int(model.predict(tfidf.transform([request_text])))

pizza['kmeanscluster'] = pizza['request_text'].apply(lambda x: predictCluster(x))

Similarly, I wanted to cluster requests with k-means clustering so that I could eventually use it to predict new requests.

Using tf-idf weights, and 4 clusters (optimized by the elbow method), my model was able to assign every request a cluster that I would eventually be able to use in a logistic regression to predict new request observations.

Predicting Altruism with Logistic Regression

logistic_regression_model_kmeans.png
X = pizza.iloc[:,2:]
y = pizza['requester_received_pizza']
X_train, X_test, y_train, y_test = 
  train_test_split(X, y, test_size=0.2, random_state=42)
logit = LogisticRegression(class_weight={1 : 2, 0 : 1}, solver='liblinear')
logit.fit(X_train, y_train)
y_pred = logit.predict(X_test)

print("Training: {:6.2f}%".format(100*logit.score(X_train, y_train)))
print("Test set: {:6.2f}%".format(100*logit.score(X_test, y_test)))
print("Accuracy test: {:6.2f}%".format(100*accuracy_score(y_test, y_pred)))
print("Cross Validate: {:6.2f}%".format(100*np.mean(cross_val_score(logit,X,y))))

Putting it all together

I was interested in being able to predict whether future requests would receive pizza (1) or not (0). I used compound sentiment, which k-means cluster the request would be in, and character length of the request as features.

Using a logistic regression with class weights of received pizza: 2 and did not receive pizza: 1, my model performed as such:

  • Training: 73.79%

  • Test set: 73.02%

  • Accuracy test: 73.02%

  • Cross Validation: 73.52%

compound_clustering_length_kmeans_unsupervisedlearning.png
Previous
Previous

Earnings Call Transcripts and Stock Correlation

Next
Next

BOWL: Using computer vision for healthy eating