What makes certain language more easy to sympathize with?
Using a dataset on /r/Random_Acts_Of_Pizza, a subreddit where users can ask other users for free pizza. I was interested in understanding why certain posts tended to warrant a pizza, while others did not.
Predicting Altruism with Reddit Posts: An NLP Project

Exploring the data
Length of Request
I was interested in the character length of requests. Requests that did receive pizza, on average, were 109 characters longer than those that did not receive pizza.
Perhaps requests that did receive pizza included more detail that helped users sympathize with the original poster (OP).
Sentiment Analysis
Overall sentiment for requests that did receive pizza were slightly higher than those that did not receive pizza. This suggests that posts with more positive sentiment may be more likely to garner responses.
The above graph shows average compound sentiment using VADER sentiment analysis. I chose to use VADER because of its ability to handle internet-slang and emojis, which are often seen in reddit posts.
Topic Modeling with NMF
Using NMF, I was able to extract topics from requests with both outcomes. For the most part, the topics were pretty consistent. Both outcomes spoke about “family” with the highest percentage.
However, those that received pizza tended to talk about illness, while those that did not tended more generally to talk about “bills”. Those that did receive pizza did not talk about “money” nearly as much as those who did not receive pizza.
for x in range(0,len(yespizza)): requests.append(yespizza['request_text'][x]) vectorizer = CountVectorizer() doc_word = vectorizer.fit_transform(requests) nmf_model = NMF(5) doc_topic = nmf_model.fit_transform(doc_word) display_topics(nmf_model, vectorizer.get_feature_names(), 10)
K-means clustering
tfidf = TfidfVectorizer(stop_words='english') X_1 = ["".join(review) for review in pizza['request_text'].values] tfidfDF = tfidf.fit_transform(X_1) model = KMeans(n_clusters=4) model.fit(tfidfDF) def predictCluster(request_text): return int(model.predict(tfidf.transform([request_text]))) pizza['kmeanscluster'] = pizza['request_text'].apply(lambda x: predictCluster(x))
Similarly, I wanted to cluster requests with k-means clustering so that I could eventually use it to predict new requests.
Using tf-idf weights, and 4 clusters (optimized by the elbow method), my model was able to assign every request a cluster that I would eventually be able to use in a logistic regression to predict new request observations.
Predicting Altruism with Logistic Regression
X = pizza.iloc[:,2:] y = pizza['requester_received_pizza'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) logit = LogisticRegression(class_weight={1 : 2, 0 : 1}, solver='liblinear') logit.fit(X_train, y_train) y_pred = logit.predict(X_test) print("Training: {:6.2f}%".format(100*logit.score(X_train, y_train))) print("Test set: {:6.2f}%".format(100*logit.score(X_test, y_test))) print("Accuracy test: {:6.2f}%".format(100*accuracy_score(y_test, y_pred))) print("Cross Validate: {:6.2f}%".format(100*np.mean(cross_val_score(logit,X,y))))
Putting it all together
I was interested in being able to predict whether future requests would receive pizza (1) or not (0). I used compound sentiment, which k-means cluster the request would be in, and character length of the request as features.
Using a logistic regression with class weights of received pizza: 2 and did not receive pizza: 1, my model performed as such:
Training: 73.79%
Test set: 73.02%
Accuracy test: 73.02%
Cross Validation: 73.52%