Final Project - Group 5¶
Introduction¶
In today’s world, social media plays a huge role in how we communicate, share ideas, and interact with content. Every day, certain posts suddenly gain massive attention and spread across social media platforms like TikTok, Instagram, Twitter, and YouTube. Our project explores this idea by figuring out what makes a social media post go viral by looking at engagement data across several platforms and building models that can help predict the virality of a post.
Our question focuses on what characteristics of a post are most associated with high engagement, and we can use those to predict whether a post will go viral or not. We are using a dataset from Kaggle called “Viral Social Media Trends & Engagement Analysis” to answer this question. It includes 5,000 social media posts and provides details like the platform being used, the hashtag used, the type of content, the region, and engagement metrics such as views, likes, shares, and comments. It also includes an engagement level of low, medium, or high.
This project explores the dataset to find patterns and trends that can help us answer the question. Our approach includes data cleaning, clustering posts to see which kinds tend to go viral. This project provides useful insights for content creators, marketers, and anyone looking to understand how viral posts work.
Our Data¶
The dataset used for this project is “Viral Social Media Trends & Engagement Analysis” from Kaggle. It contains 5,000 unique entries representing posts from platforms such as TikTok, Instagram, Twitter, and YouTube. We chose this dataset because it includes key insights of viral social media posts, including user engagement and characteristics of the post, which help us answer our question.
Each row in the dataset contains one post and includes ten features: a postID, the platform it was posted on, the hashtag used, the content type (video, tweet, reel, or livestream), and the region where it was posted. Additionally, the dataset includes information about engagement metrics like views, likes, shares, and comments. Another useful column is the engagement level, which labels each post’s engagement as either low, medium, or high. This can help us train our models for classification. Looking at the dataset as a whole, YouTube was the most represented platform. The most common content types were livestreams and standard posts. The engagement levels were evenly distributed between low, medium, and high.
Overall, the dataset helps us explore what makes a post go viral. It includes data from many different platforms, content, and metrics that can aid when we look for patterns and relationships using classification and clustering techniques.
We will begin by importing our dataset and assigning it to the pandas Dataframe varible 'trends.' This allows for easy manipulation and analysis of the data. Then by running the function: head(), on our dataset we are able to get a sneak peak of what we are working with like noting missing values or inconvenient variable types.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
trends = pd.read_csv('Viral_Social_Media_Trends.csv')
trends.head()
| Post_ID | Platform | Hashtag | Content_Type | Region | Views | Likes | Shares | Comments | Engagement_Level | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Post_1 | TikTok | #Challenge | Video | UK | 4163464 | 339431 | 53135 | 19346 | High |
| 1 | Post_2 | #Education | Shorts | India | 4155940 | 215240 | 65860 | 27239 | Medium | |
| 2 | Post_3 | #Challenge | Video | Brazil | 3666211 | 327143 | 39423 | 36223 | Medium | |
| 3 | Post_4 | YouTube | #Education | Shorts | Australia | 917951 | 127125 | 11687 | 36806 | Low |
| 4 | Post_5 | TikTok | #Dance | Post | Brazil | 64866 | 171361 | 69581 | 6376 | Medium |
Data Visualizations for Data Understanding
# Visualization 1: Distribution of posts across different platforms
plt.figure(figsize=(10, 6))
sns.countplot(data=trends, x='Platform', palette='viridis', hue='Platform')
plt.title('Distribution of Posts Across Different Platforms')
plt.xlabel('Platform')
plt.ylabel('Number of Posts')
plt.show()
# Visualization 2: Distribution of engagement levels
plt.figure(figsize=(10, 6))
sns.countplot(data=trends, x='Engagement_Level', palette='viridis', hue='Engagement_Level')
plt.title('Distribution of Engagement Levels')
plt.xlabel('Engagement Level')
plt.ylabel('Number of Posts')
plt.show()
# Visualization 3: Average views, likes, shares, and comments by platform
metrics = ['Views', 'Likes', 'Shares', 'Comments']
for metric in metrics:
plt.figure(figsize=(10, 6))
sns.barplot(data=trends, x='Platform', y=metric, palette='viridis' , hue='Platform')
plt.title(f'Average {metric} by Platform')
plt.xlabel('Platform')
plt.ylabel(f'Average {metric}')
plt.show()
# Visualization 4: Average views, likes, shares, and comments by engagement level
for metric in metrics:
plt.figure(figsize=(10, 6))
sns.barplot(data=trends, x='Engagement_Level', y=metric, palette='viridis', hue='Engagement_Level')
plt.title(f'Average {metric} by Engagement Level')
plt.xlabel('Engagement Level')
plt.ylabel(f'Average {metric}')
plt.show()
# Visualization 5: Correlation heatmap of engagement metrics
plt.figure(figsize=(10, 6))
sns.heatmap(trends[metrics].corr(), annot=True, cmap='viridis', vmin=-1, vmax=1)
plt.title('Correlation Heatmap of Engagement Metrics')
plt.show()
Pre-processing¶
Before using our data within our models, we need to transform features that may have null or categorical values. From running the info() function, we can see that our data set has no missing values and four numerical columns. The other 6 columns are categorical
trends.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Post_ID 5000 non-null object 1 Platform 5000 non-null object 2 Hashtag 5000 non-null object 3 Content_Type 5000 non-null object 4 Region 5000 non-null object 5 Views 5000 non-null int64 6 Likes 5000 non-null int64 7 Shares 5000 non-null int64 8 Comments 5000 non-null int64 9 Engagement_Level 5000 non-null object dtypes: int64(4), object(6) memory usage: 390.8+ KB
Below, we will using Onee-Hot encoding to turn our categorical values into multiple binary columns. We do this since our model cannot process these directly so one-hot encoding helps translate them. We also drop the column 'Post_ID' because it does not carry any meaningful information for predicting engagement. Finally, we converted our target column from text ('High', 'Medium', and 'low') to numerical values (2,1,0).
Once we completed these pre-processing steps, our dataset has expanded from 10 columns to 29. This way our model can now identify the different values without misunderstanding the data.
trends = pd.get_dummies(trends, columns=['Platform', 'Content_Type', 'Region', 'Hashtag'], drop_first=True)
trends = trends.drop(columns=['Post_ID'])
trends['Engagement_Level'] = trends['Engagement_Level'].map({'High': 2, 'Medium':1, 'Low':0})
trends.head()
| Views | Likes | Shares | Comments | Engagement_Level | Platform_TikTok | Platform_Twitter | Platform_YouTube | Content_Type_Post | Content_Type_Reel | ... | Region_USA | Hashtag_#Comedy | Hashtag_#Dance | Hashtag_#Education | Hashtag_#Fashion | Hashtag_#Fitness | Hashtag_#Gaming | Hashtag_#Music | Hashtag_#Tech | Hashtag_#Viral | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4163464 | 339431 | 53135 | 19346 | 2 | True | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 1 | 4155940 | 215240 | 65860 | 27239 | 1 | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
| 2 | 3666211 | 327143 | 39423 | 36223 | 1 | False | True | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 3 | 917951 | 127125 | 11687 | 36806 | 0 | False | False | True | False | False | ... | False | False | False | True | False | False | False | False | False | False |
| 4 | 64866 | 171361 | 69581 | 6376 | 1 | True | False | False | True | False | ... | False | False | True | False | False | False | False | False | False | False |
5 rows × 29 columns
Modeling¶
Classification Predicting Engagement Level¶
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
Before we run our model, we will prepare our data by separating the features from the target variable. Specifically, we dropped the 'Engagement_Level' column from the dataset to create our feature matrix 'x' and store our target label in 'y'. After splitting the data into training and testing sets, using an 80-20 split. Additionally, by setting the random state to 42 we make sure that the split is reproducible.
We will use a Random Forest Classifier which is a method that builds multiple decision tress and merges them to get a more accurate prediction.
X = trends.drop(columns=['Engagement_Level'])
y = trends['Engagement_Level']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
Feature Importance and New Feature Engineering
To gain more insights, we examined the feature importance provided by the Random Forest model. We found that the top 5 most important features were
- Likes
- Shares
- Views
- Comments
Taking this into account, we predict that engagement might be better captured not just by individual counts but by how interactions relate to the views of a post.
importances = pd.Series(rf.feature_importances_, index=X_train.columns)
importances.sort_values(ascending=False).head(10).plot(kind='barh')
plt.title('Top 10 Important Features')
plt.show()
Therefore, we created a new feature called 'Engagement_Rate,' calculation shown below. The idea is that the posts with high engagement relative to their number of views might correlate to a higher engagement level and possibly give our model a clearer, more percise prediction ability.
trends['Engagement_Rate'] = (trends['Likes'] + trends['Shares'] + trends['Comments']) / trends['Views']
X = trends.drop(columns=['Engagement_Level'])
y = trends['Engagement_Level']
X_train_Random2, X_test_Random2, y_train_Random2, y_test_Random2 = train_test_split(X, y, test_size=0.2, random_state = 42)
rf = RandomForestClassifier()
rf.fit(X_train_Random2, y_train_Random2)
y_pred_Random2 = rf.predict(X_test_Random2)
Decision Tree¶
Decision trees serve as a popular choice for classification and interpretability. For this project, we chose this model because it would work well in the classification of a post into categories, but also picks the best features that helps the post have the highest engagement features. We overall saw the views, shares, comments, and likes contributed most to engagement level.
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
importances= pd.Series(dt.feature_importances_, index=X_train.columns)
importances.sort_values(ascending=False).head(10).plot(kind='barh')
plt.title('Top 10 Important Features - Decision Tree')
plt.show()
Feature Importance: Insights from the Decision Tree¶
After evaluating the results from the decision tree results it seems that these fields contributed the most:
- Views: Higher view counts generally correlate with increased engagement, reflecting broader reach and exposure.
- Shares: Posts with more shares tend to have higher engagement, indicating active user participation in spreading content.
- Comments: Increased comments signify direct interaction and discussion around a post, contributing to higher engagement.
- Likes: Increases likes of the post seemed to contribute to higher engagement.
These findings emphasize the importance of creating content that attracts views, encourages sharing, and fosters active discussions to maximize social media engagement.
KNN Classification¶
Nearest Neighbors (KNN) was chosen for this classification task because it is simple, easy to interpret, and effective as a baseline model. KNN does not assume any specific data distribution and can naturally handle multiclass classification problems like engagement levels.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
X = trends.drop(columns=['Engagement_Level'])
y = trends['Engagement_Level']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Added random_state for reproducibility
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_prediction = knn.predict(X_test)
Evaluation¶
To evaluate our models’ performance in classifying post engagement levels (low, medium, high), we used common classification metrics: accuracy, precision, recall, and F1-score. Our goal was to identify patterns in social media posts that could help predict virality, so we focused on how well our models could classify each engagement level rather than just overall accuracy.
The Random Forest Classifier¶
print("Accuracy:", accuracy_score(y_test, y_pred))
Accuracy: 0.333
To evaluate the model's performace we will use accuracy as the metric. This model produced an accuracy between 3.19 and 3.33 depending on the run. This indicates that it performed slightly better then random guessing but still has a lot of room for improvement.
To analyze the results a bit more, we will use a confusion matrix. The rows correspond to the true labels and the columns correspond to the predicted labels. Ideally, we want high numbers on the diagonal (correct predictions) and low numbers elsewhere.
In this following matrix, we can see the model struggles to correctly classify the engagement levels. Since the misclassifications are fequent acress all classes, this suggests that the classes are diffcult for the model to distinguish based on the original features.
confm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8,6))
sns.heatmap(confm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
Random Forest - Post Feature Engineering¶
print("Accuracy Score:", accuracy_score(y_test_Random2, y_pred_Random2))
Accuracy Score: 0.318
After creating 'Engagement_Rate' feature and retraining the random forest model using the same process, the resulting accuracy improved slightly from an initial range of 0.319-0.33 to a new range of 0.32-0.36.
Although the improvement was small, this suggests that the new engagement_rate feature helped the model capture some additional information that individual features alone did not fully express.
# This is from the Random Forest model from when we ran it again after feature engineering
confm = confusion_matrix(y_test_Random2, y_pred_Random2)
plt.figure(figsize=(8,6))
sns.heatmap(confm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# This is from the Random Forest model from when we ran it again after feature engineering
report = classification_report(y_test_Random2, y_pred_Random2, output_dict=True)
report_df = pd.DataFrame(report).transpose()
plt.figure(figsize=(10,6))
sns.heatmap(report_df.iloc[:-1, :-1], annot=True, cmap='YlGnBu', fmt='.2f')
plt.title('Classification Report')
plt.show()
Decision Tree¶
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
Decision Tree Accuracy: 0.334
The accuracy of the decision tree showed to be around 0.321 since this is a multifacited classification this would be a better score than guessing.
report_df = pd.DataFrame(classification_report(y_test, y_pred_dt, output_dict=True)).T
report_df = report_df.drop(columns=['support'], errors='ignore')
plt.figure(figsize=(8, 6))
sns.heatmap(report_df, annot=True, cmap="Blues", fmt=".2f")
plt.title("Classification Report Heatmap - Decision Tree")
plt.show()
The classification report represents key performance metrics of the Decision Tree helping understand the effectiveness of predicting Engagement Level.
The next portion is the evaluation of the confusion matrix, which is a 2x2 table that represents the Actual Label compared to the Predicted Label. As previously mentioned in the report, the model correctly identifies high risk individuals well, however, classifies many low-risk individuals as high risk.
confm_dt = confusion_matrix(y_test, y_pred_dt)
plt.figure(figsize=(8, 6))
sns.heatmap(confm_dt, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Decision Tree Confusion Matrix')
plt.show()
KNN¶
After training the model with k=5 neighbors, it achieved an accuracy of approximately 0.307. This relatively low score suggests that the features used may not strongly predict engagement, and improvements like feature scaling, hyperparameter tuning, or adding more neighbors.
print(f"Accuracy score: {accuracy_score(y_test, y_prediction)}")
Accuracy score: 0.307
To better understand how the choice of k affects model performance, we plotted the accuracy of the K-Nearest Neighbors (KNN) classifier for different values of k ranging from 1 to 20. The graph shows that accuracy fluctuates across different k values, but generally trends upward as k increases. Lower k values tend to result in more variance and slightly lower accuracy, while higher k values produce more stable and higher accuracy scores, peaking around k = 20 with an accuracy slightly above 34%. This suggests that for this dataset, using a higher number of neighbors leads to better generalization, although overall accuracy remains relatively modest.
k_values = range(1, 21)
accuracy_scores = []
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred_k = knn.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, y_pred_k))
plt.figure(figsize=(8, 5))
plt.plot(k_values, accuracy_scores, marker='o', linestyle='dashed', color='b')
plt.xlabel("Number of Neighbors (k)")
plt.ylabel("Accuracy Score")
plt.title("KNN Accuracy vs. k-Values")
plt.xticks(k_values)
plt.show()
The classification report represents key performance metrics of the KNN helping understand the effectiveness of predicting Engagement Level.
report_df = pd.DataFrame(classification_report(y_test, y_prediction, output_dict=True)).T
report_df = report_df.drop(columns=['support'], errors='ignore')
plt.figure(figsize=(8, 6))
sns.heatmap(report_df, annot=True, cmap="Blues", fmt=".2f")
plt.title("Classification Report Heatmap - KNN test")
plt.show()
The next portion is the evaluation of the confusion matrix, which is a 2x2 table that represents the Actual Label compared to the Predicted Label. As previously mentioned in the report, the model correctly identifies high risk individuals well, however, classifies many low-risk individuals as high risk.
confm_knn = confusion_matrix(y_test, y_pred_dt)
plt.figure(figsize=(8, 6))
sns.heatmap(confm_knn, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('KNN Confusion Matrix')
plt.show()
Model Comparison¶
This bar chart compares the performance of the Random Forest and Decision Tree models across four evaluation metrics: Accuracy, Precision, Recall, and F1-score. Both models achieved very similar results across all metrics, with Random Forest slightly outperforming Decision Tree in each category. This suggests that while both models handle the classification task comparably, Random Forest provides a small but consistent improvement in predictive performance, likely due to its ability to reduce overfitting by averaging multiple decision trees.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
import numpy as np
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-score']
rf_scores = [accuracy_score(y_test, y_pred),
precision_score(y_test, y_pred, average='weighted'),
recall_score(y_test, y_pred, average='weighted'),
f1_score(y_test, y_pred, average='weighted')]
dt_scores = [accuracy_score(y_test, y_pred_dt),
precision_score(y_test, y_pred_dt, average='weighted'),
recall_score(y_test, y_pred_dt, average='weighted'),
f1_score(y_test, y_pred_dt, average='weighted')]
x = np.arange(len(metrics))
width = 0.35
fig, ax = plt.subplots(figsize=(10, 6))
rects1 = ax.bar(x - width/2, rf_scores, width, label='Random Forest')
rects2 = ax.bar(x + width/2, dt_scores, width, label='Decision Tree')
ax.set_ylabel('Scores')
ax.set_title('Model Comparison')
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()
fig.tight_layout()
plt.show()
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns # Import Seaborn for styling
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-score']
rf_scores = [accuracy_score(y_test, y_pred),
precision_score(y_test, y_pred, average='weighted'),
recall_score(y_test, y_pred, average='weighted'),
f1_score(y_test, y_pred, average='weighted')]
dt_scores = [accuracy_score(y_test, y_pred_dt),
precision_score(y_test, y_pred_dt, average='weighted'),
recall_score(y_test, y_pred_dt, average='weighted'),
f1_score(y_test, y_pred_dt, average='weighted')]
x = np.arange(len(metrics))
width = 0.35
# Set Seaborn style
sns.set_style("whitegrid")
fig, ax = plt.subplots(figsize=(10, 6))
# Use Seaborn's barplot for better aesthetics
sns.barplot(x=metrics, y=rf_scores, color='skyblue', width=width, label='Random Forest', ax=ax)
sns.barplot(x=metrics, y=dt_scores, color='salmon', width=width, label='Decision Tree', ax=ax)
ax.set_ylabel('Scores', fontsize=12)
ax.set_title('Model Comparison', fontsize=14, fontweight='bold')
ax.set_xticklabels(metrics, fontsize=10)
ax.legend(fontsize=10)
# Remove spines
sns.despine()
# Add data labels
for p in ax.patches:
ax.annotate(f'{p.get_height():.2f}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
textcoords='offset points')
# Adjust layout for better spacing
plt.tight_layout()
plt.show()
/var/folders/77/9zgk3rw96_16j5r99_wp7gvm0000gn/T/ipykernel_6287/3050421759.py:31: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. ax.set_xticklabels(metrics, fontsize=10)
Overview:¶
The random forrest classifer consistently performed best across metrics, with an accuracy score reaching up to 0.36 after feature engineering. Introducing a new feature, Engagement_Rate (calculated as the sum of likes, shares, and comments divided by views), slightly improved performance, suggesting it helped capture information that individual features missed.
Here’s a breakdown of results:
Random Forest (post-feature engineering):
Accuracy: ~0.333–0.36
Strength: Captured complex feature interactions and minimized overfitting
Weakness: Still struggled to accurately classify medium and high engagement posts
Decision Tree:
Accuracy: ~0.325
Strength: Easy to interpret and revealed key features like views, shares, comments, and likes
Weakness: Lower overall performance, more prone to overfitting
K-Nearest Neighbors (k=5):
Accuracy: ~0.307
Strength: Simple and good baseline
Weakness: Sensitive to k value; didn’t perform well with noisy or unscaled data
The confusion matrices and classification reports revealed that all models struggled to distinguish between medium and high engagement levels, likely due to overlapping characteristics. In contrast, low engagement posts were easier to classify correctly.
Answering our original question: “Can we predict whether a social media post will be highly engaging based on its metrics?”, our models suggest partial predictability. Engagement is influenced by more than just quantitative metrics, factors like content type, timing, and platform trends (which weren’t available in our dataset) also likely play a major role.
Future improvements could include:
Applying feature scaling (especially for KNN)
Using ensemble models or gradient boosting
Collecting more diverse data, including text or image features from the posts
Conclusion¶
This project gave us a deeper understanding of what makes social media posts go viral, but it also showed how difficult it is to predict virality. We found that engagement metrics like likes, shares, comments, and views play a big role in determining how well a post performs. However, we struggled to accurately predict posts with medium or high engagement, even with different models like Random Forest, Decision Tree, and K-Nearest Neighbors. Random Forest gave us the best results, though the accuracy wasn’t super high, which showed that working with data on viral trends is tricky. We also learned that adding features like ‘Engagement_Rate’ helped improve our model slightly, proving that good feature engineering is key. While we gained some useful insights, predicting virality is still really hard due to the many unpredictable factors that influence trends. Having more data or adding factors like timing or influencer status might help us improve predictions.
Storytelling¶
During the course of this project, we examined various factors to analyze the underlying patterns in viral trends across social media platforms, including Twitter and Instagram. Our group focused on post engagement, content type, and timing to uncover the key factors that contribute to a trend going viral. We aimed to develop a model that accurately predicts whether a post will go viral, helping content creators and marketers optimize their strategies. Our primary question was: What characteristics of a post are most strongly associated with high engagement, and can we predict whether a post will go viral based on these factors? To address this, we conducted classification and clustering analyses to understand the underlying patterns in the data.
Using classification techniques, we achieved an accuracy score of approximately 32.9% with a Random Forest Classifier. While modest, this score slightly outperformed the Decision Tree Classifier, which achieved an accuracy of 32.1%. Random Forest outperformed Decision Tree by approximately 0.8%, showing that ensemble methods and smarter feature engineering can lead to more reliable predictions — even when working with highly chaotic data like viral trends.
The relatively low overall accuracy reflects the inherent complexity and randomness of social media virality. Factors such as algorithm behavior, pop culture moments, trending topics, world events, and user behavior patterns all play critical roles — variables that are difficult to capture through basic post features alone. Although our model was able to learn some patterns (particularly distinguishing low-engagement posts), it struggled to reliably differentiate between medium and high engagement levels. This indicates that future modeling efforts would benefit from richer feature engineering — such as including sentiment analysis, posting time, influencer status, and trending hashtag usage — to better capture the nuanced realities of viral success.
Examining the confusion matrix for the Random Forest model, we observed that it correctly predicted Low Engagement posts more consistently than other categories. However, there was still significant confusion between Medium and High Engagement levels, with the model often misclassifying Medium posts as either Low or High, and vice versa. This suggests that while Low Engagement posts are easier to identify, Medium and High posts have more overlapping features, making them harder to separate.
In comparison, the Decision Tree confusion matrix showed heavier confusion across all engagement levels, particularly mistaking High Engagement posts as Low. While both models struggled to clearly separate Medium and High Engagement posts, Random Forest achieved more reliable and stable performance overall, reaffirming its advantage as a more robust and effective classifier.
An analysis of feature importance for both models revealed that shares, likes, comments, and views were the most influential factors in predicting engagement levels, while content type and platform had significantly less impact. This highlights that direct user interactions are more critical to a post's success than the medium or platform where it is posted. Notably, both models identified shares as the single most important feature, reinforcing the idea that active user participation in promoting content significantly drives engagement.
To improve the model's predictive performance, we engineered a new variable called ‘Engagement_Rate’, capturing the proportion of likes, shares, and comments relative to views. By introducing this normalized measure of post interaction, the model was able to slightly improve its performance, demonstrating the importance of thoughtful feature engineering when working with engagement data.
We also explored the K-Nearest Neighbors (KNN) algorithm, adjusting k-values from 1 to 20 to find the best fit. At low k-values (1–5), KNN exhibited chaotic behavior — accuracy swung wildly around 30–32%, showing the dangers of trusting too few neighbors. It was like predicting viral trends by asking a handful of random people. As k increased beyond 14, KNN stabilized, with accuracy climbing toward 34% at k=20. Larger k-values helped the model smooth out noise and find better patterns. However, even at its peak, KNN couldn’t outperform Random Forest. This experience taught us that proximity alone isn't enough — predicting virality requires understanding broader patterns like momentum, timing, and culture, not just surface-level similarity.
In the end, while the K-Nearest Neighbors (KNN) algorithm improved with more neighbors, ensemble models proved to be the stronger and more reliable choice for navigating the unpredictable waves of social media. The features most associated with high engagement include shares, likes, comments, and views for both models. Predicting whether a post will go viral is partially achievable, especially for low-engagement posts. However, further analysis on predicting medium and high engagement was challenging due to the random and complex nature of virality.
Impact¶
The impact of this project is that it offers meaningful insights into the patterns and features that drive the virality of various social media posts. Identifying which post characteristics, such as content types and hashtags, are linked to higher or lower engagement can guide social media creators to optimize their content for maximum engagement, which can be a great tool for influencers and promotional marketers. The findings can also shed light on what makes posts go viral across different social media platforms, so we can decipher the different trends that may be going on in each. However, the data available may encourage the manipulation of content just purely for engagement purposes, which can lead to a spread of low-quality or misleading content. Heavily focusing on our algorithms' predictions for creating content may also reduce the authenticity and creativity of content online that we truly appreciate. Balancing data-driven insights with ethical and inspired content creation can mitigate these risks and help encourage this data to be used in a positive light to help creators and marketers manage engagement for their content on their platforms in our fast-moving digital landscape.