Final Project - Group 5¶

Introduction¶

In today’s world, social media plays a huge role in how we communicate, share ideas, and interact with content. Every day, certain posts suddenly gain massive attention and spread across social media platforms like TikTok, Instagram, Twitter, and YouTube. Our project explores this idea by figuring out what makes a social media post go viral by looking at engagement data across several platforms and building models that can help predict the virality of a post.

Our question focuses on what characteristics of a post are most associated with high engagement, and we can use those to predict whether a post will go viral or not. We are using a dataset from Kaggle called “Viral Social Media Trends & Engagement Analysis” to answer this question. It includes 5,000 social media posts and provides details like the platform being used, the hashtag used, the type of content, the region, and engagement metrics such as views, likes, shares, and comments. It also includes an engagement level of low, medium, or high.

This project explores the dataset to find patterns and trends that can help us answer the question. Our approach includes data cleaning, clustering posts to see which kinds tend to go viral. This project provides useful insights for content creators, marketers, and anyone looking to understand how viral posts work.


Our Data¶

The dataset used for this project is “Viral Social Media Trends & Engagement Analysis” from Kaggle. It contains 5,000 unique entries representing posts from platforms such as TikTok, Instagram, Twitter, and YouTube. We chose this dataset because it includes key insights of viral social media posts, including user engagement and characteristics of the post, which help us answer our question.

Each row in the dataset contains one post and includes ten features: a postID, the platform it was posted on, the hashtag used, the content type (video, tweet, reel, or livestream), and the region where it was posted. Additionally, the dataset includes information about engagement metrics like views, likes, shares, and comments. Another useful column is the engagement level, which labels each post’s engagement as either low, medium, or high. This can help us train our models for classification. Looking at the dataset as a whole, YouTube was the most represented platform. The most common content types were livestreams and standard posts. The engagement levels were evenly distributed between low, medium, and high.

Overall, the dataset helps us explore what makes a post go viral. It includes data from many different platforms, content, and metrics that can aid when we look for patterns and relationships using classification and clustering techniques.


We will begin by importing our dataset and assigning it to the pandas Dataframe varible 'trends.' This allows for easy manipulation and analysis of the data. Then by running the function: head(), on our dataset we are able to get a sneak peak of what we are working with like noting missing values or inconvenient variable types.

In [246]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [247]:
trends = pd.read_csv('Viral_Social_Media_Trends.csv')
In [248]:
trends.head()
Out[248]:
Post_ID Platform Hashtag Content_Type Region Views Likes Shares Comments Engagement_Level
0 Post_1 TikTok #Challenge Video UK 4163464 339431 53135 19346 High
1 Post_2 Instagram #Education Shorts India 4155940 215240 65860 27239 Medium
2 Post_3 Twitter #Challenge Video Brazil 3666211 327143 39423 36223 Medium
3 Post_4 YouTube #Education Shorts Australia 917951 127125 11687 36806 Low
4 Post_5 TikTok #Dance Post Brazil 64866 171361 69581 6376 Medium

Data Visualizations for Data Understanding

In [249]:
# Visualization 1: Distribution of posts across different platforms
plt.figure(figsize=(10, 6))
sns.countplot(data=trends, x='Platform', palette='viridis', hue='Platform')
plt.title('Distribution of Posts Across Different Platforms')
plt.xlabel('Platform')
plt.ylabel('Number of Posts')
plt.show()
No description has been provided for this image
In [250]:
# Visualization 2: Distribution of engagement levels
plt.figure(figsize=(10, 6))
sns.countplot(data=trends, x='Engagement_Level', palette='viridis', hue='Engagement_Level')
plt.title('Distribution of Engagement Levels')
plt.xlabel('Engagement Level')
plt.ylabel('Number of Posts')
plt.show()
No description has been provided for this image
In [251]:
# Visualization 3: Average views, likes, shares, and comments by platform
metrics = ['Views', 'Likes', 'Shares', 'Comments']
for metric in metrics:
  plt.figure(figsize=(10, 6))
  sns.barplot(data=trends, x='Platform', y=metric, palette='viridis' , hue='Platform')
  plt.title(f'Average {metric} by Platform')
  plt.xlabel('Platform')
  plt.ylabel(f'Average {metric}')
  plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [252]:
# Visualization 4: Average views, likes, shares, and comments by engagement level
for metric in metrics:
  plt.figure(figsize=(10, 6))
  sns.barplot(data=trends, x='Engagement_Level', y=metric, palette='viridis', hue='Engagement_Level')
  plt.title(f'Average {metric} by Engagement Level')
  plt.xlabel('Engagement Level')
  plt.ylabel(f'Average {metric}')
  plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [253]:
# Visualization 5: Correlation heatmap of engagement metrics
plt.figure(figsize=(10, 6))
sns.heatmap(trends[metrics].corr(), annot=True, cmap='viridis', vmin=-1, vmax=1)
plt.title('Correlation Heatmap of Engagement Metrics')
plt.show()
No description has been provided for this image

Pre-processing¶

Before using our data within our models, we need to transform features that may have null or categorical values. From running the info() function, we can see that our data set has no missing values and four numerical columns. The other 6 columns are categorical

In [254]:
trends.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Post_ID           5000 non-null   object
 1   Platform          5000 non-null   object
 2   Hashtag           5000 non-null   object
 3   Content_Type      5000 non-null   object
 4   Region            5000 non-null   object
 5   Views             5000 non-null   int64 
 6   Likes             5000 non-null   int64 
 7   Shares            5000 non-null   int64 
 8   Comments          5000 non-null   int64 
 9   Engagement_Level  5000 non-null   object
dtypes: int64(4), object(6)
memory usage: 390.8+ KB

Below, we will using Onee-Hot encoding to turn our categorical values into multiple binary columns. We do this since our model cannot process these directly so one-hot encoding helps translate them. We also drop the column 'Post_ID' because it does not carry any meaningful information for predicting engagement. Finally, we converted our target column from text ('High', 'Medium', and 'low') to numerical values (2,1,0).

Once we completed these pre-processing steps, our dataset has expanded from 10 columns to 29. This way our model can now identify the different values without misunderstanding the data.

In [255]:
trends = pd.get_dummies(trends, columns=['Platform', 'Content_Type', 'Region', 'Hashtag'], drop_first=True)

trends = trends.drop(columns=['Post_ID'])

trends['Engagement_Level'] = trends['Engagement_Level'].map({'High': 2, 'Medium':1, 'Low':0})
In [256]:
trends.head()
Out[256]:
Views Likes Shares Comments Engagement_Level Platform_TikTok Platform_Twitter Platform_YouTube Content_Type_Post Content_Type_Reel ... Region_USA Hashtag_#Comedy Hashtag_#Dance Hashtag_#Education Hashtag_#Fashion Hashtag_#Fitness Hashtag_#Gaming Hashtag_#Music Hashtag_#Tech Hashtag_#Viral
0 4163464 339431 53135 19346 2 True False False False False ... False False False False False False False False False False
1 4155940 215240 65860 27239 1 False False False False False ... False False False True False False False False False False
2 3666211 327143 39423 36223 1 False True False False False ... False False False False False False False False False False
3 917951 127125 11687 36806 0 False False True False False ... False False False True False False False False False False
4 64866 171361 69581 6376 1 True False False True False ... False False True False False False False False False False

5 rows × 29 columns

Modeling¶

Classification Predicting Engagement Level¶

In [257]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

Before we run our model, we will prepare our data by separating the features from the target variable. Specifically, we dropped the 'Engagement_Level' column from the dataset to create our feature matrix 'x' and store our target label in 'y'. After splitting the data into training and testing sets, using an 80-20 split. Additionally, by setting the random state to 42 we make sure that the split is reproducible.

We will use a Random Forest Classifier which is a method that builds multiple decision tress and merges them to get a more accurate prediction.

In [258]:
X = trends.drop(columns=['Engagement_Level'])
y = trends['Engagement_Level']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

Feature Importance and New Feature Engineering

To gain more insights, we examined the feature importance provided by the Random Forest model. We found that the top 5 most important features were

  • Likes
  • Shares
  • Views
  • Comments

Taking this into account, we predict that engagement might be better captured not just by individual counts but by how interactions relate to the views of a post.

In [259]:
importances = pd.Series(rf.feature_importances_, index=X_train.columns)
importances.sort_values(ascending=False).head(10).plot(kind='barh')
plt.title('Top 10 Important Features')
plt.show()
No description has been provided for this image

Therefore, we created a new feature called 'Engagement_Rate,' calculation shown below. The idea is that the posts with high engagement relative to their number of views might correlate to a higher engagement level and possibly give our model a clearer, more percise prediction ability.

In [260]:
trends['Engagement_Rate'] = (trends['Likes'] + trends['Shares'] + trends['Comments']) / trends['Views']
In [261]:
X = trends.drop(columns=['Engagement_Level'])
y = trends['Engagement_Level']

X_train_Random2, X_test_Random2, y_train_Random2, y_test_Random2 = train_test_split(X, y, test_size=0.2, random_state = 42)

rf = RandomForestClassifier()
rf.fit(X_train_Random2, y_train_Random2)

y_pred_Random2 = rf.predict(X_test_Random2)

Decision Tree¶

Decision trees serve as a popular choice for classification and interpretability. For this project, we chose this model because it would work well in the classification of a post into categories, but also picks the best features that helps the post have the highest engagement features. We overall saw the views, shares, comments, and likes contributed most to engagement level.

In [262]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

y_pred_dt = dt.predict(X_test)
In [263]:
importances= pd.Series(dt.feature_importances_, index=X_train.columns)
importances.sort_values(ascending=False).head(10).plot(kind='barh')
plt.title('Top 10 Important Features - Decision Tree')
plt.show()
No description has been provided for this image

Feature Importance: Insights from the Decision Tree¶

After evaluating the results from the decision tree results it seems that these fields contributed the most:

  1. Views: Higher view counts generally correlate with increased engagement, reflecting broader reach and exposure.
  2. Shares: Posts with more shares tend to have higher engagement, indicating active user participation in spreading content.
  3. Comments: Increased comments signify direct interaction and discussion around a post, contributing to higher engagement.
  4. Likes: Increases likes of the post seemed to contribute to higher engagement.

These findings emphasize the importance of creating content that attracts views, encourages sharing, and fosters active discussions to maximize social media engagement.

KNN Classification¶

Nearest Neighbors (KNN) was chosen for this classification task because it is simple, easy to interpret, and effective as a baseline model. KNN does not assume any specific data distribution and can naturally handle multiclass classification problems like engagement levels.

In [264]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap


X = trends.drop(columns=['Engagement_Level'])
y = trends['Engagement_Level']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Added random_state for reproducibility

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
Out[264]:
KNeighborsClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()
In [265]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_prediction = knn.predict(X_test)

Evaluation¶

To evaluate our models’ performance in classifying post engagement levels (low, medium, high), we used common classification metrics: accuracy, precision, recall, and F1-score. Our goal was to identify patterns in social media posts that could help predict virality, so we focused on how well our models could classify each engagement level rather than just overall accuracy.

The Random Forest Classifier¶

In [266]:
print("Accuracy:", accuracy_score(y_test, y_pred))
Accuracy: 0.333

To evaluate the model's performace we will use accuracy as the metric. This model produced an accuracy between 3.19 and 3.33 depending on the run. This indicates that it performed slightly better then random guessing but still has a lot of room for improvement.

To analyze the results a bit more, we will use a confusion matrix. The rows correspond to the true labels and the columns correspond to the predicted labels. Ideally, we want high numbers on the diagonal (correct predictions) and low numbers elsewhere.

In this following matrix, we can see the model struggles to correctly classify the engagement levels. Since the misclassifications are fequent acress all classes, this suggests that the classes are diffcult for the model to distinguish based on the original features.

In [267]:
confm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8,6))
sns.heatmap(confm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
No description has been provided for this image

Random Forest - Post Feature Engineering¶

In [268]:
print("Accuracy Score:", accuracy_score(y_test_Random2, y_pred_Random2))
Accuracy Score: 0.318

After creating 'Engagement_Rate' feature and retraining the random forest model using the same process, the resulting accuracy improved slightly from an initial range of 0.319-0.33 to a new range of 0.32-0.36.

Although the improvement was small, this suggests that the new engagement_rate feature helped the model capture some additional information that individual features alone did not fully express.

In [269]:
# This is from the Random Forest model from when we ran it again after feature engineering
confm = confusion_matrix(y_test_Random2, y_pred_Random2)


plt.figure(figsize=(8,6))
sns.heatmap(confm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
No description has been provided for this image
In [270]:
# This is from the Random Forest model from when we ran it again after feature engineering

report = classification_report(y_test_Random2, y_pred_Random2, output_dict=True)
report_df = pd.DataFrame(report).transpose()

plt.figure(figsize=(10,6))
sns.heatmap(report_df.iloc[:-1, :-1], annot=True, cmap='YlGnBu', fmt='.2f')
plt.title('Classification Report')
plt.show()
No description has been provided for this image

Decision Tree¶

In [271]:
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
Decision Tree Accuracy: 0.334

The accuracy of the decision tree showed to be around 0.321 since this is a multifacited classification this would be a better score than guessing.

In [272]:
report_df = pd.DataFrame(classification_report(y_test, y_pred_dt, output_dict=True)).T

report_df = report_df.drop(columns=['support'], errors='ignore')

plt.figure(figsize=(8, 6))
sns.heatmap(report_df, annot=True, cmap="Blues", fmt=".2f")
plt.title("Classification Report Heatmap - Decision Tree")
plt.show()
No description has been provided for this image

The classification report represents key performance metrics of the Decision Tree helping understand the effectiveness of predicting Engagement Level.

The next portion is the evaluation of the confusion matrix, which is a 2x2 table that represents the Actual Label compared to the Predicted Label. As previously mentioned in the report, the model correctly identifies high risk individuals well, however, classifies many low-risk individuals as high risk.

In [273]:
confm_dt = confusion_matrix(y_test, y_pred_dt)
plt.figure(figsize=(8, 6))
sns.heatmap(confm_dt, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Decision Tree Confusion Matrix')
plt.show()
No description has been provided for this image

KNN¶

After training the model with k=5 neighbors, it achieved an accuracy of approximately 0.307. This relatively low score suggests that the features used may not strongly predict engagement, and improvements like feature scaling, hyperparameter tuning, or adding more neighbors.

In [274]:
print(f"Accuracy score: {accuracy_score(y_test, y_prediction)}")
Accuracy score: 0.307

To better understand how the choice of k affects model performance, we plotted the accuracy of the K-Nearest Neighbors (KNN) classifier for different values of k ranging from 1 to 20. The graph shows that accuracy fluctuates across different k values, but generally trends upward as k increases. Lower k values tend to result in more variance and slightly lower accuracy, while higher k values produce more stable and higher accuracy scores, peaking around k = 20 with an accuracy slightly above 34%. This suggests that for this dataset, using a higher number of neighbors leads to better generalization, although overall accuracy remains relatively modest.

In [275]:
k_values = range(1, 21)
accuracy_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred_k = knn.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred_k))


plt.figure(figsize=(8, 5))
plt.plot(k_values, accuracy_scores, marker='o', linestyle='dashed', color='b')
plt.xlabel("Number of Neighbors (k)")
plt.ylabel("Accuracy Score")
plt.title("KNN Accuracy vs. k-Values")
plt.xticks(k_values)
plt.show()
No description has been provided for this image

The classification report represents key performance metrics of the KNN helping understand the effectiveness of predicting Engagement Level.

In [276]:
report_df = pd.DataFrame(classification_report(y_test, y_prediction, output_dict=True)).T

report_df = report_df.drop(columns=['support'], errors='ignore')

plt.figure(figsize=(8, 6))
sns.heatmap(report_df, annot=True, cmap="Blues", fmt=".2f")
plt.title("Classification Report Heatmap -  KNN test")
plt.show()
No description has been provided for this image

The next portion is the evaluation of the confusion matrix, which is a 2x2 table that represents the Actual Label compared to the Predicted Label. As previously mentioned in the report, the model correctly identifies high risk individuals well, however, classifies many low-risk individuals as high risk.

In [277]:
confm_knn = confusion_matrix(y_test, y_pred_dt)
plt.figure(figsize=(8, 6))
sns.heatmap(confm_knn, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('KNN Confusion Matrix')
plt.show()
No description has been provided for this image

Model Comparison¶

This bar chart compares the performance of the Random Forest and Decision Tree models across four evaluation metrics: Accuracy, Precision, Recall, and F1-score. Both models achieved very similar results across all metrics, with Random Forest slightly outperforming Decision Tree in each category. This suggests that while both models handle the classification task comparably, Random Forest provides a small but consistent improvement in predictive performance, likely due to its ability to reduce overfitting by averaging multiple decision trees.

In [278]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
import numpy as np


metrics = ['Accuracy', 'Precision', 'Recall', 'F1-score']
rf_scores = [accuracy_score(y_test, y_pred),
            precision_score(y_test, y_pred, average='weighted'),
            recall_score(y_test, y_pred, average='weighted'),
            f1_score(y_test, y_pred, average='weighted')]
dt_scores = [accuracy_score(y_test, y_pred_dt),
            precision_score(y_test, y_pred_dt, average='weighted'),
            recall_score(y_test, y_pred_dt, average='weighted'),
            f1_score(y_test, y_pred_dt, average='weighted')]

x = np.arange(len(metrics))
width = 0.35

fig, ax = plt.subplots(figsize=(10, 6))
rects1 = ax.bar(x - width/2, rf_scores, width, label='Random Forest')
rects2 = ax.bar(x + width/2, dt_scores, width, label='Decision Tree')

ax.set_ylabel('Scores')
ax.set_title('Model Comparison')
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()

fig.tight_layout()
plt.show()
No description has been provided for this image
In [279]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns  # Import Seaborn for styling


metrics = ['Accuracy', 'Precision', 'Recall', 'F1-score']
rf_scores = [accuracy_score(y_test, y_pred),
            precision_score(y_test, y_pred, average='weighted'),
            recall_score(y_test, y_pred, average='weighted'),
            f1_score(y_test, y_pred, average='weighted')]
dt_scores = [accuracy_score(y_test, y_pred_dt),
            precision_score(y_test, y_pred_dt, average='weighted'),
            recall_score(y_test, y_pred_dt, average='weighted'),
            f1_score(y_test, y_pred_dt, average='weighted')]

x = np.arange(len(metrics))
width = 0.35

# Set Seaborn style
sns.set_style("whitegrid")

fig, ax = plt.subplots(figsize=(10, 6))

# Use Seaborn's barplot for better aesthetics
sns.barplot(x=metrics, y=rf_scores, color='skyblue', width=width, label='Random Forest', ax=ax)
sns.barplot(x=metrics, y=dt_scores, color='salmon', width=width, label='Decision Tree', ax=ax)

ax.set_ylabel('Scores', fontsize=12)
ax.set_title('Model Comparison', fontsize=14, fontweight='bold')
ax.set_xticklabels(metrics, fontsize=10)
ax.legend(fontsize=10)

# Remove spines
sns.despine()

# Add data labels
for p in ax.patches:
    ax.annotate(f'{p.get_height():.2f}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
                textcoords='offset points')

# Adjust layout for better spacing
plt.tight_layout()
plt.show()
/var/folders/77/9zgk3rw96_16j5r99_wp7gvm0000gn/T/ipykernel_6287/3050421759.py:31: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax.set_xticklabels(metrics, fontsize=10)
No description has been provided for this image

Overview:¶

The random forrest classifer consistently performed best across metrics, with an accuracy score reaching up to 0.36 after feature engineering. Introducing a new feature, Engagement_Rate (calculated as the sum of likes, shares, and comments divided by views), slightly improved performance, suggesting it helped capture information that individual features missed.

Here’s a breakdown of results:

Random Forest (post-feature engineering):

  • Accuracy: ~0.333–0.36

  • Strength: Captured complex feature interactions and minimized overfitting

  • Weakness: Still struggled to accurately classify medium and high engagement posts

Decision Tree:

  • Accuracy: ~0.325

  • Strength: Easy to interpret and revealed key features like views, shares, comments, and likes

  • Weakness: Lower overall performance, more prone to overfitting

K-Nearest Neighbors (k=5):

  • Accuracy: ~0.307

  • Strength: Simple and good baseline

  • Weakness: Sensitive to k value; didn’t perform well with noisy or unscaled data

The confusion matrices and classification reports revealed that all models struggled to distinguish between medium and high engagement levels, likely due to overlapping characteristics. In contrast, low engagement posts were easier to classify correctly.

Answering our original question: “Can we predict whether a social media post will be highly engaging based on its metrics?”, our models suggest partial predictability. Engagement is influenced by more than just quantitative metrics, factors like content type, timing, and platform trends (which weren’t available in our dataset) also likely play a major role.

Future improvements could include:

  • Applying feature scaling (especially for KNN)

  • Using ensemble models or gradient boosting

  • Collecting more diverse data, including text or image features from the posts

Conclusion¶

This project gave us a deeper understanding of what makes social media posts go viral, but it also showed how difficult it is to predict virality. We found that engagement metrics like likes, shares, comments, and views play a big role in determining how well a post performs. However, we struggled to accurately predict posts with medium or high engagement, even with different models like Random Forest, Decision Tree, and K-Nearest Neighbors. Random Forest gave us the best results, though the accuracy wasn’t super high, which showed that working with data on viral trends is tricky. We also learned that adding features like ‘Engagement_Rate’ helped improve our model slightly, proving that good feature engineering is key. While we gained some useful insights, predicting virality is still really hard due to the many unpredictable factors that influence trends. Having more data or adding factors like timing or influencer status might help us improve predictions.

Storytelling¶

During the course of this project, we examined various factors to analyze the underlying patterns in viral trends across social media platforms, including Twitter and Instagram. Our group focused on post engagement, content type, and timing to uncover the key factors that contribute to a trend going viral. We aimed to develop a model that accurately predicts whether a post will go viral, helping content creators and marketers optimize their strategies. Our primary question was: What characteristics of a post are most strongly associated with high engagement, and can we predict whether a post will go viral based on these factors? To address this, we conducted classification and clustering analyses to understand the underlying patterns in the data.

Using classification techniques, we achieved an accuracy score of approximately 32.9% with a Random Forest Classifier. While modest, this score slightly outperformed the Decision Tree Classifier, which achieved an accuracy of 32.1%. Random Forest outperformed Decision Tree by approximately 0.8%, showing that ensemble methods and smarter feature engineering can lead to more reliable predictions — even when working with highly chaotic data like viral trends.

The relatively low overall accuracy reflects the inherent complexity and randomness of social media virality. Factors such as algorithm behavior, pop culture moments, trending topics, world events, and user behavior patterns all play critical roles — variables that are difficult to capture through basic post features alone. Although our model was able to learn some patterns (particularly distinguishing low-engagement posts), it struggled to reliably differentiate between medium and high engagement levels. This indicates that future modeling efforts would benefit from richer feature engineering — such as including sentiment analysis, posting time, influencer status, and trending hashtag usage — to better capture the nuanced realities of viral success.

Examining the confusion matrix for the Random Forest model, we observed that it correctly predicted Low Engagement posts more consistently than other categories. However, there was still significant confusion between Medium and High Engagement levels, with the model often misclassifying Medium posts as either Low or High, and vice versa. This suggests that while Low Engagement posts are easier to identify, Medium and High posts have more overlapping features, making them harder to separate.

In comparison, the Decision Tree confusion matrix showed heavier confusion across all engagement levels, particularly mistaking High Engagement posts as Low. While both models struggled to clearly separate Medium and High Engagement posts, Random Forest achieved more reliable and stable performance overall, reaffirming its advantage as a more robust and effective classifier.

An analysis of feature importance for both models revealed that shares, likes, comments, and views were the most influential factors in predicting engagement levels, while content type and platform had significantly less impact. This highlights that direct user interactions are more critical to a post's success than the medium or platform where it is posted. Notably, both models identified shares as the single most important feature, reinforcing the idea that active user participation in promoting content significantly drives engagement.

To improve the model's predictive performance, we engineered a new variable called ‘Engagement_Rate’, capturing the proportion of likes, shares, and comments relative to views. By introducing this normalized measure of post interaction, the model was able to slightly improve its performance, demonstrating the importance of thoughtful feature engineering when working with engagement data.

We also explored the K-Nearest Neighbors (KNN) algorithm, adjusting k-values from 1 to 20 to find the best fit. At low k-values (1–5), KNN exhibited chaotic behavior — accuracy swung wildly around 30–32%, showing the dangers of trusting too few neighbors. It was like predicting viral trends by asking a handful of random people. As k increased beyond 14, KNN stabilized, with accuracy climbing toward 34% at k=20. Larger k-values helped the model smooth out noise and find better patterns. However, even at its peak, KNN couldn’t outperform Random Forest. This experience taught us that proximity alone isn't enough — predicting virality requires understanding broader patterns like momentum, timing, and culture, not just surface-level similarity.

In the end, while the K-Nearest Neighbors (KNN) algorithm improved with more neighbors, ensemble models proved to be the stronger and more reliable choice for navigating the unpredictable waves of social media. The features most associated with high engagement include shares, likes, comments, and views for both models. Predicting whether a post will go viral is partially achievable, especially for low-engagement posts. However, further analysis on predicting medium and high engagement was challenging due to the random and complex nature of virality.

Impact¶

The impact of this project is that it offers meaningful insights into the patterns and features that drive the virality of various social media posts. Identifying which post characteristics, such as content types and hashtags, are linked to higher or lower engagement can guide social media creators to optimize their content for maximum engagement, which can be a great tool for influencers and promotional marketers. The findings can also shed light on what makes posts go viral across different social media platforms, so we can decipher the different trends that may be going on in each. However, the data available may encourage the manipulation of content just purely for engagement purposes, which can lead to a spread of low-quality or misleading content. Heavily focusing on our algorithms' predictions for creating content may also reduce the authenticity and creativity of content online that we truly appreciate. Balancing data-driven insights with ethical and inspired content creation can mitigate these risks and help encourage this data to be used in a positive light to help creators and marketers manage engagement for their content on their platforms in our fast-moving digital landscape.