Project 4 College Majors and Clustering¶
The object of this project is to work with a dataset of college majors and their corresponding statistics. The goal is to perform clustering on the data to identify groups of similar majors based on their characteristics. The project will involve data preparation, clustering, and visualization of the results.
The dataset being used in the project is from the FiveThirtyEight College Majors Dataset on Kaggle and contains information about the employment and salary statistics of various college majors. For the scope of this project I will specifically be using the all-ages.csv dataset to evaluate clusters of college majors based on their economic and employment metrics. By grouping similar fields and or categories to see which fields have common outcomes across all age groups.
The dataset includes the following columns:
Major_code: The code assigned to the major.Major: The name of the major.Total: The total number of people employed in the major.Employed: The number of people employed in the major.Unemployed: The number of people unemployed in the major.Employed_full_time_year_round: The number of people employed full-time year-round in the major.Median: The median salary of people in the major.P25th: The 25th percentile salary of people in the major.P75th: The 75th percentile salary of people in the major.Sample_size: The sample size of the data for the major.Major_category: The category of the major.
Overall Problem¶
The overall problem is to identify clusters of college majors based on their employment and salary statistics. By clustering the majors, we can gain insights into which majors have similar outcomes and characteristics. This can help students and educators understand the job market and make informed decisions about their education and career paths.
The overall questions to be solved are:
- What are the clusters of college majors based on their employment and salary statistics? (overall question)
- How do the clusters relate to each other?
- What insights can be gained from the clustering analysis?
- How can the clusters be visualized to aid in understanding the data?
Importing Libraries and Dependencies¶
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import scipy.cluster.hierarchy as shc
from sklearn.decomposition import PCA
Loading the Dataset¶
df = pd.read_csv('data/all-ages.csv')
print(df)
Major_code Major \
0 1100 GENERAL AGRICULTURE
1 1101 AGRICULTURE PRODUCTION AND MANAGEMENT
2 1102 AGRICULTURAL ECONOMICS
3 1103 ANIMAL SCIENCES
4 1104 FOOD SCIENCE
.. ... ...
168 6211 HOSPITALITY MANAGEMENT
169 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS
170 6299 MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION
171 6402 HISTORY
172 6403 UNITED STATES HISTORY
Major_category Total Employed \
0 Agriculture & Natural Resources 128148 90245
1 Agriculture & Natural Resources 95326 76865
2 Agriculture & Natural Resources 33955 26321
3 Agriculture & Natural Resources 103549 81177
4 Agriculture & Natural Resources 24280 17281
.. ... ... ...
168 Business 200854 163393
169 Business 156673 134478
170 Business 102753 77471
171 Humanities & Liberal Arts 712509 478416
172 Humanities & Liberal Arts 17746 11887
Employed_full_time_year_round Unemployed Unemployment_rate Median \
0 74078 2423 0.026147 50000
1 64240 2266 0.028636 54000
2 22810 821 0.030248 63000
3 64937 3619 0.042679 46000
4 12722 894 0.049188 62000
.. ... ... ... ...
168 122499 8862 0.051447 49000
169 118249 6186 0.043977 72000
170 61603 4308 0.052679 53000
171 354163 33725 0.065851 50000
172 8204 943 0.073500 50000
P25th P75th
0 34000 80000.0
1 36000 80000.0
2 40000 98000.0
3 30000 72000.0
4 38500 90000.0
.. ... ...
168 33000 70000.0
169 50000 100000.0
170 36000 83000.0
171 35000 80000.0
172 39000 81000.0
[173 rows x 11 columns]
Data Preparation and Preprocessing¶
Before analyzing the data, we need to prepare it by cleaning and preprocessing it. This includes handling missing values, scaling the data, and selecting relevant features for clustering. The first step is to check for any missing values within the dataset, and drop the missing values. The next step was to gather all the features needed for clustering. After gathering the features, the cluster DataFrame was then created and missing values were dropped from that dataframe. The next step was to scale the data using the StandardScaler, which helps to normalize the data making it easier for clustering. Lastly, the scaled data was then converted back into a DataFrame for further analysis.
missing_values = df.isnull().sum()
print("Missing values in each column:")
print(missing_values)
features = ["Employed", "Unemployed", "Total", "Employed_full_time_year_round", "Median", "P25th", "P75th",]
df_cluster = df[features].dropna()
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_cluster)
df_scaled = pd.DataFrame(scaled_data, columns=features)
df_cluster.head()
Missing values in each column: Major_code 0 Major 0 Major_category 0 Total 0 Employed 0 Employed_full_time_year_round 0 Unemployed 0 Unemployment_rate 0 Median 0 P25th 0 P75th 0 dtype: int64
| Employed | Unemployed | Total | Employed_full_time_year_round | Median | P25th | P75th | |
|---|---|---|---|---|---|---|---|
| 0 | 90245 | 2423 | 128148 | 74078 | 50000 | 34000 | 80000.0 |
| 1 | 76865 | 2266 | 95326 | 64240 | 54000 | 36000 | 80000.0 |
| 2 | 26321 | 821 | 33955 | 22810 | 63000 | 40000 | 98000.0 |
| 3 | 81177 | 3619 | 103549 | 64937 | 46000 | 30000 | 72000.0 |
| 4 | 17281 | 894 | 24280 | 12722 | 62000 | 38500 | 90000.0 |
Explanation of Clustering¶
Clustering is a form of unsupervised machine learning algorithm used to group similar data points together based on their features. The goal of clustering is to identify natural groupings within the dataset.
There are several types of clustering algorithms, where the ones we will focus on for this dataset include:
- K-Means Clustering: The algorithm partitions the data into a predefined number of clusters
K. This iteratively assigns each data point to the nearest cluster center and then updates the centers based on the means of the data points within each cluster. - Agglomerate Clustering: This is a type of hierarchical clustering that starts with each data point as its own cluster and then iteratively merges the closest clusters until a stopping criterion is met. The result is a hierarchy of clusters that can be visualized using a dendrogram.
- PCA: Principal Component Analysis is a dimensionality reduction technique that transforms the data into a new coordinate system, where the axes (principal components) are ordered by the amount of variance they explain. PCA is often used to visualize high-dimensional data in lower dimensions (e.g., 2D or 3D) while preserving as much information as possible.
Clustering overall helps to understand the overall structure of data, identify natural patterns, and helps make informed decisions based on the grouping of data. An example, within this context could help us reveal which college majors have similar employment and salary statistics.
The Data¶
Data Understanding and Visualization¶
Before diving into immediate clustering, we will create some base visualizations to understand the data better.
Distribution of Scaled Features¶
The first step is to visualize the distribution of the scaled features to understand their characteristics. This will help us identify any potential outliers or skewness in the data. The histogram will show the distribution of each feature in the dataset. It seems that overall, the scaled features have relatively the same distribution across all features.
plt.figure(figsize=(12, 8))
df_scaled.hist(bins=20, figsize=(12, 8), layout=(3, 3))
plt.tight_layout()
plt.title("Distribution of Scaled Features")
plt.show()
<Figure size 1200x800 with 0 Axes>
Correlation Matrix¶
The correlation matrix shares a heatmap of the relationship between different features within the dataset. A key observation of the correlation matrix is:
- Strong positive correlations between employed and the total, indicating that majors with more employed individuals tend to have a higher total number of individuals in that major overall
corr_matrix = df_scaled.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", square=True)
plt.title("Correlation Matrix of College Majors Features (Scaled)")
plt.show()
K-Means Clustering¶
The first clustering algorithm used will be K-means clustering. This section will determine the number of optimal clusters for the K-Means clustering using the elbow method. The elbow plot helps to identify the point where adding more clusters does not show a significant reduction to the sum of squared errors SSE. Based on the visualization, the optimal number of clusters seemed to be around the point k = 4 as the SSE started to plateau after that point.
K-means was chosen as the first clustering algorithm due to its simplicity and efficiency in handling large datasets. It is a widely used algorithm for clustering tasks and provides a good starting point for understanding the data while also letting us know the optimal number of clusters.
sse = []
K_range = range(2, 10)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(df_scaled)
sse.append(kmeans.inertia_)
plt.figure(figsize=(6, 4))
plt.plot(K_range, sse, marker='x')
plt.title("K-Means Elbow Plot")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("SSE (Sum of Squared Errors)")
plt.show()
K-Means Clustering with Optimal Clusters¶
The next step is to perform K-Means clustering with the optimal number of clusters determined from the elbow plot which was 4. The silhouette score is then calculated to evaluate the quality of the clustering. The silhouette score ranges from -1 to 1, where a higher score indicates better-defined clusters. The silhouette score for the optimal number of clusters is printed below. The clusters are then added to the original DataFrame for further analysis. From this analysis, it seems the score was 0.499, which indicates a fair clustering of the dataset.
k_opt = 4
kmeans = KMeans(n_clusters=k_opt, random_state=42)
kmeans_labels = kmeans.fit_predict(df_scaled)
kmeans_silhouette = silhouette_score(df_scaled, kmeans_labels)
print(f"K-Means Silhouette Score (k={k_opt}): {kmeans_silhouette:.3f}")
df_cluster['KMeans_Cluster'] = kmeans_labels
K-Means Silhouette Score (k=4): 0.499
Truncated Dendrogram¶
This part deals with visualization of a hierarchical clustering using a dendrogram. The dendrogram is a tree-like diagram that records the sequences of merges or slits. The dendrogram below is truncated to show up to the last 12 merges to showcase a better visualization. Truncating in this case helps simplify the visualization, making it easier to read and interpret.
The dendrogram is created using the Ward's method, which minimizes the variance within each cluster. The x-axis represents the cluster size, and the y-axis represents the distance or dissimilar between clusters.
An observation made from this dendrogram is that there are 4 main clusters that can be identified, which aligns with the K-Means clustering results. The dendrogram also shows the hierarchical structure of the clusters, which can be useful for understanding the relationships between different clusters.
plt.figure(figsize=(12, 6))
shc.dendrogram(
shc.linkage(df_scaled, method='ward'),
truncate_mode='lastp',
p=12,
leaf_rotation=90,
leaf_font_size=12,
show_contracted=True
)
plt.title("Truncated Dendrogram - Last 12")
plt.xlabel("Cluster Size")
plt.ylabel("Distance")
plt.show()
Full Dendrogram with Distance Threshold¶
For this part, we have the comparison to the full dendrogram with a distance threshold. The distance threshold is set to 6, which helps to identify the clusters more clearly. The red-dashed line indicates the distance threshold, and the clusters are formed based on this threshold. The dendrogram shows the hierarchical structure of the clusters, and the distance threshold helps to identify the clusters more clearly.
The full dendrogram provides a more detailed view of the clustering structure, and the distance threshold helps to identify the clusters more clearly. The clusters formed based on the distance threshold align with the K-Means clustering results, confirming the validity of the clustering.
plt.figure(figsize=(12, 6))
ddata = shc.linkage(df_scaled, method='ward')
shc.dendrogram(
ddata,
color_threshold=6,
leaf_rotation=90,
leaf_font_size=12,
show_contracted=True
)
plt.axhline(y=6, color='r', linestyle='--', label="Threshold")
plt.title("Dendrogram with Distance Threshold")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.legend()
plt.show()
Salary Box Plot¶
The box plot below shows the distribution of median salaries for each cluster. The box plot provides a visual representation of the median salary distribution for each cluster, including the median, quartiles, and potential outliers. The box plot helps to identify the differences in median salaries between the clusters.
For context:
- Cluster 0: Highest median salary
- Cluster 1: Moderate median salary
- Cluster 2: Moderate median salary with some outliers
- Cluster 3: Lowest median salary
The box plot also shows that Cluster 2 has a wider range of median salaries, indicating more variability within that cluster.
plt.figure(figsize=(12, 8))
sns.boxplot(x='KMeans_Cluster', y='Median', data=df_cluster)
plt.title("Box Plot of Median Salary by Cluster")
plt.xlabel("Cluster")
plt.ylabel("Median Salary")
plt.show()
PCA (Principal Component Analysis)¶
PCA is used to reduce the dimensionality of the dataset while preserving as much variance as possible. The PCA scatter plot shows the first two principal parts (PC1 and PC2) of the dataset. The explained variance ratio for each component is printed below the plot. The PCA scatter plot helps to visualize the clustering results in a lower-dimensional space. The PCA scatter plot shows that the clusters are well-separated, indicating that the K-Means clustering algorithm performed well.
PCA was picked as the ue due to its simplicity and effectiveness in reducing the dimensionality of the dataset while preserving the variance. PCA is widely used in data analysis and visualization, making it a fair choice for this project.
From this we observed that the first two principal parts explain the variance in the data, which is a good amount of variance to capture for visualization purposes.
pca = PCA(n_components=2)
pca_result = pca.fit_transform(df_scaled)
pca_df = pd.DataFrame(data=pca_result, columns=['PC1', 'PC2'])
print(f"Explained Variance Ratio: PC1 = {pca.explained_variance_ratio_[0]*100:.1f}%, PC2 = {pca.explained_variance_ratio_[1]*100:.1f}%")
Explained Variance Ratio: PC1 = 56.3%, PC2 = 40.7%
PCA Scatter Plot¶
The PCA scatter plot shows the first two principal parts of the dataset. The explained variance ratio for each component is printed below the plot. The PCA scatter plot helps to visualize the clustering results in a lower-dimensional space. The clusters are color-coded based on the K-Means clustering results.
The PCA scatter plot shows that the clusters are somewhat separated, indicating that the K-Means clustering algorithm performed well. The scatterplot additionally shows that most of the points are to the left of the principal components, indicating that most of the majors have lower employment and salary statistics.
plt.figure(figsize=(8, 6))
plt.scatter(pca_df['PC1'], pca_df['PC2'], alpha=0.7)
plt.title(f"PCA Scatter Plot (PC1 vs PC2)\nExplained Variance: {pca.explained_variance_ratio_[0]*100:.2f}% and {pca.explained_variance_ratio_[1]*100:.2f}%")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.grid(True)
plt.show()
PCA Scatter Plot with reduced PCA data¶
The PCA scatter plot with reduced PCA data shows the first two principal parts of the dataset. The explained variance ratio for each component is shown below the plot. The PCA scatter plot helps to visualize the clustering results in a lower-dimensional space and evaluating if lowering the amount of data present has a drastic effect on the dataset. From this scatter plot, it does not seem that any significant information was lost by reducing the data to two principal parts.
The clusters are color-coded based on the K-Means clustering results.
plt.figure(figsize=(10, 6))
sns.scatterplot(x=pca_df['PC1'], y=pca_df['PC2'], hue=df_cluster['KMeans_Cluster'], palette='viridis', alpha=0.7)
plt.title("Clustered Scatter Plot - PCA Reduced Data")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend(title='Cluster')
plt.grid(True)
plt.show()
Heatmap of Cluster Centers¶
The heatmap shows the cluster centers for each cluster. The heatmap provides a visual representation of the cluster centers, making it easier to compare the clusters. The heatmap shows the values of each feature for each cluster center. The heatmap helps to identify the differences between the clusters based on their features. The heatmap shows that Cluster 0 has the highest values for most features, indicating that it has the highest employment and salary statistics. Cluster 3 has the lowest values for most features, indicating that it has the lowest employment and salary statistics.
The overall observations from the heatmap include:
- Cluster 3 has the largest values for Total, Employed, and employed full time year round, suggesting it groups majors with very high total graduates.
- Cluster 2 also has fairly large totals indicating another segment of the large population
- Cluster 0 and 1 have the lowest values for most features, indicating that they have the lowest employment and salary statistics.
- The salary statistics (Median, P25th, P75th) are highest in Cluster 0, indicating that this cluster has the highest median salary.
- For employment clusters 2 and 3 seem to share high employed and unemployed amount of data.
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
cluster_centers_df = pd.DataFrame(cluster_centers, columns=features)
plt.figure(figsize=(12, 8))
sns.heatmap(cluster_centers_df, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Heatmap of Cluster Centers")
plt.xlabel("Features")
plt.ylabel("Cluster")
plt.show()
Information Regarding the Majors¶
This step helps answer the key question of breaking down which majors fit into the categories. The majors are grouped by their respective clusters, and the most common majors within each cluster are identified. This information helps to understand the characteristics of each cluster and the majors that belong to them. The most common majors within each cluster are printed below. It seems that the first clusters have the highest group of majors, while the bottom group of clusters have the least number of majors.
In order to display the different major groups, the features were grouped by the KMeans cluster and the major category, then the most common majors were identified within each cluster. Then, after fitting the data into the scalar and using the same number of optimal k points, the majors are listed below.
features = ["Major", "Employed", "Unemployed", "Total", "Employed_full_time_year_round", "Median", "P25th", "P75th", "Major_category"]
df_cluster2 = df[features].dropna()
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_cluster2.drop(columns=['Major', 'Major_category']))
df_scaled = pd.DataFrame(scaled_data, columns=features[1:-1])
df_scaled['Major'] = df_cluster2['Major'].values
k_opt = 4
kmeans = KMeans(n_clusters=k_opt, random_state=42)
kmeans_labels = kmeans.fit_predict(df_scaled.drop(columns=['Major']))
df_cluster2['KMeans_Cluster'] = kmeans_labels
clusters = df_cluster2.groupby('KMeans_Cluster')['Major'].apply(list).reset_index()
for index, row in clusters.iterrows():
print(f"Cluster {row['KMeans_Cluster']}:")
for major in row['Major']:
print(f" - {major}")
print("\n")
Cluster 0: - COMPUTER AND INFORMATION SYSTEMS - INFORMATION SCIENCES - GENERAL ENGINEERING - AEROSPACE ENGINEERING - ARCHITECTURAL ENGINEERING - CHEMICAL ENGINEERING - CIVIL ENGINEERING - COMPUTER ENGINEERING - ELECTRICAL ENGINEERING - ENGINEERING MECHANICS PHYSICS AND SCIENCE - ENVIRONMENTAL ENGINEERING - GEOLOGICAL AND GEOPHYSICAL ENGINEERING - INDUSTRIAL AND MANUFACTURING ENGINEERING - MATERIALS ENGINEERING AND MATERIALS SCIENCE - MECHANICAL ENGINEERING - METALLURGICAL ENGINEERING - MINING AND MINERAL ENGINEERING - NAVAL ARCHITECTURE AND MARINE ENGINEERING - NUCLEAR ENGINEERING - PETROLEUM ENGINEERING - MISCELLANEOUS ENGINEERING - ENGINEERING AND INDUSTRIAL MANAGEMENT - ELECTRICAL ENGINEERING TECHNOLOGY - INDUSTRIAL PRODUCTION TECHNOLOGIES - MATHEMATICS - APPLIED MATHEMATICS - STATISTICS AND DECISION SCIENCE - MATHEMATICS AND COMPUTER SCIENCE - ASTRONOMY AND ASTROPHYSICS - GEOLOGY AND EARTH SCIENCE - PHYSICS - MATERIALS SCIENCE - CONSTRUCTION SERVICES - TRANSPORTATION SCIENCES AND TECHNOLOGIES - PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION - ACTUARIAL SCIENCE - OPERATIONS LOGISTICS AND E-COMMERCE - BUSINESS ECONOMICS - MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Cluster 1: - GENERAL AGRICULTURE - AGRICULTURE PRODUCTION AND MANAGEMENT - AGRICULTURAL ECONOMICS - ANIMAL SCIENCES - FOOD SCIENCE - PLANT SCIENCE AND AGRONOMY - SOIL SCIENCE - MISCELLANEOUS AGRICULTURE - ENVIRONMENTAL SCIENCE - FORESTRY - NATURAL RESOURCES MANAGEMENT - ARCHITECTURE - AREA ETHNIC AND CIVILIZATION STUDIES - JOURNALISM - MASS MEDIA - ADVERTISING AND PUBLIC RELATIONS - COMMUNICATION TECHNOLOGIES - COMPUTER PROGRAMMING AND DATA PROCESSING - COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY - COMPUTER NETWORKING AND TELECOMMUNICATIONS - COSMETOLOGY SERVICES AND CULINARY ARTS - EDUCATIONAL ADMINISTRATION AND SUPERVISION - SCHOOL STUDENT COUNSELING - MATHEMATICS TEACHER EDUCATION - PHYSICAL AND HEALTH EDUCATION TEACHING - EARLY CHILDHOOD EDUCATION - SCIENCE AND COMPUTER TEACHER EDUCATION - SECONDARY TEACHER EDUCATION - SPECIAL NEEDS EDUCATION - SOCIAL SCIENCE OR HISTORY TEACHER EDUCATION - TEACHER EDUCATION: MULTIPLE LEVELS - LANGUAGE AND DRAMA EDUCATION - ART AND MUSIC EDUCATION - MISCELLANEOUS EDUCATION - BIOLOGICAL ENGINEERING - BIOMEDICAL ENGINEERING - ENGINEERING TECHNOLOGIES - MECHANICAL ENGINEERING RELATED TECHNOLOGIES - MISCELLANEOUS ENGINEERING TECHNOLOGIES - LINGUISTICS AND COMPARATIVE LANGUAGE AND LITERATURE - FRENCH GERMAN LATIN AND OTHER COMMON FOREIGN LANGUAGE STUDIES - OTHER FOREIGN LANGUAGES - FAMILY AND CONSUMER SCIENCES - COURT REPORTING - PRE-LAW AND LEGAL STUDIES - COMPOSITION AND RHETORIC - HUMANITIES - LIBRARY SCIENCE - BIOCHEMICAL SCIENCES - BOTANY - MOLECULAR BIOLOGY - ECOLOGY - GENETICS - MICROBIOLOGY - PHARMACOLOGY - PHYSIOLOGY - ZOOLOGY - NEUROSCIENCE - MISCELLANEOUS BIOLOGY - MILITARY TECHNOLOGIES - MULTI/INTERDISCIPLINARY STUDIES - INTERCULTURAL AND INTERNATIONAL STUDIES - NUTRITION SCIENCES - COGNITIVE SCIENCE AND BIOPSYCHOLOGY - INTERDISCIPLINARY SOCIAL SCIENCES - PHYSICAL FITNESS PARKS RECREATION AND LEISURE - PHILOSOPHY AND RELIGIOUS STUDIES - THEOLOGY AND RELIGIOUS VOCATIONS - PHYSICAL SCIENCES - ATMOSPHERIC SCIENCES AND METEOROLOGY - CHEMISTRY - GEOSCIENCES - OCEANOGRAPHY - MULTI-DISCIPLINARY OR GENERAL SCIENCE - NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES - EDUCATIONAL PSYCHOLOGY - CLINICAL PSYCHOLOGY - COUNSELING PSYCHOLOGY - INDUSTRIAL AND ORGANIZATIONAL PSYCHOLOGY - SOCIAL PSYCHOLOGY - MISCELLANEOUS PSYCHOLOGY - PUBLIC ADMINISTRATION - PUBLIC POLICY - HUMAN SERVICES AND COMMUNITY ORGANIZATION - SOCIAL WORK - GENERAL SOCIAL SCIENCES - ANTHROPOLOGY AND ARCHEOLOGY - CRIMINOLOGY - GEOGRAPHY - INTERNATIONAL RELATIONS - MISCELLANEOUS SOCIAL SCIENCES - ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLOGIES AND PRODUCTION - DRAMA AND THEATER ARTS - MUSIC - VISUAL AND PERFORMING ARTS - FILM VIDEO AND PHOTOGRAPHIC ARTS - ART HISTORY AND CRITICISM - STUDIO ARTS - MISCELLANEOUS FINE ARTS - GENERAL MEDICAL AND HEALTH SERVICES - COMMUNICATION DISORDERS SCIENCES AND SERVICES - HEALTH AND MEDICAL ADMINISTRATIVE SERVICES - MEDICAL ASSISTING SERVICES - MEDICAL TECHNOLOGIES TECHNICIANS - HEALTH AND MEDICAL PREPARATORY PROGRAMS - TREATMENT THERAPY PROFESSIONS - COMMUNITY AND PUBLIC HEALTH - MISCELLANEOUS HEALTH MEDICAL PROFESSIONS - HUMAN RESOURCES AND PERSONNEL MANAGEMENT - INTERNATIONAL BUSINESS - HOSPITALITY MANAGEMENT - MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION - UNITED STATES HISTORY Cluster 2: - COMMUNICATIONS - COMPUTER SCIENCE - GENERAL EDUCATION - ELEMENTARY EDUCATION - ENGLISH LANGUAGE AND LITERATURE - LIBERAL ARTS - BIOLOGY - CRIMINAL JUSTICE AND FIRE PROTECTION - ECONOMICS - POLITICAL SCIENCE AND GOVERNMENT - SOCIOLOGY - FINE ARTS - COMMERCIAL ART AND GRAPHIC DESIGN - MARKETING AND MARKETING RESEARCH - FINANCE - HISTORY Cluster 3: - PSYCHOLOGY - NURSING - GENERAL BUSINESS - ACCOUNTING - BUSINESS MANAGEMENT AND ADMINISTRATION
Most Common Majors within Each Cluster¶
Now that we have received the majors for each cluster, it would be good to know which majors has the most common population within each cluster to show which major has the best results per cluster. The bar plot below shows the most common majors within each cluster. The bar plot shows the most common majors within each cluster, indicating which majors have the highest employment and salary statistics. The bar plot helps to identify the differences between the clusters based on their most common majors.
most_common_majors = df_cluster2.groupby('KMeans_Cluster')['Major'].agg(lambda x: x.value_counts().index[0]).reset_index()
plt.figure(figsize=(12, 6))
sns.barplot(x='KMeans_Cluster', y='Major', data=most_common_majors)
plt.title("Most Common Majors within Each Cluster")
plt.xlabel("Cluster")
plt.ylabel("Most Common Major")
plt.show()
Distribution of Majors within Each Category Group by Cluster¶
The bar plot below shows the distribution of majors within each category group by cluster. The bar plot shows the distribution of majors within each category group, indicating which majors have the highest employment and salary statistics. The bar plot helps to identify the differences between the clusters based on their most common majors. The bar plot shows that Cluster 0 has the highest number of majors in the Engineering category, while Cluster 3 has the highest number of majors in the Arts category. This shows that the engineering category has the highest employment and salary statistics, while the arts category has the lowest employment and salary statistics. This is a good indicator of the overall trends in the data.
plt.figure(figsize=(14, 8))
sns.countplot(y='Major_category', hue='KMeans_Cluster', data=df_cluster2, palette='viridis', order=df_cluster2['Major_category'].value_counts().index)
plt.title("Distribution of Majors within Each Category Group by Cluster")
plt.xlabel("Count")
plt.ylabel("Major Category")
plt.legend(title='Cluster')
plt.show()
Clustering Analysis¶
Regarding clustering analysis overall, the K-Means clustering algorithm performed well in identifying the clusters within the dataset. The elbow method and silhouette score helped to determine the optimal number of clusters, and the dendrogram provided a hierarchical view of the clusters. The PCA scatter plot helped to visualize the clustering results in a lower-dimensional space, and the heatmap of cluster centers provided a visual representation of the differences between the clusters.
The group of majors within Cluster 0 (Engineering, Computer Science, Mathematics, etc.) has the highest employment and salary statistics, while the group of majors within Cluster 3 (Arts, Humanities, etc.) has the lowest employment and salary statistics. Additionally, we had groups in Cluster 2 that seemed to have a mix of the stats like Education which could, for example, have a high employment but not the highest in salary. This indicates that the engineering and computer science majors tend to have better job prospects and higher salaries compared to the arts and humanities majors.
Regarding answering my original question, the clustering aided a lot in finding which majors belonged in certain clusters and what success category they belonged in. Four distinct characteristics based on employment and salary statistics were identified. The clusters shared they had relations while also being distinct from each other. For example, clusters 0–1 had a high amount of employment and salary statistics, while clusters 2–3 had a lower amount of employment and salary statistics. This indicates that the clustering analysis was successful in identifying the clusters and their characteristics. As far as regarding the insights from the clusters, it seemed that we could base success of off certain college majors and how they perform with employment statistics.
Impact¶
The clustering analysis of college majors based on employment and salary statistics can have several potential impacts:
Social Impact¶
The social impact of this project can help impact decision-making where students can make informed decisions about their education and career paths by providing insights into the economic outcomes of different majors. This can help lead to better alignment between education and workforce needs. Additionally, the analysis can help identify potential disparities in employment and salary outcomes among different majors, which can inform discussions about equity and access to education.
Ethical Impact¶
Regarding the ethical impact, the project raises several important considerations:
- Bias and Fairness: Some conclusions from this dataset can help reinforce existing biases or stereotypes about certain majors or fields of study. For example, majors historically associated with lower employment rates or salaries may be unfairly stigmatized, leading to negative perceptions of those fields.
- Access to Information: Another possible impact could be the access to information. The analysis can help provide valuable information to students and educators, but it is important to ensure that this information is accessible to all individuals, regardless of their background or socioeconomic status. This can help level the playing field and provide more opportunities for all students to make informed decisions.
Economic Impact¶
Regarding the economic impact, the clustering analysis can help with identifying which majors have better employment and salary outcomes, which can help in aligning the demands of the market. Additionally, the analysis from the project can help inform educational institutions and policymakers about the economic outcomes of different majors, which can help guide decisions about program offerings and funding.
Negative Impacts¶
- The analysis focuses primarily on economic outcomes, which may lead to overemphasis on certain majors at the expense of others. This can discourage students from pursuing majors that may not have the highest economic returns but are still valuable for personal growth and development.
- The analysis primary considers economic outcomes, which may overshadow other important factors such as overall job satisfaction, work-life balance, and other areas where people could be moved to pick a certain field.
Overall, while the clustering analysis provides valuable insights into mainly economic outcomes for different college majors. It is important to consider the broader social, ethical, and economic implications of the analysis and ensure that the findings are used responsibly and ethically.
References¶
- FiveThirtyEight. (n.d.). FiveThirtyEight College Majors Dataset. Kaggle. Retrieved from https://www.kaggle.com/datasets/fivethirtyeight/fivethirtyeight-college-majors-dataset
- scikit-learn. (n.d.). Clustering. Retrieved from https://scikit-learn.org/stable/modules/clustering.html
- Displayr. (n.d.). What is Dendrogram?. Retrieved from https://www.displayr.com/what-is-dendrogram/
- Penn State Eberly College of Science. (n.d.). Ward's method. Retrieved from https://online.stat.psu.edu/stat505/lesson/14/14.7
- scikit-learn. (n.d.). Silhouette Score. Retrieved from https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
- scikit-learn. (n.d.). PCA. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
- scikit-learn. (n.d.). StandardScaler. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
- scikit-learn. (n.d.). Agglomerative Clustering. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
- scikit-learn. (n.d.). KMeans. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Code¶
The notebook file can be found at the following link: https://github.com/ToyaOkey/Project4