Star Classification and Prediction#

Author: Andrew Heider

Course Project, UC Irvine, Math 10, F23


My project will be focused around using a 6 class star dataset for star classificiation that will utilized deep learning approaches. This dataset will include seveal features of the star which are Absolute Temperature, Relative Luminosity, Relative Radius, Absolute Magnitude, Star Color, Spectral Class, and Star Type. The purpose of this dataset is to prove that these stars follow a certain graph in the celestial space, specifically the Hertzsprung-Russel Diagram in order to classify stars by plotting its feautures based on the graph. This project’s main goal is to show visualization of temperature and luminosity relations within the dataset and will be comparing both the Random Forest model and the Logistic Regression model in prediction of star types from given temperature and luminosity values within the dataset.

Star Data Section#

import pandas as pd

star_data = pd.read_csv("6 class csv.csv")
star_data = star_data.dropna(axis = 1)
Temperature (K) Luminosity(L/Lo) Radius(R/Ro) Absolute magnitude(Mv) Star type Star color Spectral Class
0 3068 0.002400 0.1700 16.12 0 Red M
1 3042 0.000500 0.1542 16.60 0 Red M
2 2600 0.000300 0.1020 18.70 0 Red M
3 2800 0.000200 0.1600 16.65 0 Red M
4 1939 0.000138 0.1030 20.06 0 Red M
... ... ... ... ... ... ... ...
235 38940 374830.000000 1356.0000 -9.93 5 Blue O
236 30839 834042.000000 1194.0000 -10.63 5 Blue O
237 8829 537493.000000 1423.0000 -10.73 5 White A
238 9235 404940.000000 1112.0000 -11.23 5 White A
239 37882 294903.000000 1783.0000 -7.80 5 Blue O

240 rows Ă— 7 columns

Temperature (K) Luminosity(L/Lo) Radius(R/Ro) Absolute magnitude(Mv) Star type
count 240.000000 240.000000 240.000000 240.000000 240.000000
mean 10497.462500 107188.361635 237.157781 4.382396 2.500000
std 9552.425037 179432.244940 517.155763 10.532512 1.711394
min 1939.000000 0.000080 0.008400 -11.920000 0.000000
25% 3344.250000 0.000865 0.102750 -6.232500 1.000000
50% 5776.000000 0.070500 0.762500 8.313000 2.500000
75% 15055.500000 198050.000000 42.750000 13.697500 4.000000
max 40000.000000 849420.000000 1948.500000 20.060000 5.000000

Graphical Relationship of Luminosity and Temperature for Spectral Class#

import altair as alt

star_chart = alt.Chart(star_data).mark_circle().encode(

    x = alt.X('Luminosity(L/Lo):Q', scale = alt.Scale(type = 'log', domain = [0.00001, 10000000])),
    y = 'Temperature (K):Q',
    color = "Spectral Class:N"

This graph shows the relationship with temperature and luminosity with the different star types in the dataset. If we go and compare this graph to the Hertzsprung-Russel Diagram, it is very simliar in nature. We can see the cluster of stars that curve up which are part of the main sequence. Then there is a cluster to the right side above the curve that correspond to the different Supergiant class star types. Finally we can see another cluster to the left side of the graph that indicate where the white dwarfs are.

star_data.columns = ['temperature', 'luminosity', 
              'radius', 'absolute_magnitude', 
              'star_type', 'star_color', 'spectral_class']

I am replacing the names of the different values in this dataset to easier to write names.

star_data.replace('Blue', 'Blue', inplace = True)
star_data.replace('white', 'White', inplace = True)
star_data.replace(['Blue White', 'Blue white', 'Blue white ', 'Blue-white'], 'Blue-White', inplace= True)
star_data.replace('yellowish', 'Yellowish', inplace= True)
star_data.replace('yellow-white', 'Yellow-White', inplace= True)

Some of the color code in the dataset is incorrect so I have to correct it to the right color indexes.

array(['Red', 'Blue-White', 'White', 'Yellowish White',
       'Pale yellow orange', 'Blue', 'Whitish', 'Yellow-White', 'Orange',
       'White-Yellow', 'Blue ', 'Yellowish', 'Orange-Red'], dtype=object)
count_star_color = pd.DataFrame({'Count': star_data['star_color'].value_counts(ascending= True)})

This bar graph goes to show the different number of stars in the dataset that correspond to the different star colors assocaited with it.

import matplotlib.pyplot as plt

plt.figure(figsize= (10, 8))
bars = plt.barh(count_star_color.index, count_star_color['Count'])

for index, bar in enumerate(bars):
  plt.text(bar.get_width() + 1, index - 0.1, bar.get_width())

plt.title('Count of stars per star color type', fontsize= 16)
plt.ylabel('Star color')
plt.xlim([0, 140])

Star Temperature Behavior#

Here is where I will be gathering the mean, median, and std specifically for the temperature to help visulize the dataset on how temperature affects the range of the different star colors collected within the dataset.

star_data.groupby('star_color').agg(['mean', 'median', 'std'])['temperature']
mean median std
Blue 21955.763636 19923.0 9536.927627
Blue 19860.000000 19860.0 NaN
Blue-White 16659.951220 14100.0 6560.050964
Orange 4018.000000 4018.0 380.423448
Orange-Red 5112.000000 5112.0 NaN
Pale yellow orange 7230.000000 7230.0 NaN
Red 3291.785714 3324.0 490.780761
White 9720.300000 8879.5 2563.623176
White-Yellow 7100.000000 7100.0 NaN
Whitish 8876.000000 8876.0 1165.311975
Yellow-White 6466.250000 6158.0 1049.074524
Yellowish 4527.666667 4526.0 451.502307
Yellowish White 10826.666667 11790.0 2773.451520
red_star = star_data.query('star_color == "Red"')
blue_star = star_data.query('star_color == "Blue"')
blue_white_star = star_data.query('star_color == "Blue-White"')
white_star = star_data.query('star_color == "White"')
yellow_white_star = star_data.query('star_color == "Yellow-White"')

chart1 = alt.Chart(red_star).mark_bar().encode(
    x = alt.X('temperature', bin=alt.Bin(maxbins=30)),

chart2 = alt.Chart(blue_star).mark_bar().encode(
    x = alt.X('temperature', bin=alt.Bin(maxbins=30)),

chart3 = alt.Chart(blue_white_star).mark_bar().encode(
    x = alt.X('temperature', bin=alt.Bin(maxbins=30)),

chart4 = alt.Chart(white_star).mark_bar().encode(
    x = alt.X('temperature', bin=alt.Bin(maxbins=30)),

chart5 = alt.Chart(yellow_white_star).mark_bar().encode(
    x = alt.X('temperature', bin=alt.Bin(maxbins=30)),

boxplot = alt.Chart(star_data).mark_boxplot().encode(
    x=alt.X('star_color:N', title='Star color'),
    y=alt.Y('temperature:Q', title='Temperature (K)'),
    color=alt.Color('star_color:N', title='Star color')
    title="Stars' temperatures boxplot related to star color type"


After reviewing the graphs, we can see that blue stars tend to have a higher temperature.

Star Luminosity Behavior#

Here I will be doing the same process I used for temperature where I will be analyzing the mean, median, and std of luminosity within the dataset. I will then be visualizing again the affects of the luminsity across the different star colors.

star_data.groupby('star_color').agg(['mean', 'median', 'std'])['luminosity']
mean median std
Blue 254519.727561 224780.000000 217261.615190
Blue 0.001100 0.001100 NaN
Blue-White 68904.416521 788.000000 185475.774192
Orange 590000.000000 590000.000000 56568.542495
Orange-Red 0.630000 0.630000 NaN
Pale yellow orange 0.000080 0.000080 NaN
Red 60526.787893 0.002550 104282.874572
White 94243.300229 0.000335 201123.823736
White-Yellow 0.000290 0.000290 NaN
Whitish 41.350000 41.350000 46.174073
Yellow-White 2.940750 1.353500 3.515042
Yellowish 0.198333 0.153000 0.141553
Yellowish White 0.000115 0.000110 0.000033
red_star = star_data.query('star_color == "Red"')
blue_star = star_data.query('star_color == "Blue"')
blue_white_star = star_data.query('star_color == "Blue-White"')
yellow_white_star = star_data.query('star_color == "Yellow-White"')

chart1 = alt.Chart(red_star).mark_bar().encode(
    x = alt.X('luminosity:Q', bin=alt.Bin(maxbins=30)),

chart2 = alt.Chart(blue_star).mark_bar().encode(
    x = alt.X('luminosity:Q', bin=alt.Bin(maxbins=30)),

chart3 = alt.Chart(blue_white_star).mark_bar().encode(
    x = alt.X('luminosity:Q', bin=alt.Bin(maxbins=30)),

chart4 = alt.Chart(yellow_white_star).mark_bar().encode(
    x = alt.X('luminosity:Q', bin=alt.Bin(maxbins=30)),
boxplot = alt.Chart(star_data).mark_boxplot().encode(
    x=alt.X('star_color:N', title='Star color'),
    y=alt.Y('luminosity:Q', title='Luminosity(L/Lo)'),
    color=alt.Color('star_color:N', title='Star color')
    title="Stars' luminosity boxplot related to star color type"


After analysis of the boxplot graph, we can see that orange stars tend to have greater luminosity. There are a relatively small amount of stars of this nature in the database as they are classified as type 5 Hypergiants. If we complete further analysis of orange star types we can see the numerical values assocaited with the luminosity.

star_data.query('star_color == "Orange"')
temperature luminosity radius absolute_magnitude star_type star_color spectral_class
53 3749 550000.0 1648.0 -8.05 5 Orange M
172 4287 630000.0 1315.0 -9.20 5 Orange K

Stars with star color of Blue also tend to have higher temperatures as seen from the temperature boxplot. Using query, we can view the temperatures associated with stars of Blue color.

star_data.query('star_color == "Blue"')
temperature luminosity radius absolute_magnitude star_type star_color spectral_class
30 39000 204000.00000 10.60000 -4.700 3 Blue O
49 33750 220000.00000 26.00000 -6.100 4 Blue B
99 36108 198000.00000 10.20000 -4.400 3 Blue O
100 33300 240000.00000 12.00000 -6.500 4 Blue B
101 40000 813000.00000 14.00000 -6.230 4 Blue O
102 23000 127000.00000 36.00000 -5.760 4 Blue O
103 17120 235000.00000 83.00000 -6.890 4 Blue O
104 11096 112000.00000 12.00000 -5.910 4 Blue O
105 14245 231000.00000 42.00000 -6.120 4 Blue O
106 24630 363000.00000 63.00000 -5.830 4 Blue O
107 12893 184000.00000 36.00000 -6.340 4 Blue O
108 24345 142000.00000 57.00000 -6.240 4 Blue O
109 33421 352000.00000 67.00000 -5.790 4 Blue O
141 21020 0.00150 0.01120 11.520 2 Blue B
142 18290 0.00130 0.00934 12.780 2 Blue B
159 37800 202900.00000 6.86000 -4.560 3 Blue O
160 25390 223000.00000 57.00000 -5.920 4 Blue O
161 11567 251000.00000 36.00000 -6.245 4 Blue O
162 12675 452000.00000 83.00000 -5.620 4 Blue O
163 5752 245000.00000 97.00000 -6.630 4 Blue O
164 8927 239000.00000 35.00000 -7.340 4 Blue O
165 7282 131000.00000 24.00000 -7.220 4 Blue O
166 19923 152000.00000 73.00000 -5.690 4 Blue O
167 26373 198000.00000 39.00000 -5.830 4 Blue O
168 17383 342900.00000 30.00000 -6.090 4 Blue O
169 9373 424520.00000 24.00000 -5.990 4 Blue O
173 26000 316000.00000 1679.00000 -9.100 5 Blue B
176 18000 200000.00000 1045.00000 -8.300 5 Blue O
200 16790 0.00140 0.01210 12.870 2 Blue B
201 15680 0.00122 0.01140 11.920 2 Blue B
202 14982 0.00118 0.01130 12.230 2 Blue B
203 13340 0.00109 0.01160 12.900 2 Blue B
204 18340 0.00134 0.01240 11.220 2 Blue B
205 19920 0.00156 0.01420 11.340 2 Blue B
206 24020 0.00159 0.01270 10.550 2 Blue B
207 23092 0.00132 0.01040 10.180 2 Blue B
208 17920 0.00111 0.01060 11.660 2 Blue B
209 19360 0.00125 0.00998 11.620 2 Blue B
214 34190 198200.00000 6.39000 -4.570 3 Blue O
215 32460 173800.00000 6.23700 -4.360 3 Blue O
220 23678 244290.00000 35.00000 -6.270 4 Blue O
221 12749 332520.00000 76.00000 -7.020 4 Blue O
222 9383 342940.00000 98.00000 -6.980 4 Blue O
223 23440 537430.00000 81.00000 -5.975 4 Blue O
224 16787 246730.00000 62.00000 -6.350 4 Blue O
225 18734 224780.00000 46.00000 -7.450 4 Blue O
226 9892 593900.00000 80.00000 -7.262 4 Blue O
227 10930 783930.00000 25.00000 -6.224 4 Blue O
228 23095 347820.00000 86.00000 -5.905 4 Blue O
229 21738 748890.00000 92.00000 -7.346 4 Blue O
231 38234 272830.00000 1356.00000 -9.290 5 Blue O
232 32489 648430.00000 1948.50000 -10.840 5 Blue O
235 38940 374830.00000 1356.00000 -9.930 5 Blue O
236 30839 834042.00000 1194.00000 -10.630 5 Blue O
239 37882 294903.00000 1783.00000 -7.800 5 Blue O

Heat Map#

This following part is a heat map made to help color code different regions based on the various concentration of stars with their density or intensity. This map is usually used by astronomers to help identify regions of interest such as star clusters or areas of high star formation activity.

import numpy as np
import seaborn as sns

heat_map = np.corrcoef([star_data['temperature'], star_data['luminosity'], star_data['radius'],
star_data['absolute_magnitude'], star_data['star_type']])

sns.heatmap(heat_map, annot=True,
xticklabels=['temperature', 'luminosity', 'radius', 
                         'absolute_magnitude', 'star_type'], 
            yticklabels=['temperature', 'luminosity', 'radius', 
                         'absolute_magnitude', 'star_type'])

Random Forest Vs Logistic Regression for Star Type Classification#

For this section of the project I will be looking at both Random Forest and Logistic Regression and will be comparing the two to see which one is a better model to use for prediciting star types. I will be using both temperature and luminosity for the parameters in predicting the different star types. I will be looking at both the accuracy of each model and will be showing a classification report to help analyze each model’s performance.

Random Forest#

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
X = star_data[['temperature', 'luminosity']]
y = star_data['star_type']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 21)
rf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=21), y_train)
y_pred_rf = rf.predict(X_test)
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
print("Random Forest Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_rf))
Random Forest Performance:
Accuracy: 0.7777777777777778
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.92      0.88        12
           1       0.90      0.82      0.86        11
           2       1.00      0.88      0.93         8
           3       0.92      0.86      0.89        14
           4       0.58      0.73      0.65        15
           5       0.60      0.50      0.55        12

    accuracy                           0.78        72
   macro avg       0.81      0.78      0.79        72
weighted avg       0.79      0.78      0.78        72

By using the Random Forest method I was able to achieve a accuracy of 77.8% which is very resonable and leads to the conclusion that there isn’t much overfitting within this model. Looking at both the f1-score and the recall from the classification report, we can see that they are both very high which means the model was able to successfully predict the outcome of given temperature and luminosity and associate them with each star type.

Logistic Regression#

from sklearn.linear_model import LogisticRegression

X1 = star_data[['temperature', 'luminosity']]
y1 = star_data['star_type']

clf = LogisticRegression(max_iter=4000)

X_train, X_test, y_train, y_test = train_test_split(X1,y1, test_size=0.3, random_state=42), y_train)
y_pred_logistic = clf.predict(X_test)
print("Logistic Regression Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_logistic))
print("Classification Report:\n", classification_report(y_test, y_pred_logreg))
Logistic Regression Performance:
Accuracy: 0.4861111111111111
Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00        14
           1       0.00      0.00      0.00        12
           2       0.26      1.00      0.42        10
           3       0.88      0.70      0.78        10
           4       0.57      0.73      0.64        11
           5       0.83      0.67      0.74        15

    accuracy                           0.49        72
   macro avg       0.42      0.52      0.43        72
weighted avg       0.42      0.49      0.42        72

By using the Logisitic Regression method, I was able to only achieve an accuracy of 48.6% which is still quite a resonable accuracy given that it is realtively low and there isn’t much overfitting happening within this method. However, looking at the classification matrix, we can see that the f1-scores are lower meaning that this model did a poorer job in prediciting each star type from the given temperature and luminosity.


After running both the Random Forest model and the Logistic Regression model, I can conclude that using the Random Forest model is a much better model than using the Logisitc Regression model in prediciting star types from temperature and luminosity. Both model’s had a resonable accuracy score which means that there wasn’t too much overfitting happening in either case. However based off of the classification report, the Random Forest model had a much better overall score with f1-scores which leads to a more accuracte predicition versus the Logistic Regression model.


