Predicting Stars, Galaxies & Quasars with Random Forest Classifiers in Python

The output of :df.info()is shown below:<class 'pandas.core.frame.DataFrame'>RangeIndex: 10000 entries, 0 to 9999Data columns (total 18 columns):objid 10000 non-null float64ra 10000 non-null float64dec 10000 non-null float64u 10000 non-null float64g 10000 non-null float64r 10000 non-null float64i 10000 non-null float64z 10000 non-null float64run 10000 non-null int64rerun 10000 non-null int64camcol 10000 non-null int64field 10000 non-null int64specobjid 10000 non-null float64class 10000 non-null objectredshift 10000 non-null float64plate 10000 non-null int64mjd 10000 non-null int64fiberid 10000 non-null int64dtypes: float64(10), int64(7), object(1)memory usage: 1.4+ MBNone of the entries are NaN, as expected of a well-maintained dataset..Cleaning is not necessary.Unique EntriesThe nunique() method returns Series objects with the number of unique entries for each column.df.nunique().to_frame().transpose()Occurrences of each Astronomical EntityI then ran value_counts() on the class column.occurrences = df['class'].value_counts().to_frame().rename(index=str, columns={'class': 'Occurrences'})occurrencesWe see that majority of the entries are either galaxies or stars..Only 8.5% of the entries are classified as quasars.Density Distribution PlotsUsing a kernel density estimation (kde), I plotted (smooth) density distributions of the various features.featuredf = df.drop(['class','objid'], axis=1)featurecols = list(featuredf)astrObjs = df['class'].unique()colours = ['indigo', '#FF69B4', 'cyan']plt.figure(figsize=(15,10))for i in range(len(featurecols)): plt.subplot(4, 4, i+1) for j in range(len(astrObjs)): sns.distplot(df[df['class']==astrObjs[j]][featurecols[i]], hist = False, kde = True, color = colours[j], kde_kws = {'shade': True, 'linewidth': 3}, label = astrObjs[j]) plt.legend() plt.title('Density Plot') plt.xlabel(featurecols[i]) plt.ylabel('Density')plt.tight_layout()Filter band densities are also plotted for each class.filterbands = pd.concat([df.iloc[:,3:8], df['class']],axis=1)plt.figure(figsize=(15,5))plt.suptitle('Density Plots')sns.set_style("ticks")for i in range(len(astrObjs)): plt.subplot(1, 3, i+1) for j in range(len(featurecols2)): sns.distplot(df[df['class']==astrObjs[i]][featurecols2[j]], hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 3}, label = featurecols2[j]) plt.legend() plt.xlabel(astrObjs[i]) plt.ylabel('Density')plt.tight_layout()Additional VisualisationsFor completeness, I include a 3D plot, identical to that of the original notebook..The original intention seems to be determining if a linear kernel for the SVM works (correct me if I’m wrong please)..There was a lot of clustering at the bottom, and I took the log of the redshift (ignoring the errors) to make the visualisation clearer.from mpl_toolkits.mplot3d import Axes3Dfig = plt.figure(figsize=(5,5))ax = Axes3D(fig)for obj in astrObjs: luminous = df[df['class'] == obj] ax.scatter(luminous['ra'], luminous['dec'], np.log10(luminous['redshift']))ax.set_xlabel('ra')ax.set_ylabel('dec')ax.set_zlabel('log redshift')ax.view_init(elev = 0, azim=45)plt.show()Building the Random Forest ClassifierTraining and Test Set Splitfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierx_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=123, stratify=labels)clf = RandomForestClassifier()Hyperparameter OptimisationFor hyperparameter tuning, I found this and this rather handy..We begin by instantiating a random forest and looking at the default values of the available hyperparameters..Pretty-printing the get_params() method:from pprint import pprintpprint(clf.get_params())This gave:{'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 10, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}The hyperparameters which I decided to focus on are:n_estimators (number of trees in the forest)max_features (max. no. of features used in node splitting, usu. < no. of features in dataset)max_depth (max. no. of levels in each decision tree)min_samples_split (min. no. of data points in a node before node is split)min_samples_leaf (min. no. of data points allowed in node)criterion (metric used to determine stopping criteria for the decision trees)Tuning Using Random SearchTo narrow down my search, I first ran a Randomised Search Cross-Validation..Here, I performed a random search of parameters using k = 10 fold cross-validation (cv = 10), across 100 different combinations (n_iter = 100), and with all available cores concurrently (n_jobs=-1)..Random search selects a combination of features at random instead of iterating across every possible combination..Recall that a higher n_iter and cv results in more combinations and less possibility of overfitting respectively.from sklearn.model_selection import RandomizedSearchCVhyperparameters = {'max_features':[None, 'auto', 'sqrt', 'log2'], 'max_depth':[None, 1, 5, 10, 15, 20], 'min_samples_leaf': [1, 2, 4], 'min_samples_split': [2, 5, 10], 'n_estimators': [int(x) for x in np.linspace(start = 10, stop = 100, num = 10)], 'criterion': ['gini', 'entropy']}rf_random = RandomizedSearchCV(clf, hyperparameters, n_iter = 100, cv = 10, verbose=2, random_state=123, n_jobs = -1)rf_random.fit(x_train, y_train)A huge bunch of stuff comes up..To obtain the best parameters, I called:rf_random.best_params_This gave:{'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': None, 'max_depth': 15, 'criterion': 'entropy'}Tuning Using Grid SearchI could now specify a narrower range of hyperparameters to concentrate on..GridSearchCV is perfect for the fine-tuning of the hyperparameters.from sklearn.model_selection import GridSearchCVhyperparameters = {'max_features':[None], 'max_depth':[14, 15, 16], 'min_samples_leaf': [1, 2, 3], 'min_samples_split': [4, 5, 6], 'n_estimators': [90, 100, 110], 'criterion': ['entropy']}rf_grid = GridSearchCV(clf, hyperparameters, cv = 10, n_jobs = -1, verbose = 2)rf_grid.fit(x_train, y_train)This took me roughly 50 minutes.. More details

Leave a Reply