Optimize Data Science Models with Feature Engineering

My name (“Pauline”) is old fashion without much room for abbreviation.

I assumed the following list of features based on time and manually pulled my assumptions from the SSA data, aka “manual” features.

Then I applied principal component analysis (PCA) to create a set of “automated” features.

Finally, cluster analysis results are compared using these automated, manual, and combined features.

Clustering is a way to group sets of objects together with similar attributes.

Data science methods have a solid track record on pooling together similar things.

For example, product recommendation algorithms identify people like you by purchasing history and/or demographics or determine products that are most commonly purchased with each other (i.

e.

ketchup with fries).

List of features to find names similar to “Pauline”:Similar names will not not at or near it’s peak of popularity (peak detection),Similar names are relatively obscure in the United States (quantity and acceleration),Similar names are not too unique (quantity), andSimilar names is subjectively pleasant (not quantifiable).

The data set, Python code, and analysis are available in a public and interactive Kaggle notebook: https://www.

kaggle.

com/paulinechow/baby-names-optimize-w-feature-engineeringWhat features will generate the “best” short list of baby names?Baby Name Metrics: peak detection, acceleration, and rankSSA’s name data tracks the frequency of names used from the years 1910 to 2017.

Name frequencies are aggregated and grouped by year and gender.

In the notebook, sections 1 to 3 are general checks of data, including sampling of the data, test statistics, size, and shape.

The notebook walks through creating categorical metrics based on peak popularity, year over year change, and appearance in the top 500 most popular names in the last 3 years.

Section 4 of the notebook goes through the steps to create, combine, and analyze these metrics.

(1) Peak popularity detectionEach name has highs (“peaks”) and lows (“valleys”) compared to itself and and globally to provide important information.

For instance, the names “Bertha” and “Jenny” reached peak popularity in 1920s and 1970s, respectively, and since decreased in popularity.

Bertha has had a steady decline since it’s peak, while Jenny and Jennifer were strong contenders between 1940s to 1970s before its steady descent.

for names of her peers that will be popular in the lifetime of my child.

A hypothesis is that names will not likely be popular again if it peaked significantly relative to the population before the current cohort and currently on a decelerating trend.

Peak detection combined with knowing the current acceleration or YoY (year over year) change can help narrow down names that meet the current requirements.

Peak detection also returns information to create metrics.

Peak detection is used in digital signal processing and speech recognition to find local minima and maxima both in fixed and real time data.

With names, peak detection can contextualize names with respect to events, people, and culture.

Here, peak detection is leveraged to determine any peak(s) within the last 5, 10, 15, 20, and 25 years, which are saved as categorical features in the data set.

In this notebook, peaks and valleys are detected with a simple and complex method.

(a) The most straight forward approach to peak detection is to calculate sign changes between consecutive periods.

A sign change between two periods from positive to negative would denote a decrease from a peak.

The simple algorithm returns the index of decreases compared with previous element.

The input data for the peak_detection_simple function are a list of values, such as yearly or 5 year rolling averages.

The calculations for this list is completed before the function returns indices.

The simple peak detection function below computes differences between consecutive time periods and returns the number of sign changes.

Sign changes are defined as movement from positive to negative and does not differentiate between magnitude or length of time at a peak.

This simple method lacks the ability to look at the big picture trend of a name.

Questions arise from the results of the simple function: What if a name spends more time at a “peak”?.Should we aggregate similar peaks that are pretty close?.What fluctuation from positive to negative a peak are significant?.Is there a threshold for the slope of an incline to or descent from the peak?(b) Scipy is an open-source scientific computing package that provides built in functions for mathematics, science, and engineering.

The package provides functions for identifying peaks and with additional parameters can differentiate further, see scipy.

signal.

find_peaks.

The scipy find_peaks function provides options to define the absolute minimum and maximum of peaks (height), set a minimum vertical (threshold) and horizontal (distance) measurement of peaks, and relative strength of the peak (prominence).

(2) Acceleration or YOY change Year over year (or over any time period) metrics are standard in analytics and reporting contexts.

The longer the time period compared, the more seasonality factors are normalized in the outputs.

In python pandas, calculating percentage change between x number of years creates a proxy for the acceleration rate with the last x years.

A number of features are created for the names based over various periods of time.

(3) Top 500 ranked name indicatorA categorical variable is created to flag if a name was ranked in the top 500 list over the last 3 years.

The indicator is a way to prevent attributing too much weight by using actual ranks even if we scale this number.

Other ways this indicator can be changed to aggregate over more or less years, collect all top X names for every year, and rank over names grouped by state.

A list of the top 500 names over the last 3 years will come in handy later for filtering names from the final list.

Features from Principle Component Analysis (PCA)Name popularity is hard to predict solely based on frequency.

A pattern is not necessarily discernible because inspiration is random.

Parents may be influenced by Disney movies, public figure, or private event in their lives.

A study that uses baby names as indicators of cultural traits in the United States, shows new names are being invention versus using names from past generations.

Cross-Correlations of Baby Names Instead of manually extracting metrics, the entire trends can be decomposed into features that explain the variance of each trend.

PCA transforms the The math behind PCA is explained here.

Running PCA with 25 components, the results show that 3 and 10 components cumulatively explains 80% and 99% of variance in the data, respectively.

A threshold for cumulative variance can be set before the results of PCA, especially if the data set is very large.

Computation time can be saved by leveraging PCA as a dimension reduction technique.

Alternatively, the number of components chosen for subsequent analysis can be dependent on the outcome of the final model.

This means that a model can be further optimized by data inputs, in this case it would be the number of components.

Section 5 of the notebook transforms the features and implements cluster analysis for 4 and 10 components.

In this analysis, accounting for more variance between names is able to more cleanly partition names into clusters.

The more component leveraged here, the better the final cluster silhouette scores.

Optimizing the results for quality of clustering is most aligned with the desired outcome.

Finding clusters with KmeansI chose Kmeans cluster analysis for a more generalized method of groupings of names.

Kmeans clustering is used to to group names (observations) into n clusters, where each name is allocated to the cluster with the nearest mean.

The quality of clusters can measured with silhouette score, ranging from -1 to +1, which determines cohesion of observations within its own cluster and compared with other clusters.

The number of clusters for manual, automatic, and combined data sets were selected based on the count of names in the same cluster as Pauline.

Results Baby Name ListsThe baby name list results were promising since we are able to go from thousands to less than 200 names.

Lists generated by clustering with manual and automatic features contains 155 and 57 names, respectively.

There are 37 names shared by both lists.

PCA features identified a shorter list of names and, from the graph below, follows the assumptions laid out originally.

At the same time, only using automatically created features subjectively “misses” potential names.

A blended dataset, aggregating manual and automatic features, produced the shortest list with 47 names.

A lesson from this comparison is that feature engineering can be both art and science.

Cluster analysis will produce groupings that meet requirements and the more requirements means the more restrictive the groupings.

Automatic features are able to mirror trend similarities over time more accurately.

Manual features will capture internalized rules or assumptions but is not a guarantee to remove noisy results.

When solely relying on automatic features then creativity may be lost.

Blending the data and adding more features filter names further instead of infusing creativity into the list.

The blended data produced a list with less noise, while the manual features only did not do enough to meet the requirements.

Below are sample of names generated from using different set of features.

The full lists from both set of features can be downloaded from the Jupyter notebook.

Below are 10 randomly selected names from the lists:Further questions that you can ask about the data:What other metrics can we derive from the SSA data set?Do the similar trends and insights follow with male names in the SSA data set?What attributes of cluster analysis can be optimized to find similar names?.This analysis only explores number of clusters for k-means analysis.

What if connectivity or distribution based clustering are used instead of centroid-based clustering?Read other data science posts on www.

fountainofdata.

com.

Kaggle notebook available here.

Post up loaded with the help of gist.

github.

com.

.. More details

Leave a Reply