Can data science help you pick a baby name?

Can data science help you pick a baby name?Christopher DoughtyBlockedUnblockFollowFollowingJan 13Finding the perfect name for your baby can be a challenge.

It is hard to find a name not only that both you and your partner like but also that fits with your surname.

There is then the added complication of name popularity, i.

e.

whether you pick something common or unique.

When trying to wade through all of these aspects, could data science make the name selection process easier?To investigate the topic, I gathered some data on baby names.

Data for first names and last names was compiled from the Office of National Statistics (ONS) and National Records of Scotland (NRScotland) [1].

My final dataset contained 31,697 first names and 6,453 last names.

The first name data was partitioned by year for the number of babies born between 2000 and 2017.

TLDR: the online application created for this project can be found here.

Pronunciation predictionSomething people prefer to avoid is a first name that rhymes with the last.

First and last names with matching rhythm do not feel as natural when you say them aloud, e.

g.

Hank Banks.

Another component of pronunciation that can affect the flow of a name is when both the first and the last names end with the same sound, e.

g.

Emilie Kelly.

In many cases, an algorithm cannot identify the ending sound of a word using the letters alone; it needs to know the pronunciation.

To score words based on their similar rhythm, I needed to get training data on how to pronounce words.

This data came from the CMU Pronouncing Dictionary [2], which, after cleaning, contained 117,414 unique words.

The CMU dictionary contains words and their ARPAbet phonetic composition.

For words not present in the dictionary, I predicted how they might be pronounced.

This can be achieved using a seq2seq model, also called an encoder-decoder model.

The preprocessing of our existing dictionary required tokenising a unique set of the alphabet and ARPAbet codes.

This was achieved using a LabelEncoder and created a set of 81 unique tokens.

An example of the LabelEncoder converting the characters to numbers is shown below using my first name:Source: ['C', 'H', 'R', 'I', 'S', 'T', 'O', 'P', 'H', 'E', 'R']Target: ['K', 'R', 'IH1', 'S', 'T', 'AH0', 'F', 'ER0']#### TOKENISING ####Source: [21 37 62 39 63 65 53 60 37 25 62]Target: [48 62 41 63 65 8 35 29]Evaluating the input CMU data, I found that the longest source vector was 34 elements and the longest target vector was 32 elements.

These maximums seemed sensible, so the source and target vectors were standardised in length using a padding element ‘<PAD>’.

The seq2seq model was constructed with Keras using two LSTM (Long Short Term Memory) networks.

The model was trained for 35 epochs, with a final accuracy of 95.

6% and loss of 0.

147.

Testing the final model with some example first names not present in the original dictionary produced promising results.

LEXI L EH1 K S IYAYLA EY1 L AH0MYA M IY1 AH0The model was then applied to the first names and last names data [1].

It was used to compute the pronunciation of 25,036 out of 31,697 first names and 739 out of 6,452 last names.

Rhyming similarityHaving the phonetics for all first names and last names meant that I could now score the similarity between names.

I achieved this through calculating something called Levenshtein distance between each of the words.

This method describes the minimal number of insertions, deletions, or substitutions required to transform one string into another.

The greater the Levenshtein distance, the less similarity there is between the two strings, e.

g.

a distance of 0 means that both strings are identical.

The code to create a matrix of the changes required to make both strings the same is outlined below.

# Get the length of each stringlenx = len(str1) + 1leny = len(str2) + 1# Create a matrix of zerosmatrx = np.

zeros((lenx, leny))# Index the first row and columnmatrx[:,0] = [x for x in range(lenx)]matrx[0] = [y for y in range(leny)]# Loop through each value in the matrixfor x in range(1, lenx): for y in range(1, leny): # If the two string characters are the same if str1[x-1] == str2[y-1]: matrx[x,y] = matrx[x-1,y-1] else: matrx[x,y] = min(matrx[x-1, y]+1, matrx[x-1,y-1]+1, matrx[x,y-1]+1)Testing the code on the similarity of the words CONTAIN and TODAY to the word OBTAIN produced the following output matrices.

The number of operations can be used to calculate a similarity score using the following equation: (1 –(operations / max_string_length) * 100).

This technique calculates that compared to OBTAIN, CONTAIN is 71% similar whereas TODAY is only 17% similar.

Levenshtein distance calculated between different words.

The similarity between the words is higher on the left (two alterations) compared to the right (five alterations)How this is used on our data is slightly different.

Rather than apply this method to the last name directly and score first names based on their similarity, it is applied to the phonetics to score names that have similar sounds or rhythm to the last name.

For example, if we use the last name SMITH and search for the most and least similar-sounding boy's names, the analysis produces the following output:# TOP SCORING SIMILARITYSETH 75.

0KIT 62.

5SZYMON 50.

0# LEAST SCORING SIMILARITYZACHARY 8.

33HUSSAIN 9.

09PARKER 10.

0Popularity trendsAt this point, we have a working method for comparing phonetic similarity, removing similar-sounding components and scoring first and last names.

We would want to consider the current trends of the first names in our data.

Would we like to pick a name that is popular, increasing in popularity, decreasing in popularity, stable, or quite rare in the data?The data I had allowed me to look at the trend changes for each name according to its year-on-year changes from 2000–2017.

The following chart shows the change in the top 10 boys and girls names in the UK between 2012 and 2017.

It shows that there is plenty of movement.

Top 10 boys and girls names between the years 2012 and 2017; lines show rank changes within the top 10To label each first name with a ‘popularity profile’, I ran linear regression across small segments of 4-year data for each name.

The regression coefficients for each of the segments were combined to create a trend array.

An array that looks like this [+, +, +, +, +] indicates that the name is growing over time, and one like this [-, -, -, -, -] indicates that the name is in decline.

This allows a profile to be created for each name with regard to its array.

The seven profiles I put together are explained below:'CONTINUED DECLINE' : Most recently declining, +1 previous 'DECLINE STABILISING' : Most recently stable, +2 decline'RECENT DECLINE' : Most recently declining, 0 previous'CURRENTLY STABLE' : Most recently stable, +1 previous'GROWTH STABILISING' : Most recently stable, +2 growth'RECENT BOOM' : Most recently growing, 0 previous'GROWING BOOM' : Most recently growing, +1 previousTo match the profiles to names, I plotted the top 100 female baby names in 2017 and coloured them based on their profile (below figure).

Top 100 ranked female names of 2017 and their historical profilesAn interesting finding from this figure is that a large number of names in the top 100 are in decline.

This can be explained by looking at the number of unique names for each year (below figure), as the diversity in name choices increased between 2000 and 2012.

Count of unique male and female baby names by yearA few more interesting findings…A deeper exploration of the data revealed a few more interesting insights.

One of these was the frequency count of babies based on the length of first names.

It shows that most first names will fall between four and eight letters long.

Frequency plot of the length of first namesAnother was the use of a hyphen in names.

We can observe that this has become increasingly popular over the past 17 years.

Number of babies with hyphenated first names by yearFinal thoughts and applicationWhen combining the pronunciation similarity model with surname and the popularity data, the output is a very powerful approach to reducing thousands of names down to a couple of hundred.

This is the main benefit of a tool like this, providing a reduction on a dataset that is too large to read through and giving insight into a bit of history and geographical context on particular names.

I have pushed the scoring model and trend profile data to GitHub for anyone interested in testing a lightweight version of the model and trend data: https://penguinstrikes.

github.

io/content/baby_names/index.

html.

In conclusion, a model for a problem like this will not be able to provide the ‘perfect’ name, and nor should it.

There are too many personal attributes involved in a task like this to provide a definitive solution.

However, in comparison to some of the websites and literature available to help soon-to-be parents pick a baby name, a data-driven approach, in my opinion, is far more useful.

References[1] ONS data available under the Open Government Licence v3.

0[2] http://svn.

code.

sf.

net/p/cmusphinx/code/trunk/cmudict/cmudict-0.

7b.

. More details

Leave a Reply