Calculating the Semantic Brand Score with Python

Calculating the Semantic Brand Score with PythonBrand Intelligence in the Era of Big DataAndrea Fronzetti ColladonBlockedUnblockFollowFollowingApr 16The Semantic Brand Score (SBS) is a novel metric designed to assess the importance of one or more brands, in different contexts and whenever it is possible to analyze textual data, even big data.

The advantage with respect to some traditional measures is that the SBS do not relies on surveys administered to small samples of consumers.

The measure can be calculated on any source of text documents, such as newspaper articles, emails, tweets, posts on online forums, blogs and social media.

The idea is to capture insights and honest signals through the analysis of big textual data.

Spontaneous expressions of consumers, or other brand stakeholders, can be collected from the places where they normally appear— for example a travel forum, if studying the importance of museum brands.

This has the advantage of reducing the biases induced by the use of questionnaires, where interviewees know that they are being observed.

The SBS can also be adapted to different languages and to study the importance of specific words, or set of words, not necessarily ‘brands’.

By ‘brand’ one can intend the name of a politician, or a set of words that represent a concept (for example, the concept of “innovation” or a corporate core value).

The measure was used to evaluate the transition dynamics that occur when a new brand replaces an old one[1].

The Semantic Brand Score is also useful to relate the importance of a brand to that of its competitors, or to analyze importance time trends of a single brand.

In some applications, the scores obtained proved to be useful for forecasting purposes; for example, a link has been found between brand importance of political candidates in online press and election outcomes [3, 4].

Three Dimensions of Brand ImportanceThe SBS measures brand importance, which is at the basis of brand equity [1].

Indeed the metric was partially inspired by well-known conceptualizations of brand equity and by the constructs of brand image and brand awareness (see for example the work of Keller) [2].

Brand importance is measured along 3 dimensions: prevalence, diversity and connectivity.

Prevalence measures the frequency of use of the brand name, i.

e.

the number of times a brand is directly mentioned.

Diversity measures the diversity of the words associated with the brand.

Connectivity represents the brand ability to bridge connections between other words or groups of words (sometimes seen as discourse topics).

More information about the SBS can be found on this website [5], on Wikipedia, or reading this paper [1].

In this article I will not spend too much time on the metric, as my focus is to describe the main steps for calculating it using Python 3.

Data Collection and Text Pre-processingThe calculation of the Semantic Brand Score requires combining methods and tools of text mining and social network analysis.

Figure 1 illustrates the main preliminary steps, which comprise data collection, text pre-processing and construction of word co-occurrence networks.

Figure 1 — From Texts to NetworksFor this introductory tutorial, we can assume that relevant textual data has already been collected and organized in a text file, where each new line is a different document.

I will just insert two imaginary brands (‘BrandA’ and ‘BrandB’) into random English text.

Using Python 3 to Calculate the Semantic Brand ScoreI imported the random text file in Python as a list of text documents (texts), which are processed to remove punctuation, stop-words and special characters.

Words are lowercased and split into tokens, thus obtaining a new texts variable, which is a list of lists.

More complex operations of text preprocessing are always possible (such as the removal of html tags or ‘#’), for which I recommend reading one of many tutorials on Natural Language Processing in Python.

The stopwords list is taken from the NLTK package.

Lastly, word affixes are remove through Snowball Stemming.

##Import re, string and nltk, and download stop-wordsimport reimport nltkimport stringfrom nltk.

stem.

snowball import SnowballStemmer#Define stopwordsnltk.

download("stopwords")stopw = nltk.

corpus.

stopwords.

words('english')#Define brands (lowercase)brands = ['branda', 'brandb']# texts is a list of strings, one for each document analyzed.

#Convert to lowercasetexts = [t.

lower() for t in texts]#Remove words that start with HTTPtexts = [re.

sub(r"httpS+", " ", t) for t in texts]#Remove words that start with WWWtexts = [re.

sub(r"wwwS+", " ", t) for t in texts]#Remove punctuationregex = re.

compile('[%s]' % re.

escape(string.

punctuation))texts = [regex.

sub(' ', t) for t in texts]#Remove words made of single letterstexts = [re.

sub(r'w{1}', ' ', t) for t in texts]#Remove stopwordspattern = re.

compile(r'(' + r'|'.

join(stopw) + r')s*')texts = [pattern.

sub(' ', t) for t in texts]#Remove additional whitespacestexts = [re.

sub(' +',' ',t) for t in texts]#Tokenize text documents (becomes a list of lists)texts = [t.

split() for t in texts]# Snowball Stemmingstemmer = SnowballStemmer("english")texts = [[stemmer.

stem(w) if w not in brands else w for w in t] for t in texts]During text preprocessing we should pay attention not to lose useful information.

Smileys :-), made of punctuation, can be very important if we calculate sentiment.

We can now proceed with the calculation of prevalence, which counts the frequency of occurrence of each brand name — subsequently standardized considering the scores of all the words in the texts.

My choice of standardization here is to subtract the mean and divide by the standard deviation.

Other approaches are also possible [1].

This step is important to compare measures carried out considering different time frames or sets of documents (e.

g.

brand importance on Twitter in April and May).

Normalization of absolute scores is necessary before summing prevalence, diversity and connectivity to obtain the Semantic Brand Score.

#PREVALENCE#Import Counter and Numpyfrom collections import Counterimport numpy as np#Create a dictionary with frequency counts for each wordcountPR = Counter()for t in texts: countPR.

update(Counter(t))#Calculate average score and standard deviationavgPR = np.

mean(list(countPR.

values()))stdPR = np.

std(list(countPR.

values()))#Calculate standardized Prevalence for each brandPREVALENCE = {}for brand in brands: PR_brand = (countPR[brand] – avgPR) / stdPR PREVALENCE[brand] = PR_brand print("Prevalence", brand, PR_brand)Next and most important step is to transform texts (list of lists of tokens) into a social network where nodes are words and links are weighted according to the number of co-occurrences between each pair of words.

In this step we have to define a co-occurrence range, i.

e.

a maximum distance between co-occurring words (here is set to 3).

In addition, we might want to remove links which represent negligible co-occurrences, for example those of weight = 1.

Sometimes it can also be useful to remove isolates, if these are not brands.

#Import Networkximport networkx as nx#Choose a co-occurrence rangeco_range = 3#Create an undirected Network GraphG = nx.

Graph()#Each word is a network nodenodes = set([item for sublist in texts for item in sublist])G.

add_nodes_from(nodes)#Add links based on co-occurrencesfor doc in texts: w_list = [] length= len(doc) for k, w in enumerate(doc): #Define range, based on document length if (k+co_range) >= length: superior = length else: superior = k+co_range+1 #Create the list of co-occurring words if k < length-1: for i in range(k+1,superior): linked_word = doc[i].

split() w_list = w_list + linked_word #If the list is not empty, create the network links if w_list: for p in w_list: if G.

has_edge(w,p): G[w][p]['weight'] += 1 else: G.

add_edge(w, p, weight=1) w_list = []#Remove negligible co-occurrences based on a filterlink_filter = 2#Create a new Graph which has only links above#the minimum co-occurrence thresholdG_filtered = nx.

Graph() G_filtered.

add_nodes_from(G)for u,v,data in G.

edges(data=True): if data['weight'] >= link_filter: G_filtered.

add_edge(u, v, weight=data['weight'])#Optional removal of isolatesisolates = set(nx.

isolates(G_filtered))isolates -= set(brands)G_filtered.

remove_nodes_from(isolates)#Check the resulting graph (for small test graphs)G_filtered.

nodes()G_filtered.

edges(data = True)Having determined the co-occurrence network, we can now calculate diversity and connectivity, which are degree centrality and betweenness centrality of a brand node.

We standardize these values as we did with prevalence.

#DIVERSITYDIVERSITY_sequence=dict(nx.

degree(G_filtered))#Calculate average score and standard deviationavgDI = np.

mean(list(DIVERSITY_sequence.

values()))stdDI = np.

std(list(DIVERSITY_sequence.

values()))#Calculate standardized Diversity for each brandDIVERSITY = {}for brand in brands: DI_brand = (DIVERSITY_sequence[brand] – avgDI) / stdDI DIVERSITY[brand] = DI_brand print("Diversity", brand, DI_brand)If we calculate connectivity as weighted betweenness centraliy, we first have to define inverse weights, as weights are treated by Networkx as distances (which is the opposite of our case).

#Define inverse weights for u,v,data in G_filtered.

edges(data=True): if 'weight' in data and data['weight'] != 0: data['inverse'] = 1/data['weight'] else: data['inverse'] = 1 #CONNECTIVITYCONNECTIVITY_sequence=nx.

betweenness_centrality(G_filtered, normalized=False, weight ='inverse')#Calculate average score and standard deviationavgCO = np.

mean(list(CONNECTIVITY_sequence.

values()))stdCO = np.

std(list(CONNECTIVITY_sequence.

values()))#Calculate standardized Prevalence for each brandCONNECTIVITY = {}for brand in brands: CO_brand = (CONNECTIVITY_sequence[brand] – avgCO) / stdCO CONNECTIVITY[brand] = CO_brand print("Connectivity", brand, CO_brand)The Semantic Brand Score of each brand is finally obtained by summing the standardized values of prevalence, diversity and connectivity.

Different approaches are also possible, such as taking the geometric mean of unstandardized coefficients.

#Obtain the Semantic Brand Score of each brandSBS = {}for brand in brands: SBS[brand] = PREVALENCE[brand] + DIVERSITY[brand] + CONNECTIVITY[brand] print("SBS", brand, SBS[brand])Analytics DemoThis link points to a short demo of the analyses that can be carried out, once the SBS has been calculated.

Word co-occurence networks can additionally be used to study textual brand associations, in order to infer unique and shared brand characteristics.

The calculation of brand sentiment can also complement the analysis.

ConclusionsThis article provided a brief introduction to the Semantic Brand Score and a short tutorial for its simplified calculation using Python 3.

While learning the basics, we should remember that there are many choices that can be made and would influence results.

For example, one could choose different weighting schemes, or normalization approaches, to combine the 3 dimensions into a single score.

Particular attention should be paid to the selection of an appropriate word co-occurrence range.

Moreover, different techniques can be used to prune those links which supposedly represent negligible co-occurrences.

Lastly, the final code will be much more complex if the calculation is carried out on big data.

Metrics such as betweenness centrality have a high computational complexity on large graphs.

Graph-Tool is a library that helped me a lot, as its performance is significantly higher than Networkx.

In some cases, complexity can be reduced working on the initial dataset.

With online news, for example, one could choose to analyze just their title and first paragraph instead of their full content.

As a self-learned Python programmer, I will appreciate any comment or suggestion you might have about the metric and its efficient calculation.

Feel always free to contact me.

References[1] The Semantic Brand Score.

Fronzetti Colladon, 2018.

[2] Conceptualizing, Measuring, and Managing Customer-Based Brand Equity.

Keller, 1993.

[3] Semantic Brand Score page on Wikipedia.

[4] Forecasting Election Results Using the Semantic Brand Score.

Fronzetti Colladon, 2019.

XXXIX Sunbelt Conference of the International Network for Social Network Analysis, Montréal, Québec, Canada.

[5] Semanticbrandscore.

com, the metric website, with updated links and information.

. More details

Leave a Reply