Community Forums Meets Data Science

Can we help with your next project?” In addition, these members could be invited to higher level discussions (e.


, the future roadmap of the project or company) rather than code Q&A conversations.

OTHER BEHAVIORAL ANALYSIS TOPICSThere is much more insight that can be gleaned using member login information.

For example, an analysis can be done on the 90% of registrants who never posted at all, based on their login time stamps.

These quiet “lurkers” who log in regularly can then be separated from the truly inactive, and their behavior analyzed to encourage higher levels of engagement.

Other interesting information can be gleaned by combining the forum records with other information sources, like Jira or CRM data, enabling a 360 degree view of the forum members’ relevant activities across platforms.

It may also be insightful to analyze popular posting times in the community, so that the marketing and communications team can identify the best times to post in the forums (Keep in mind that international communities and time zones can make this analysis tricky).

Finally, it might be interesting to understand the impact of initiatives.

For example, did member activity increase after the release of a new version of the software?Analysis of community members’ activity on the forums can help improve the customer journey and member engagement.

The next level of insight can be obtained by analyzing the actual content of the forum’s posts.


USING NATURAL LANGUAGE PROCESSING (NLP) TO UNDERSTAND THE COMMUNITYWORD CLOUDS — UNDERSTANDING TOPICS OF CONVERSATIONThe actual content posted by a single user over time can be used to build a Member Profile Word Cloud, and the data that comprises the word cloud can be used to radically improve the member’s experience in the community.

For example, posts on the topics identified through the word cloud can appear in the member’s activity feed, making the feed more relevant and engaging.

For the word cloud below, I used a standard stopper list (a list of words to be ignored by the program), which preserved words such as “login” and “community.

” In order to de-emphasize all common words in the corpus, (“login” is a common word in the Liferay forums), I could have applied “term frequency–inverse document frequency” (TFIDF).

Word cloud for a single member who posted 887 messages:from wordcloud import WordCloudtext = resultdef generate_wordcloud(text): wordcloud = WordCloud( relative_scaling = 0.

5, stopwords = {'re','to', 'of','for','the','is','Liferay','in','and','on','from','with'} ).

generate(text) plt.

imshow(wordcloud) plt.

axis("off") plt.

show()generate_wordcloud(text)WORD COUNTS AND TRIGRAMS — IDENTIFYING THE MAIN TOPICS GENERATING DISCUSSIONWhat is the entire community talking about, and how are members interacting?.The chart below contains the word counts from the subject lines of every discussion in two Liferay forum categories (“Announcements” and “Development”).

After tokenizing the words in the subject, we can make a list of the top 10 words in each category:devWords = [word for word in words if word not in stoplist]fdist_dev = nltk.

FreqDist(devWords)dev_common = fdist_dev.

most_common(10)We can then build a data frame to compare the common words across categories:dfWord = pd.

DataFrame( {'Development': dev_common, 'Announcements': announce_common })dfWord.

index += 1 dfWordThe “!” in the Announcement category is more common for professional communities that are celebrating successes (e.


, “Congrats on the sale!).

In the case of the Liferay forums, the dialogue resembles a typical “information dissemination” community (with keywords such as “released” and “available”).

The “?” and word “how” that appear in the Development category are typical for this community, where the primary use case is members asking other members technical or product-related questions.

This “Q & A” content structure can be confirmed by looking at the most common trigrams (3 word combinations) below.

Development Category Trigrams:tgs = nltk.

trigrams(words)fdistT = nltk.


most_common(5)NOTES ABOUT THE DATA AND NLP ANALYSISFor this type of analysis, make sure characters such as “?” and “!” are considered (i.


, not included on stopper lists)The announcement category is much smaller than the development category, and thus the word counts are much lower (but the relative ranking is what is most important).

On GitHub, I show the steps to convert 1,000s of subject lines into lists of words (chunk, tokenize, etc).

KEYWORD ANALYSIS- HOW NLP CAN ENABLE GREATER EFFICIENCY IN COMMUNITY MANAGEMENTNLP can also help with routine, but essential, community management tasks.

For example, a common challenge in technical forums is getting good answers to all of the questions asked by community members (thus fostering trust and engagement within the community.

)A typical approach used by Liferay is to assign experts to cover each forum category.

However, in Liferay’s case, nearly 40% of the Liferay messages fall in one category, “development”, and it is impractical to assign experts to cover this entire catch-all category.

To tackle this issue, we can look for key themes to some of the discussions in this category that can be isolated to form smaller, more manageable discussion categories.

Keyword analysis is very helpful here, but in this case, characters and short words aren’t helpful, so I filtered for longer words:long_words = [word for word in words if len(word) > 2 and word not in stoplist]fdistLong = nltk.


most_common(50)It is clear that there are some good candidates here for topics that can be separated out from the crowded “Development” category:1.

Portlet 2.

JSP 3.


Theme 5.

DatabaseBigrams (2 word combinations) reveal other good potential topics, including “service builder”, “custom portlet”, and “document library.

” It is also helpful to understand popular topics and how they trend over time.

For example, in the dispersion plot below, it is apparent that the community talked intensely about “6.

1” (presumably, a version of Liferay) as it was coming out — and then the discussions quieted down.

There was also an overlap of discussion around “6.

1” and “6.

2” as “6.

2” was released.

The code and dispersion plot for “6.

1” and “6.

2” appear below.

mytext = nltk.




2"])The above functions and visualizations are just the tip of the iceberg; there is much more that can be discovered by analyzing user behavior and analysis of the content.

Alas, meaningful analysis with the latest AI tools, especially for “unstructured” analysis (where the system doesn’t know up-front what it’s looking for) requires a lot of data.

Even so, relatively basic, yet thoughtful analysis of small data sets can significantly improve community management.

Understanding the members’ behavior and their content can augment community managers’ ability to effectively serve their communities, increasing their communities’ relevance and by extension, engagement, membership, and impact.

¹Note: No names or personal details appear in the sample data.

The original forum postings and member profiles are public, so this is an extra cautionary step.


. More details

Leave a Reply