Data Science & NLP for an effective content strategy

Data Science & NLP for an effective content strategyHarnessing the power of Process excellence, Analytics & NLP to design the content strategy for a data science knowledge portalSowmya VivekBlockedUnblockFollowFollowingMay 4BackgroundThe demand for skill-sets in data science and AI has been exponentially increasing over the past few years.

However the supply of skilled data scientists is not increasing at the same pace, thereby leading to a big gap between demand and supply.

In addition to a structured training for various disciplines in data science, an all around knowledge sharing & enhancement fueled by data science knowledge portals will help to a large extent in bridging the demand-supply gap for data scientists.

Success & effectiveness of the knowledge portals depends on the range of content that can be delivered at the best content quality in the shortest possible time.

This is turn depends on identifying suitable content contributors and automating the overall content curation process.

Overarching approach of the content strategyThe overarching approach for designing the content strategy is based on harnessing the power of process excellence, analytics & NLP to ensure content & contributor quality by the following:Overarching approach to content strategyBuilding blocks of the Content StrategyThe following will be the building blocks of the content strategy:Building blocks of content strategyBuilding a content matrix with proposed content typesThe portal will have an even mix of various content types and a multi-dimensional content matrix will be developed.

The criteria for arriving at the content types will correspond to each dimension of the multi-dimensional content matrix.

The criteria proposed are:Content matrix and content typesContent hierarchy & flow of viewership in the content chainThe content hierarchy is to enable a seamless flow of audience from simplest form of content that needs a relatively less attention span to the most complex content types that need a lot more of attention and interest.

Content hierarchyThe first level of the content hierarchy would include content pieces like a daily dose of a concept in analytics, short youtube videos on conecpts etc that would demand very less reading time from the reader.

These would help in drawing the initial audience for the portal and could serve as leads to the more intense content types that would need a greater reading time and higher span of attention from the reader.

The next level in the hierarchy would be blogs which could be based on the first level and could be of any of the criteria proposed above — for various levels of audience and on various concepts of data science & ML.

These will deem a greater attention that the first level and will usually emerge from one of the level one content types.

A selected collection of blogs on a particular topic could then be combined into an e-book — the criteria for selecting articles for the collection could be either blogs on similar topics as well as blogs which have been received well by the audience with a high read ratio and viewership.

The next or the highest level of content type could be extensive reference works on broad areas in Data Science which would have extensive articles, podcasts and industry developments on that area which could be re-purposed as training content for that area.

This way, the content hierarchy would create a chain of viewership with one level leading the reader to the next and so on — thereby creating a seamless viewership funnel.

The content chain based on the content hierarchyThe content chain is an inspiration from the food chain in the ecosystem which shows the ideal flow of traffic from one level of content hierarchy to another.

Proposed flow of traffic across the content chainThe forethought behind creating the content hierarchy is to start with content requiring least span of attention and gradually build up the trajectory across more complex content types needing a higher span of attention.

The flowchart below illustrates the proposed flow of traffic across several levels of content types with colors representative of the level of content hierarchy.

Flow of traffic across the content chainSetting up content metricsThe objective of designing the content metrics is to evaluate the effectiveness of the content both in terms of content quality and audience preferences.

This will provide a sense of direction to the editorial team on how to steer the content & contributor selection.

Number of repeated views vs Number of views — This indicates how many users have read the content more than once, which means that the content piece is referred to frequently and a higher ratio here could mean that the content piece could qualify as a reference material.

Number of clicks & Clicks to view ratio — This is a direct measure of the content quality, since it indicates how many viewers who have clicked on the content have read the complete content.

Number of likes — This is a measure of viewership for the content.

Number of highlights & citations — Similar to the first metric, this is a measure of how often the content piece is referred by other authors.

It also gives a measure of other sources which directed traffic to the content piece.

Bounce rate — This is a typical metric from web analytics which indicates the % of viewers who navigate away from the content piece as soon as they enter.

This is an indication of low reader engagement.

Sentiment scores from sentiment analysis of the reader comments will provide insights around the most liked and disliked aspects of a content.

Impact factor — Finally a comprehensive score could be developed for the content piece based on a combination of several metrics mentioned above which is a measure of content quality and how much it engages the reader.

On-boarding contributorsThe key to high quality content capable of engaging viewers is to have a set of passionate and knowledgeable contributors who can create content master pieces.

The core principle of hiring contributors is to identify candidates who have an equally balanced passion for writing as well as data science.

The hiring process should consider the following aspects while looking for prospective contributors:Motivational fit to writeAlignment of contributors’ passion with the vision of the knowledge portalDomain expertiseProjected propensity of retentionGood mix of contributors with diverse demography & domain expertiseThe contributors are sourced, based on a combined automated and manual search and typically will include:Industry experts / celebrities in various domains of data scienceCurrent writers who have their blogs or write for other publicationsData science enthusiasts who want to start writingStudent writersSemi-automated approach for Contributor selection & on-boarding (except Industry experts / celebrities)The approach illustrated below will help in creating a well synchronised and documented process flow for creating a pipeline of contributors and capturing information about the contributors at every milestone of the selection process.

The semi-automated approach will also ensure that the entire selection process is scalable as the number of contributors and the publication size increases without introducing a bottleneck of dependency on people.

Work allotment for contributorsOnce the contributors are selected based on the above selection process, it is important to ensure that the contributors have a constant flow of work without being under-occupied or over-occupied.

As the publication grows in size, a structured and automated approach of work allotment is again crucial to maintain scalability.

The process of work allotment should be based on the following factors:Matching content to be created with domain and capacity of contributorsSetting up an automated allotment system based on profile matching so that content is auto-assigned to contributors without any wait timeCreate a pipeline for contributors to work on based on contributor availability and publication deadline.

Making the assignment process transparent and measurableContent QualityContent created by the contributors should be evaluated on-the-fly and using a multi-layered approach.

This will be a combination of rule-based algorithms, NLP and manual editorial layer with a feedback mechanism to the rule-based algorithm to maintain consistent content quality.

This will ensure that maintaining content quality is not a bottleneck that hampers with the publication cycle time.

The top three layers are automated NLP based layers to maintain content quality which will considerably filter content before it is passed on to the manual filter thereby reducing manual effort and thereby human dependency.

Further the feedback from the manual layer will be used to enhance the learning of the NLP layers thereby strengthening the automated quality layer day by day.

Multi-layer content quality filteringHandling reader reviewAny business venture is successful only if it can listen to its customers and work diligently on customer feedback.

This holds good for the portal to evolve as an audience-centric portal driven by voice-of-customer.

Reading reader reviews and responding to reader questions is important to sustain this aspect.

This could be achieved again with a combination of an NLP layer and a manual layer to ensure that this is scalable.


Text mining of reader reviews to flag highly critical reviews or unhappy audience and an automated escalation mechanism so that such reviews are brought to the attention of relevant stakeholders2.

A chatbot or Q&A layer based on RNN set up to answer reader questions.

The questions that cannot be answered by the chatbot can be forwarded to a human agent and the answer can be fed back to the chat bot for learning.

Proposed high level process flowConsidering all the aspects discussed in the overall content strategy, the following high-level process flow that incorporates all aspects of the content strategy is proposed:Concluding remarksThe content strategy that evolves based on the various aspects of process optimization, NLP & data science concepts will lead to an optimized & reliable content flow for a knowledge portal.

Finally such a strategy also strengthens the knowledge portal’s ideals in terms of “Practicing what is preached”.

.. More details

Leave a Reply