Machine Learning Versus The News

Mathematically, we may be tempted to think that to know the truth in its unvarnished and untarnished essence, we must read every article that covers the events of the story.

Somehow we would then average away all the noise and be left with a well-informed and unbiased view on the underlying events and their implications.

But there are two problems:A disproportionate part of the media might speak with one voice, and thus the noise won’t average away to reveal some notion of the truth.

It would, anyway, require immense discipline for us to read every article that’s published.

Such discipline, while laudable, is not a pragmatist’s preferred approach to life in the modern world — a place where the uninformed may drown in a sea of information.

Knowing that we lack this discipline, this pragmatist set out to find an algorithmic solution.

Broadly speaking, the algorithm must first identify the set of articles from across the media that cover the story in which we are interested.

Given the sheer number of articles that are published, this is non-trivial, but entirely solvable.

Next, that set of articles is analyzed to determine:The extent to which they include a diverse set of perspectives.

Where, amongst the various perspectives, any particular article ranks.

Armed with this information we can:Understand whether the article we are reading is written from an extreme perspective.

Propose another article to read which will give a more nuanced understanding of the story.

Score the neutrality of the aggregate coverage.

CLUSTERINGThe first step corresponds loosely to clustering.

As with many aspects of Natural Language Processing, extracting useful meaning is something of a craft.

In this instance, we have nothing more than a set of words for each article and want to group the articles into those which cover the same story.

Ultimately each article in the corpus is replaced with a TF-IDF vector, which is effectively an abstract numerical representation of its relative information content.

(More on this later.

) If we are careful with our pre-processing and parameter calibration, then near-neighbors in the vector space can reliably be judged to be articles pertaining to the same story.

The game becomes one of performing a set of pre-processing steps from the NLP arsenal.

These include:Removing corrupt articles — at the very least, we can find that there are many articles in the corpus which are not helpful.

They could be corrupt, they could be summary articles containing a sentence on each of many stories, etc.

For sure, these articles are not ones we recommend someone read in order to better understand a story.

Focussing on the first part of the article — a journalistic article is normally structured such that the first few sentences (the “lead” or “lede”) specify the Who, What, Why, When, Where and How of the story.

The terms contained in this part of an article are more likely to be shared among articles reporting the same story.

Synonym substitutions — some entities can be referred to in multiple ways.

If two related articles refer to the entity in different ways, they are less likely to be correctly paired.

To mitigate this risk, common synonyms are united.

For example, Federal Bureau of Investigation, FBI and F.

B.

I.

are all translated to FBI.

Parts-of-speech restrictions — the different parts of speech impart a different amount of information on the actual topic of an article.

For example, proper nouns are likely to be most significant, verbs will contain some information, and adjectives typically very little.

Thus stripping out the less informative parts of speech helps increase the robustness of the article grouping.

Of course, this necessitates analyzing the article algorithmically to assign a role to each word.

Lemmatization — the vectorization doesn’t intrinsically understand the relationships between words.

For example the same root word can be expressed in different tenses, etc.

So elect, elected, elects, election, electing, etc.

may be referring to the same thing.

Lemmatization is the process of replacing words with their root.

This has the advantage of increasing the chance that two related articles are brought closer together in vector space.

N-grams — consecutive words are grouped and used as combined terms in the vectorization.

The objective is to take advantage of the fact that if two terms occur adjacently in two articles those articles are more likely to be related than if they appear in different parts of the article.

For example, an article that mentions Hillary Clinton meeting Bill Gates might pair with an article on Bill Clinton, unless we included bigrams — the Hillary article won’t contain “Bill Clinton” and hence will be less likely to be paired.

Stop-word suppression — this step simply removes any words contained in a defined list of those which are too common to convey useful information.

This reduces the dimensionality of the vector, but may not significantly change the clustering since the IDF scores of such common terms will be very low.

The aim is to generate vectors focusing as much as possible on the terms that convey information pertinent to the story, those terms being the ones that are most likely to be shared with other articles covering the same story — and less likely to appear in articles on a different story.

These steps can be performed, with varying degrees of success, in many of the common NLP libraries, such as NLTK and spAcy.

Once this is done, each article is converted to a TF-IDF vector using sklearn.

TF-IDF vectorization is a popular method for topic mining.

It represents each document in the corpus with a single vector.

That vector contains a value for each term in the combined dictionary.

The values themselves are high when that document’s topic relates to the term — and low, otherwise.

Hence words which appear in many documents are regarded as conveying little information for discriminating topic.

And words which appear many times in a document are regarded as being more significant for discriminating the topic.

The TF-IDF value is then effectively the product of how many times a term appears in the article, and how much topic-discriminating power that term has in the corpus of all provided documents.

If we look at the news articles across a set of media outlets for one specific day in 2016, we would find a series of extremely high-dimension vectors.

While those vectors, in high dimensions, are used in the processing we perform, to visualize what’s going on it’s necessary to apply some standard mathematical techniques and map their values to an approximation in two dimensions.

That gives us a graph such as this:Story clusters inferred from news articles of 2016–09–01Each point in the graph corresponds to an individual article.

The articles in the color-coded clusters are the ones the algorithm has determined are on the same topic.

(To increase reliability, that determination necessitated replacing the usual Euclidean distance with a new measure of affinity.

Details are provided in the full project report.

)Note the large mass of blue points in the central area.

These articles turn out to each be the only article reporting on their particular story.

Because they are the only article on the story, their vectors contain low TF-IDF values, hence they cluster around the origin in the centre of the graph.

The power of this transformation can be impressive.

What at first glance appears to be a single pink point for “Penn State” turns out to be a series of three points from nearly identical vectors that are only visible when zooming in several times.

A quick look at the beginning of the text of those articles shows that while they pertain to the same story, they are certainly not identical lists of words.

Related news articles on the Joe Paterno anniversaryThe story of the Venezuela protests similarly turns out to contain three extremely close articles:Related news articles on the Venezuela protestsWhat’s clear in both these cases is that the algorithm is succeeding in isolating the topic across thousands of articles, and the clustering is doing a good job of determining which articles are close enough to be designated as covering the same story.

While it’s immediately obvious to the human eye that these articles are related, once one takes a step back and thinks of thousands of news articles, each with a series of many words, each word conveying differing degrees of information, then one can begin to see that the mind must be doing some pretty clever processing itself in order to arrive at its conclusion.

So, now that we’ve found this set of articles, what can we do with it?RANKING THE ARTICLESIn order to rank the articles, we need to find an axis against which to measure them.

Given the assumption that the articles reflect the same set of events, it should be possible to draw upon the work of sentiment analysis in order to evaluate and rank.

There are several steps to this process:Restore the original version of the article (from before lemmatization, etc).

Convert each article into a list of sentences (using NLTK).

Restrict the analysis to the first n sentences of each article.

(A comprehensive grid search found that an optimal value for discerning sentiment was around 10 sentences per article.

)Compute the sentiment of each article.

This is done with the user’s choice between Stanford CoreNLP, Google Cloud Platform’s NLP library, and the NTLK implementation of Vader.

Translate results to be in -1/+1 based on the meaning of the scale of each library.

Further scale the results to take into account the likelihood that the NLP library can practically return values of the full range.

For Google, an analysis of several thousands of news articles suggested that the appropriate scaling was to divide by 0.

86.

(This was a smaller adjustment than for the other libraries and probably reflects the fact that each library was trained/calibrated on differing types of data — with Google’s being trained on the broadest.

)Finally, the scaled standardized sentiment values of the articles comprising a story are taken as inputs to compute the Neutrality Score for the aggregate coverage of the story.

The Neutrality Score is computed using a new formula that was designed specifically for this project:Here count(x,y) counts the number of articles having sentiment within the range from x to y.

This concept is very loosely inspired by Wikipedia’s editorial guidelines, in particular its definition of a “Neutral Point of View” to mean the inclusion of all verifiable points of view without undue weight being assigned to any of them.

The formula is discussed in considerable detail in the full technical report for the project.

PART 3: SOME RESULTSThe challenge when working with subjective data is finding an objective means for validating the results.

For this project the basic approach was to take the articles of a story, compute each article’s score, then compute the neutrality of the coverage.

These values were then ordered and correlated with the relative positions of the publishers of the articles on the Pew scale mentioned previously (see this report).

Clearly this is not a perfect comparison — for example, the OpEd pieces in the New York Times may or may not align with the paper’s general editorial slant.

But if we take one story, the reporting on Trump’s meeting with the president of Mexico, we get the following:The results seem to suggest that the further to the left the media organization stands, the more negatively it reported on the meeting.

Breitbart’s report was the most enthusiastic.

CNN was numerically basically neutral.

Reuters was perfectly neutral.

And the Guardian vied with the New York Times for the most negative coverage.

More cases are reviewed in the technical report.

PART 4: A FEW CONCLUSIONSThe project applied many different concepts to try to answer the original question of ‘how to find and propose an article which covers the same story differently’?.The following conclusions can be taken away:The clustering of the articles into stories appears to work very well, but the testing would need to be expanded in order to validate this perception.

Time and resources are, however, always the main constraint for this type of work.

The Neutrality score does something that appears to have meaning but will require further enhancement to become truly robust.

To really get to the core of how valid the results are it may be necessary to consider radically different approaches to testing.

An imaginative decomposition of the problem might yield a task which can be put to the Mechanical Turk.

It would be interesting to track the evolution of a story over time.

With a few modifications to the algorithm this should prove doable.

My main conclusion is that journalistic articles are inherently complex — and measuring their sentiment is non-trivial.

It may be worth considering the sentiment of the article relative to the sentiment of the underlying story/facts.

This could give a clearer indication of any bias in the report.

Additionally, the sentiment of reported speech might need to be treated differently to the sentiment of the remainder of the article.

And some thought should be given to the future possibility of news organizations writing their articles in a way to game the score — perhaps by burying the lead.

This, too, could potentially be circumvented by flagging cases where the sentiment of the lead deviates significantly from the sentiment of the rest of the article.

PART 5: ALL THE DETAILSThis post is lengthy, but not sufficient to get fully into the details.

For that, and for many more examples, here’s one last reminder of the links to the technical report and the source repository:The technical report for the project is available here.

The underlying Python code, including a couple of Jupyter notebooks can be found on GitHub.

.

. More details

Leave a Reply