A large-scale Ipsos Public Affairs survey found that 23% of respondents use Facebook as a major source of news, and 83% of those believed the fake news they read on the platform to be authentic.
The right to be informed is an essential criterion that allows us to make informed decisions everyday whether it be on a small-scale (e.g., doing your weekly groceries) or on a large-scale (e.g., during federal elections). However, this very information is not always reliable. Contents that are created on digital platforms can be manipulated very easily, and its dissemination could be also used with mal-intent for the purpose of spreading misinformation. This is true especially with the proliferation of online media, where news and stories propagate fast and are often based on user-generated content, often with little time and few resources for the information to be carefully cross-checked within the platform it is propagating through. What makes misinformation so effective is that it exploits the inherent characteristics of human nature such as confirmation bias, and then latches onto the smallest seed of doubt and amplifies it with falsehoods and confusion.
Recently, the fact-checking world has been in somewhat of a crisis. Websites like Snopes and Politifact have generally focused on verifying specific claims and statements, which is admirable but at the same time extremely tedious; by the time they would have gone through, verified and debunked it, there is a good chance that it has already traveled across the world and back again. Recent advances in the world of AI can now harness the power of machine learning for computers to classify news articles and complete the tasks that would have taken humans much longer to complete.
The Traditional Approach
Since the upsurge of fake news, the traditional approach of misinformation detection has included the implementation of Natural Language Process (NLP) principles followed by Machine/Deep Learning models to classify claims or news articles according to their authenticity, or lake thereof.
For the NLP portion, such approaches generally distance themselves from any real application of linguistics driven algorithms and solely look at the content.
A few of the more common aspects of this approach on the text content (in our case the claim/news statement), for example utilising Python’s Natural Language Toolkit (NLTK) library, would include:
Pre-Processing: Removing stopwords (pronouns such as an, is, a, the, etc) stemming/lemmatization (reducing words to their root forms e.g. “walking”, “walked” will be reduced to “walk”) and other standard data cleaning processes
Feature Engineering: Through feature engineering, the text (usually in an unstructured state following pre-processing), can be converted to a numeric vector which can then be fed into the training of the machine learning model. A few of the common NLP featurisation techniques are:
- Bag of Words: This method counts the frequency of each unique word in a text, where each word becomes each of the feature/dimension and their counts becoming the vector weight.
- TF-IDF (Term Frequency – Inverse Domain Frequency): Unlike Bag of Words, TF-IDF considers the text in the context of all other texts in the training model. TF-IDF is calculated by multiplying the TF, or Term Frequency, which measures the frequency or probability of a word in a text, with the IDF, or Inverse Domain Frequency, which measures how important that term is with respect to the whole corpus, or collection of texts in the training model.
- Word2Vec: Word2Vec groups the vectors of similar words together in vectorspace, creating a dependence of one word on another. That is, it detects similarities mathematically. It creates vectors that are the distributed numerical representations of word features, such as the context of the individual words themselves, obtained through training over a massive amount of data. And given sufficient volume of training data, usage and their contexts, Word2vec can make highly accurate guesses about a word’s contextual meaning based on past appearances in the training data. Those guesses can be used to establish a word’s correlation with and dependence on other words (e.g. establishing the relationship that “male” is to “man” is what “female” is to “woman”).
Shortfalls
While the implementation of the aforementioned NLP techniques on Machine Learning, classification models have been able to obtain fairly good accuracy scores, ranging from 70% to 99%, there are some major limitations:
- Usually, a massive training data set is required. And that means that those training data need to be human-vetted.
- The training data has to be fairly well balanced of both fake and authentic news, or else the models’ bias and variances can creep up fairly quickly.
- The training data should be fairly homogeneous. As technology has evolved with generations, so has the nature of our way of expression and our vocabulary where a word could have a completely different meaning in a given context even decades apart. This makes it almost impossible to create a homogeneous training set. Add to the fact that parody or satire based articles, which are often even difficult for humans to detect, can further complicate the training aspect of the models.
The Source and Linguistic-Based Approach
Researchers at the Computer Science and Artificial Intelligence Lab (CSAIL) at MIT and the Qatar Computing Research Institute (QCRI) claim that a better approach is to supplement the information gained from the textual content with the source(s) of the news themselves. Rather than relying solely on the textual elements of an article as the main source of information for the Machine/Deep learning models, this technique involves the examination of the propagation patterns as the news spreads through digital and social media. By assessing the user interactions, comments and temporal patterns in the spread of such stories and the reputation of the platforms/websites as well as the users who share and engage with the content, a clear idea of their authenticity can be obtained. This would involve deploying a layer of web scraping and digital media forensics prior to the implementation of NLP and Machine Learning models. If a website has published fake news before, there’s a high probability that they would do it again, either because of their own lack of quality control of user generated content or pure malicious intent. In reality, it does become increasingly difficult to deploy source tracking to trace the original source after a story has undergone a chain of evolutions of being shared and possibly modified through a series of platform, groups and users.
The teams at CSAIL and QCRI also advise that one of the more reliable methods of detecting fake news is to delve in depth into the common linguistic characteristics across the source’s stories, including sentiment, complexity, and structure. For example, fake-news outlets were found more likely to use language that is subjective, hyperbolic and emotional in order to sensationalise the headlines. Additionally, these linguistic characteristics must always be updated to stay relevant with the ever-changing vocabulary and verbal/textual expressions of today’s dynamic and diverse global population.
The Challenges That Remain
In today’s “Post-Truth” era, whereindividuals are more likely to accept an argument based on their emotions and beliefs, rather than their factual foundations, the challenge of monitoring misinformation becomes one step more difficult – one where we have to challenge the very definition of fake news and whether there can always be a clear line separating what is “fake” from what is not.
The tech and social media boom was supposed to usher in a new era for all of humanity, bringing an unparalleled level of knowledge transfer and connectivity across the globe — and in many ways it truly has. However, the generation and propagation of misinformation has been one of the largest antitheses of this dream. That being said, let us not forget that even a decade on from the boom of social media, we are still in our digital infancy and there is a whole lot more to be discovered and innovated.
On that note, AI has now learned to CREATE fake news as well and one of them is called “Neural Fake News”, which uses neural-networks-based deep learning models to replicate human-like linguistics. And that definitely warrants a more in-depth discussion for another day.