Researchers in the Netherlands have developed a new machine learning method capable of distinguishing sponsored or paid content within news platforms with greater than 90% accuracy, in response to the increasing interest of advertisers in news formats difficult from ” real “journalistic output.
The authors note that while standards are slowly but inexorably changing towards greater integration, although more reputable publications, which advertisers can more easily dictate terms, use reasonable efforts to distinguish “affiliate content” from mainstream news and analytics. between editorial and commercial teams in one medium, which they perceive as an alarming and negative trend.
‘The ability to disguise content, willingly or unwillingly, and the probability that advertorials are not recognized as such even if properly labelled is significant. Marketers call it native [advertising] for a reason.’
The work was carried out as part of a broader study of network news culture at Amsterdam’s ACED Reverb Channel, which focuses on the data-driven analysis of evolving journalistic trends.
Acquiring Data
To develop the source data for the project, the authors used 1,000 articles and 1,000 infomercials from four Dutch media and classified them according to their textual characteristics. Because the data set was relatively modest, the authors avoided large-scale approaches like BERT. , and instead evaluated the effectiveness of more classic machine learning frameworks, including Support Vector Machine (SVM), LinearSVC, Decision Tree, Random Forest, KNearest Neighbor (KNN), Stochastic Gradient Descent (SGD), and Naive Bayes.
The Reverb Channel corpus was able to provide the required 1,000 “direct” articles, but the writers had to extract the infomercials directly from the four Dutch websites submitted. The data obtained is limited (for copyright reasons) available on GitHub, along with some of the Python code that is used to retrieve and evaluate the data.
The four publications examined were the politically conservative Nu.nl, the more progressive Telegraaf, NRC and the business magazine De Ondernemer. Each publication was represented equally in the data.
It was necessary to identify and exclude possible “leaks” in the research lexicon – words that could appear in both types of content with little difference between their frequency and usage in order to create clear patterns for true native and sponsored Contents.
Results
For all of the methods tested for identification, the best results were obtained with SVM, linearSVC, Random Forest, and SGD, so the researchers used SVM in subsequent analyzes.
The best modeling approach to extracting the classification across the corpus exceeded 90 ° accuracy, although the researchers find that clear classification becomes more difficult in B2B-oriented publications where the lexical overlap between perceived content and ” sponsored ”is an exaggeration, perhaps because the native style of business language is already more subjective than the general implementation of reporting and analysis conventions and an agenda can more easily obscure.
Is Sponsored Content ‘Fake News’?
The authors’ research suggests that their project is new in the field of news content analysis. Frameworks capable of identifying sponsored content could pave the way for the development of an annual follow-up of the balance between objective journalism and the increasing prevalence of “native advertising” found today. Almost the same context in most of the posts, using the same visual cues (CSS stylesheets and other formats) as general content.
In a sense, the frequent lack of obvious context for sponsored content becomes part of the study of fake news, even though most publishers recognize the need to separate “church and state” and provide information to readers with clear divisions between paid and organically generated content, the realities of the post-print news scene, and increasing reliance on advertisers have made the lack of emphasis on sponsored indicators an art in user interface psychology. Sometimes the rewards of running sponsored content are tempting enough to risk a major optical disaster.
In 2015, the social media and competitive benchmarking platform Quintly offered an artificial intelligence-based detection method to determine if a Facebook post was sponsored and claimed an accuracy rate of 96%. the sponsored statement of content could “contribute to the deception”.
In 2017, MediaShift, an organization that studies the intersection between media and technology, observed the increasing extent to which the New York Times monetizes its business through its branded content studio, T Brand Studio, and each time less transparency on sponsored content asserts, with the implied result that readers will not be able to easily tell whether or not the content was organically generated.
In 2020, another research initiative in the Netherlands developed machine learning classifiers to automatically identify state-funded Russian news appearing on Serbian news platforms. Additionally, Forbes’ Media Content Solutions accounted for an estimated 40% of total revenue through BrandVoice in 2019. , the content study launched by the publisher in 2010.