Text Mining Uncovers Millions of ‘Fake Comments’ Sent to FCC
According to a November 29, 2017 report by Reuters News Agency, “More than half of the 21.7 million public comments submitted to the U.S. Federal Communications Commission about net neutrality this year used temporary or duplicate email addresses and appeared to include false or misleading information, the Pew Research Center said.”
From April 27 to Aug. 30 the public was able to submit comments to the FCC on the topic, online or by e-mail. The Reuter’s article noted, “Of those, 57 percent used either duplicate e-mail addresses or temporary e-mail addresses, while many individual names appeared thousands of times in the submissions, Pew said. For example, ‘Pat M’ was listed on 5,910 submissions, and the e-mail address firstname.lastname@example.org was used in 1,002 comments. TV host John Oliver supported keeping net neutrality on his HBO talk show.”
Pew did not say how many of the comments supported or opposed the FCC’s proposal. “They found that only six percent of submitted comments were unique while the rest had been submitted multiple times, in some cases, hundreds of thousands of times,” the authors stated. “Thousands of identical comments were submitted in the same second on at least five occasions. On July 19 at precisely 2:57:15 p.m. ET, 475,482 comments were submitted, Pew said, adding that almost all were in favor of net neutrality.”
In the same vein, data scientist Jeff Kao used a similar dataset and got a similar result. Writing on the blog Hackernoon, Kao reports, “I used natural language processing techniques to analyze net neutrality comments submitted to the FCC from April-October 2017, and the results were disturbing. My research found at least 1.3 million fake pro-repeal comments, with suspicions about many more. In fact, the sum of fake pro-repeal comments in the proceeding may number in the millions.”
He continues, “It was clear from the start that the data was going to be duplicative and messy. If I wanted to do the analysis without having to set up the tools and infrastructure typically used for ‘big data,’ I needed to break down the 22-plus million comments and more than 60GB worth of text data and metadata into smaller pieces.
“Thus, I tallied up the many duplicate comments and arrived at 2,955,182 unique comments and their respective duplicate counts. I then mapped each comment into semantic space vectors and ran some clustering algorithms on the meaning of the comments. This method identified nearly 150 clusters of comment submission texts of various sizes.”
After clustering comment categories and removing duplicates, Kao found “less than 800,000 of the 22 million comments submitted to the FCC (3-4 percent) could be considered truly unique.”
Frontline Systems Releases XLMiner® SDK V2018 for High-Performance Predictive Analytics
Frontline Systems has released XLMiner SDK V2018, a next-generation version of its Software Development Kit for data mining, text mining, forecasting, and predictive analytics. XLMiner SDK offers application developers working in C++, C#, Java, Python or R a powerful, high-level API for quickly creating applications that use predictive analytics. Developers can register for a free account at https://www.solver.com, and download and install a fully-functional version of XLMiner SDK with a free 15-day trial license.
“Data mining and machine learning software has proliferated, but there’s a difference between common libraries and truly robust, high-performance software – especially if you’re working in C++, C# or Java,” said Daniel Fylstra, Frontline’s President and CEO. “XLMiner SDK is a toolkit that developers can count on to build commercial-grade applications.”
Full Support for Popular Programming Languages
XLMiner SDK provides full API support for five popular programming languages: C++ 11 or later, C# 4.0 or later, Java 8, Python 2.7 or 3.6 (both are supported), and R 3.4. In Microsoft Visual Studio and R Studio, developers will benefit from automatic recognition and “command completion” for XLMiner’s objects, properties and methods. And the new SDK is ready for REPL (Read-Eval-Print-Loop) style execution with C# Interactive.
XLMiner SDK’s R support uses R-native types, including R’s own DataFrame type; hence it can be used easily with a wide range of R packages from CRAN. XLMiner SDK provides its own “R package” that can be loaded with one command from R itself, or an IDE such as R Studio.
For C++, C# and Java developers, XLMiner SDK should be especially welcome, since quality data mining tools have been hard to find for these popular languages. But even R and Python developers will find that XLMiner SDK offers a far better integrated, comprehensive data mining and text mining toolkit.
Support for Popular Databases and Files, Text, and Big Data
The SDK also handles unstructured text data, and provides stemming, term normalization, vocabulary reduction, creation of a term-document matrix, and concept extraction with latent semantic indexing. It even has built-in facilities to draw a statistically representative sample from an Apache Spark Big Data cluster, running a Frontline-supplied component on one of the cluster nodes.
Model Export in PMML, and Export/Import in JSON
The new SDK release can export a wide range of trained/fitted models in industry-standard PMML (Predictive Modeling Markup Language) format, from data transformations to linear and logistic regression, decision trees, neural networks, and k nearest neighbors for both classification and prediction; discriminant analysis, naïve Bayes, time series models, association rules, and even ensembles with boosting, bagging, and random forest methods. Few other products provide such extensive PMML support.
XLMiner SDK also provides its own JSON serialization format, more general than PMML, for its full range of objects (DataFrames, Estimators and Models) and properties.
Faster and More Robust Algorithms
Statistical and machine learning algorithms in XLMiner SDK are optimized for performance on current Intel-compatible processors. In the new release, the Naive Bayes algorithm is much faster and less memory-intensive, while K Nearest Neighbors is an order of magnitude faster in k-parameter tuning, and handles distance matrices that would exceed available memory in other software.
Category Reduction and Missing Data Handling algorithms are also extended for multivariate use, with new “missing value options” for different data types, and One-Hot-Encoding is faster and enhanced for categorical variables. The new release even offers Vector and Matrix objects that enable developers to write high-level “linear algebra expressions” with high-performance, parallel multi-core execution.