Can you work with Big Data in Excel? From the barrage of recent news, white papers, and sales calls about Big Data, you would think not. Better think again.
If you’ve had the experience of waiting in line to board a plane at your local airport only to have it announced that the departure would be late because the incoming flight was delayed, join a very big club. Flight on-time performance of the major airlines has been a growing problem as more aircraft are in the sky, more people are in the boarding queue, and more security causes slowdowns throughout the process. Finding the causes and relief from them requires exploring a lot of data, which is known as Big Data.
A common theme expressed in multiple white papers has been that widely used spreadsheets can’t handle Big Data and advanced analytics. Companies need to move to new tools—specifically those that the vendors with the white papers offer. Implicitly, the benefits outweigh the expense and steep learning curve.
But what if you can work with Big Data in a program like Excel? It turns out that you actually can, using the classic tool of statistically representative sampling. A blog post from Frontline Systems shows how you can do this with XLMiner, also known as Analytic Solver Data Mining.
The project in question studies the airline data sets used in an online tutorial by HortonWorks, one of the best-known Big Data firms. This 120 million record data set covers all commercial flights within the United States from October 1987 to April 2008 – 29 commercial airlines and 3,376 airports, including 3.2 million cancelled flights and 25 million flights at least 15 minutes late.
Data by the Million
The post shows how you can quickly summarize results across 120 million records using point and click—instead of writing lambda function code in Python—by performing an aggregation query against the data in Frontline’s Apache Spark Big Data cluster on Amazon Web Services. The authors were able to obtain the average delays for 341 airports, aggregated over the 22-year period, as an Excel data table.
Result? Farmington, N.M., seems to have the longest delays. That data table was used to create a visualization, using Microsoft Power Map.
Where the HortonWorks tutorial simplified matters by restricting the data to flights originating from Chicago’s O’Hare airport, Frontline’s analysis covers all 3,376 airports using a simple menu selection: Get Data – Big Data – Sample. This allows you to draw a statistically representative random sample of about 100,000 records from the Apache Spark cluster. Like HortonWorks, the post partitions the data into a training set from 2007 flights, and a validation set from 2008 flights.
Applying logistic regression over 100,000 records to obtain a binary classifier, using data about each flight to predict whether or not it was delayed, takes a fraction of a second in XLMiner. Comparing results with Iteration 1 of the HortonWorks study, the XLMiner model has essentially equivalent Recall (ratio of true positives, 0.64) and Accuracy (59 percent), even though it used less than 0.1 percent of the total data in the data set.
The post goes further, using Feature Selection to ask, “Are all the variables in the airline data set really important? Which ones provide useful information about the possible delay of a flight?” A quick visualization shows that the scheduled departure and scheduled arrival times have the strongest correspondence with departure delays, according to the Welch test. This leads to a model that confirms frequent airline travelers’ anecdotal experience: time of day really matters. An 8-variable logistic regression model has Recall=0.64, Accuracy=59 percent, but a simple 1-variable model using time of day has Recall=0.63, Accuracy=59 percent.
While there are obviously some drawbacks to pushing Excel or any other program beyond its design parameters, exploring those parameters with care can show how to use common systems in uncommon ways. Don’t think you can’t handle Big Data until you try.