Consider what happens when you place an order with an Internet retail site such as Amazon.com.
You search for a product and narrow your preferences until the site finds an exact match from a sprawling inventory. The site analyzes a shipping system in which millions of items are in motion at once. Then it calculates costs and estimates a delivery time. Money transfers from your bank to the vendor’s, and information about your order goes to a warehouse. There, computers find the item’s physical location and track it as it’s pulled, packaged and delivered to your address.
“Each step along the way represents a huge data analysis problem,” says Ted Ralphs
, associate professor of industrial and systems engineering (ISE). “I know how the systems work beneath the surface, and the whole process still boggles my mind.”
The use of information on a mind-boggling scale is now commonplace in the emerging world of what’s become known as Big Data. Many people have become aware of the challenges, opportunities and, to some, threats of large-magnitude data analysis through debates about its role in national security and marketing.
But Big Analytics—using algorithms to sift through massive data sets in order to find patterns, reveal hidden correlations, make predictions and improve function as efficiently as possible—has the potential to dramatically enhance virtually every field of science, engineering, business and the humanities.
“Astronomical surveys of the heavens, weather measurements, gauges of loads on bridges, brain scans, genome mapping, social media—all of these activities generate massive amounts of data that are far beyond what any human could possibly take in,” says Jeff Heflin
, associate professor of computer science and engineering (CSE).
“The question is, how can we make all this data useful?”Harnessing the data explosion
In one sense, Big Data has been around a long time. Businesses, for example, have long grappled with the logistics of filling orders and making deliveries. But several converging trends have enabled the tools for analyzing increasingly large and complex data sets to become more powerful than ever before. Lehigh researchers are using these new capabilities to tackle the challenges of big analytics on a number of levels, from boosting the precision of searches across vast amounts of information to evaluating the quality of results, predicting online behavior and making the most efficient use of computing power for faster and more valuable calculations.
Among the trends: “Computing and data storage are getting cheaper, and networking is getting better, so it’s becoming cost-effective to capture and analyze lots of data in ways we couldn’t imagine 10 years ago,” says CSE associate professor Brian Davison
At the same time, computing and hardware advances increasingly allow analytical tasks to be distributed among multiple processing cores, machines or networks. “Big, interesting problems can’t be calculated on a single machine,” Davison says. “You need clusters of machines, which may be of different types.” This approach, known as heterogeneous computing, divides large problems into smaller ones that different computers can solve concurrently, integrating their results.
Data itself is becoming more heterogeneous as well. In the past, digitized data was largely structured—meaning it followed well-defined database formats such as columns and rows that tended to be shared across platforms. Today, there’s an explosion of unstructured data in widely different forms. “Data might take the form of sound, images, video or information in Excel or Word,” Heflin says—as well as website links, Facebook posts and online social connections.
Even within a given format such as text, data elements can vastly differ. Variables range from the lengths of files to the meanings attached to specific words. (Does “fracking” refer to hydraulic fracturing or an expletive used in the futuristic world of Battlestar Galactica?)
“Putting data together for analysis creates challenges in matching schemas,” or data structure, says Heflin. “Once you’ve made matches, you need to come up with ways to query efficiently across different databases so you’re not wasting time and resources reviewing every possible source, including those that couldn’t possibly have the answer.”
Such challenges have led to advances in machine learning, in which advanced algorithms automate the discovery of patterns and trends so that computers can make predictions and come to accurate conclusions based on mathematical probabilities.
“One of the best examples of applying these analytical techniques on huge amounts of data is the development of the smart grid,” Ralphs says. Already the focus of a research cluster at Lehigh called Integrated Networks for Electricity, the smart grid would overlay the nation’s antiquated electrical system with a layer of information technology. By constantly monitoring itself and allowing real-time communication between utilities, users and other key elements of the electrical system, the smart grid would more easily integrate renewable but fluctuating sources of energy such as solar and wind. It would also offer consumers information about electrical demand and at-the-moment energy costs so they could buy electricity cheaper during off-peak hours—running the washing machine at night, for example, instead of during the hottest hours of a summer day.
“The system will need to crunch huge amounts of data—sometimes within minutes—to make predictions about a vast number of variables,” Ralphs says. These include what appliances people will be using at any given time, how much sun will shine and wind will blow to generate electricity, which mainstay power sources such as nuclear reactors will need to be online, and what actions will be needed if demand and supply start to become unbalanced.Searching for gold
Crunching large data sets to make correlations and predictions is already fueling big business. “The classic example is stores looking at purchases that tend to go together,” Heflin says. Walmart, for example, mined its massive records of customer purchases and discovered that as hurricanes approach, people stock up not only on flashlights but Pop-Tarts. Placing the non-perishable pastry at the front of stores near other hurricane supplies boosted sales.
“The goal is to find patterns that aren’t obvious,” Heflin says. “When people open a wedding registry, it’s easy to guess they’re interested in honeymoon travel. More interesting and valuable would be to know that people from a specific ZIP code are most likely to take cruises. You can learn that from looking at massive amounts of data.”
Heflin specializes in panning informational gold from the vast Internet data stream through the Semantic Web, an effort to promote common data formats that allow machines to process information more intelligently. One challenge is evaluating the credibility of the information that a search turns up—“because it’s well known that the Internet is 100 percent factual,” Heflin jokes.
Among other things, Heflin is interested in providing tools that allow people to uniformly explore massive data pulled from different sources of linked data, finding both novel patterns and inconsistencies. In 2012, a system that he and a team of students built won the Billion Triple Challenge, a competition at the International Semantic Web Conference in Boston.
“A triple is a digitized fact with a subject, predicate and object,” Heflin says. “Our interface allowed us to browse about 30 billion triples in about 300,000 categories of data to reveal interesting phenomena, connections and errors.”
For example, the system discovered that a popular online source of structured data had confused British computer scientist Michael Guy with American rock musician Michael Guy. “They had the same name, but different birthdates,” Heflin says. “Somewhere along the line a simplistic automated system assumed they were the same person due to the common name, and then produced triples about the scientist that used the same computer identifier as the triples about the musician.” Such data sleuthing adds new tools that could help produce more fruitful search results.
Fruitful searching could also entail finding articles that not only are accurate but also offer a particular perspective—as revealed by algorithms that detect various forms of bias. Davison is developing such a model with funding from a Lehigh grant.
“People have different worldviews when they write, particularly news,” he says. “It might be a political or religious bias, a tendency to use big words, a fondness for sports analogies or a tone that’s negative or sensational. Our goal is to recognize what the biases are so we can customize content to what a viewer really wants.”
The algorithm would be able to score any article on a variety of perspectives and recommend similar articles—or alternatives. When researching a topic, you could use the system to seek a balanced view. That would offer an escape from what’s been called “the filter bubble”—a phenomenon in which your prior browsing history automatically personalizes and narrows the information that a search reveals to you without your realizing it.
“Scoring biases could reinforce the filter bubble,” Davison says. “But we give people a choice.”
Parsing information with big analytics can help refine social media as well. In a paper titled “Predicting Popular Messages in Twitter,” Davison and graduate students Liangjie Hong and Ovidiu Dan analyzed more than 10 million tweets to determine why posts are retweeted.
“We found that Joe Random can do everything right and still not be retweeted,” Davison says. “What determines retweeting are factors like whether you have a track record of posting worthwhile content, who follows you and whether the topic is already popular.”
The paper won the Best Paper Award at the 2011 World Wide Web Conference in Hyderabad, India. Hong is now a researcher at Yahoo! Labs.
In a follow-up project, Davison, Hong and graduate student Aziz S. Doumith tackled a more complicated question: Who will retweet what? “This model predicts individual interests and behavior based on what you’ve done in the past,” Davison says. “We find this more interesting as the ultimate goal is to understand your interests and provide more of the information you really want.”
Their paper—an award finalist at the 2013 International Web Search and Data Mining Conference in Rome—suggests that such models could help sort through an overwhelming gush of social media information and pluck out messages you’ll actually care about. Predicting retweets is also extremely valuable to companies.
“If someone has a million followers, it’s obvious that people are paying attention,” Davison says. “But if companies know that a person with a thousand followers today will have a half million next year, they can establish a relationship before that person becomes big.”The optimizing balance
Answering even relatively limited questions can consume enormous computing power if the scope of analysis is broad enough. “We know how to write a program that will allow a computer to choose the perfect chess move in any situation,” says Heflin, “but it would take many lifetimes of the universe for it to decide on that move.” However, the more limited goal of creating a program that can beat any human is more attainable.
“A large part of computer science is figuring out when it’s time to back off and try for an algorithm that approximates what you’re looking for, then figuring out when it’s good enough to give you the quality you need,” Heflin says.
Such optimization is a driving force behind analytics, and its methodologies often draw on game theory.
“When you have multiple decision-makers—that is, players—each solving their own problems while also affecting each other, you’re in the game theory world,” Ralphs says.
One game theory model assumes that groups of players are working against each other. “This is the kind of analysis the Defense Department uses in counterterrorism,” says Ralphs. “We’ve had a small grant from the Army to work on these kinds of problems.”
A second model assumes that players are cooperating. Delivery operations—like UPS bringing your order from Amazon.com—are a prime example.
“Delivery operations represent a hierarchy of different subsystems,” says Ralphs, “ranging from a single truck driving around to fleets of trucks, warehouses, regional warehousing operations and on up.” Improvements in optimization, machine learning and data storage in recent years have enabled constant feedback among subsystems so that efficiencies are shared throughout the whole system almost as soon as they occur.
Ralphs develops software for, and chairs the Technical Leadership Council of a nonprofit, open-source research foundation called COIN-OR—short for Computational Infrastructure for Operations Research.
“We develop tools that other people then use for specific applications,” he says. “I’m passionate about these ground-level software methodologies because they have broad applications.”
In medicine, for example, integrated analytics allow a treatment known as intensity-modulated radiation therapy to deliver optimal amounts of radiation to cancer tumors while avoiding healthy tissue. Making the system work entails using data from previous outcomes to predict effects, analyzing imagery to locate boundaries between healthy and cancerous tissue, and optimizing the function of the machine in real time as it delivers bursts of radiation in precisely calibrated doses to different areas of a patient’s tumor.
“Big analytics is an inherently multidisciplinary field because big problems need a lot of different perspectives to be properly understood,” Heflin says. The amount of data, how fast it arrives, the form it takes and how accurate it is—sometimes called the “four Vs” of volume, velocity, variety and veracity—“are each research areas in themselves,” he adds.
This invites collaboration among researchers seeking the big picture. Heflin and Davison, for example, are exploring ways to combine their respective interests in structured and unstructured data. “If you only focus on one or the other, you’re missing half the world,” Davison says.
But the tools to mine and analyze data can also be applied to disciplines outside of science and engineering. Heflin has worked with Edward Whitley, associate professor of English in Lehigh’s College of Arts and Sciences and author of The American Literature Scholar in the Digital Age
, to develop tools for analyzing a trove of digital data related to the poet Walt Whitman and other “Bohemians” who congregated in New York City’s Greenwich Village in the mid-19th century.
“It’s a work in progress, but we hope to look at the relationships and social structures that contributed to literary processes in this influential community,” Heflin says.
Collaborations like this promise to become more common, says Ralphs, as the future of Big Analytics unfolds.
“Some people want to develop analytic techniques,” he says, “while others want to use them. These people need to find each other.
“If it hasn’t already, big analytics or big data will touch almost every aspect of life.”