Advanced algorithms aim to optimally extract information in large data sets


August Schiess, CSL

When solving complex scientific problems, researchers sometimes encounter what is called the curse of dimensionality: having so much data that they’re unable to efficiently analyze it.

Acquiring large data sets can also be expensive and time consuming, so it's important to only gather what is necessary. To help researchers make the most of their data, Assistant Professor Yihong Wu  and Professor Tsachy Weissman of Stanford University were awarded $500,000 from the National Science Foundation to identify optimal algorithms that efficiently extract the relevant information from large data sets.

Yihong Wu
Yihong Wu
With advanced technology and data acquisition methods, data sets can be larger than they ever have been. It may seem that a surplus of data would be advantageous to researchers, but analyzing large data sets, called high-dimensional data, is challenging.

Wu's algorithms will cut the number of data needed significantly, especially when measuring entropy, which is the amount of relevant information contained in the data, measured in bits.

"If the goal is to learn a high-dimensional feature consisting of a million coordinates, the conventional wisdom would tell you that you need roughly one million samples to analyze. But suppose we want to estimate the entropy—which is only one number—then it is possible to get away with maybe only 200,000 samples,” said Wu, who is also a faculty member in the Coordinated Science Laboratory. “But the challenge in our work is that the entropy depends on these one million coordinates in a very complicated way, making an optimal algorithm difficult to create."

Developing a smarter algorithm will not only help scientists extract the most relevant data, but if they have a specific goal, the algorithm can initially collect less data to fulfill that goal, saving both time and resources.

The algorithm developed by the Illinois team also comes with a theoretical guarantee of optimality—it cannot be significantly improved by other schemes. One of the inspirations for solving this problem originates from what is called the “species problem” of “estimating the unseen.”

This concept first appeared in a paper in 1943 by Ronald Fisher and tells the story of a group of biologists who went to Malay Archipelago to investigate the butterfly population. They recorded how many times they saw several species of butterflies, with the intention of identifying how many distinct species there were. If they observed 21 different species, for example, they would know there are at least 21 different types of butterflies; but what if there were a lot more they didn’t see? If they stayed longer, they could have seen more, but they could only afford to stay for so long.

The question that remains: How can they make the most efficient use of their limited data to extrapolate how many species there really are?

"This problem of estimating the number of unseens turns out to be very difficult. A lot of deep mathematical results are needed to tackle it,” Wu said. “But the algorithm we developed for entropy can be easily tweaked to yield an optimal estimator for the unseens. As you can imagine, this problem has reincarnations in many disciplines such as ecology, computer science, and linguistics. A breakthrough in efficient algorithms will have far-reaching consequences.”

Additionally, Wu is hoping to use these algorithms in machine learning, which is the science of getting computers to act without being explicitly programmed.

“Many machine learning algorithms rely on the accurate estimation of information," Wu said. "Our improved estimators can lead to better performance and serve as building blocks for machine learning tasks on a high-dimensional model."