![]() |
![]() |
![]() |
|
Dick Muntz: The Data Miner |
||
|
Back when Dick Muntz was an undergraduate in the early 1960s, the choice of computer classes listed in the fall schedule was limited to a single cource dedicated to the programming language known as Fortran. Years later, when Muntz arrived at UCLA on July 1, 1969, he was one of a handful of instructors in the computer science department -- which had come into being that very day. Now chair of that department, Muntz has dedicated the past three decades to working out problems generated by what's known in the business as "data intensive computing." As muntz explains it, this can be divided into two categories: on the one hand are computations in which the amoutn of data is manageable and can be stored easily on disks. but the computations themselves can run for days and generate an enormous amount of data as output; on the other are situations that demand enormous amounts of data to begin with and involve difficulties ranging from how to move the informaiton back and forth from main memory to secondary storage, to making sense of the data wonce it's retrieved, to extracting the knowledge buried amidst the bytes. Prompted by the data explosion, three years ago Muntz and his colleagues founded the Data Mining Laboratory (DML). They were sponsored primarily by NASA, which needed help dealing with the extraordinary amount of Earth-science data the agency was collecting from satellites adn remote sensors to create reliable climate models. Data Mining Laboratory will help NASA try to make sens of data related to such problems as global warming and the ozone hole, but, at the same time, it will have broad applications throughout the world of science and computing. "The cost of data storage is extremely low," explains Muntz. "It decreases at about 60 percent per year and has been going like that for the last decade. So we're able to keep more data online than ever before, and the sources of that data are more easily accessible and more economical. Take electronic commerce, for example. Even before the Internet, scanners at checkout counters could keep track of everything from the efficiency of the checkout clerk to what's selling and who's buying. The amount of pure information available online is extraordinary. As it becomes more economical to keep it, it brings up the question of making use of that information to everyone's advantage. In busines, science, medicine and elsewhere, it can affect everything we do." The phrase researchers in fields like molecular biology and Earth science like to use is that these fields have gone from "data poor to data rich environments." |
"You can imagine 50 years ago that somebody trying to predict the weather had very little data to work with," says Muntz. Today, they have satellite, marine and meteorological reports from all over the world and from every kind of institution, form civil aeronautics to agricultural monitors to military weather stations. "It's become a horrendous problem," insists Muntz. "Many times, business -- even NASA -- simply get the data, put it on tapes and send it to some storage depot down in someone's basement. The don't have the capability to do anything with it." There's no single magic bullet to solve the problem of "knowledge discovery" in this flood of data. Each discipline, each data set, each particular problem, comes with its own unique dynamic that requires its own unique solution. "This is an emerging discipline, and there's a lot of theory being developed," Muntz says. "But it's very confusing because there are so many disciplines involved: statistics, visualization, artificial intelligence, machine learning, patter recognition, database sciences. If you go to a data-mining conference these days, people ceom from all these fields, and each has a somewhat different take on the problem." Muntz and his colleagues at the Data Mining Laboratory are presently working with scientists at Pasadena's Jet Propulsion Laboratory and Scripps Institution of Oceanography in La Jolla to help solve data-intense problems. "We're trying to put tools in their hands to help them describe the patters they're looking for," explains Muntz. In addition, DML is looking at systems-level aspects of storing and retrieving data and standards, at distributed processin g in order to make the best use of emerging technologies to access and process data faster and at optimizing the algorithms used to make sense of the data. Of all the fields, data mining is the most dependent on an interdisciplinary cross-breeding; how in, say, five years the various techniques and technologies being used to make sense of large data sets will come together. "Right now, there is a huge variety of things people are trying, and there will be advances in all of them," notes Muntz. "But what we need to understand is how they all come together. You can see that happening already. The theories on data mining are becoming more comprehensive rather than fragmented and ad hoc. But we still have a long way to go." |
"The
amount of pure information available online is extraordinary. As it becomes
more economical to keep it, it brings up the question of making use of
that information to everyone's advantage."
|