EnviGP
Improving Genetic Programming for
the Environment and Other Applications
Sara Silva
Project Summary
Genetic Programming (GP) is the youngest paradigm inside the artificial intelligence research area called evolutionary computation, and consists on the automated learning of computer programs. Theoretically, GP can solve any problem whose candidate solutions can be measured and compared, making it a widely applicable technique. GP often yields results that are not merely academically interesting, but competitive with the work developed by humans.
Remote Sensing (RS) is the science of obtaining information about objects or phenomena without being in direct contact with them. It is based on the detection and measurement of electromagnetic energy such as light, heat, and radio waves. The success of RS applications depends on the ability to derive useful information from the raw data, and machine learning has been widely used in the attempt to automate some of the related tasks.
Despite being a powerful and versatile technique, GP has not yet been intensively used in RS applications. Although some work has been published, this area of research remains mostly unexplored for two reasons: practitioners of the field do not know about GP and its capabilities; computer scientists do not have the opportunity to access the RS data, or the background to fully understand the problems involved. Prior collaborations between the team members have already paved the way for bridging the gap between the two worlds.
Because it is a young and complex paradigm, the practical use of GP still poses a few challenges. The first major problem faced by most researchers and users of GP is the emergent phenomena of bloat. Because GP uses a variable-length representation for its solutions, the programs are allowed to grow during the evolutionary process. Although code growth is a natural result of the search for better solutions, there is usually an excessive proliferation of redundant code that increases the size of programs without correspondingly improving their fitness. This is a serious problem that can actually stagnate the evolutionary process, and it has been the subject of numerous studies since the beginnings of GP. In recent years there has been a new surge of interest in the bloat problem, and some recent publications concern a new theory explaining its emergence and describe some of the most successful bloat control techniques developed so far. The team members have actively contributed to both, and are now eager to go beyond the benchmark problems and validate the new findings with real-world data.
Typical of most supervised learning algorithms, another problem faced by GP is overfitting. During the optimization process the algorithm tries to improve the proposed solution as much as possible. If given enough degrees of freedom it will become specialized on the data used for learning, while losing the ability to generalize to different data sets. Few studies have addressed the overfitting problem in the specific context of GP, with some of the most recent ideas being proposed by a team member. There are a few contradictory results regarding the connection between code parsimony and overfitting, but none that ever explored the direct link between bloat and overfitting. It has been generally believed that the two phenomena are strongly related, but a very recent study by the team has provided surprising evidence that bloat can occur without overfitting and overfitting can occur without bloat, with the most unlikely scenario being the occurrence of both.
Another handicap of most GP systems is the inadequacy to deal with multiclass classification problems, a kind so common in RS. This problem derives not from limitations of the algorithm, but from the particular representation used for the solutions. Adapting GP to better deal with multiclass problems has never been an active field of research. The team has addressed this problem on the context of a RS application.
In this project we will develop and test new approaches to the bloat and overfitting problems in GP, while studying the relationship between the two, and adapt GP for improved efficiency in multiclass classification problems. Going beyond the benchmark tests, we will be using RS and other types of data in the newly emerged land use / land cover change science domain, solving problems that have a major impact at the environmental level. Besides actually providing new models and solutions to the field, the achievement of these goals will ultimately produce a powerful general-purpose tool that can be used by practitioners of many diverse areas of research besides land science and beyond RS data.
This project brings together a team of specialists from both land and computer sciences whose knowledge and experience covers all the topics addressed, spanning through a range of biologically inspired algorithms and real-world applications much broader than the simple pairing of GP and RS.
Participants, Funding, etc
PTDC/EIA-CCO/103363/2008
Sara Silva