|
Abstract:
We investigate the following data mining problem from
Computational Chemistry: From a large data set of compounds, find
those that bind to a target molecule in as few iterations of
biological testing as possible. In each iteration a comparatively
small batch of compounds is screened for binding to the target.
We apply active learning techniques for selecting the successive
batches.
One selection strategy picks unlabeled examples closest to the
maximum margin hyperplane. Another produces many weight vectors
by running perceptrons over multiple permutations of the data.
Each weight vector votes with its ± prediction and we pick
the unlabeled examples for which the prediction is most evenly
split between + and -. For a third selection strategy note that
each unlabeled example bisects the version space of consistent
weight vectors. We estimate the volume on both sides of the split
by bouncing a billiard through the version space and select
unlabeled examples that cause the most even split of the version
space.
We demonstrate that on two data sets provided by DuPont
Pharmaceuticals that all three selection strategies perform
comparably well and are much better than selecting random batches
for testing.
|