Decision Branches | Christian Lülf

The vast amounts of data collected in various domains pose great challenges to modern data exploration and analysis. To find “interesting” objects in large databases, users typically define a query using positive and negative example objects and train a classification model to identify the objects of interest in the entire data catalog. However, this approach requires a scan of all the data to apply the classification model to each instance in the data catalog, making this method prohibitively expensive to be employed in large-scale databases serving many users and queries interactively. In this work, we propose a novel framework for such search-by-classification scenarios that allows users to interactively search for target objects by specifying queries through a small set of positive and negative examples. Unlike previous approaches, our framework can rapidly answer such queries at low cost without scanning the entire database. Our framework is based on an index-aware construction scheme for decision trees and random forests that transforms the inference phase of these classification models into a set of range queries, which in turn can be efficiently executed by leveraging multidimensional indexing structures. Our experiments show that queries over large data catalogs with hundreds of millions of objects can be processed in a few seconds using a single server, compared to hours needed by classical scanning-based approaches.

Our search-by-classification framework accelerates data retrieval by utilizing an index-aware classifier and pre-built indexes. The classifier leverages efficient range queries on the indexes, eliminating the need for time-consuming full data scans.

The construction algorithm of our machine learning model is based on two stages. The first stage is the initialization phase, where minimal boxes are constructed that are grown from a random positive trainig sample. The second stage is the expansion phase, where the boundaries of the boxes are expanded to cover more positive training samples by minimizing an impurity measure similar to a decision tree. The final model is a set of hyperboxes that can be used to classify unseen data points.

Initialization phase. The hyperboxes are grown from a random positive training sample.

Expansion phase. The boundaries of the hyperboxes are expanded to cover more positive training samples.

The code for this project can be found in the following Github repository.

References

2023

Fast Search-by-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests

Christian Lülf, Denis Mayr Lima Martins, Marcos Antonio Vaz Salles, Yongluan Zhou, and Fabian Gieseke

In Proceedings of the VLDB Endowment (CORE Rank A*) , 2023

Abs Bib PDF Code

The vast amounts of data collected in various domains pose great challenges to modern data exploration and analysis. To find "interesting" objects in large databases, users typically define a query using positive and negative example objects and train a classification model to identify the objects of interest in the entire data catalog. However, this approach requires a scan of all the data to apply the classification model to each instance in the data catalog, making this method prohibitively expensive to be employed in large-scale databases serving many users and queries interactively. In this work, we propose a novel framework for such search-by-classification scenarios that allows users to interactively search for target objects by specifying queries through a small set of positive and negative examples. Unlike previous approaches, our framework can rapidly answer such queries at low cost without scanning the entire database. Our framework is based on an index-aware construction scheme for decision trees and random forests that transforms the inference phase of these classification models into a set of range queries, which in turn can be efficiently executed by leveraging multidimensional indexing structures. Our experiments show that queries over large data catalogs with hundreds of millions of objects can be processed in a few seconds using a single server, compared to hours needed by classical scanning-based approaches.
@inproceedings{vldb, author = {L\"{u}lf, Christian and Martins, Denis Mayr Lima and Salles, Marcos Antonio Vaz and Zhou, Yongluan and Gieseke, Fabian}, title = {Fast Search-by-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests}, year = {2023}, booktitle = {Proceedings of the VLDB Endowment}, pages = {2845--2857}, }