Robust and Hierarchy-Aware Classification
Loading...
Date
Authors
Advisor
Fieguth, Paul
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
The BIOSCAN project, led by the International Barcode of Life (iBOL) Consortium, is an international, multi-year, and multidisciplinary effort seeking to catalogue all multicellular life on Earth by 2045 to enable the global-scale study of changes in biodiversity, species interactions, and species dynamics. Access to this information has the potential to inform strategies to mitigate the damaging ecological effects of climate change. In the near term, the goal is to catalogue all insects. Each sample is imaged, genetically barcoded, and taxonomically classified by domain experts, a time- and resource-intensive process that is becoming increasingly impractical as collection rates surpass five million samples annually. Addressing such needs is among the foundational motivations for the research of this thesis.
This thesis presents several contributions motivated by the challenges of the BIOSCAN project. Over five million insect samples were organized into a machine-learning-ready dataset, and a deep neural network classifier was developed to establish a baseline for image-to-taxonomy classification performance. To mitigate the harmful impacts of mislabelled samples in training data, a study of neural network architecture robustness was conducted alongside the development of two novel loss functions: Blurry and Piecewise-zero loss. Blurry loss de-weights and reverses the gradient of samples likely to be mislabelled, while Piecewise-zero loss disregards these samples. These improvements strengthen model robustness and enhance label error detection, enabling the referral of suspicious samples for expert review and correction. Additional work investigates the hierarchical structure of biological data and its integration into classification models, specifically through Hyperbolic neural networks, and measures the benefits of doing so in comparison to using conventional architectures. Finally, this thesis explores aligning image, genetic, and taxonomic representations in a hierarchy-aware manner to improve retrieval across modalities.
The contributions of this thesis advance the application of machine learning to facilitate the ongoing global-scale cataloguing of insect life. As challenges such as label errors, hierarchical structures in data, and incomplete annotations are present across many domains, the contributions are valuable to both the machine learning community and the global network of BIOSCAN collaborators.