An Approximate Anytime Hierarchical Clustering Earth Mover’s Distance (AAHC-EMD)
Loading...
Date
Journal Title
Journal ISSN
Volume Title
Publisher
The British University in Dubai (BUiD)
Abstract
Flow Cytometry (FC) is a crucial tool for analysing soluble substances such as blood, where added biomarkers help highlight any available abnormalities or diseases. This results in numerical datasets that are analysed to assess similarities or dissimilarities between samples. However, applying machine learning to hierarchical FC datasets presents challenges due to their two-level structure: the top level contains blood cells, while the bottom level holds cell attributes. In this study, the dataset was reduced from 30,000 cells to 2,500 and from 8 to 4 attributes per sample to manage time consumption. Despite this reduction, the analysis still required handling 10,000 attributes at the lower level, which is computationally impractical. While dimensionality reduction could help, it risks losing critical information. This thesis proposes treating each sample as a cluster configuration, using Earth Mover’s Distance (EMD), which is robust to instrumental drift but computationally expensive. The solution employs an Approximate Anytime Hierarchical Clustering-based EMD lower bound (AAHC-EMD) algorithm to calculate similarity by measuring the distance between cluster centroids instead of individual cells. This method ranks testing samples to each query at each stage of the hierarchy, reducing computational time by 48% to 89%, and achieves 100% ranking accuracy when given more time. Using the same approach also identifies the Best-Fit testing sample for each query from a list of 20 testing samples, producing 100% accuracy and a 72.5% time saving compared to traditional EMD. This approach improves diagnostic precision and computational efficiency in analysing complex hierarchical datasets such as FC.
Keywords: Anytime Algorithm, Earth Mover’s Distance, Flow Cytometry, Hierarchical Clustering, k-means, Lower bound.