The development of machine learning models for deployment into the clinical setting requires a substantial amount of high-quality labelled data. Generating high-quality labelled datasets for complicated physiologic waveform data, such as electrocardiogram (ECG) data, requires the time of clinical domain experts. As the time of domain experts is usually scarce, labelling often becomes the rate-limiting step of many ML projects and this is the challenge we’re addressing.
Our lab members, Danny Eytan, Dmitrii Shubin, and Sebastian Goodfellow, and collaborators, Minfan Zhang and Daniel Ehrmann, are developing an annotation web app for physiologic waveform data, utilizing an automated human-in-the-loop label proposal system. We are thrilled to announce that MASc candidate Dmitrii Shubin recently presented a research poster at the Machine Learning in Healthcare (MLHC) conference, describing our time-efficient labelling framework.
For our method, we first pre-train a Deep Neural Network in an unsupervised manner (self-supervised contrastive learning in the latest version of the algorithm), using our large-scale waveform database AtriumDB. For context, AtriumDB now contains over a million patient hours of data for over 7800 children, which is the largest of its kind in the world. Next, based on similarity criteria applied to feature embeddings generated by the DNN, we ask the expert to annotate the most diverse segments of the waveform record. For a set of test patients, after labelling the initial 0.3% of their data, our system was able to make accurate (> 80% of F1 macro score) label proposals, minimizing the data that needed to be manually labelled by the annotation expert. The overall impact of this method is that we can optimally utilize the time of our expert annotators to generate representative training datasets sooner.
Youtube video presentation: https://youtu.be/yNgqf4Vgh9E