Chun Wai Chiu
Smoclust: synthetic minority oversampling based on stream clustering for evolving data streams
Chiu, Chun Wai; Minku, Leandro L.
Authors
Leandro L. Minku
Abstract
Many real-world data stream applications not only suffer from concept drift but also class imbalance. Yet, very few existing studies investigated this joint challenge. Data difficulty factors, which have been shown to be key challenges in class imbalanced data streams, are not taken into account by existing approaches when learning class imbalanced data streams. In this work, we propose a drift adaptable oversampling strategy to synthesise minority class examples based on stream clustering. The motivation is that stream clustering methods continuously update themselves to reflect the characteristics of the current underlying concept, including data difficulty factors. This nature can potentially be used to compress past information without caching data in the memory explicitly. Based on the compressed information, synthetic examples can be created within the region that recently generated new minority class examples. Experiments with artificial and real-world data streams show that the proposed approach can handle concept drift involving different minority class decomposition better than existing approaches, especially when the data stream is severely class imbalanced and presenting high proportions of safe and borderline minority class examples.
Citation
Chiu, C. W., & Minku, L. L. (2024). Smoclust: synthetic minority oversampling based on stream clustering for evolving data streams. Machine Learning, 113(7), 4671-4721. https://doi.org/10.1007/s10994-023-06420-y
Journal Article Type | Article |
---|---|
Acceptance Date | Oct 3, 2023 |
Online Publication Date | Dec 18, 2023 |
Publication Date | Jul 1, 2024 |
Deposit Date | Jun 13, 2024 |
Journal | Machine Learning |
Print ISSN | 0885-6125 |
Publisher | Springer Verlag |
Peer Reviewed | Peer Reviewed |
Volume | 113 |
Issue | 7 |
Pages | 4671-4721 |
DOI | https://doi.org/10.1007/s10994-023-06420-y |
Keywords | Concept drift, Data difficulty factors, Class imbalance, Data streams, Synthetic data, Stream clustering |
Public URL | https://keele-repository.worktribe.com/output/847521 |
Publisher URL | https://link.springer.com/article/10.1007/s10994-023-06420-y |
Downloadable Citations
About Keele Repository
Administrator e-mail: research.openaccess@keele.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search