Journal Differential Equations and Control Processes - An Experimental Study of the Machine Learning Algorithms Response to Data Labelling Errors

An Experimental Study of the Machine Learning Algorithms Response to Data Labelling Errors

Author(s):

Viacheslav Anatolievich Diuk

Doctor of Technical Sciences,
Principal Researcher of the Institute for Transport
Problems of the Russian Academy of Sciences (IPT RAS)
Russia, 199178, St. Petersburg, 12-th Line VO, 13.

v_duke@mail.ru

Abstract:

There are authoritative opinions that data labeling is today the most important element in the procedure for creating AI systems based on machine learning methods. At the same time, in particular with crowdsourcing, there is a serious problem of inaccurate data labeling. The materials of this article complement the well-known approaches to solving this problem by studying the reaction to inaccurate data labeling of some popular machine learning methods. These are naive Bayesian classifier, three-layer perceptron, nearest neighbor method (KNN), decision trees, random forest, logistic regression, support vector machine (SVM). We trained algorithms on copies of specially generated data with different proportions of labeling errors and then tested them on data with accurate labeling. Based on the results of the experiment on data simulating a simple and complex structure of two classes of multidimensional objects, the phenomenon of a relatively weak dependence of the accuracy of the KNN and SVM classification models on the labeling errors of the training sample was demonstrated. In conditions of inaccurate data labeling, the KNN algorithm is more preferable. It is less complicated, has fewer adjustable parameters, is free from a priori assumptions about the data structure, is resistant to anomalous outliers, and is interpretable. In addition, this method has significant potential for further theoretical and practical development based on the approach associated with the construction of context-dependent local metrics.

Keywords

artificial intelligence
context-dependent local metrics
data labeling errors
machine learning

References:

Diuk V. A. Logicheskie metody mashinnogo obucheniya (instrumental'nye sredstva i prakticheskie primery) [Machine Learning Logic Methods (Tools and Practical Examples)]. St. Petersburg. : Vuzizdat Publ. - 2020. - 248 p
https://www.researchandmarkets.com/reports/5415416 (accessed 05. 06. 2022)
Roh Y.; Heo G.; Whang S. A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective. IEEE Trans. Knowl. Data Eng. - 2019, - No. 33, P. 1328-1347
CloudFactory. The Ultimate Guide to Data Labeling for Machine Learning. https://www.cloudfactory.com/data-labeling-guide (accessed 05. 06. 2022)
Cognilytica. Data Preparation and Labeling for AI 2020. https://www.cognilytica.com/document/data-preparation-labeling-for-ai-2020/ (accessed 05. 06. 2022)
A Chat with Andrew on MLOps: From Model-centric to Data-centric AI. - 2021. - https://youtu.be/06-AZXmwHjo (accessed 05. 06. 2022)
Experian’s 2021 Data experience research report. https://www.edq.com/blog/experians-2021-data-experience-research-report (accessed 05. 06. 2022)
Kaftannikov I. L., Parasich A. V. [Problems of Forming a Training Sample in Machine Learning Problems]. Vestnik YuUrGU. Seriya «Komp'yuternye tekhnologii, upravlenie, radioelektronika». 2016. - Vol. 16. - P. 16-24. (In Russ. )
Zhou Z-H. A brief introduction to weakly supervised learning. Natl Sci Rev, 2018, - Vol. 5, - No. 1, - P. 44-53
Adam Kilgarriff and Adam Kilgarriff. Gold standard datasets for evaluating word sense disambiguation programs. Computer Speech and Language, - 1998. - Vol. 12, - No. 3, - P. 453-472
Angluin D., Laird, P. Learning from noisy examples. Mach. Learn. 1988, - Vol. 2, - No. 4, - P. 343-370
Blum A., Kalai A., Wasserman H. Noise-tolerant learning, the parity problem, and the statistical query model. JACM 50(4), - 2003. - P. 506-519
Gao W., Wang L, Li YFet al. Risk minimization in the presence of label noise. In 30th AAAI Conference on Artificial Intelligence, Phoenix, AZ, - 2016. - P. 1575-1581
Muhlenbach, F., Lallich, S. & Zighed, D. A. Identifying and Handling Mislabelled Instances. Journal of Intelligent Information Systems, -2004. - No. 22. - P. 89-109
Gilyazev R. A., Turdakov D. Yu. [Active Learning and Crowdsourcing: An Overview of Data Label Optimization Methods]. Trudy ISP RAN. - Vol. 30, - No. 2, - 2018, - P. 215-250. (In Russ. )
Noyunsan C., Katanyukul T., Saikaew K. Performance evaluation of supervised learning algorithms with various training data sizes and missing attributes. Engineering and Applied Science Research. - 2018. - No. 45(3), - Р. 221-229
Ferná ndez-Delgado M., Cernadas E., Barro S., Amorim D. Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res. - 2014; - No. 15(1). - P. 3133-3181
Hayes B. Top Machine Learning Algorithms, Frameworks, Tools and Products Used by Data Scientists. July 24, 2020. https://customerthink.com/top-machine-learning-algorithms-frameworks-tools-and-products-used-by-data-scientists/ (accessed 05. 06. 2022)
Eibe Frank, Mark A. Hall, and Ian H. Witten. The WEKA Workbench. Online Appendix for «Data Mining: Practical Machine Learning Tools and Techniques», Morgan Kaufmann, Fourth Edition, - 2016
Platt C. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines, In: Advances in Kernel Methods - Support Vector Learning, ed. by B. Schö lkopf and C. J. C. Burges and A. J. Smola, Cambridge, MA, MIT Press. - 1999. P. 185-208
Quinlan, J. R. C4. 5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993
Cover T., Hart P. Nearest neighbour pattern classification. IEEE Trans. Inform. Theory, Vol. IT 13. - 1967. - P. 21-27
Duda R. О., Hart P. E. Pattern classification and scene analysis, Wiley, New York. - 1973
Diuk V. A., Bryus F. O., Bogdanov A. V. [Perspective on extensional methods of machine learning]. Informaciya i kosmos. - 2020. - No. 2. - P. 69-76. (In Russian)
Diuk V. A., Mihov O. M., Bryus F. O. [Extensional Machine Learning Methods]. Trudy Mezhdunarodnoi Nauchno-Tekhnicheskoi Konferenzii ««Transport Rossii: problemy i perspektivy - 2019». - 2019. - P. 198-202. (In Russian)
Dyuk V. A. Context-dependent local metrics and geometrical approach to the problem of knowledge formation. Journal of Computer and Systems Sciences International. - 1996. - Vol. 35. - No. 5. - P. 715-722

Differential Equations and Control Processes
(Differencialnie Uravnenia i Protsesy Upravlenia)

An Experimental Study of the Machine Learning Algorithms Response to Data Labelling Errors

Author(s):

Abstract:

Keywords

References:

Full text (pdf)

Differential Equations and Control Processes (Differencialnie Uravnenia i Protsesy Upravlenia)

An Experimental Study of the Machine Learning Algorithms Response to Data Labelling Errors

Author(s):

Abstract:

Keywords

References:

Full text (pdf)

Differential Equations and Control Processes
(Differencialnie Uravnenia i Protsesy Upravlenia)