ISSN 1817-2172, рег. Эл. № ФС77-39410, ВАК

Differential Equations and Control Processes
(Differencialnie Uravnenia i Protsesy Upravlenia)

An Experimental Study of the Machine Learning Algorithms Response to Data Labelling Errors

Author(s):

Viacheslav Anatolievich Diuk

Doctor of Technical Sciences,
Principal Researcher of the Institute for Transport
Problems of the Russian Academy of Sciences (IPT RAS)
Russia, 199178, St. Petersburg, 12-th Line VO, 13.

v_duke@mail.ru

Abstract:

There are authoritative opinions that data labeling is today the most important element in the procedure for creating AI systems based on machine learning methods. At the same time, in particular with crowdsourcing, there is a serious problem of inaccurate data labeling. The materials of this article complement the well-known approaches to solving this problem by studying the reaction to inaccurate data labeling of some popular machine learning methods. These are naive Bayesian classifier, three-layer perceptron, nearest neighbor method (KNN), decision trees, random forest, logistic regression, support vector machine (SVM). We trained algorithms on copies of specially generated data with different proportions of labeling errors and then tested them on data with accurate labeling. Based on the results of the experiment on data simulating a simple and complex structure of two classes of multidimensional objects, the phenomenon of a relatively weak dependence of the accuracy of the KNN and SVM classification models on the labeling errors of the training sample was demonstrated. In conditions of inaccurate data labeling, the KNN algorithm is more preferable. It is less complicated, has fewer adjustable parameters, is free from a priori assumptions about the data structure, is resistant to anomalous outliers, and is interpretable. In addition, this method has significant potential for further theoretical and practical development based on the approach associated with the construction of context-dependent local metrics.

Keywords

References:

  1. Diuk V. A. Logicheskie metody mashinnogo obucheniya (instrumental'nye sredstva i prakticheskie primery) [Machine Learning Logic Methods (Tools and Practical Examples)]. St. Petersburg. : Vuzizdat Publ. - 2020. - 248 p
  2. https://www.researchandmarkets.com/reports/5415416 (accessed 05. 06. 2022)
  3. Roh Y.; Heo G.; Whang S. A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective. IEEE Trans. Knowl. Data Eng. - 2019, - No. 33, P. 1328-1347
  4. CloudFactory. The Ultimate Guide to Data Labeling for Machine Learning. https://www.cloudfactory.com/data-labeling-guide (accessed 05. 06. 2022)
  5. Cognilytica. Data Preparation and Labeling for AI 2020. https://www.cognilytica.com/document/data-preparation-labeling-for-ai-2020/ (accessed 05. 06. 2022)
  6. A Chat with Andrew on MLOps: From Model-centric to Data-centric AI. - 2021. - https://youtu.be/06-AZXmwHjo (accessed 05. 06. 2022)
  7. Experian’s 2021 Data experience research report. https://www.edq.com/blog/experians-2021-data-experience-research-report (accessed 05. 06. 2022)
  8. Kaftannikov I. L., Parasich A. V. [Problems of Forming a Training Sample in Machine Learning Problems]. Vestnik YuUrGU. Seriya «Komp'yuternye tekhnologii, upravlenie, radioelektronika». 2016. - Vol. 16. - P. 16-24. (In Russ. )
  9. Zhou Z-H. A brief introduction to weakly supervised learning. Natl Sci Rev, 2018, - Vol. 5, - No. 1, - P. 44-53
  10. Adam Kilgarriff and Adam Kilgarriff. Gold standard datasets for evaluating word sense disambiguation programs. Computer Speech and Language, - 1998. - Vol. 12, - No. 3, - P. 453-472
  11. Angluin D., Laird, P. Learning from noisy examples. Mach. Learn. 1988, - Vol. 2, - No. 4, - P. 343-370
  12. Blum A., Kalai A., Wasserman H. Noise-tolerant learning, the parity problem, and the statistical query model. JACM 50(4), - 2003. - P. 506-519
  13. Gao W., Wang L, Li YFet al. Risk minimization in the presence of label noise. In 30th AAAI Conference on Artificial Intelligence, Phoenix, AZ, - 2016. - P. 1575-1581
  14. Muhlenbach, F., Lallich, S. & Zighed, D. A. Identifying and Handling Mislabelled Instances. Journal of Intelligent Information Systems, -2004. - No. 22. - P. 89-109
  15. Gilyazev R. A., Turdakov D. Yu. [Active Learning and Crowdsourcing: An Overview of Data Label Optimization Methods]. Trudy ISP RAN. - Vol. 30, - No. 2, - 2018, - P. 215-250. (In Russ. )
  16. Noyunsan C., Katanyukul T., Saikaew K. Performance evaluation of supervised learning algorithms with various training data sizes and missing attributes. Engineering and Applied Science Research. - 2018. - No. 45(3), - Р. 221-229
  17. Ferná ndez-Delgado M., Cernadas E., Barro S., Amorim D. Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res. - 2014; - No. 15(1). - P. 3133-3181
  18. Hayes B. Top Machine Learning Algorithms, Frameworks, Tools and Products Used by Data Scientists. July 24, 2020. https://customerthink.com/top-machine-learning-algorithms-frameworks-tools-and-products-used-by-data-scientists/ (accessed 05. 06. 2022)
  19. Eibe Frank, Mark A. Hall, and Ian H. Witten. The WEKA Workbench. Online Appendix for «Data Mining: Practical Machine Learning Tools and Techniques», Morgan Kaufmann, Fourth Edition, - 2016
  20. Platt C. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines, In: Advances in Kernel Methods - Support Vector Learning, ed. by B. Schö lkopf and C. J. C. Burges and A. J. Smola, Cambridge, MA, MIT Press. - 1999. P. 185-208
  21. Quinlan, J. R. C4. 5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993
  22. Cover T., Hart P. Nearest neighbour pattern classification. IEEE Trans. Inform. Theory, Vol. IT 13. - 1967. - P. 21-27
  23. Duda R. О., Hart P. E. Pattern classification and scene analysis, Wiley, New York. - 1973
  24. Diuk V. A., Bryus F. O., Bogdanov A. V. [Perspective on extensional methods of machine learning]. Informaciya i kosmos. - 2020. - No. 2. - P. 69-76. (In Russian)
  25. Diuk V. A., Mihov O. M., Bryus F. O. [Extensional Machine Learning Methods]. Trudy Mezhdunarodnoi Nauchno-Tekhnicheskoi Konferenzii ««Transport Rossii: problemy i perspektivy - 2019». - 2019. - P. 198-202. (In Russian)
  26. Dyuk V. A. Context-dependent local metrics and geometrical approach to the problem of knowledge formation. Journal of Computer and Systems Sciences International. - 1996. - Vol. 35. - No. 5. - P. 715-722

Full text (pdf)