An Experimental Study of the Machine Learning Algorithms Response to Data Labelling Errors
Author(s):
Viacheslav Anatolievich Diuk
Doctor of Technical Sciences,
Principal Researcher of the Institute for Transport
Problems of the Russian Academy of Sciences (IPT RAS)
Russia, 199178, St. Petersburg, 12-th Line VO, 13.
v_duke@mail.ru
Abstract:
There are authoritative opinions that
data labeling is today the most important element
in the procedure for creating AI systems based
on machine learning methods. At the same time,
in particular with crowdsourcing, there is a serious
problem of inaccurate data labeling. The materials of
this article complement the well-known approaches
to solving this problem by studying the reaction
to inaccurate data labeling of some popular
machine learning methods. These are naive Bayesian classifier,
three-layer perceptron, nearest neighbor method (KNN),
decision trees, random forest, logistic regression,
support vector machine (SVM). We trained algorithms
on copies of specially generated data with different
proportions of labeling errors and then tested them
on data with accurate labeling. Based on the results
of the experiment on data simulating a simple and
complex structure of two classes of multidimensional objects,
the phenomenon of a relatively weak dependence of the accuracy
of the KNN and SVM classification models on the labeling errors
of the training sample was demonstrated. In conditions of
inaccurate data labeling, the KNN algorithm is more preferable.
It is less complicated, has fewer adjustable parameters,
is free from a priori assumptions about the data structure,
is resistant to anomalous outliers, and is interpretable.
In addition, this method has significant potential for further
theoretical and practical development based on the approach
associated with the construction of context-dependent
local metrics.
Keywords
- artificial intelligence
- context-dependent local metrics
- data labeling errors
- machine learning
References:
- Diuk V. A. Logicheskie metody mashinnogo obucheniya (instrumental'nye sredstva i prakticheskie primery) [Machine Learning Logic Methods (Tools and Practical Examples)]. St. Petersburg. : Vuzizdat Publ. - 2020. - 248 p
- https://www.researchandmarkets.com/reports/5415416 (accessed 05. 06. 2022)
- Roh Y.; Heo G.; Whang S. A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective. IEEE Trans. Knowl. Data Eng. - 2019, - No. 33, P. 1328-1347
- CloudFactory. The Ultimate Guide to Data Labeling for Machine Learning. https://www.cloudfactory.com/data-labeling-guide (accessed 05. 06. 2022)
- Cognilytica. Data Preparation and Labeling for AI 2020. https://www.cognilytica.com/document/data-preparation-labeling-for-ai-2020/ (accessed 05. 06. 2022)
- A Chat with Andrew on MLOps: From Model-centric to Data-centric AI. - 2021. - https://youtu.be/06-AZXmwHjo (accessed 05. 06. 2022)
- Experian’s 2021 Data experience research report. https://www.edq.com/blog/experians-2021-data-experience-research-report (accessed 05. 06. 2022)
- Kaftannikov I. L., Parasich A. V. [Problems of Forming a Training Sample in Machine Learning Problems]. Vestnik YuUrGU. Seriya «Komp'yuternye tekhnologii, upravlenie, radioelektronika». 2016. - Vol. 16. - P. 16-24. (In Russ. )
- Zhou Z-H. A brief introduction to weakly supervised learning. Natl Sci Rev, 2018, - Vol. 5, - No. 1, - P. 44-53
- Adam Kilgarriff and Adam Kilgarriff. Gold standard datasets for evaluating word sense disambiguation programs. Computer Speech and Language, - 1998. - Vol. 12, - No. 3, - P. 453-472
- Angluin D., Laird, P. Learning from noisy examples. Mach. Learn. 1988, - Vol. 2, - No. 4, - P. 343-370
- Blum A., Kalai A., Wasserman H. Noise-tolerant learning, the parity problem, and the statistical query model. JACM 50(4), - 2003. - P. 506-519
- Gao W., Wang L, Li YFet al. Risk minimization in the presence of label noise. In 30th AAAI Conference on Artificial Intelligence, Phoenix, AZ, - 2016. - P. 1575-1581
- Muhlenbach, F., Lallich, S. & Zighed, D. A. Identifying and Handling Mislabelled Instances. Journal of Intelligent Information Systems, -2004. - No. 22. - P. 89-109
- Gilyazev R. A., Turdakov D. Yu. [Active Learning and Crowdsourcing: An Overview of Data Label Optimization Methods]. Trudy ISP RAN. - Vol. 30, - No. 2, - 2018, - P. 215-250. (In Russ. )
- Noyunsan C., Katanyukul T., Saikaew K. Performance evaluation of supervised learning algorithms with various training data sizes and missing attributes. Engineering and Applied Science Research. - 2018. - No. 45(3), - Р. 221-229
- Ferná ndez-Delgado M., Cernadas E., Barro S., Amorim D. Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res. - 2014; - No. 15(1). - P. 3133-3181
- Hayes B. Top Machine Learning Algorithms, Frameworks, Tools and Products Used by Data Scientists. July 24, 2020. https://customerthink.com/top-machine-learning-algorithms-frameworks-tools-and-products-used-by-data-scientists/ (accessed 05. 06. 2022)
- Eibe Frank, Mark A. Hall, and Ian H. Witten. The WEKA Workbench. Online Appendix for «Data Mining: Practical Machine Learning Tools and Techniques», Morgan Kaufmann, Fourth Edition, - 2016
- Platt C. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines, In: Advances in Kernel Methods - Support Vector Learning, ed. by B. Schö lkopf and C. J. C. Burges and A. J. Smola, Cambridge, MA, MIT Press. - 1999. P. 185-208
- Quinlan, J. R. C4. 5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993
- Cover T., Hart P. Nearest neighbour pattern classification. IEEE Trans. Inform. Theory, Vol. IT 13. - 1967. - P. 21-27
- Duda R. О., Hart P. E. Pattern classification and scene analysis, Wiley, New York. - 1973
- Diuk V. A., Bryus F. O., Bogdanov A. V. [Perspective on extensional methods of machine learning]. Informaciya i kosmos. - 2020. - No. 2. - P. 69-76. (In Russian)
- Diuk V. A., Mihov O. M., Bryus F. O. [Extensional Machine Learning Methods]. Trudy Mezhdunarodnoi Nauchno-Tekhnicheskoi Konferenzii ««Transport Rossii: problemy i perspektivy - 2019». - 2019. - P. 198-202. (In Russian)
- Dyuk V. A. Context-dependent local metrics and geometrical approach to the problem of knowledge formation. Journal of Computer and Systems Sciences International. - 1996. - Vol. 35. - No. 5. - P. 715-722