Construction of an Emotional Image of a Person Based on the Analysis of Key Points in Consecutive Frames of a Video Sequence
Author(s):
Dmitry Dmitriyevich Averianov
Researcher at the Center for Research and Development, LLC "Robert Bosch"
PhD student of the Department of Applied Cybernetics, Faculty of Mathematics and Mechanics,
St. Petersburg State University (SPBU)
dmitryaverianov@gmail.com
Mikhail Valerievich Zheludev
Ph.D. Senior Researcher at the Center for Research and Development, LLC "Robert Bosch"
Saint Petersburg. st. Marshal Govorov, 49
mikhail.zheludev@ru.bosch.com
Vladimir Ilyich Kiyaev
Ph.D. Associate Professor, Department of Astronomy, Faculty of Mathematical and Mechanical,
St. Petersburg State University (SPBU)
kiyaev@mail.ru
Abstract:
The work is devoted to the development of an algorithm for classifying human behavior in the context of
detecting the truthfulness or falsity of statements presented in video file format. The analysis of
the video file was carried out within the time window, in which both changes in the micromotility of
the facial muscles and speech signs were analyzed. In our case, facial expressions are represented by
a mathematical representation in the form of a vector containing the necessary digital information
about the state of the face, which is characterized by the positions of special points (key points of
the nose, eyebrows, eyes, eyelids, etc.). The mimic vector is formed as a result of training non-linear
models. The speech characterizing vector is formed on the basis of the heuristic characteristics of
the audio signal. The temporal aggregation of vectors for the final classification of behavior is
performed by a separate neural network. The paper presents the results of the accuracy and speed of
the algorithm, which show that the new approach is competitive with respect to existing methods.
Keywords
- audio analysis
- facial landmarks
- lie detector
- machine and deep learning
- speech signal
- transformers
- video analytics
- video classification
References:
- Goupil L. et al. Listeners’ perceptions of the certainty and honesty of a speaker are associated with a common prosodic signature // Nature Communications. 2021. Vol. 12, № 1. P. 861
- Teixeira J. P., Oliveira C., Lopes C. Vocal Acoustic Analysis - Jitter, Shimmer and HNR Parameters // Procedia Technology. 2013. Vol. 9. P. 1112-1122
- Burzo M. et al. Multimodal deception detection // The Handbook of Multimodal-Multisensor Interfaces, Volume 2. 2018
- Chow A., Louie J. Detecting lies via speech patterns. 2017
- Zhang, X., Sugano, Y., Fritz, M. & Bulling, A. 2017, " It's Written All over Your Face: Full-Face Appearance-Based Gaze Estimation", IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 2299
- Kathi, M. G. & Shaik, J. H. 2021, " Estimating the smile by evaluating the spread of lips", Revue d'Intelligence Artificielle, vol. 35, no. 2, pp. 153-158
- Zhang, X., Sugano, Y., Fritz, M. & Bulling, A. 2015, " Appearance-based gaze estimation in the wild", Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 4511
- Bazarevsky, V. et. al., BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs, CoRR, abs/1907. 05047. 2019
- Kaiming H. et. al., Deep Residual Learning for Image Recognition, CVPR 2016, 2016
- Bertatius G. et. al., Is Space-Time Attention All You Need for Video Understanding?, ICML 2021, 2021
- Vaswani A. et. al., Attention Is All You Need, NIPS 2017, 2017
- Gong Y., et. al., AST: Audio Spectrogram Transformer, Interspeech 2021, 2021
- Burkhardt F. et al. A Database of German Emotional Speech // Interspeech. 2005. P. 1517-1520
- Zhu Y., et. al., TinaFace: Strong but Simple Baseline for Face Detection, arXiv preprint arXiv:2011. 13183, 2020
- Tran D., et. al., A Closer Look at Spatiotemporal Convolutions for Action Recognition, CVPR 2018, 2018
- Olah C., Understanding LSTM Networks // colah. github. io. 2015.
- Alammar J., Visualizing A Neural Machine Translation Model (Mechanics of Seq2Seq Models With Attention)
- Vaswani A., Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. 2017