A multimodal deep learning framework for human behavior
recognition and synthesis using CNN-LSTM and ensemble models

Vaikunta Pai, T.; Manjula, Mallya M.; Birau, Ramona; Nethravathi, S. P.; Popescu, Virgil; Bărbăcioru, Iuliana Carmen; Naik, Pramod Vishnu

A MULTIMODAL DEEP LEARNING FRAMEWORK FOR HUMAN BEHAVIOR RECOGNITION AND SYNTHESIS USING CNN-LSTM AND ENSEMBLE MODELS

Vaikunta Pai T, Manjula Mallya M, Ramona Birau, Nethravathi P S,
Virgil Popescu, Iuliana Carmen Bărbăcioru, Pramod Vishnu Naik

Abstract. This study integrates deep learning models to represent, analyze, and generate diverse human behaviors, including postures, gestures, facial expressions, physiological signals, and emotional states. By modeling multimodal signals, the research develops a holistic framework for understanding and recreating complex human behaviors, advancing human-computer interaction (HCI) and enabling empathetic, responsive digital experiences. This approach offers transformative applications across healthcare, education, entertainment, security, automotive, and human resources. In healthcare, it supports patient well-being monitoring, while in education, it enables personalized learning experiences. Entertainment benefits from the creation of immersive, emotionally resonant content, and security sectors gain improved threat detection capabilities. In the automotive field, this research can inform advanced driver-assistance systems (ADAS), enhancing vehicle safety, while in human resources, it supports improved team dynamics and productivity. By prioritizing multimodal data integration, the study enhances accuracy in behavior recognition and the efficient processing of large-scale data. These advancements not only elevate HCI by making interactions more natural and intuitive but also support the development of tailored, human-centered applications. This work paves the way for a future where technology authentically replicates the depth of human expression, fostering an empathetic, adaptive digital environment that responds meaningfully to individual needs.

2020 Mathematics Subject Classification: 68T07; 68T37; 68T50
Keywords: Convolutional Neural Networks (CNNs), Human Behavioral Data, Long Short-Term Memory (LSTM), Deep Learning Models, Multimodal signals, Human Body Language, human-centric applications

Full text (PDF)

References

M. Adhikari & A. Munusamy, ICovidCare: Intelligent health monitoring framework for COVID-19 using ensemble random forest in edge networks, Internet of Things, 14, (2021), 100385. DOI: https://doi.org/10.1016/j.iot.2021.100385.
M. A. Akbar, A. A. Khan, S. Mahmood, S. Rafi & S. Demi, Trustworthy artificial intelligence: A decision-making taxonomy of potential challenges, Software: Practice and Experience, 54(9), (2024), 1621--1650. DOI: https://doi.org/10.1002/spe.3216.
A. Ali, W. Samara, D. Alhaddad, A.Ware & O.A. Saraereh, Human activity and motion pattern recognition within indoor environment using convolutional neural networks clustering and naive bayes classification algorithms, Sensors, 22(3):1016, (2022). DOI: https://doi.org/10.3390/s22031016.
N. Alruwais & M. Zakariah, Student-engagement detection in classroom using machine learning algorithm, Electronics, 12(3), (2023), 731. DOI: https://doi.org/10.3390/electronics12030731.
T. Baltru\vsaitis, C. Ahuja & L.P. Morency, Multimodal machine learning: A survey and taxonomy, IEEE Trans. Pattern Anal. Mach. Intell., 41(2), (2019), 423--443. DOI: https://doi.org/10.1109/TPAMI.2018.2798607.
P. Bhardwaj, P.K. Gupta, H. Panwar, M.K. Siddiqui, R. Morales-Menendez & A. Bhaik, Application of deep learning on student engagement in e-learning environments, Computers & Electrical Engineering, 93, (2021), 107277. DOI: https://doi.org/10.1016/j.compeleceng.2021.107277.
F. Caba Heilbron, V. Escorcia, B. Ghanem & J. Carlos Niebles, ActivityNet: A large-scale video benchmark for human activity understanding, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, (2015), 961-970. DOI: https://doi.org/10.1109/CVPR.2015.7298698.
C.W. Chang, C.Y. Chang & Y.Y. Lin, A hybrid CNN and LSTM-based deep learning model for abnormal behavior detection, Multimedia Tools and Applications, 81(9), (2022), 11825--11843. DOI: https://doi.org/10.1007/s11042-021-11887-9.
D.C. Ciresan, U. Meier, L.M. Gambardella & J. Schmidhuber, Convolutional neural network committees for handwritten character classification, in Proc. Int. Conf. Document Anal. Recognit., Sep., (2011), 1135-1139. DOI: https://doi.org/10.1109/ICDAR.2011.229.
P. Datta & R. Rohilla, An autonomous and intelligent hybrid CNN-RNN-LSTM based approach for the detection and classification of abnormalities in brain, Multimedia Tools and Applications, 83(21), (2024), 60627-60653. DOI: https://doi.org/10.1007/s11042-023-17877-3.
A. Graves, Supervised Sequence Labelling, Springer, 2012, 5-13. DOI: https://api.semanticscholar.org/CorpusID:60085539.
A. Graves, A.R. Mohamed & G. Hinton, Speech recognition with deep recurrent neural networks, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May, (2013), 6645-6649. DOI: https://doi.org/10.1109/ICASSP.2013.6638947.
K. Gupta, A. Singh, S.R. Yeduri, M.B. Srinivas & L.R. Cenkeramaddi, Hand gestures recognition using edge computing system based on vision transformer and lightweight CNN, J. Ambient Intell. Humanized Comput., 14(3), (2023), 2601--2615. DOI: https://doi.org/10.1007/s12652-022-04506-4.
Z. Hua, Z. Wang, X. Xu, X. Kong & H. Song, An effective PoseC3D model for typical action recognition of dairy cows based on skeleton features, Computers and Electronics in Agriculture, 212, 108152, (2023). DOI: https://doi.org/10.1016/j.compag.2023.108152.
S. Indolia, A.K. Goswami, S.P. Mishra & P. Asopa, Conceptual understanding of convolutional neural network---a deep learning approach, Procedia Computer Science, 132, (2018), 679--688. DOI: https://doi.org/10.1016/j.procs.2018.05.069.
M.M. Islam, S. Hassan, S. Akter, F.A. Jibon & M. Sahidullah, A comprehensive review of predictive analytics models for mental illness using machine learning algorithms, Healthcare Analytics, 6, (2024), 100350. DOI: https://doi.org/10.1016/j.health.2024.100350.
N. Jaouedi, N. Boujnah & M.S. Bouhlel, A new hybrid deep learning model for human action recognition, J. King Saud Univ.--Comput. Inf. Sci., 32(4), (2020), 447--453. DOI: https://doi.org/10.1016/j.jksuci.2019.09.004.
S. Ji, W. Xu, M. Yang & K. Yu, 3D convolutional neural networks for human action recognition, IEEE Trans. Pattern Anal. Mach. Intell., 35(1), (2012), 221--231. http://researchonline.ljmu.ac.uk/id/eprint/9438/.
A. Khan, A. Sohail, U. Zahoora & A.S. Qureshi, A survey of the recent architectures of deep convolutional neural networks, Artif. Intell. Rev., 53(8), (2020), 5455--5516. DOI: https://doi.org/10.1007/s10462-020-09825-6.
J. Li, K. Jin, D. Zhou, N. Kubota & Z. Ju, Attention mechanism-based CNN for facial expression recognition, Neurocomputing, 411, (2020), 340--350. DOI: https://doi.org/10.1016/j.neucom.2020.06.014.
J. Liu et al., NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding, IEEE Trans. Pattern Anal. Mach. Intell., 42(10), (2020) 2684--2701. DOI: https://doi.org/10.1109/TPAMI.2019.2916873.
Y. Liu, H. Zhang, Y. Zhan, Z. Chen, G. Yin, L. Wei & Z. Chen, Noise-resistant multimodal transformer for emotion recognition, International Journal of Computer Vision, (2024), 1-21. DOI: https://doi.org/10.1007/s11263-024-02304-3.
F. Luo, S. Khan, B. Jiang & K. Wu, Vision transformers for human activity recognition using WiFi channel state information, IEEE Internet of Things Journal, (2024). DOI: https://doi.org/10.1109/JIOT.2024.3375337.
S. Marcos-Pablos & F.J. Garc\ia-Peñalvo, Emotional intelligence in robotics: A scoping review. In New Trends in Disruptive Technologies,Tech Ethics and Artificial Intelligence: The DITTET Collection 1. Springer International Publishing.(2022)(pp. 66--75) DOI: https://doi.org/10.1007/978-3-030-87687-6_7.
S. Mekruksavanich & A. Jitpattanakul, Hybrid convolution neural network with channel attention mechanism for sensor-based human activity recognition, Sci. Rep., 13(1), (2023), 12067. DOI: https://doi.org/10.1038/s41598-023-39080-y.
A. van den Oord et al., WaveNet: A generative model for raw audio, arXiv preprint arXiv:1609.03499, (2016). DOI: https://doi.org/10.48550/arXiv.1609.03499.
S. Peng, S. Sun & Y.D. Yao, A survey of modulation classifi- cation using deep learning: Signal representation and data preprocessing, IEEE Transactions on Neural Networks and Learning Systems, 33(12), (2021), 7020--7038. DOI: https://doi.org/10.1109/TNNLS.2021.3085433.
M. Sajjad, S. Zahir, A. Ullah, Z. Akhtar & K. Muhammad, Human behavior understanding in big multimedia data using CNN based facial expression recognition, Mobile Netw. Appl., 25, (2020), 1611--1621. DOI: https://doi.org/10.1007/s11036-019-01366-9.
W. Sheng, P. Shan, S. Chen, Y. Liu & F. E. Alsaadi, A niching evolutionary algorithm with adaptive negative correlation learning for neural network ensemble, Neurocomputing, 247, (2017), 173-182. DOI: https://doi.org/10.1016/j.neucom.2017.03.055.
X. Shao, R. Niu, X. Shao, J. Gao, Y. Shi, Z. Jiang & Y. Wang, Application of dual-stream 3D convolutional neural network based on 18 F-FDG PET/CT in distinguishing benign and invasive adenocarcinoma in ground-glass lung nodules, EJNMMI Physics, 8, (2021),1--13. DOI: https://doi.org/10.1186/s40658-021-00423-1.
P. Tarnowski, M. Ko\l odziej, A. Majkowski & R.J. Rak, Emotion recognition using facial expressions, Procedia Computer Science, 108, (2017), 1175--1184. DOI: https://doi.org/10.1016/j.procs.2017.05.025.
T. Vaikunta Pai et al., DKCNN: Improving deep kernel convolutional neural network-based COVID-19 identification from CT images of the chest, Journal of X-Ray Sci. Technol., Preprint, 32(1), (2024), 1--18. DOI: https://doi.org/10.3233/XST-230424.
A. Vaswani et al., Attention is all you need, Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), (2017). DOI: https://doi.org/10.48550/arXiv.1706.03762.
J. Wang, Y. Chen, S. Hao, X. Peng & L. Hu, Deep learning for sensor- based activity recognition: A survey, Pattern Recognit. Lett., 119 (2019), 3-11. DOI: https://doi.org/10.1016/j.patrec.2018.02.010.
Q. Xu et al., Scene image and human skeleton-based dual-stream human action recognition, Pattern Recognit. Lett., 148, (2021),136--145. DOI: https://doi.org/10.1016/j.patrec.2021.06.003.
H. Yang et al., Asymmetric 3D convolutional neural networks for action recognition, Pattern Recognit., 85, (2019), 1--12. DOI: https://doi.org/10.1016/j.patcog.2018.07.028.
Z. Yu & W.Q. Yan, Human action recognition using deep learning methods, in Proc. 35th Int. Conf. Image and Vision Comput. New Zealand (IVCNZ), Nov., (2020), 1-6. DOI:https://doi.org/10.1109/IVCNZ51579.2020.9290594.
E.M. Younis, S.M. Zaki, E. Kanjo & E.H. Houssein, Evaluating ensemble learning methods for multimodal emotion recognition using sensor data fusion, Sensors, 22(15), (2022), 5611. DOI: https://doi.org/10.3390/s22155611.
Q. Zhang & W.Q. Yan, Currency detection and recognition based on deep learning, Proc. 15th IEEE Int. Conf. Advanced Video and Signal Based Surveillance (AVSS), Nov., (2018), 1-6. DOI:https://doi.org/10.1109/AVSS.2018.8639124.

Vaikunta Pai T, ORCID: 0000-0001-6100-9023
Nitte (Deemed to be University), NMAM Institute of Technology (NMAMIT),
Department of Information Science and Engineering, Nitte, India.
e-mail: vaikunthpai@gmail.com

Manjula Mallya M, ORCID: 0009-0005-8812-6912
Government First Grade College for Women, Mangalore, India.
e-mail: manjulamallya88@gmail.com

Ramona Birau, ORCID: 0000-0003-1638-4291
Eugeniu Carada Doctoral School of Economic Sciences,
University of Craiova, Craiova, România.
e-mail: ramona.f.birau@gmail.com

Nethravathi P S, ORCID: 0000-0001-5447-8673
Department of Computer Science and Engineering, Shree Devi Institute of Technology,
Mangalore, India.
e-mail: nethrakumar590@gmail.com

Virgil Popescu, ORCID: 0009-0002-0269-2541
University of Craiova, Faculty of Economics and Business Administration, Craiova, România.
e-mail: virgil.popescu@vilaro.ro

Iuliana Carmen Bărbăcioru - Corresponding author, 0000-0001-7329-9590
Faculty of Engineering, Constantin Brâcuşi University of Târgu Jiu, Gorj, România.
e-mail: cbarbacioru@gmail.com

Pramod Vishnu Naik, ORCID: 0009-0002-6551-964x
Software Development Engineer, MResult Services Private Limited, Mangalore, India.
e-mail: pammunaik92@gmail.com

Received: January 07, 2026; Accepted: April 08, 2026
Published electronically: April 09, 2026

A MULTIMODAL DEEP LEARNING FRAMEWORK FOR HUMAN BEHAVIOR RECOGNITION AND SYNTHESIS USING CNN-LSTM AND ENSEMBLE MODELS

Vaikunta Pai T, Manjula Mallya M, Ramona Birau, Nethravathi P S, Virgil Popescu, Iuliana Carmen Bărbăcioru, Pramod Vishnu Naik

References

Vaikunta Pai T, Manjula Mallya M, Ramona Birau, Nethravathi P S,
Virgil Popescu, Iuliana Carmen Bărbăcioru, Pramod Vishnu Naik