Publications

Browse our research papers published in top-tier conferences and journals in machine learning, computer vision, natural language processing, and AI ethics.

LVLM @ IEEE International Conference on Image Processing 2025
Computer Vision
2025
Aug 2, 2025

Vision-Language Fusion for Real-Time Autonomous Driving: Goal-Centered Cross-Attention of Camera, HD-Map, & Waypoints

Santosh Patapati, Trisanth Srinivasan, Murari Ambati

Autonomous cars need geometric accuracy and semantic understanding to navigate complex environments, yet most stacks handle them separately. We present XYZ-Drive, a single vision-language model that reads a front-camera frame, a 25 m × 25 m overhead map, and the next waypoint, then outputs steering and speed. A lightweight goal-centered cross-attention layer lets waypoint tokens highlight relevant image and map patches, supporting both action and textual explanations, before the fused tokens enter a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX Outdoor-Driving benchmark XYZ-Drive attains 95 % success and 0.80 Success weighted by Path Length (SPL), surpassing PhysNav-DG by 15 % and halving collisions, all while significantly improving efficiency by using only a single branch.

IEEE Vehicular Technology Conference 2025
Computer Vision
2025
Aug 2, 2025

Early Goal-Guided Multi-Scale Fusion for Real-Time Vision-Language Driving

Santosh Patapati, Trisanth Srinivasan

Autonomous vehicles must react in milliseconds while reasoning about road geometry and traffic intent to navigate complex situations. We introduce NovaDrive, a single-branch vision-language architecture that processes front-camera images, HD-map tiles, LiDAR depth, and textual waypoints in a single branch. A lightweight, two-stage cross-attention block first aligns waypoint tokens with the HD map, then refines attention over fine-grained image and depth patches. Coupled with a novel smoothness loss that discourages abrupt steering and speed changes, this design eliminates the need for recurrent memory.

ACM Multimedia
NLP & Speech
2025
Jun 26, 2025

A Personalized Embodied Conversational Agent-Based System for Dementia Reminiscence Therapy

Santosh Patapati, Trisanth Srinivasan, Rushil Kukreja

We introduce an Embodied Conversational Agent (ECA)-based system that delivers daily, personalized reminiscence therapy for people living with dementia, whether at home, in assisted-living, or in nursing-home settings. The system (1) automatically transforms caregiver archives (diary excerpts, letters, posts, or custom memories) into a knowledge graph for memories; (2) retrieves those memories at run-time via Graph-based Retrieval Augmented Generation (GraphRAG) to generate relevant prompts for the ECA; (3) conducts natural, multimodal dialogue powered by a fine-tuned LLaMA-3 13B model; and (4) monitors well-being with a quadmodal cross-attention-based model that tracks 25 DSM-5 dimensions of mental health using audio, video, text transcript data, and questionnaire responses.

Human Behavior Analysis for Emotion Understanding @ IJCAI
Computer Vision
2025
Jun 24, 2025

CLIP-MG: Guiding Semantic Attention with Skeletal Pose Features and RGB Data for Micro-Gesture Recognition on the iMiGUE Dataset

Santosh Patapati, Trisanth Srinivasan, Amith Adiraju

Micro-gesture recognition is a challenging task in affective computing due to the subtle, involuntary nature of the gestures and their low movement amplitude. In this paper, we introduce a Pose-Guided Semantics-Aware CLIP-based architecture, or CLIP-MG, a modified CLIP model tailored for micro-gesture classification on the iMiGUE dataset. CLIP-MG integrates human pose (skeleton) information into the CLIP-based recognition pipeline through pose-guided semantic query generation and a gated multi-modal fusion mechanism. The proposed model achieves a Top-1 accuracy of 61.82 %.

Interspeech 2025
NLP & Speech
2025
Jun 9, 2025

GenECA: A General-Purpose Framework for Real-Time Adaptive Multimodal Embodied Conversational Agents

Santosh Patapati, Trisanth Srinivasan, Murari Ambati

We present GenECA, a general-purpose framework for real-time multimodal interaction with embodied conversational agents. GenECA captures audio and visual signals from standard devices to analyze nonverbal features such as facial expressions, vocal tone, gaze, and posture. This information is used to generate context-aware dialogue and synchronize the agent's speech with dynamic gestures and backchannel facial animations in real time. GenECA provides the first ECA system able to deliver context-aware speech and well-timed animations in real time without reliance on human operators.

8th IEEE International Conference on Inventive Computation Technologies (ICICT)
NLP
2025
May 23, 2025

Towards Leveraging Semantic Web Technologies for Automated UI Element Annotation

Trisanth Srinivasan

This paper presents a novel Chrome extension for automated web UI element annotation using Semantic Web technologies. The primary objective is to enable Visual Language Models (VLMs) and Large Language Models (LLMs) to quickly comprehend and interact with the web by transforming unstructured visual data into structured semantic information. Our approach integrates natural language processing, vector embeddings, and FAISS for real-time similarity search. In addition to detailing the system architecture and implementation, we propose a custom algorithm for semantic annotation and provide a comparison with traditional methods.

55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W)
AI Dependability
2025
May 12, 2025

DURA-CPS: Multi-Role Orchestration System for Dependability Assurance of AI-Enhanced Cyber-Physical Systems

Trisanth Srinivasan, Santosh Patapati, Himani Musku, Idhant Gode, Aditya Arora, Samvit Bhattacharya, Abubakr Nazriev, Sanika Hirave, Zaryab Kanjiani, Srinjoy Ghose

Cyber-Physical Systems (CPS) increasingly depend on advanced AI techniques to operate in critical applications. However, traditional verification and validation methods often struggle to handle the unpredictable and dynamic nature of AI components. In this paper, we introduce DURA-CPS, a novel framework that employs multi-role orchestration to automate the iterative assurance process for AI-powered CPS. By assigning specialized roles (e.g., safety monitoring, security assessment, fault injection, and recovery planning) to dedicated agents within a simulated environment, DURA-CPS continuously evaluates and refines AI behavior against a range of dependability requirements.

Computer Vision and Pattern Recognition Workshops 2025 (CVPRW)
Computer Vision
2025
Apr 29, 2025

PhysNav-DG: A Novel Adaptive Framework for Robust VLM-Sensor Fusion in Navigation Applications

Trisanth Srinivasan, Santosh Patapati

Robust navigation in diverse environments and domains requires both accurate state estimation and transparent decision making. We present PhysNav-DG, a novel framework that integrates classical sensor fusion with the semantic power of vision-language models. Our dual-branch architecture predicts navigation actions from multi-sensor inputs while simultaneously generating detailed chain-of-thought explanations. A modified Adaptive Kalman Filter dynamically adjusts its noise parameters based on environmental context. It leverages several streams of raw sensor data along with semantic insights from models such as LLaMA-3.2 11B and BLIP-2.

arXiv Preprint
NLP
2025
Mar 28, 2025

WebNav: An Intelligent Agent for Voice-Controlled Web Navigation

Trisanth Srinivasan, Santosh Patapati

The increasing reliance on web interfaces presents many challenges for visually impaired users, showcasing the need for more advanced assistive technologies. This paper introduces WebNav, a voice-controlled web navigation agent that leverages a ReAct-inspired architecture and generative AI to provide this framework. WebNav comprises a hierarchical structure: a Digital Navigation Module (DIGNAV) for high-level strategic planning, an Assistant Module for translating abstract commands into executable actions, and an Inference Module for low-level interaction. A key component is a dynamic labeling engine, implemented as a browser extension, that generates real-time labels for interactive elements, creating mapping between voice commands and Document Object Model (DOM) components.

arXiv Preprint
NLP & Speech
2024
Jul 27, 2024

Integrating Large Language Models into a Tri-Modal Architecture for Automated Depression Classification on the DAIC-WOZ

Santosh Patapati

Major Depressive Disorder (MDD) is a pervasive mental health condition that affects 300 million people worldwide. This work presents a novel, BiLSTM-based tri-modal model-level fusion architecture for the binary classification of depression from clinical interview recordings. The proposed architecture incorporates Mel Frequency Cepstral Coefficients, Facial Action Units, and uses a two-shot learning based GPT-4 model to process text data. This is the first work to incorporate large language models into a multi-modal architecture for this task. It achieves impressive results on the DAIC-WOZ AVEC 2016 Challenge cross-validation split and Leave-One-Subject-Out cross-validation split, surpassing all baseline models and multiple state-of-the-art models.