Publications
Browse our research papers published in top-tier conferences and journals in machine learning, computer vision, natural language processing, and AI ethics.
Vision-Language Fusion for Real-Time Autonomous Driving: Goal-Centered Cross-Attention of Camera, HD-Map, & Waypoints
Autonomous cars need geometric accuracy and semantic understanding to navigate complex environments, yet most stacks handle them separately. We present XYZ-Drive, a single vision-language model that reads a front-camera frame, a 25 m × 25 m overhead map, and the next waypoint, then outputs steering and speed. A lightweight goal-centered cross-attention layer lets waypoint tokens highlight relevant image and map patches, supporting both action and textual explanations, before the fused tokens enter a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX Outdoor-Driving benchmark XYZ-Drive attains 95 % success and 0.80 Success weighted by Path Length (SPL), surpassing PhysNav-DG by 15 % and halving collisions, all while significantly improving efficiency by using only a single branch.
Early Goal-Guided Multi-Scale Fusion for Real-Time Vision-Language Driving
Autonomous vehicles must react in milliseconds while reasoning about road geometry and traffic intent to navigate complex situations. We introduce NovaDrive, a single-branch vision-language architecture that processes front-camera images, HD-map tiles, LiDAR depth, and textual waypoints in a single branch. A lightweight, two-stage cross-attention block first aligns waypoint tokens with the HD map, then refines attention over fine-grained image and depth patches. Coupled with a novel smoothness loss that discourages abrupt steering and speed changes, this design eliminates the need for recurrent memory.
A Personalized Embodied Conversational Agent-Based System for Dementia Reminiscence Therapy
We introduce an Embodied Conversational Agent (ECA)-based system that delivers daily, personalized reminiscence therapy for people living with dementia, whether at home, in assisted-living, or in nursing-home settings. The system (1) automatically transforms caregiver archives (diary excerpts, letters, posts, or custom memories) into a knowledge graph for memories; (2) retrieves those memories at run-time via Graph-based Retrieval Augmented Generation (GraphRAG) to generate relevant prompts for the ECA; (3) conducts natural, multimodal dialogue powered by a fine-tuned LLaMA-3 13B model; and (4) monitors well-being with a quadmodal cross-attention-based model that tracks 25 DSM-5 dimensions of mental health using audio, video, text transcript data, and questionnaire responses.
CLIP-MG: Guiding Semantic Attention with Skeletal Pose Features and RGB Data for Micro-Gesture Recognition on the iMiGUE Dataset
Micro-gesture recognition is a challenging task in affective computing due to the subtle, involuntary nature of the gestures and their low movement amplitude. In this paper, we introduce a Pose-Guided Semantics-Aware CLIP-based architecture, or CLIP-MG, a modified CLIP model tailored for micro-gesture classification on the iMiGUE dataset. CLIP-MG integrates human pose (skeleton) information into the CLIP-based recognition pipeline through pose-guided semantic query generation and a gated multi-modal fusion mechanism. The proposed model achieves a Top-1 accuracy of 61.82 %.
GenECA: A General-Purpose Framework for Real-Time Adaptive Multimodal Embodied Conversational Agents
We present GenECA, a general-purpose framework for real-time multimodal interaction with embodied conversational agents. GenECA captures audio and visual signals from standard devices to analyze nonverbal features such as facial expressions, vocal tone, gaze, and posture. This information is used to generate context-aware dialogue and synchronize the agent's speech with dynamic gestures and backchannel facial animations in real time. GenECA provides the first ECA system able to deliver context-aware speech and well-timed animations in real time without reliance on human operators.
Towards Leveraging Semantic Web Technologies for Automated UI Element Annotation
This paper presents a novel Chrome extension for automated web UI element annotation using Semantic Web technologies. The primary objective is to enable Visual Language Models (VLMs) and Large Language Models (LLMs) to quickly comprehend and interact with the web by transforming unstructured visual data into structured semantic information. Our approach integrates natural language processing, vector embeddings, and FAISS for real-time similarity search. In addition to detailing the system architecture and implementation, we propose a custom algorithm for semantic annotation and provide a comparison with traditional methods.
DURA-CPS: Multi-Role Orchestration System for Dependability Assurance of AI-Enhanced Cyber-Physical Systems
Cyber-Physical Systems (CPS) increasingly depend on advanced AI techniques to operate in critical applications. However, traditional verification and validation methods often struggle to handle the unpredictable and dynamic nature of AI components. In this paper, we introduce DURA-CPS, a novel framework that employs multi-role orchestration to automate the iterative assurance process for AI-powered CPS. By assigning specialized roles (e.g., safety monitoring, security assessment, fault injection, and recovery planning) to dedicated agents within a simulated environment, DURA-CPS continuously evaluates and refines AI behavior against a range of dependability requirements.
PhysNav-DG: A Novel Adaptive Framework for Robust VLM-Sensor Fusion in Navigation Applications
Robust navigation in diverse environments and domains requires both accurate state estimation and transparent decision making. We present PhysNav-DG, a novel framework that integrates classical sensor fusion with the semantic power of vision-language models. Our dual-branch architecture predicts navigation actions from multi-sensor inputs while simultaneously generating detailed chain-of-thought explanations. A modified Adaptive Kalman Filter dynamically adjusts its noise parameters based on environmental context. It leverages several streams of raw sensor data along with semantic insights from models such as LLaMA-3.2 11B and BLIP-2.
WebNav: An Intelligent Agent for Voice-Controlled Web Navigation
The increasing reliance on web interfaces presents many challenges for visually impaired users, showcasing the need for more advanced assistive technologies. This paper introduces WebNav, a voice-controlled web navigation agent that leverages a ReAct-inspired architecture and generative AI to provide this framework. WebNav comprises a hierarchical structure: a Digital Navigation Module (DIGNAV) for high-level strategic planning, an Assistant Module for translating abstract commands into executable actions, and an Inference Module for low-level interaction. A key component is a dynamic labeling engine, implemented as a browser extension, that generates real-time labels for interactive elements, creating mapping between voice commands and Document Object Model (DOM) components.
Integrating Large Language Models into a Tri-Modal Architecture for Automated Depression Classification on the DAIC-WOZ
Major Depressive Disorder (MDD) is a pervasive mental health condition that affects 300 million people worldwide. This work presents a novel, BiLSTM-based tri-modal model-level fusion architecture for the binary classification of depression from clinical interview recordings. The proposed architecture incorporates Mel Frequency Cepstral Coefficients, Facial Action Units, and uses a two-shot learning based GPT-4 model to process text data. This is the first work to incorporate large language models into a multi-modal architecture for this task. It achieves impressive results on the DAIC-WOZ AVEC 2016 Challenge cross-validation split and Leave-One-Subject-Out cross-validation split, surpassing all baseline models and multiple state-of-the-art models.