X-OmniClaw an edge-native Multimodal Android A @codeKK AndroidOpen Source Website

X-OmniClaw

Introduction: an edge-native Multimodal Android Agent that integrates multimodal perception, memory and action

Tags:

X-OmniClaw is an edge-native Multimodal Android Agent that integrates multimodal perception, memory, and action. It operates independently of virtual environments, functioning directly on physical Android devices. By capturing real-time visual telemetry and executing native touch interactions, it performs cross-app operations through on-device tools.

Omni refers to the integration of three sensing domains: on-screen UI state, real-world visual context, and audio input. X- emphasizes the cross-modal nature of the system, evolving it into a unified perception-to-action framework for reliable task execution.

English · 简体中文 Chinese

2026-04-22 🧠 Execution policy tightened: plain-text replies can return locally; device actions uniformly go through the agent; stronger cross-package ref rebinding and error-location logging.
2026-04-20 🧵 Multi-session parallelism: per-session agent loops, isolated runtime across sessions, precise stop chains—better stability on long tasks.
2026-03-31 ⏰ Scheduled automation: intervals, weekdays, or weekly plans; works screen-on or screen-off.
2026-03-25 🎙️ Speech–vision spine: local speech–vision loop (recording, frames, decision, execution); speech and text share one execution core.
2026-03-14 🛠️ Core runtime refactor: unified device tools (snapshot, actions, launch app, screenshots, etc.) aligned with key OpenClaw runtime capabilities.

📄 Paper

Project page: https://eggplant95.github.io/X-OmniClaw-Page/
Hugging Face Papers: 2605.05765
arXiv: https://arxiv.org/abs/2605.05765
PDF in this repository: X-OmniClaw Technical Report

🧭 Overview

X-OmniClaw local engine architecture

X-OmniClaw is an edge-native multimodal Android Agent deployed on mobile devices. In the following, we summarize its core execution methodology, four-layer system architecture, and three core capabilities in a table.

1. Core methodology: from “chat” to “execution”

To ensure task completion in complex mobile GUI environments, X-OmniClaw organizes every interface interaction as a minimal Observation → Reasoning → Execution loop during device manipulation. It first perceives the current screen interface and the outcome of the previous action, then infers the next optimal operation, and finally invokes atomic Android actions for actual execution. This loop repeats continuously until the task is accomplished or terminated.

Stage	Description
observation	Build a unified observation stack from screenshots, XML metadata, and screen projection. The stack perceives the execution outcome of the previous step and provides decision evidence for subsequent action planning.
reasoning	LLM/VLM interprets the current page, checks the previous action state, retrieves relevant memory, selects skills/tools, and decides whether to answer directly or continue execution.
execution	Dispatch concrete operations through Android atomic actions, including taps, swipes, text input, and app switches.

2. System architecture: four-layer closed loop

At the system level, X-OmniClaw forms a full closed loop through perceive → plan → act → verify. The perception layer aggregates multimodal inputs, the planning layer forms task plans, the execution layer dispatches device actions, and the verification layer checks results and decides whether to continue.

Layer	Component	Role
Perception	Multi-modal Input	ASR, screenshot/recording frames, accessibility tree → unified context.
Planning	Agent Loop	Main agent loop: task decomposition; Kotlin bridge for dispatch.
Execution	Device Scheduler	Snapshots, simulated UI actions, app lifecycle.
Verification	Success Monitor	Post-action checks and loop detection for drift or completion.

3. Three core capabilities

Beyond typical agents, X-OmniClaw strengthens perception depth, memory breadth, and action robustness.

Capability	Focus	Implementation
Omni Perception	Unified multimodal ingress and intent understanding	Integrates UI states, real-world visual contexts, speech inputs, scheduled triggers, floating widgets, and external channels; uses temporal alignment and scene-grounded VLM understanding to convert raw streams into structured intent.
Omni Memory	Multimodal Personalized Memory	Combines working memory for task continuity with long-term personal memory distilled from local multimodal data, enabling personalized multi-turn interactions and memory-grounded automation.
Omni Action	Robust mobile execution and reusable skills	Runs an observation-reasoning-execution loop over hybrid UI evidence and converts user navigation into reusable deeplink/intent-based skills

💡 Key features

On-device executable agent loop: multi-turn tasks use budgeting and loop detection; failures converge and execution continues.
Observable runs and cost: stream steps, thoughts, tool calls, and results; accumulate LLM usage for UI display.
Unified device tools: one surface for UI understanding, tap/type, launch app, screenshots, clipboard—with stability and mis-click guards.
Vision fallback & dual-track decisions: prefer structured understanding; fall back to vision on messy pages.
Speech-to-action: ASR plus screen understanding; align “the frame when push-to-talk fired” with decision input.
Gallery & media workflows: searchable, summarizable, actionable flows (e.g. theme-based photo search, memory tidy, CapCut-style one-tap video).
Deep links & reproducible flows (when available): bookmarks and deep links compress long paths into one-shot commands (“record once, next time one sentence”).
Parallel sessions & controlled stop: isolated sessions; interrupt per session.

🎬 Demos

Three demo tracks / four demos

📷 Demo A1 — Camera-informed execution	📺 Demo A2 — ScreenAvatar execution / screen companion
User “How much is this bottle of water on Taobao?”	User “Let’s start the exercises.”
Behavior • Camera + voice to infer intent • Jump to target app search (e.g. Taobao) • Scroll results, capture prices/volumes	Behavior • Follow the active screen as the primary context • Push-to-talk + screen understanding • Multi-step execution with live feedback
Camera-based item recognition → Camera object recognition.	Screen companion → Multi-step auto execution .

✂️ Demo B — Memory-based one-tap video	📦 Demo C — Instant portal to a Meituan flash-sale page ( Behavior cloning )
User “Find parrot-themed photos and make a one-tap video.”	User “Open Meituan flash deals.”
Behavior • Build a searchable memory index; filter by “parrot” • Stage picks into a temp album (e.g. `A_latest`) • Jump to CapCut one-tap flow, batch select, export/share	Behavior • Record once → reusable bookmark/skill • Later: one sentence to target page • Fallback if launch fails
Theme search → One-tap video.	Record once → One-shot navigation.

🔧 Skills & tools

Skills load from app/src/main/assets/skills/; execution reaches the UI via tools (including device).

🗂️ Bundled skills (10)

Path: app/src/main/assets/skills/<skill>/SKILL.md.

Category	Skill IDs
Search & apps	`app-search`, `taobao-search`
Gallery & media	`gallery-qa`, `gallery-memory`, `capcut-theme-video`, `clipboard-to-shortcut`
Configuration	`model-config`, `channel-config`
Skill management	`skill-creator`
Automation	`scheduled-automation`

🧪 Example utterances (say to the agent)

Skill ID	Example
`app-search`	“Search Xiaohongshu for Beijing travel tips and send me the summary.”
`taobao-search`	“On Taobao, search men’s light sunscreen shirts; recommend 3 by price band and sales.”
`gallery-qa`	“What photos did I take today? Briefly in time order.”
`gallery-memory`	“Sync gallery memory and refresh my profile—scan the latest 20 photos.”
`clipboard-to-shortcut`	“Turn the clipboard URL into a skill named Taobao quick link.”
`channel-config`	“Configure Feishu channel: app id xxx, secret xxx.”
`model-config`	“Add an OpenAI-compatible provider at URL xxx; default model xxx.”
`scheduled-automation`	“Every Wednesday 10:00 open Xiaohongshu, search AI news, summarize.”
`capcut-theme-video`	“Make a one-tap video from today’s landscapes.”
`skill-creator`	“Summarize what just worked as a new skill and write SKILL.md.”

🤖 Models

Recommended: enter API keys in the APK and save; config is written to /sdcard/.xomniclaw/xomniclaw.json—usually no manual JSON editing.

In-app setup (recommended)

Open the app → Model configuration (drawer / settings; wizard may appear on first launch).
Agent model: pick provider → API Key (and Base URL if required) → default model → Save.
Speech STT: open STT from the same area; set transcription API Key, endpoint, model (e.g. SiliconFlow SenseVoice Small); keys may differ from Agent.
Vision VLM: VLM settings for screenshot/UI understanding (must support images); same or different provider as Agent.
After save, disk persists; gateway port etc. live in Settings or related screens.

UI fields map to the JSON below; edit the file directly only for bulk migration, scripting, or debugging.

Built-in providers & example model IDs

Provider ID	Example model ID
`openrouter`	`Qwen 3.6 Flash`
`anthropic`	`claude-opus-4`
`openai`	`gpt-4.1`
`moonshot`	`kimi-k2.5`
`minimax`	`MiniMax-M2.5`
`ollama`	No fixed ID (`/api/tags`)

Config snippet: Agent model

Matches in-app provider / default model.

{
  "models": {
    "providers": {
      "openrouter": {
        "baseUrl": "https://openrouter.ai/api/v1",
        "api": "openai-completions",
        "apiKey": "<OPENROUTER_API_KEY>",
        "models": [
          {
            "id": "qwen/qwen3.6-flash",
            "name": "Qwen 3.6 Flash",
            "contextWindow": 131072,
            "maxTokens": 8192
          }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "openrouter/qwen/qwen3.6-flash"
      }
    }
  }
}

Config snippet: Speech STT

Maps to STT screen. Example: SiliconFlow + FunAudioLLM/SenseVoiceSmall.
SiliconFlow signup: https://cloud.siliconflow.cn.
Free tiers depend on the vendor’s current policy.

{
  "models": {
    "providers": {
      "stt": {
        "baseUrl": "https://api.siliconflow.cn/v1/audio/transcriptions",
        "api": "openai-completions",
        "apiKey": "<SiliconFlow API Key>",
        "models": [
          {
            "id": "FunAudioLLM/SenseVoiceSmall",
            "name": "SenseVoice Small",
            "contextWindow": 1,
            "maxTokens": 1
          }
        ]
      }
    }
  }
}

Config snippet: Vision VLM

Maps to VLM. Example using OpenRouter with an OpenAI-compatible vision model:

{
  "models": {
    "providers": {
      "vlm": {
        "baseUrl": "https://openrouter.ai/api/v1",
        "api": "openai-completions",
        "apiKey": "sk-or-v1-xxx",
        "models": [
          {
            "id": "qwen/qwen3.6-flash",
            "name": "Qwen 3.6 Flash (via OpenRouter)",
            "contextWindow": 200000,
            "maxTokens": 16384
          }
        ]
      }
    }
  }
}

🚀 Quick start

📥 1) Download & install

Latest APK:
https://github.com/OPPO-Mente-Lab/X-OmniClaw/releases/latest

⚙️ 2) First-time setup (prefer in-app forms)

Model configuration: set Agent API Key and a default model with multimodal (image) support; optionally configure STT and VLM separately.
Grant permissions (seven, same as in-app checks): Accessibility, Overlay, Screen capture, Photos, All files access, Camera, Microphone.
Optional: configure Feishu / Discord (and similar) in-app.

📄 3) Config file location (auto-generated)

After saving from the UI:

/sdcard/.xomniclaw/xomniclaw.json

Open this file directly mainly for backup, migration, or debugging.

🛠️ Build from source

📋 Requirements

JDK 17 or newer
Android SDK
Gradle Wrapper (recommended)

🔨 Clone & build (example)

git clone <your GitLab repo URL>
cd X-OmniClaw
cp local.properties.example local.properties
# Edit local.properties: set sdk.dir to your Android SDK
./gradlew :app:assembleDebug

Upstream GitHub mirror (optional):

git clone https://github.com/OPPO-Mente-Lab/X-OmniClaw.git
cd X-OmniClaw

Windows PowerShell example:

$env:JAVA_HOME = "D:\path\to\jdk-17"
$env:ANDROID_HOME = "D:\path\to\android-sdk"
Set-Location "D:\path\to\X-OmniClaw"
Copy-Item local.properties.example local.properties
notepad local.properties
.\gradlew.bat :app:assembleDebug

Example output path:

releases/X-OmniClaw-v<version>-debug.apk

📄 License

Apache License 2.0: commercial use and modifications allowed; retain notices and describe changes.

🙏 Acknowledgments

HermesApp: This project was inspired by HermesApp. Thanks to the HermesApp maintainers and community.

Star History

Apps

Android Developer Tools

Android Developer Tools Pro

About Me

Tools: TimeShining

GitHub: Trinea

Facebook: Dev Tools

JSON Format, Support error correction

MD5/SHA Encode, Support batch

Text Process

CSS Format and Compress