2025-11-19
🖱️ Click with Modifier Keys Support (Linux Only)
Theclick action now supports modifier keys, enabling keyboard combinations during click operations. This is useful for actions like Shift+Click for multi-select, Control+Click for opening links in new tabs, or other keyboard-mouse combinations.This feature is currently only available for Linux boxes. Android support is not yet available.
control- Control keyshift- Shift keyalt- Alt key
Example
2025-11-04
🔍 Action Response with Actual Field
All UI actions (Click, Tap, Scroll, Swipe, Drag, Long Press, Touch) now return anactual field with detailed execution information, showing the exact coordinates and parameters used during action execution.Example
actual field helps you verify action execution, debug automation, and understand how natural language targets are translated into screen coordinates.2025-10-23
🤖 Model API for Coordinate Generation
Introducing the gbox-handy-1 model - a specialized AI model that generates precise UI element coordinates from screenshots using natural language. The Model API analyzes screenshots and returns exact coordinates for click, drag, and other UI actions based on descriptive targets.Learn more →⏳ Keep Alive on Activity
Automatically extend box expiration time when there’s activity on the box. WhenkeepAlive is set (e.g., “5m”), any operation (UI Action, File System, Browser, Command, Media, or Run Code) will ensure at least the specified duration remains.Example: If keepAlive is “5m” and the box has 2 minutes remaining, any operation extends it back to 5 minutes.2025-10-15
🤖 Customizable AI Model Selection for UI Actions
The UI Action system now supports specifying custom AI models through theoptions.model parameter. This enhancement gives you the flexibility to choose different computer vision models based on your specific needs, performance requirements, or accuracy preferences.The
model parameter is only effective when using natural language-driven UI actions (e.g., click, type, scroll with descriptive targets). It does not apply to coordinate-based or other non-AI-driven interactions.openai-computer-use) for UI element detection by passing the model parameter in action options.Example Usage
🔌 Appium Integration Support
Gbox now provides native Appium connection support, enabling seamless integration with the Appium ecosystem for advanced automation workflows. The newappiumURL() method generates a ready-to-use Appium connection URL with optimized default configurations.Key Features
- One-Click Connection: Get an Appium connection URL with a single method call
- Pre-configured Options: Default capabilities and settings optimized for Gbox Android boxes
- Full Appium Compatibility: Use any Appium client library (WebdriverIO, Appium Python Client, etc.)
- Advanced Automation: Access native Appium features like XML layout inspection, element finding, and complex gestures
Use Cases
- Extract and analyze UI element hierarchies
- Implement complex automation workflows
- Integrate with existing Appium-based testing frameworks
- Debug UI layouts and element properties
Example Usage
2025-10-11
📍 Enhanced Scroll Action with Natural Language Location Support
Thescroll action now supports natural language descriptions for the location parameter, allowing you to specify where on the screen to perform the scroll gesture. This enhancement makes it easier to interact with specific areas of the UI, such as scrolling within a particular region or component.Key Features
- Natural Language Location: Use descriptive text to specify scroll locations (e.g., “screen bottom”, “toolbar area”, “middle of the screen”)
- Flexible Targeting: Perfect for scrolling within specific UI regions or components
- Improved Precision: Better control over scroll behavior in complex layouts
Example Usage
Additional Examples
2025-09-26
🌐 Enhanced Browser API
New browser API with full-screen support and window control options, designed to prevent AI agents from accidentally closing or minimizing browsers.Key Features & Use Cases
- 🔄 Full-Screen Control: Seamlessly maximize browser windows for immersive experiences
- 🪟 Window Control Management: Hide browser minimize, maximize, and close buttons to prevent AI or automation from accidentally closing browsers
- 🤖 AI Agent Automation: Perfect for AI agents and LLM applications requiring focused browser sessions
2025-09-25
🔍 UI Elements Detection
Automatically detect and extract interactive elements from web pages for precise automation and AI-driven interactions.Key Features
- Element Detection: Identifies buttons, links, inputs, and other interactive elements
- Rich Metadata: Position, size, text content, and HTML attributes
- Annotated Screenshots: Returns screenshots with visual annotations highlighting detected elements
- AI Ready: Perfect for LLM integration and intelligent automation
Usage Example
2025-09-19
📋 Clipboard Support
New clipboard management capabilities for Android boxes, enabling programmatic control over device clipboard content.Key Features
- Set/Get Content: Read and write text to device clipboard
- Automation Integration: Seamlessly integrate with UI automation workflows
- Cross-App Data: Share data between applications through clipboard
Usage Example
2025-09-12
📸 Screenshot ScrollCapture
Enhanced screenshot functionality now supports automatic scrolling to capture tall content like long web pages, documents, or chat conversations in a single image.Key Features
- Automatic Scrolling: Intelligently scrolls through content to capture everything in one screenshot
- Height Control: Configurable maximum height to manage memory usage and file size
- Position Restoration: Optional scroll-back functionality to return to original position
- Memory Optimization: Built-in limits to prevent excessive memory consumption
Configuration Options
maxHeight: Maximum height in pixels (default: 4000px) - limits the total height of captured contentscrollBack: Whether to scroll back to original position after capture (default: false)
Usage Example
Use Cases
- Web Page Documentation: Capture entire web pages for documentation or analysis
- Chat History: Save complete conversation threads
- Long Documents: Capture full document content in one screenshot
- Social Media Feeds: Capture extended social media timelines
- App Content: Document complete app screens or settings pages
2025-09-05
🔄 Rewind Recording
New Rewind functionality that automatically preserves the last 5 minutes of screen recording, allowing you to extract video clips from any time period at any moment.Key Features
- Automatic Recording: Continuously preserves up to 5 minutes of recent box screen recording
- Flexible Extraction: Extract video clips from any time period (up to 5 minutes maximum)
- Instant Access: No need to pre-start recording - access historical clips anytime
Use Cases
- Debug Analysis: Review recent operation recordings when automation scripts encounter issues
- Error Reproduction: Quickly capture screen recordings from before problems occurred for easier troubleshooting
- Operation Logging: Automatically save operation history without manual recording management
- Performance Monitoring: Observe application behavior during specific operations
Performance Considerations
Basic Usage
Advanced Usage
📸 Save Screenshots to Album
Enhanced screenshot functionality now supports saving screenshots directly to the device’s media album, making it easier to organize and access captured images.Key Features
- Album Integration: Screenshots are automatically saved to the device’s media gallery
- Easy Access: Screenshots can be accessed through the device’s photo app
- Organized Storage: All screenshots are properly categorized in the media library
Usage
2025-08-29
🖼️ Enhanced Screenshot Options
New unified screenshot configuration withoptions.screenshot parameter for better control over screenshot behavior.Basic Usage
Key Features
📸 Screenshot Phasesbefore: Screenshot before the actionafter: Screenshot after the actiontrace: Screenshot with operation trace- Default captures all three phases
delay: Wait time after action before taking final screenshot- Default:
500ms, Maximum:30s
base64: Direct image data (default)storageKey: Storage key with presigned URL accesspresignedExpiresIn: Custom expiration for storageKey URLs (default:30m)
Usage Examples
Capture specific phases:2025-08-25
🎯 Natural Language UI Actions
UI Actions now support natural language descriptions for targets and locations, making automation more intuitive and human-readable.📏 Action Scale Settings
Added support for scaling UI actions to adjust the size of screenshots and coordinate calculations without changing the actual screen resolution.Key Features
- Scale Range: 0.1 to 1.0 (10% to 100% of original size)
- Screenshot Scaling: Output screenshots are scaled according to the setting
- Coordinate Scaling: All action coordinates and distances are automatically scaled
- No Resolution Change: The box’s actual screen resolution remains unchanged
Usage Examples
Scale Examples
| Scale Value | Screenshot Size | Coordinate Example |
|---|---|---|
1.0 (default) | Full size | Click({x: 100, y: 100}) |
0.5 | 50% size | Click({x: 50, y: 50}) equivalent |
0.25 | 25% size | Click({x: 25, y: 25}) equivalent |
🌐 Proxy Configuration
Added proxy configuration support for Android boxes.2025-08-15
🖱️ UI Actions
- Added:
tapandlongPressactions for precise coordinate taps and long-press interactions. - Enhanced:
swipe/scrollnow support semanticdistancevalues ("tiny" | "short" | "medium" | "long"), so you no longer need to provide pixel values.
New action: tap
New action: longPress
Semantic distance (swipe/scroll)
2025-08-08
🎥 Android Screen Recording
Added screen recording capabilities to Android boxes for creating tutorials and debugging UI interactions.📱 Android Media Management
Added media file management capabilities for Android devices, including album listing and media operations.🪝 AI Action Progress Callbacks
Since AI Action execution takes time, we provide a series of callbacks to monitor the progress of the entire operation process.2025-07-29
UI Action Support for StorageKey Output Format
UI Action now supportsoutputFormat: "storageKey", allowing GBOX.AI to directly store screenshot information. Compared to returning image data directly, StorageKey provides more flexible storage and access options:- presigned URL: System-generated temporary access link with a default validity period of 30 minutes
- storageKey: A storage key that remains valid throughout the box lifecycle (until the box is deleted)
- Custom presigned URL: Use
createPresignedUrlto create presigned URLs with specified expiration times, convenient for returning to LLM models for image analysis
Android Multiple APK Installation Support
Added support for installing multiple APK files using ZIP archives, enabling installation of split APKs and bundle applications.Installation Modes
Single APK Installation (default):- Upload and install a single APK file
- Traditional installation method for standalone applications
- Upload a ZIP archive containing multiple APK files
- Automatically extracts and installs all APK files in the correct order
- Essential for modern Android apps that use App Bundle distribution
ZIP Archive Structure
When using multiple APK installation, organize your files as follows:Usage Examples
Single APK Installation:2025-07-21