Skip to main content
2025-10-15
These features require upgrading to TypeScript SDK version 40 or later.

🤖 Customizable AI Model Selection for UI Actions

The UI Action system now supports specifying custom AI models through the options.model parameter. This enhancement gives you the flexibility to choose different computer vision models based on your specific needs, performance requirements, or accuracy preferences.
The model parameter is only effective when using natural language-driven UI actions (e.g., click, type, scroll with descriptive targets). It does not apply to coordinate-based or other non-AI-driven interactions.
You can now choose from different AI models (e.g., openai-computer-use) for UI element detection by passing the model parameter in action options.

Example Usage

import GboxSDK from "gbox-sdk";

const gboxSDK = new GboxSDK({
  apiKey: process.env["GBOX_API_KEY"],
});

async function main() {
  const box = await gboxSDK.create({ type: "android" });

  // Use a specific AI model for UI action
  await box.action.click({
    target: "chrome app",
    options: {
      model: "openai-computer-use"
    }
  })
}

main()

🔌 Appium Integration Support

Gbox now provides native Appium connection support, enabling seamless integration with the Appium ecosystem for advanced automation workflows. The new appiumURL() method generates a ready-to-use Appium connection URL with optimized default configurations.

Key Features

  • One-Click Connection: Get an Appium connection URL with a single method call
  • Pre-configured Options: Default capabilities and settings optimized for Gbox Android boxes
  • Full Appium Compatibility: Use any Appium client library (WebdriverIO, Appium Python Client, etc.)
  • Advanced Automation: Access native Appium features like XML layout inspection, element finding, and complex gestures

Use Cases

  • Extract and analyze UI element hierarchies
  • Implement complex automation workflows
  • Integrate with existing Appium-based testing frameworks
  • Debug UI layouts and element properties

Example Usage

import { remote } from "webdriverio";
import GboxSDK from "gbox-sdk";
import * as fs from "fs";
import * as path from "path";

const gboxSDK = new GboxSDK({
  apiKey: process.env["GBOX_API_KEY"],
});

async function main() {
  const box = await gboxSDK.create({ type: "android" });

  // Generate Appium connection URL and default options
  const { url, defaultOption } = await box.appiumURL();

  console.log("Appium connection URL:", url);

  // Connect to Appium with defaultOption from backend
  console.log("Connecting to Appium server...");
  const ac = await remote(defaultOption);

  console.log("✅ Successfully connected to Appium server");
  console.log("Session ID:", ac.sessionId);

  try {
    // Get current page XML layout
    console.log("Fetching XML layout...");
    const xmlLayout = await ac.getPageSource();
    
    console.log("✅ XML layout fetched successfully!");
    console.log("XML length:", xmlLayout.length, "characters");
    
    // Display a preview of the XML (first 500 characters)
    console.log("\n--- XML Layout Preview (first 500 chars) ---");
    console.log(xmlLayout.substring(0, 500) + "...\n");
    
    // Save XML to file for easier viewing
    const outputDir = path.join(__dirname, "output");
    if (!fs.existsSync(outputDir)) {
      fs.mkdirSync(outputDir, { recursive: true });
    }
    
    const timestamp = new Date().toISOString().replace(/[:.]/g, "-");
    const filename = `layout_${timestamp}.xml`;
    const filepath = path.join(outputDir, filename);
    
    fs.writeFileSync(filepath, xmlLayout, "utf-8");
    console.log(`📁 XML layout saved to: ${filepath}`);
    
    // Optional: Parse and display some basic info
    const elementMatches = xmlLayout.match(/<[\w.-]+/g);
    if (elementMatches) {
      const elementSet = new Set(elementMatches.map(e => e.substring(1)));
      const uniqueElements = Array.from(elementSet);
      console.log("\n--- UI Elements Found ---");
      console.log("Total elements:", elementMatches.length);
      console.log("Unique element types:", uniqueElements.length);
      console.log("Element types:", uniqueElements.slice(0, 10).join(", "), "...");
    }
    
  } catch (error) {
    console.error(
      "❌ Error fetching XML layout:",
      error instanceof Error ? error.message : String(error)
    );
  } finally {
    console.log("\nClosing session...");
    await ac.deleteSession();
    console.log("Session closed.");
  }
}

main()
2025-10-11
These features require upgrading to TypeScript SDK version 38 or later.

📍 Enhanced Scroll Action with Natural Language Location Support

The scroll action now supports natural language descriptions for the location parameter, allowing you to specify where on the screen to perform the scroll gesture. This enhancement makes it easier to interact with specific areas of the UI, such as scrolling within a particular region or component.

Key Features

  • Natural Language Location: Use descriptive text to specify scroll locations (e.g., “screen bottom”, “toolbar area”, “middle of the screen”)
  • Flexible Targeting: Perfect for scrolling within specific UI regions or components
  • Improved Precision: Better control over scroll behavior in complex layouts

Example Usage

import GboxSDK from "gbox-sdk";

const gboxSDK = new GboxSDK({
  apiKey: process.env["GBOX_API_KEY"] // This is the default and can be omitted
});

async function main() {
  const box = await gboxSDK.create({
    type: "android",
  });

  // Scroll up from the bottom of the screen
  const result = await box.action.scroll({
    direction: 'up',
    location: "screen bottom",
  });

  console.info(result);
}

Additional Examples

// Scroll down in the toolbar area
await box.action.scroll({
  direction: 'down',
  location: "toolbar area",
});

// Scroll left in the center of the screen
await box.action.scroll({
  direction: 'left',
  location: "middle of the screen",
});
2025-09-26
These features require upgrading to TypeScript SDK version 37 or later.

🌐 Enhanced Browser API

New browser API with full-screen support and window control options, designed to prevent AI agents from accidentally closing or minimizing browsers.

Key Features & Use Cases

  • 🔄 Full-Screen Control: Seamlessly maximize browser windows for immersive experiences
  • 🪟 Window Control Management: Hide browser minimize, maximize, and close buttons to prevent AI or automation from accidentally closing browsers
  • 🤖 AI Agent Automation: Perfect for AI agents and LLM applications requiring focused browser sessions
import GboxSDK from "gbox-sdk";
import { chromium } from "playwright";

const gboxSDK = new GboxSDK({
  apiKey: process.env["GBOX_API_KEY"] // This is the default and can be omitted
});

async function main() {
  const box = await gboxSDK.create({ type: "linux" });

  const result = await box.browser.open({
    maximize: true,
    showControls: false,
  })

  const browser = await chromium.connectOverCDP(result.cdpUrl);

  // Use the browser as usual
  const context = await browser.contexts()[0];
  const page = await context.newPage();
  await page.goto("https://example.com");

  // Perform actions on the page
  console.log(await page.title());

  await box.browser.close()
}

main();
2025-09-25
These features require upgrading to TypeScript SDK version 36 or later.

🔍 UI Elements Detection

Automatically detect and extract interactive elements from web pages for precise automation and AI-driven interactions.
Currently only supported for Linux boxes with browser. Android support coming soon.

Key Features

  • Element Detection: Identifies buttons, links, inputs, and other interactive elements
  • Rich Metadata: Position, size, text content, and HTML attributes
  • Annotated Screenshots: Returns screenshots with visual annotations highlighting detected elements
  • AI Ready: Perfect for LLM integration and intelligent automation

Usage Example

import GboxSDK from "gbox-sdk";

const gboxSDK = new GboxSDK({
  apiKey: process.env["GBOX_API_KEY"] // This is the default and can be omitted
});

async function main() {
  const box = await gboxSDK.create({ type: "linux" });

  await box.browser.openTab({
    url: "https://gbox.ai",
  });

  const { screenshot, elements } = await box.action.elements.detect({
    screenshot: {
      outputFormat: 'storageKey'
    }
  });

  console.info(`Screenshot: ${JSON.stringify(screenshot, null, 2)}`);

  console.info(`Detected elements length: ${elements.list().length}`);

  // You can send the screenshot to an LLM or Agent to decide which element to click
  // here we just click the first element
  const firstElement = elements.get("1");
  await box.action.click({
    target: firstElement,
  });

  console.info(
    `Clicked element: ${JSON.stringify(firstElement, null, 2)}`
  );
}

main();
2025-09-19
These features require upgrading to TypeScript SDK version 35 or later.

📋 Clipboard Support

New clipboard management capabilities for Android boxes, enabling programmatic control over device clipboard content.

Key Features

  • Set/Get Content: Read and write text to device clipboard
  • Automation Integration: Seamlessly integrate with UI automation workflows
  • Cross-App Data: Share data between applications through clipboard

Usage Example

import GboxSDK from "gbox-sdk";

const gboxSDK = new GboxSDK({
  apiKey: process.env["GBOX_API_KEY"] // This is the default and can be omitted
});

async function main() {
  const box = await gboxSDK.create({ type: "android" });

  // Set clipboard content
  await box.action.clipboard.set("Hello, world!");

  // Get current clipboard content
  const clipboardContent = await box.action.clipboard.get();
  console.log("Clipboard content:", clipboardContent);
}

main();
2025-09-12
These features require upgrading to TypeScript SDK version 34 or later.

📸 Screenshot ScrollCapture

Enhanced screenshot functionality now supports automatic scrolling to capture tall content like long web pages, documents, or chat conversations in a single image.

Key Features

  • Automatic Scrolling: Intelligently scrolls through content to capture everything in one screenshot
  • Height Control: Configurable maximum height to manage memory usage and file size
  • Position Restoration: Optional scroll-back functionality to return to original position
  • Memory Optimization: Built-in limits to prevent excessive memory consumption

Configuration Options

  • maxHeight: Maximum height in pixels (default: 4000px) - limits the total height of captured content
  • scrollBack: Whether to scroll back to original position after capture (default: false)

Usage Example

const result = await box.action.screenshot({
  scrollCapture: {
    maxHeight: 5000,
    scrollBack: true
  }
})

Use Cases

  • Web Page Documentation: Capture entire web pages for documentation or analysis
  • Chat History: Save complete conversation threads
  • Long Documents: Capture full document content in one screenshot
  • Social Media Feeds: Capture extended social media timelines
  • App Content: Document complete app screens or settings pages
2025-09-05
These features require upgrading to TypeScript SDK version 31 or later.

🔄 Rewind Recording

New Rewind functionality that automatically preserves the last 5 minutes of screen recording, allowing you to extract video clips from any time period at any moment.

Key Features

  • Automatic Recording: Continuously preserves up to 5 minutes of recent box screen recording
  • Flexible Extraction: Extract video clips from any time period (up to 5 minutes maximum)
  • Instant Access: No need to pre-start recording - access historical clips anytime

Use Cases

  • Debug Analysis: Review recent operation recordings when automation scripts encounter issues
  • Error Reproduction: Quickly capture screen recordings from before problems occurred for easier troubleshooting
  • Operation Logging: Automatically save operation history without manual recording management
  • Performance Monitoring: Observe application behavior during specific operations

Performance Considerations

Enabling Rewind functionality will have some performance impact. It’s recommended to enable only when needed and disable when not required to optimize performance.

Basic Usage

import GboxSDK from "gbox-sdk";

const gboxSDK = new GboxSDK({
  apiKey: process.env["GBOX_API_KEY"] // This is the default and can be omitted
});

async function main() {
  const box = await gboxSDK.create({
    type: "android",
    config: {
      deviceType: "physical",
    },
  });

  // Enable Rewind functionality
  await box.action.recording.rewind.enable();

  // Perform some operations...
  await box.action.tap({ target: "chrome app" });
  await box.action.swipe({ direction: "up" });

  // Extract the last 10 seconds of recording
  const result = await box.action.recording.rewind.extract({
    duration: "10s"
  });

  console.log("Recording result:", result);
  console.log("Download URL:", result.presignedUrl);

  // Disable Rewind functionality when not needed to save performance
  // await box.action.recording.rewind.disable();
}

main();

Advanced Usage

// Extract different time periods
const shortClip = await box.action.recording.rewind.extract({
  duration: "5s"  // Last 5 seconds
});

const longClip = await box.action.recording.rewind.extract({
  duration: "2m"  // Last 2 minutes
});

const maxClip = await box.action.recording.rewind.extract({
  duration: "5m"  // Last 5 minutes (maximum)
});

📸 Save Screenshots to Album

Enhanced screenshot functionality now supports saving screenshots directly to the device’s media album, making it easier to organize and access captured images.

Key Features

  • Album Integration: Screenshots are automatically saved to the device’s media gallery
  • Easy Access: Screenshots can be accessed through the device’s photo app
  • Organized Storage: All screenshots are properly categorized in the media library

Usage

import GboxSDK from "gbox-sdk";

const gboxSDK = new GboxSDK({
  apiKey: process.env["GBOX_API_KEY"] // This is the default and can be omitted
});

async function main() {
  const box = await gboxSDK.create({ type: "android" });

  // Take a screenshot and save it to the device album
  await box.action.screenshot({
    saveToAlbum: true,
  });

  // Screenshot is now available in the device's photo gallery
}

main();
2025-08-29
These features require upgrading to TypeScript SDK version 30 or later.

🖼️ Enhanced Screenshot Options

New unified screenshot configuration with options.screenshot parameter for better control over screenshot behavior.

Basic Usage

// Simple boolean
await box.action.click({
  x: 100,
  y: 100,
  options: {
    screenshot: true,
  },
});

// Detailed configuration
await box.action.click({
  x: 100,
  y: 100,
  options: {
    screenshot: {
      outputFormat: "storageKey",
      presignedExpiresIn: "1h",
      delay: "1s",
      phases: ["before", "after"],
    },
  },
});

Key Features

📸 Screenshot Phases
  • before: Screenshot before the action
  • after: Screenshot after the action
  • trace: Screenshot with operation trace
  • Default captures all three phases
⏱️ Configurable Delay
  • delay: Wait time after action before taking final screenshot
  • Default: 500ms, Maximum: 30s
🔄 Output Format
  • base64: Direct image data (default)
  • storageKey: Storage key with presigned URL access
  • presignedExpiresIn: Custom expiration for storageKey URLs (default: 30m)

Usage Examples

Capture specific phases:
await box.action.click({
  target: "login button",
  options: {
    screenshot: {
      phases: ["before", "after"],
    },
  },
});
Custom delay for UI state capture:
await box.action.click({
  target: "submit button",
  options: {
    screenshot: {
      delay: "2s",
      phases: ["after"],
    },
  },
});
Disable screenshots:
await box.action.click({
  target: "button",
  options: {
    screenshot: false,
  },
});
Please use the new options.screenshot parameter instead of the old screenshot fields, as they will be deprecated in future versions.
2025-08-25
These features require upgrading to TypeScript SDK version 29 or later.

🎯 Natural Language UI Actions

UI Actions now support natural language descriptions for targets and locations, making automation more intuitive and human-readable.
const box = await gboxSDK.create({ type: "android" });

// Tap on app icons or UI elements using natural language
await box.action.tap({
  target: "chrome app",
});

await box.action.click({
  target: "login button",
});

// Swipe with natural language location descriptions
await box.action.swipe({
  direction: "up",
  distance: 300,
  duration: "500ms",
  location: "screen bottom",
});

// Drag and drop using natural language
await box.action.drag({
  start: "Chrome App",
  end: "Trash",
});

// Long press with natural language targets
await box.action.longPress({
  target: "Chrome icon",
  duration: "1s",
});

📏 Action Scale Settings

Added support for scaling UI actions to adjust the size of screenshots and coordinate calculations without changing the actual screen resolution.

Key Features

  • Scale Range: 0.1 to 1.0 (10% to 100% of original size)
  • Screenshot Scaling: Output screenshots are scaled according to the setting
  • Coordinate Scaling: All action coordinates and distances are automatically scaled
  • No Resolution Change: The box’s actual screen resolution remains unchanged

Usage Examples

const box = await gboxSDK.create({ type: "android" });

// Get current action settings
const currentSettings = await box.action.getSettings();
console.log("Current settings:", currentSettings);

// Set scale to 50% for smaller screenshots and scaled coordinates
await box.action.updateSettings({
  scale: 0.5,
});

// With scale = 0.5, this click at (100, 100) becomes equivalent to (50, 50) at full scale
await box.action.click({
  x: 100,
  y: 100,
});

// Reset all settings to default values
await box.action.resetSettings();

// Verify settings have been reset
const resetSettings = await box.action.getSettings();
console.log("Settings after reset:", resetSettings);

Scale Examples

Scale ValueScreenshot SizeCoordinate Example
1.0 (default)Full sizeClick({x: 100, y: 100})
0.550% sizeClick({x: 50, y: 50}) equivalent
0.2525% sizeClick({x: 25, y: 25}) equivalent
Note: Scale affects both the output screenshot dimensions and the coordinate system for all UI actions, making it useful for optimizing performance and storage when full resolution isn’t needed.

🌐 Proxy Configuration

Added proxy configuration support for Android boxes.
const box = await gboxSDK.create({ type: "android" });

// Set proxy
await box.proxy.set({
  host: "127.0.0.1",
  port: 9090,
});

// Get current proxy settings
console.info(await box.proxy.get());

// Clear proxy
await box.proxy.clear();
2025-08-15
These features require upgrading to TypeScript SDK version 27 or later.

🖱️ UI Actions

  • Added: tap and longPress actions for precise coordinate taps and long-press interactions.
  • Enhanced: swipe / scroll now support semantic distance values ("tiny" | "short" | "medium" | "long"), so you no longer need to provide pixel values.

New action: tap

const box = await gboxSDK.create({ type: "android" });

await box.action.tap({
  x: 100,
  y: 100,
});

New action: longPress

const box = await gboxSDK.create({ type: "android" });

await box.action.longPress({
  x: 100,
  y: 100,
});

Semantic distance (swipe/scroll)

const box = await gboxSDK.create({ type: "android" });

await box.action.swipe({
  direction: "up",
  distance: "long",
});

await box.action.scroll({
  direction: "up",
  distance: "long",
});
2025-08-08
These features require upgrading to TypeScript SDK version 26 or later.

🎥 Android Screen Recording

Added screen recording capabilities to Android boxes for creating tutorials and debugging UI interactions.
const box = await gboxSDK.create({ type: "android" });

console.info("start recording");

await box.action.screenRecordingStart();

console.info("swipe up");

await box.action.swipe({
  direction: "up",
});

console.info("sleep 5 seconds...");

// you can do anything you want here
await new Promise((resolve) => setTimeout(resolve, 5000));

const result = await box.action.screenRecordingStop();

// you can download the video from the result
console.info(`recording result: ${JSON.stringify(result, null, 2)}`);

📱 Android Media Management

Added media file management capabilities for Android devices, including album listing and media operations.
const box = await gboxSDK.create({ type: "android" });

const albums = await box.media.listAlbums();

console.info(albums);
For more media API endpoints, see Media API Docs.

🪝 AI Action Progress Callbacks

Since AI Action execution takes time, we provide a series of callbacks to monitor the progress of the entire operation process.
const box = await gboxSDK.create({ type: "android" });

await box.action.ai("open the youtube app", {
  onActionStart: () => {
    console.info("action start");
  },
  onActionEnd: () => {
    console.info("action end");
  },
});
2025-07-29
These features require upgrading to TypeScript SDK version 25 or later.

UI Action Support for StorageKey Output Format

UI Action now supports outputFormat: "storageKey", allowing GBOX.AI to directly store screenshot information. Compared to returning image data directly, StorageKey provides more flexible storage and access options:
  • presigned URL: System-generated temporary access link with a default validity period of 30 minutes
  • storageKey: A storage key that remains valid throughout the box lifecycle (until the box is deleted)
  • Custom presigned URL: Use createPresignedUrl to create presigned URLs with specified expiration times, convenient for returning to LLM models for image analysis
const box = await gboxSDK.create({ type: "android" });

const res = await box.action.screenshot({
  outputFormat: "storageKey",
});

console.info(`presigned url: ${res.presignedUrl}`);

// create custom presigned url
const customPresignedUrl = await box.storage.createPresignedUrl({
  storageKey: res.uri,
  expiresIn: "1h",
});

console.info(`custom presigned url: ${customPresignedUrl}`);

Android Multiple APK Installation Support

Added support for installing multiple APK files using ZIP archives, enabling installation of split APKs and bundle applications.

Installation Modes

Single APK Installation (default):
  • Upload and install a single APK file
  • Traditional installation method for standalone applications
Multiple APK Installation (ZIP-based):
  • Upload a ZIP archive containing multiple APK files
  • Automatically extracts and installs all APK files in the correct order
  • Essential for modern Android apps that use App Bundle distribution

ZIP Archive Structure

When using multiple APK installation, organize your files as follows:
app-bundle.zip
└── app-folder/
    ├── base.apk           (base application)
    ├── config.arm64_v8a.apk  (architecture-specific)
    ├── config.en.apk      (language resources)
    └── config.xxxhdpi.apk (density-specific resources)

Usage Examples

Single APK Installation:
const box = await gboxSDK.create({ type: "android" });

const app = await box.app.install({
  apk: "/path/to/single-app.apk",
});

await app.open();
Multiple APK Installation (ZIP):
const box = await gboxSDK.create({ type: "android" });

const app = await box.app.install({
  apk: "/path/to/app-bundle.zip",
});

await app.open();
2025-07-21
These features require upgrading to TypeScript SDK version 22 or later.

Android Box Command Support

Added direct command execution support for Android boxes, enabling full system-level control through shell commands.
const box = await gboxSDK.create({ type: "android" });

await box.command({
  commands: ["logcat"],
  onStdout: (data) => {
    console.info("---stdout---");
    console.info(data);
  },
  onStderr: (data) => {
    console.info("---stderr---");
    console.error(data);
  },
});

Linux Box Support for UI Action / AI Action

Added UI automation and AI-powered action capabilities to Linux boxes, enabling desktop application interaction and programmatic UI operations.
await gboxSDK.create({ type: "linux" });

await box.action.click({
  x: 100,
  y: 100,
});

await box.action.ai("Open the Chrome browser");

Linux Box Command / Run Code Streaming Support

Added real-time streaming support for command execution and code running on Linux boxes, providing immediate feedback and live output monitoring.
const box = await gboxSDK.create({ type: "linux" });

const result = await box.command({
  commands: ["for i in {10..1}; do echo $i; echo $i >&2; sleep 1; done"],
  onStdout: (data) => {
    console.info("---stdout---");
    console.info(data);
  },
  onStderr: (data) => {
    console.info("---stderr---");
    console.error(data);
  },
});

console.info(result);

const result2 = await box.runCode({
  code: `
console.log("something xxxx");
console.error("something error")
`,
  language: "typescript",
  onStdout: (data) => {
    console.info("stdout");
    console.info(data);
  },
  onStderr: (data) => {
    console.info("stderr");
    console.error(data);
  },
});

console.info(result);

Browser Action Support

Added browser automation capabilities to Linux boxes, enabling programmatic control and interaction with web browsers and tabs.
const box = await gboxSDK.create({ type: "linux" });

const tabs = await box.browser.listTabInfo();

console.info(tabs);
I