OS-Copilot Autonomous Computer Use Agent: Introduction & Code Insights

Generated by Google’s Imagen 3 via AI Studio. A silhouette of a person stands before a giant, futuristic robot with glowing red eyes. The robot is surrounded by digital screens displaying various data and interfaces, set in a high-tech environment. Image Generated by Google’s Imagen 3 via AI Studio.

Introduction

If you wanted to know a tool so that your computer can be controlled automatically, capable of automating complex tasks and self-improving over time. I am writing about OS-Copilot, a multi-agent framework which is pretty impressive. It provides a modular and flexible architecture for building generalist agents that can interface with various operating system elements, including the web, code terminals, files, multimedia, and third-party applications. OS-Copilot enables the creation of powerful digital assistants. I compared it with Anthropic Computer Use as well.

Let’s begin!

Overall Architecture

The OS-Copilot framework consists of 3 components:

  1. Planner: Decomposes user requests into simpler subtasks, retrieving relevant information about agent capabilities and operating system information.

  2. Configurator: Configures subtasks for the actor, inspired by the human brain's memory structure (working, declarative, and procedural memory).

  3. Actor: Consists of 2 stages:

    • Execution: Proposes and executes actions (e.g., python code, bash commands) on your system.

    • Self-directed learning: Self-criticism, provides feedback to refine execution errors, update long-term memory and fetches back next time thus called as learning. To acquire new skills, through trial and error, accumulating valuable tools and knowledge, demonstrating the effectiveness of self-directed learning for a general-purpose OS-level agent.

They introduced 3 Agents:

  • FridayAgent: A multi-agent, multimodal support framework which can self-learn, adopt any tool on your system

  • FridayVision: A light weight agent which is can only open browser and perform UI activities on your system similar to Anthropic’s Computer use.

LightFriday: A light weight agent which only executes until task is completed but doesn’t incorporate self-directed learning.


FridayAgent

CodeInsights

  1. Inner Monologue
  • Captures and stores intermediate representations during agent execution such as

  • reasoning, error_type, critique, isRePlan, isTaskCompleted, result

  1. 5 types of Subtasks
  • Python, Shell, AppleScript, API, QA / information gathering
  1. Tool execution environments
  • Pythonjupyter, Shell, AppleScript
  1. Tool execution state variables

command, error, ls, pwd, result

How it works?

Architecture

Every agent in the framework have slightly different architecture while comprising the above mentioned 3 components.

Flow diagram of how FridayAgent performs for any given task

Why to use it?

  • Program with Natural Language Query: Automatically executes tasks like you give a CMD command in your system but in natural language

  • Fully Autonomous: Autonomously finds solution, executes actions and integrates with various data sources or tools to bring you the desired result.

  • Tool and Environment Flexibility: Adapts to new tools and environments, creating and integrating them as needed.

  • Continuous Learning: Evolves through trial and error, storing past experiences in the form of few-shot example to improve future performance.

When to use?

  • Automate your daily system usage, workflow and boring repetitive tasks

  • Advanced automation that incorporates various tools native to your system and data

  • Give task in natural language rather following the bounded strict syntaxes.

  • Delegate it the task of researching across web and using various tool and present you the report will be time saving

How to integrate your custom tools?

  • Add Code Tool: Add your tool with supported type as python, bash or applescript code, find the guide here

  • Integrate API Tool: Integrate Existing APIs or add OpenAPI specs for your API tool.


Comparison With Competitors

OS-CopilotAnthropic Computer Use
VendorOS-Copilot (Chinese)Anthropic (US)
OSLinux onlyLinux only
LLM supportOpenAI, Ollama but easy to write your own API to use other vendors alsoAnthropic, Bedrock, Vertex but easy to write your own API to use other vendors also
Multi-modal supportYesYes
Cursor Navigation MechanismOCR + LLMLLM
PlanningBreaks tasks into subtasksPredicts next action
Self-improvingYesNo
Computer Use viaPyAutoGUI librarybash commands
Human in loopNoYes

How well OS-Copilot works?

  • When evaluated on the GAIA benchmark, it outperformed other baseline agents on the GAIA benchmark, with a pass rate of 64.6% on the private test set. It could master these tasks through self-directed learning, with a pass rate of 83.3% on the test set.

  • I tried many queries but it failed many times due to some bugs in the latest code, screenshots failed as I setup this up on a VM and the library used failed.

  • Passed: Create excel sheet, insert data. Create folder and write code

  • Failed: Toggle night mode in vscode or os. While creating react appkept it kept on repeating and repairing solution and finally threw error when LLM’s context length exceeded. LOL!

Limitations

  • Not fully tool or env agnostic: It’s not fully agnostic to tools or environments, as it installs dependencies with commands, relies on installed libraries and apps that are compatible with your Python env and OS.

  • No Human-in-loop: may make hallucinate, take wrong or different turns without human oversight. This may not always adapt to your preferences and could lead to unintended risks.

  • Latency: The task execution time depends on the LLM’s response time, can be slow, and may take longer to complete tasks.

  • Visual accuracy issues: The FridayVision, during computer use, may make mistakes in interpreting object and co-ordinates on your display.

  • Security, Responsibility and Vulnerability: Run OS-Copilot our system in trusted environments, such as VMs or containers, and limit access to sensitive data to minimize security risks.

Conclusion

Note: I tried only the non-vision part of agent due to hosting it on a remote instance. I’ll fix this and post in the next blog.

  • OS-Copilot is inspired from OpenInterpreter but with only a different feature which is to store tools, iteratively improve it when failed and retrieve next time.

  • All these 3 frameworks failed many times apart from what was demoed but they are well engineered and can evolve to become more Robust General Agent. Their code implementation is good and suggest you best practices further on how to modularize your code well to build such multi-agent frameworks.

So, OS-Copilot worked faster than Anthropic Computer Use for me as tools were already stored and basically worked similarly for many use cases. I’ll drop comparison with OpenInterpreter as well, stay tuned!

Referenced Articles

Thank you for taking the time to read my blog! I hope you found the information about OS-Copilot insightful and helpful. I am excited to bring the vision part to life and unleash the full potential of this amazing tool. Your support and feedback mean the world to me, so feel free to share your thoughts and experiences.

Until next time, happy automating!