Integral Review

Welcome to my personal blog! I use it to share what I'm currently learning or thinking about, usually on topics related to technology, business, and health.

OpenAI Operator review: Currently too limited but reasons to be hopeful

As OpenAI describes it, Operator is "an agent that can go to the web to perform tasks for you. Using its own browser, it can look at a webpage and interact with it by typing, clicking, and scrolling."

Here are my thoughts on its usefulness after having tried it.

Operator: ChatGPT in a VM

Before we jump into the results, it's important to understand how Operator works. Operator is ChatGPT with access to a VM and within it a browser. This means the agent has its own virtual computer on which it can do its work. This isolation has its advantages in terms of not messing up your computer if it makes a mistake, but it's also limiting as it's not logged in to anything and can't use the software on your machine. For now, the process to log in to websites in Operator's VM is manual and clunky (I couldn't even paste passwords, which is a problem if your password manager generates them for you). I have no doubt this process will quickly improve.

Reviewing Operator as it is Today

So... is Operator a good product? My short answer is not yet. The results are too low quality and unpredictable. It very much is a research preview. Using Operator, you can see the potential and have a glimpse of how powerful such an agent can be, but its reasoning and navigational skills are too low to meaningfully provide value to users.

When Operator Almost Shines
Let's start with an example where Operator almost provided a lot of value: I was interested in attending a classical concert with my wife in Berlin this summer. It's usually a bit of a pain to search for each venue's website, set the dates, and then see which concert is sold out or not. So I asked Operator to do this work upfront for me with the following prompt:



After 15 minutes of waiting, here's the response I got back:

And here's the screen recording of Operator's work in the browser:



It's interesting to see how Operator struggles with navigating pages. Due to my request, it had to work with a lot of date pickers, which are often tricky to work with even for humans. Watching Operator navigate is like watching someone who just discovered the internet. I don't, however, see this slog as a problem. It should be used as an assistant to which you give a task and know that you'll eventually hear back from them. For my request, I would have been fine waiting 1 hour if it meant having a better response.

However, only using the names and dates didn't really allow me to find said concerts. So I tried once again but asked for the URLs. After 33 minutes of work, this time I got 3 concerts that weren't at all what I was looking for.

An intelligence that's currently too limited
I was curious to try out Operator for Product Research. To get started, I tried to get some pricing and feature information on a list of tools. In particular, I tried to get some data on competitors to the Anki software. I tried multiple times and even tried to have GPT-4 write a detailed prompt to help.


The results were abysmal. After 10 minutes of work, here's the best output I got from Operator:


This is unusable. It's not detailed, contains incorrect information, and some parts don't even make sense. Part of the problem is that Operator didn't even visit each product's website. And how did it compile this list of 10 products (which is actually only 9)? It searched for a list of alternatives and used the first result Bing returned.

This is the lowest effort, lowest quality approach I could imagine. It really shows the very limited reasoning of the model. 

Looking to the future

Operator very much is a research preview. OpenAI is clear about it. For now, I don't see how it could be used for actually tasks.

It has 3 main problems:

  1. Browser navigation: It's obvious to us, but the web is actually quite messy. Interfaces aren't always clear, where we can or can't click depends on each website, etc. Operators struggle a lot with this to the point of failing tasks simply because they don't understand how to deal with some forms or navigation.
  2. The VM isn't your computer: I was hoping to have Operator connect to my Google Workspace Gmail account to automatically fetch receipts. I couldn't manage to sign in because of how difficult it is to interact with the VM.
  3. Reasoning: Supposedly, Operator uses GPT-4o under the hood. I struggle to believe it, as it feels like a huge step back. This is important because tasks on the web are more than a simple Google (or in this case Bing) search; if you need to look for alternatives to a certain product, you understand that some websites are going to offer better insights than others. You also know the importance of checking multiple results. Operator doesn't know any of these, making its output very low quality.

Thankfully, all of these challenges can be vastly improved within a reasonable timeframe. It can even happen quickly. Dealing with browsers is a challenge but isn't a fundamental obstacle and reasoning is vastly better on other models.

These improvements will happen, and I can already see many use cases for a working Operator. I very much look forward to it.

#ai #productivity #review

💬 Comments
Subscribe to future posts