Productivity8 min readOctober 25, 2024

Airshots: Screenshots That Understand Context

A picture is worth a thousand tokens. Here is how we added eyes to voice commands.

TL;DR

Learn how Airshots adds visual context to voice commands on Mac. Capture screenshots with your voice queries for better AI understanding of error messages, forms, and app interfaces.

Voice communication has a fundamental limitation: you can describe what you are looking at, but description is inherently slower and less precise than just showing it. When you want help with an error message on your screen, explaining "there is a dialog box with a red icon that says something about permissions and then a long path to a file" takes much longer than simply showing the error.

Airshots is our solution to this problem. It lets you combine voice commands with visual context by capturing what is on your screen and sending it along with your spoken question.

The Problem with Voice Only Commands

Consider trying to get help with a software error using only voice. You might say: "I am getting an error in the settings panel of this app. It says something like unable to save preferences, and there is a number, maybe a code, I think it starts with E and then some digits."

Even this detailed description leaves out important context. What app are you using? What version? What were you trying to do when the error appeared? What other options are visible in the interface? The AI has to make guesses about all of this context, and those guesses may be wrong.

Now compare that to showing the AI a screenshot of the error. In one image, the AI can see the exact error message and code, the application name and interface, what you were trying to do, the available options and buttons, and any related information visible on screen. All of this context arrives instantly rather than through a lengthy verbal description.

How Airshots Works

Airshots is activated with a simple gesture: double-tap the Right Option key. When you double-tap, Air captures a screenshot of your currently active window and prepares to attach it to your next voice command.

You then speak your question while continuing to hold the key after the second tap. The voice capture works the same as normal, but now your spoken question has visual context attached.

Before anything is sent, you see a preview of both the screenshot and your transcribed question. You can confirm to proceed, cancel if the screenshot contains something sensitive, or retake if the capture did not capture what you intended.

The entire flow takes about two seconds from double-tap to confirmation. Compare this to the alternative of describing what you see in enough detail for the AI to understand, which might take thirty seconds or more of explanation.

Common Use Cases for Visual Context

Error messages are one of the most powerful use cases. When you encounter an error, you can double-tap and ask "what does this error mean?" or "how do I fix this?" The AI sees the exact error text, the application context, and any relevant details in the interface. It can provide specific guidance rather than generic troubleshooting steps.

Complex forms benefit from visual context. If you are filling out a form and are confused about what a field is asking for, you can show the AI the form and ask for clarification. The AI can see all the field labels, any helper text, and the context of surrounding fields.

Application help becomes more specific with screenshots. Instead of asking "how do I export in this app," you can show the AI which app you are in and where you are in the interface. The AI can reference specific menu items and buttons that it can see in the screenshot.

Document review is faster with visual context. If you want a summary of a document or help understanding a chart, showing it is faster than selecting and copying all the text. The AI can see formatting, layout, and visual elements that would be lost in plain text.

Web content analysis works well with Airshots. You can ask questions about a webpage you are viewing, and the AI can see the design, layout, and content of the page. This is particularly useful for comparing options, understanding complex interfaces, or getting explanations of unfamiliar content.

Privacy and Control with Screenshot Capture

We designed Airshots with privacy as a primary concern. The feature only activates when you explicitly trigger it with a double-tap. There is no passive screen monitoring, no automatic capture, and no background recording of what is on your screen.

Every capture is shown to you before it is sent anywhere. You see exactly what the AI will see. If the screenshot contains something sensitive like passwords, personal information, or confidential documents, you can cancel and the image is discarded.

The screenshot is processed for your immediate query and then deleted, following the same privacy principles as voice data. We do not retain screenshots for training, analysis, or any other purpose.

The screen recording permission required for Airshots is optional. You can use all other Air features without ever granting this permission. It is only requested the first time you try to use the Airshots feature.

Technical Implementation of Screen Capture

Airshots uses macOS screen capture APIs to grab the contents of the frontmost window. We capture just the window, not the entire screen, to minimize the amount of information collected and reduce the chance of capturing unrelated content.

The capture happens at screen resolution, which provides enough detail for the AI to read text and understand interface elements. We do some processing to reduce file size without sacrificing legibility before transmission.

The integration with voice input is seamless. The double-tap gesture naturally transitions into the hold-to-speak gesture, so activating Airshots does not require learning a completely new interaction pattern. After the second tap, you simply keep holding and speak your question.

When to Use Airshots vs Voice Only

Airshots adds value when what you are seeing is important context for your question. This includes error messages and dialogs, forms and interfaces you need help with, documents and content you want analyzed, and app interfaces where you need specific guidance.

Voice only is sufficient when your question does not depend on visual context. Sending a message, checking your calendar, creating a note, setting a reminder, and general knowledge questions do not typically need screenshots.

Over time, you will develop an intuition for when visual context helps. The general rule is: if you find yourself about to describe what you are looking at, just show it instead.