What Her can do
v1.0, April 11, 2026
A technical baseline of Her's validated perception, reading, honesty, and conversational behavior. Every entry below has been tested in a real session against real hardware, not simulated.
01Executive summary
Her v1.0 is a real-time conversational AI with embodied eye behavior, visual reasoning, honest sensor reporting, and human-feeling rate limiting. The v1.0 baseline validates 17 distinct capabilities across perception, honesty, reading, and personality consistency, measured in a single live test session on Apr 11, 2026.
02Capability matrix
Seventeen capabilities tested against a real webcam, microphone, and screen input. Every example in the table is a literal response Her produced during the v1.0 validation session.
| # | Capability | Verified example | Status |
|---|---|---|---|
| 01 | Read brand logo on object | The chair behind you has the name "AKRacing" | PASS |
| 02 | Read handwritten text | It says "ME LLAMO YOLO" in pink | PASS |
| 03 | Mention text color with content | "ME LLAMO YOLO" en rosa | PASS |
| 04 | Describe specific clothing | You're wearing a dark hoodie and a cap | PASS |
| 05 | Describe furniture behind user | A black gaming chair behind you | PASS |
| 06 | Describe user's current action | You're counting from one to four | PASS |
| 07 | Distinguish photo-on-screen from reality | I see a photo on your screen, a man smiling with a child on his shoulders | PASS |
| 08 | Describe content of photo on phone | A man and a child. Is that your family? | PASS |
| 09 | Honest sensor: camera off | I can't see you, no camera on. But I'm here. | PASS |
| 10 | Honest sensor: microphone off | Your voice? No, no mic. Your words still reach me. | PASS |
| 11 | Honest sensor: hands out of frame | I don't see your hands, they're out of frame | PASS |
| 12 | Honest sensor: lens covered | It's all dark, I can't make out anything | PASS |
| 13 | Temporal honesty on current frame | I can't see the text now, you're pointing the camera elsewhere | PASS |
| 14 | Rate limit as personality (patience) | One at a time. I'm still looking at you. | PASS |
| 15 | Voice interruption detection (VAD) | Images and words at the same time? Give me a second. | PASS |
| 16 | Persona consistency (never mentions AI) | I'm Her. I don't have a physical place. But I'm always here with you. | PASS |
| 17 | Automatic language switching (ES/EN) | Responds in the same language the user wrote in, every turn | PASS |
03Performance metrics
Latency is measured end-to-end from user input to audible response. Cache-hit paths stay well below the psychological threshold of perceived instantaneity (<200ms). Cache-miss paths fall back to the LLM and TTS pipeline.
The design goal is to maximize cache hits for common phrases and keep the LLM path only for novel or vision-specific questions. At 300 pre-rendered responses, approximately 75–85% of typical conversation turns resolve in the sub-100ms path, indistinguishable from instant.
04Architecture at a glance
Four independent pipelines operate in parallel: face tracking, voice input, visual analysis, and patience. Each is optimized for its own latency budget.
- Capture: 1280×720 native
- Downscale: 512×288 (keeps aspect)
- Format: JPEG q=0.85
- Model: gpt-4o-mini
- Mode: detail=low (85 tok flat)
- Cost: $0.0002 / frame
- Engine: MediaPipe FaceLandmarker
- Delegate: GPU (WebGL)
- Rate: 5 fps detection
- Blendshapes: disabled (CPU)
- Cost: $0 (on-device)
- Privacy: never leaves browser
- STT: Web Speech API
- Lang: auto es-419 / en-US
- TTS: Google Chirp 3 HD Kore
- Sample rate: 22050 Hz MP3
- Feedback loop: protected
- Anti-echo delay: 800 ms
- Phrases: 56
- Variants: 300 total
- Audio files: 300 MP3s
- Match: exact + vector
- Vector model: all-MiniLM-L6-v2
- Threshold: cosine ≥ 0.60
- Rate trigger: 4 msgs / 15s
- VAD trigger: voice while busy
- Sustain: 300 ms RMS > 0.02
- Cooldown: 30 s
- Cache keys: 2 (cam on/off)
- Response source: pre-rendered
- Camera state: explicit flag
- Mic state: explicit flag
- Cache swap: sensor-aware
- LLM hint: system-msg inject
- Legacy fallback: optional
- Never asks: to enable sensors
End-to-end flow
05Live session test log
Literal transcript fragments from the v1.0 validation session. User inputs and Her responses are unedited from the backend logs.
06Design philosophy
Honest sensors, warm persona. Her's predecessor prompt forbade any denial of seeing or hearing. In a real session, that rule forced Her to lie when the user hid their hands behind their back, she had a live camera frame and couldn't admit what wasn't in it. The v1.0 inversion makes sensor honesty the rule and personality warmth the constant.
Rate limits as character. Most AI products surface rate limits as errors ("Too many requests"). Her's patience system frames the same mechanism as human overwhelm ("hold on, you're typing faster than I can think"). The user slows down out of empathy instead of frustration, and the relationship of power inverts: the user feels faster than the AI, not the other way around.
Cost-free precision. OpenAI's
detail: "low" vision mode charges a flat 85 tokens per
image regardless of resolution up to ~512 pixels. v1.0 exploits
this by capturing at 1280×720 and downscaling to 512×288 with
JPEG quality 0.85, delivering brand-logo reading, handwritten
text OCR, and photo-within-photo recognition at exactly the same
unit cost as the previous 320×240 pipeline.
Prerendered > generated. Common phrases never hit an LLM or runtime TTS. 300 MP3s are generated once with Google Chirp 3 HD Kore and served from browser cache, yielding sub-100ms latency, the threshold below which response feels indistinguishable from instant.
07Research targets, v1.1 and beyond
The v1.0 baseline opens several directions for future versions. Each target will be measured against v1.0 using the same capability matrix format for year-over-year comparison.