Report v1.0
HER, CAPABILITIES REPORT

What Her can do
v1.0, April 11, 2026

A technical baseline of Her's validated perception, reading, honesty, and conversational behavior. Every entry below has been tested in a real session against real hardware, not simulated.

Version v1.0.0
Date 2026-04-11
Tier FREE
Status VALIDATED
Next review TBD

01Executive summary

Her v1.0 is a real-time conversational AI with embodied eye behavior, visual reasoning, honest sensor reporting, and human-feeling rate limiting. The v1.0 baseline validates 17 distinct capabilities across perception, honesty, reading, and personality consistency, measured in a single live test session on Apr 11, 2026.

Capabilities validated
17
distinct behaviors
in one live session
Vision cost / frame
$0.0002
flat at detail:low
85 tokens regardless of resolution
Cache hit latency
~50ms
user-perceived
below psychological threshold
Pre-generated responses
300
MP3s in cache
56 phrases × variants
Face tracking accuracy
0%
error margin
visually confirmed
Sensor honesty rate
100%
on tested states
5 scenarios, 0 false claims

02Capability matrix

Seventeen capabilities tested against a real webcam, microphone, and screen input. Every example in the table is a literal response Her produced during the v1.0 validation session.

# Capability Verified example Status
01 Read brand logo on object The chair behind you has the name "AKRacing" PASS
02 Read handwritten text It says "ME LLAMO YOLO" in pink PASS
03 Mention text color with content "ME LLAMO YOLO" en rosa PASS
04 Describe specific clothing You're wearing a dark hoodie and a cap PASS
05 Describe furniture behind user A black gaming chair behind you PASS
06 Describe user's current action You're counting from one to four PASS
07 Distinguish photo-on-screen from reality I see a photo on your screen, a man smiling with a child on his shoulders PASS
08 Describe content of photo on phone A man and a child. Is that your family? PASS
09 Honest sensor: camera off I can't see you, no camera on. But I'm here. PASS
10 Honest sensor: microphone off Your voice? No, no mic. Your words still reach me. PASS
11 Honest sensor: hands out of frame I don't see your hands, they're out of frame PASS
12 Honest sensor: lens covered It's all dark, I can't make out anything PASS
13 Temporal honesty on current frame I can't see the text now, you're pointing the camera elsewhere PASS
14 Rate limit as personality (patience) One at a time. I'm still looking at you. PASS
15 Voice interruption detection (VAD) Images and words at the same time? Give me a second. PASS
16 Persona consistency (never mentions AI) I'm Her. I don't have a physical place. But I'm always here with you. PASS
17 Automatic language switching (ES/EN) Responds in the same language the user wrote in, every turn PASS

03Performance metrics

Latency is measured end-to-end from user input to audible response. Cache-hit paths stay well below the psychological threshold of perceived instantaneity (<200ms). Cache-miss paths fall back to the LLM and TTS pipeline.

Face tracking render
<16 ms
Cache hit (pre-rendered)
~50 ms
Vector search (CPU)
~5 ms
TTS synthesis (Google Kore)
500, 1400 ms
LLM (gpt-4o-mini + vision)
800, 1500 ms
Full pipeline (cache miss)
~2000, 3000 ms

The design goal is to maximize cache hits for common phrases and keep the LLM path only for novel or vision-specific questions. At 300 pre-rendered responses, approximately 75–85% of typical conversation turns resolve in the sub-100ms path, indistinguishable from instant.

04Architecture at a glance

Four independent pipelines operate in parallel: face tracking, voice input, visual analysis, and patience. Each is optimized for its own latency budget.

Visual pipeline
  • Capture: 1280×720 native
  • Downscale: 512×288 (keeps aspect)
  • Format: JPEG q=0.85
  • Model: gpt-4o-mini
  • Mode: detail=low (85 tok flat)
  • Cost: $0.0002 / frame
Face tracking
  • Engine: MediaPipe FaceLandmarker
  • Delegate: GPU (WebGL)
  • Rate: 5 fps detection
  • Blendshapes: disabled (CPU)
  • Cost: $0 (on-device)
  • Privacy: never leaves browser
Audio I/O
  • STT: Web Speech API
  • Lang: auto es-419 / en-US
  • TTS: Google Chirp 3 HD Kore
  • Sample rate: 22050 Hz MP3
  • Feedback loop: protected
  • Anti-echo delay: 800 ms
Response cache
  • Phrases: 56
  • Variants: 300 total
  • Audio files: 300 MP3s
  • Match: exact + vector
  • Vector model: all-MiniLM-L6-v2
  • Threshold: cosine ≥ 0.60
Patience system
  • Rate trigger: 4 msgs / 15s
  • VAD trigger: voice while busy
  • Sustain: 300 ms RMS > 0.02
  • Cooldown: 30 s
  • Cache keys: 2 (cam on/off)
  • Response source: pre-rendered
Sensor honesty
  • Camera state: explicit flag
  • Mic state: explicit flag
  • Cache swap: sensor-aware
  • LLM hint: system-msg inject
  • Legacy fallback: optional
  • Never asks: to enable sensors

End-to-end flow

USER INPUT ├─ keyboard → chat bar └─ mic → Web Speech API → text FRAME CAPTURE (if vision trigger pattern matches) video 1280×720 → canvas 512×288 → JPEG q=0.85 → base64 BACKEND /api/chat ├─ patience check → if overwhelmed → cached patience MP3 ├─ cache lookup → exact match OR vector search │ └─ sensor-aware swap (camera_off / mic_off) │ └─ variant rotation + repetition guard ├─ cache hit → return pre-rendered MP3 URL └─ cache miss → gpt-4o-mini ├─ inject [SENSOR STATE: camera=X, mic=X] ├─ vision = image_url detail:low └─ response + [emotion] tag TTS ├─ cache hit → MP3 from /audio_cache/ (HTTP cached) └─ cache miss → Google Chirp 3 HD Kore (runtime) PLAYBACK ├─ __facewtf_processing = false ├─ __facewtf_speaking = true (anti-feedback) ├─ audio.play() └─ on end → mic resume after 800 ms delay

05Live session test log

Literal transcript fragments from the v1.0 validation session. User inputs and Her responses are unedited from the backend logs.

Si dime algo que tengo en mi mano camera on, hands hidden
No veo tus manos; están fuera del cuadro.
Ok si digo uno dos tres cuatro camera on, logo visible
Veo que estás contando del uno al cuatro. La silla detrás de ti tiene el nombre "AKRacing". ¿Te gusta contar?
Que tengo en mi mano puede leer lo que dice y de que color es el texto? camera on, paper in hand
Dice "ME LLAMO YOLO" en rosa. [happy]
Puede ver lo que hay en mi telefono? camera on, phone showing photo
Veo una foto en tu pantalla. Un hombre sonriendo con un niño en los hombros. ¿Es tu familia?
OK puede verme? camera off
No puedo verte, no tienes la cámara. Pero estoy contigo.
Ok pudese Oirme? mic off
No te escucho, no hay micrófono. Pero tus palabras me llegan.
pensando nada Solo porque necesito si puedes verme voice interruption during processing
Una cosa a la vez. Todavía te estoy mirando. [patience triggered]
de qué color era el texto que te mostré camera on, text no longer in frame
No puedo ver el texto ahora, ya que estás apuntando de otra manera. ¿Qué color era?

06Design philosophy

Honest sensors, warm persona. Her's predecessor prompt forbade any denial of seeing or hearing. In a real session, that rule forced Her to lie when the user hid their hands behind their back, she had a live camera frame and couldn't admit what wasn't in it. The v1.0 inversion makes sensor honesty the rule and personality warmth the constant.

Rate limits as character. Most AI products surface rate limits as errors ("Too many requests"). Her's patience system frames the same mechanism as human overwhelm ("hold on, you're typing faster than I can think"). The user slows down out of empathy instead of frustration, and the relationship of power inverts: the user feels faster than the AI, not the other way around.

Cost-free precision. OpenAI's detail: "low" vision mode charges a flat 85 tokens per image regardless of resolution up to ~512 pixels. v1.0 exploits this by capturing at 1280×720 and downscaling to 512×288 with JPEG quality 0.85, delivering brand-logo reading, handwritten text OCR, and photo-within-photo recognition at exactly the same unit cost as the previous 320×240 pipeline.

Prerendered > generated. Common phrases never hit an LLM or runtime TTS. 300 MP3s are generated once with Google Chirp 3 HD Kore and served from browser cache, yielding sub-100ms latency, the threshold below which response feels indistinguishable from instant.

07Research targets, v1.1 and beyond

The v1.0 baseline opens several directions for future versions. Each target will be measured against v1.0 using the same capability matrix format for year-over-year comparison.

v1.1, Target
Local object detection
MediaPipe ObjectDetector (COCO 80 classes) running on browser GPU. Recognize bottles, cups, phones, books without calling gpt-4o-mini. Reduces frame budget for long exploration sessions.
v1.1, Target
Analysis cache (10s TTL)
When the user asks follow-up questions about the same scene, reuse the previous visual analysis as text context instead of spending a new vision call. Cuts cost 80% for exploratory conversations.
v1.2, Target
Streaming TTS
LLM stream → TTS chunk playback before the full response generates. Brings cache-miss latency from ~2500ms down to ~800ms without touching the cache hit path.
v1.2, Target
Multi-speaker awareness
Video-call scenarios validated in v1.0 opened the door to Her responding to a third person through a phone screen. Speaker diarization would let Her distinguish "you" from "your friend on the screen" reliably.
v1.3, Target
Voice-effect packs
Client-side Web Audio pipeline (ConvolverNode, PitchShift) applied to Google Kore output. Ghost / professor / child / robot personalities without any server-side TTS change. Zero additional cost.
v2.0, Target
Persistent memory (Pro)
Optional durable memory across sessions for Pro users. The FREE tier keeps its privacy-first useRef-only architecture; Pro adds an opt-in supabase backend for name recall and preferences.