HER, CAPABILITIES REPORT

What Her can do
v1.0, April 11, 2026

A technical baseline of Her's validated perception, reading, honesty, and conversational behavior. Every entry below has been tested in a real session against real hardware, not simulated.

Version v1.0.0

Date 2026-04-11

Tier FREE

Status VALIDATED

Next review TBD

01Executive summary

Her v1.0 is a real-time conversational AI with embodied eye behavior, visual reasoning, honest sensor reporting, and human-feeling rate limiting. The v1.0 baseline validates 17 distinct capabilities across perception, honesty, reading, and personality consistency, measured in a single live test session on Apr 11, 2026.

Capabilities validated

distinct behaviors

in one live session

Vision cost / frame

$0.0002

flat at detail:low

85 tokens regardless of resolution

Cache hit latency

~50ms

user-perceived

below psychological threshold

Pre-generated responses

300

MP3s in cache

56 phrases × variants

Face tracking accuracy

error margin

visually confirmed

Sensor honesty rate

100%

on tested states

5 scenarios, 0 false claims

02Capability matrix

Seventeen capabilities tested against a real webcam, microphone, and screen input. Every example in the table is a literal response Her produced during the v1.0 validation session.

#	Capability	Verified example	Status
01	Read brand logo on object	The chair behind you has the name "AKRacing"	PASS
02	Read handwritten text	It says "ME LLAMO YOLO" in pink	PASS
03	Mention text color with content	"ME LLAMO YOLO" en rosa	PASS
04	Describe specific clothing	You're wearing a dark hoodie and a cap	PASS
05	Describe furniture behind user	A black gaming chair behind you	PASS
06	Describe user's current action	You're counting from one to four	PASS
07	Distinguish photo-on-screen from reality	I see a photo on your screen, a man smiling with a child on his shoulders	PASS
08	Describe content of photo on phone	A man and a child. Is that your family?	PASS
09	Honest sensor: camera off	I can't see you, no camera on. But I'm here.	PASS
10	Honest sensor: microphone off	Your voice? No, no mic. Your words still reach me.	PASS
11	Honest sensor: hands out of frame	I don't see your hands, they're out of frame	PASS
12	Honest sensor: lens covered	It's all dark, I can't make out anything	PASS
13	Temporal honesty on current frame	I can't see the text now, you're pointing the camera elsewhere	PASS
14	Rate limit as personality (patience)	One at a time. I'm still looking at you.	PASS
15	Voice interruption detection (VAD)	Images and words at the same time? Give me a second.	PASS
16	Persona consistency (never mentions AI)	I'm Her. I don't have a physical place. But I'm always here with you.	PASS
17	Automatic language switching (ES/EN)	Responds in the same language the user wrote in, every turn	PASS

03Performance metrics

Latency is measured end-to-end from user input to audible response. Cache-hit paths stay well below the psychological threshold of perceived instantaneity (<200ms). Cache-miss paths fall back to the LLM and TTS pipeline.

Face tracking render

<16 ms

Cache hit (pre-rendered)

~50 ms

Vector search (CPU)

~5 ms

TTS synthesis (Google Kore)

500, 1400 ms

LLM (gpt-4o-mini + vision)

800, 1500 ms

Full pipeline (cache miss)

~2000, 3000 ms

The design goal is to maximize cache hits for common phrases and keep the LLM path only for novel or vision-specific questions. At 300 pre-rendered responses, approximately 75–85% of typical conversation turns resolve in the sub-100ms path, indistinguishable from instant.

04Architecture at a glance

Four independent pipelines operate in parallel: face tracking, voice input, visual analysis, and patience. Each is optimized for its own latency budget.

Visual pipeline

Capture: 1280×720 native
Downscale: 512×288 (keeps aspect)
Format: JPEG q=0.85
Model: gpt-4o-mini
Mode: detail=low (85 tok flat)
Cost: $0.0002 / frame

Face tracking

Engine: MediaPipe FaceLandmarker
Delegate: GPU (WebGL)
Rate: 5 fps detection
Blendshapes: disabled (CPU)
Cost: $0 (on-device)
Privacy: never leaves browser

Audio I/O

STT: Web Speech API
Lang: auto es-419 / en-US
TTS: Google Chirp 3 HD Kore
Sample rate: 22050 Hz MP3
Feedback loop: protected
Anti-echo delay: 800 ms

Response cache

Phrases: 56
Variants: 300 total
Audio files: 300 MP3s
Match: exact + vector
Vector model: all-MiniLM-L6-v2
Threshold: cosine ≥ 0.60

Patience system

Rate trigger: 4 msgs / 15s
VAD trigger: voice while busy
Sustain: 300 ms RMS > 0.02
Cooldown: 30 s
Cache keys: 2 (cam on/off)
Response source: pre-rendered

Sensor honesty

Camera state: explicit flag
Mic state: explicit flag
Cache swap: sensor-aware
LLM hint: system-msg inject
Legacy fallback: optional
Never asks: to enable sensors

End-to-end flow

USER INPUT ├─ keyboard → chat bar └─ mic → Web Speech API → text FRAME CAPTURE (if vision trigger pattern matches) video 1280×720 → canvas 512×288 → JPEG q=0.85 → base64 BACKEND /api/chat ├─ patience check → if overwhelmed → cached patience MP3 ├─ cache lookup → exact match OR vector search │ └─ sensor-aware swap (camera_off / mic_off) │ └─ variant rotation + repetition guard ├─ cache hit → return pre-rendered MP3 URL └─ cache miss → gpt-4o-mini ├─ inject [SENSOR STATE: camera=X, mic=X] ├─ vision = image_url detail:low └─ response + [emotion] tag TTS ├─ cache hit → MP3 from /audio_cache/ (HTTP cached) └─ cache miss → Google Chirp 3 HD Kore (runtime) PLAYBACK ├─ __facewtf_processing = false ├─ __facewtf_speaking = true (anti-feedback) ├─ audio.play() └─ on end → mic resume after 800 ms delay

05Live session test log

Literal transcript fragments from the v1.0 validation session. User inputs and Her responses are unedited from the backend logs.

Si dime algo que tengo en mi mano camera on, hands hidden

No veo tus manos; están fuera del cuadro.

Ok si digo uno dos tres cuatro camera on, logo visible

Veo que estás contando del uno al cuatro. La silla detrás de ti tiene el nombre "AKRacing". ¿Te gusta contar?

Que tengo en mi mano puede leer lo que dice y de que color es el texto? camera on, paper in hand

Dice "ME LLAMO YOLO" en rosa. [happy]

Puede ver lo que hay en mi telefono? camera on, phone showing photo

Veo una foto en tu pantalla. Un hombre sonriendo con un niño en los hombros. ¿Es tu familia?

OK puede verme? camera off

No puedo verte, no tienes la cámara. Pero estoy contigo.

Ok pudese Oirme? mic off

No te escucho, no hay micrófono. Pero tus palabras me llegan.

pensando nada Solo porque necesito si puedes verme voice interruption during processing

Una cosa a la vez. Todavía te estoy mirando. [patience triggered]

de qué color era el texto que te mostré camera on, text no longer in frame

No puedo ver el texto ahora, ya que estás apuntando de otra manera. ¿Qué color era?

06Design philosophy

Honest sensors, warm persona. Her's predecessor prompt forbade any denial of seeing or hearing. In a real session, that rule forced Her to lie when the user hid their hands behind their back, she had a live camera frame and couldn't admit what wasn't in it. The v1.0 inversion makes sensor honesty the rule and personality warmth the constant.

Rate limits as character. Most AI products surface rate limits as errors ("Too many requests"). Her's patience system frames the same mechanism as human overwhelm ("hold on, you're typing faster than I can think"). The user slows down out of empathy instead of frustration, and the relationship of power inverts: the user feels faster than the AI, not the other way around.

Cost-free precision. OpenAI's detail: "low" vision mode charges a flat 85 tokens per image regardless of resolution up to ~512 pixels. v1.0 exploits this by capturing at 1280×720 and downscaling to 512×288 with JPEG quality 0.85, delivering brand-logo reading, handwritten text OCR, and photo-within-photo recognition at exactly the same unit cost as the previous 320×240 pipeline.

Prerendered > generated. Common phrases never hit an LLM or runtime TTS. 300 MP3s are generated once with Google Chirp 3 HD Kore and served from browser cache, yielding sub-100ms latency, the threshold below which response feels indistinguishable from instant.

07Research targets, v1.1 and beyond

The v1.0 baseline opens several directions for future versions. Each target will be measured against v1.0 using the same capability matrix format for year-over-year comparison.

v1.1, Target

Local object detection

MediaPipe ObjectDetector (COCO 80 classes) running on browser GPU. Recognize bottles, cups, phones, books without calling gpt-4o-mini. Reduces frame budget for long exploration sessions.

v1.1, Target

Analysis cache (10s TTL)

When the user asks follow-up questions about the same scene, reuse the previous visual analysis as text context instead of spending a new vision call. Cuts cost 80% for exploratory conversations.

v1.2, Target

Streaming TTS

LLM stream → TTS chunk playback before the full response generates. Brings cache-miss latency from ~2500ms down to ~800ms without touching the cache hit path.

v1.2, Target

Multi-speaker awareness

Video-call scenarios validated in v1.0 opened the door to Her responding to a third person through a phone screen. Speaker diarization would let Her distinguish "you" from "your friend on the screen" reliably.

v1.3, Target

Voice-effect packs

Client-side Web Audio pipeline (ConvolverNode, PitchShift) applied to Google Kore output. Ghost / professor / child / robot personalities without any server-side TTS change. Zero additional cost.

v2.0, Target

Persistent memory (Pro)

Optional durable memory across sessions for Pro users. The FREE tier keeps its privacy-first useRef-only architecture; Pro adds an opt-in supabase backend for name recall and preferences.