This paper was not planned. It emerged from a single question asked during the development of a diagnostic assessment tool: can every person access this?
The answer was no. And when we investigated why, we discovered that the highest accessibility standard in the world — WCAG 2.2 Level AAA — would not have caught the problem.
The Standard Everyone Aspires To
The Web Content Accessibility Guidelines, maintained by the World Wide Web Consortium, are the global benchmark for digital accessibility. They operate across three conformance levels: A (minimum), AA (recommended), and AAA (highest). Most legislation, including the Americans with Disabilities Act, the European Accessibility Act, and Section 508, references Level AA as the compliance standard.
Level AAA goes further. It demands 7:1 contrast ratios, sign language interpretation for video, extended audio descriptions, and content that does not require reading ability above lower secondary education level. The W3C itself acknowledges that Level AAA is not achievable for all content.
Most organisations do not attempt it. Those that do are considered exemplary.
We are arguing that it is not enough.
The Assumption at the Heart of Every Standard
WCAG is built on an implicit assumption: the user possesses at least one reliable input modality and at least one reliable output modality. They can see, or they can hear. They can type, or they can speak. They can read, or they can listen. The guidelines ensure that for each barrier, an alternative pathway exists.
This is good design. It has transformed digital accessibility for millions of people. But it contains a structural gap.
What happens when the user cannot see the screen, cannot read text, cannot type, cannot use a mouse, and cannot speak clearly?
WCAG has no answer to this question. Not at Level A. Not at Level AA. Not at Level AAA. The guidelines assume that at least one conventional input and output channel is available. For users where this assumption fails — those with compound disabilities, severe motor impairments, non-verbal communication needs, or combinations of visual, cognitive, and physical limitations — the highest accessibility standard in the world still leaves them outside the door.
How We Found the Gap
NeuroSync Technologies builds tools for organisational transformation. Our flagship diagnostic, the iXform, is a 24-question structural assessment that evaluates whether an organisation is safe to undergo change. It is a commercial product. It generates revenue. It feeds a consulting pipeline.
The company’s founding philosophy is accessibility and the removal of friction. Every product we build asks the same question: what barrier is stopping someone from doing the thing they need to do? Then it removes that barrier.
When we applied this question to our own website, the answer was uncomfortable.
Our site was entirely text-dependent. Every page required the ability to read English. The assessment required the ability to type. Navigation required the ability to see. The contact form required the ability to write. We had built a company on the principle of removing barriers — and our own front door was full of them.
The first fix was obvious
We integrated an audio layer across every page. A neural text-to-speech system using a natural British English voice reads the content of every page by default. Not as an option buried in a settings menu. Not as a widget in the corner. As the default experience. The audio player is the first element a visitor encounters on every page, because the philosophy is not that audio is available — it is that audio is expected.
The second fix was harder
We redesigned our diagnostic assessment to be completable entirely by voice. The system speaks each question aloud, presents the response options verbally, and accepts spoken answers. No reading required. No typing required. No mouse required.
The third fix changed everything
Voice recognition fails for people with speech impairments, heavy accents, non-standard speech patterns, or non-verbal communication. The Web Speech API — the browser-native speech recognition system — requires clear, recognisable words. If you cannot produce those words, voice recognition is not an accessibility solution. It is another barrier wearing different clothes.
So we went further. Our system accepts any repeatable sound as input.
One sound for option one. Two sounds for option two. Three sounds for option three. A grunt, a click, a clap, a tap. The system does not need to understand language. It counts audio events.
When the input is unclear, the system degrades gracefully. It presents each option sequentially: “Is your answer one?” Any sound confirms. Silence declines. The interaction has been reduced to its absolute minimum: a single noise means yes.
This means a person who cannot see, cannot read, cannot write, cannot type, cannot use a mouse, and cannot speak clearly can still complete a 24-question diagnostic assessment using nothing but a repeatable sound.
No existing accessibility standard requires this. No existing accessibility standard even describes it.
What Exists Today
We reviewed the accessibility landscape extensively. The tools, the standards, the research. Here is what we found:
| Capability | WCAG AAA | Current Tools | VANTAGE Platform | Website Deployment |
|---|---|---|---|---|
| Screen reader compatibility | Required | Widely available | Implemented | Implemented |
| Keyboard navigation | Required | Widely available | Implemented | Implemented |
| Audio content by default | Not required | Rare (opt-in only) | Core feature | Live on every page |
| Voice-guided form completion | Not addressed | Does not exist | Core feature | In progress |
| Non-verbal sound as input | Not addressed | Academic research only | Core feature | In progress |
| Graceful binary fallback | Not addressed | AAC devices only | Core feature | In progress |
| Simultaneous multi-modal input | Not addressed | Does not exist | 5 modalities (voice, sign, touch, blink, eye tracking) | In progress |
The academic literature contains relevant work. A 2022 study explored fifteen non-verbal mouth sounds as input for interactive applications, demonstrating that sound event detection can function independently of speech recognition. Research into voice assistants and impaired users has established that conventional voice interfaces fail for people with cognitive, linguistic, or motor impairments due to requirements for clear articulation, specific vocabulary, and precise timing.
Tools like Voiceitt have made progress in recognising non-standard speech patterns, but these are standalone products — not features integrated into commercial websites. The gap between the research and the implementation is vast.
The Architecture
NeuroSync’s flagship product, VANTAGE, is a voice-first AI execution engine that accepts five simultaneous input modalities: voice, sign language recognition (ASL/BSL), touch, blink detection, and eye tracking. It was designed from the ground up as adaptive technology that benefits everyone while solving accessibility challenges that current standards do not address.
What we describe below is the process of applying VANTAGE’s principles to our own website and diagnostic assessment — proving that these capabilities are not theoretical but deployable on any web platform using existing browser APIs.
Audio output
Pre-generated neural text-to-speech audio using a natural British English voice. Every page has an embedded audio player as the first content element. Assessment questions and response options are spoken aloud in sequence. Audio is the default. Text is the fallback.
Input detection
Four simultaneous input channels are always active, with no mode selection required. Voice input via the Web Speech API maps spoken words and numbers to response options. Keyboard input maps key presses (1, 2, 3) directly. Touch and click input accepts taps on visible response options. Sound detection via the Web Audio API monitors the microphone for any audio event above a volume threshold, counting discrete sounds within a timing window.
Graceful degradation
When no clear input is received, the system falls back to sequential binary confirmation. Each option is presented individually. Any sound confirms. Silence declines. The minimum viable interaction is one noise.
No separate “accessible version”
There is no toggle. There is no accessibility menu. There is no alternative version of the site for disabled users. Every visitor encounters the same experience. The audio plays. The inputs listen. The system adapts. This is not an accommodation. It is the architecture.
Why This Matters Beyond Our Website
Approximately 16% of the global population — over 1.3 billion people — lives with a significant disability. Among these, compound disabilities are more common than isolated ones. A person with cerebral palsy may have motor, speech, and visual impairments simultaneously. A stroke survivor may lose fine motor control, clear speech, and reading ability in the same event. An elderly person may experience declining vision, hearing, dexterity, and cognitive function concurrently.
The web accessibility industry has made enormous progress in addressing individual barriers. Screen readers serve the blind. Captions serve the deaf. Keyboard navigation serves those who cannot use a mouse. Voice recognition serves those who cannot type.
But compound barriers remain largely unaddressed. The user who needs all of these accommodations simultaneously — and additionally cannot produce clear speech — finds that each individual solution assumes the availability of another channel that is also impaired.
This is not a niche edge case. It describes millions of people. And the current standards do not serve them.
A Proposal
We are not suggesting that every website needs non-verbal sound detection. We are suggesting that the accessibility standards need to evolve to address compound barriers and non-conventional input methods. Specifically:
Audio-first should be an option, not an afterthought. Content that can be read aloud should be read aloud by default, not hidden behind a button that requires sight to find.
Form completion should not require literacy. Any form that collects information from the public should offer a voice-guided pathway that does not depend on the ability to read or write.
Input methods should degrade gracefully. When speech recognition fails, the system should fall back to simpler interactions — not abandon the user. The binary confirmation pattern (present option, accept sound, advance) is implementable with existing browser APIs at negligible cost.
Multi-modal input should be simultaneous, not sequential. Users should not have to declare their disability in order to receive the appropriate interface. All input channels should be active by default.
None of this requires new technology. The Web Speech API, the Web Audio API, and neural text-to-speech services are mature, widely supported, and affordable. The barrier is not technical. It is philosophical.
The question is not whether we can include everyone. It is whether we have decided to.
What We Are Publishing
We intend to document our full technical implementation and make the architecture available for others to adopt. The voice-guided assessment framework, the multi-modal input system, and the graceful degradation pattern are not proprietary advantages we intend to protect. They are accessibility solutions that should exist everywhere.
If you build websites, build forms, or build digital services that collect information from people, we would welcome the conversation about how to implement these patterns in your context.
The current standards got us here. They are good. They are not finished.
Kirk Harper is the founder of NeuroSync Technologies. The iXform Structural Diagnostic, including its voice-first accessibility layer, is available at neurosync-technologies.ltd. Correspondence: [email protected]
Experience It
The iXform is live. The audio player is on this page. The voice-guided assessment is being deployed. This is not theory — it is practice.
See the iXformThe assessment that should come before everything else