Windows Speech Recognition
Windows Speech Recognition (WSR) is a speech recognition component developed by Microsoft for Windows Vista that enables voice commands to control the desktop user interface; dictate text in electronic documents and email; navigate websites; perform keyboard shortcuts; and to operate the mouse cursor. It also supports the creation of custom macros to perform additional or supplementary tasks.
|Operating system||Windows Vista and later|
Windows Server 2008 and later
WSR is a locally-processed speech recognition platform; it does not rely on cloud computing for accuracy, dictation, or recognition, but adapts based on contexts, grammars, speech samples, training sessions, and vocabularies. It provides a personal dictionary that allows users to include or exclude words or expressions from dictation and to optionally record pronunciations to increase recognition accuracy. With Windows Search, it can optionally analyze and collect text in documents, email, as well as handwritten tablet PC input to contextualize and disambiguate terms to further adapt the recognizer. Custom language models that adapt the recognizer to the specific contexts, phonetics, and terminologies of users in particular occupational fields such as legal or medical are also supported.
WSR was developed to be integrated into Windows Vista, as Windows previously only supported speech recognition features exclusive to applications such as Windows Media Player. Microsoft Office XP introduced speech recognition, but it was mainly limited to Internet Explorer and Office. With the release of Windows Vista, Office 2007 and later versions of Office rely on WSR, replacing the separate Office speech recognition. The majority of integrated applications in Windows Vista can be controlled through speech. WSR is present in Windows 7, Windows 8, Windows 8.1, Windows RT, and Windows 10.
Microsoft was involved in speech recognition and speech synthesis research for many years before WSR. In 1993, Microsoft hired Xuedong Huang from Carnegie Mellon University to lead its speech development efforts; the company's research led to the development of the Speech API introduced in 1994. Speech recognition had also been used in previous Microsoft products. Office XP and Office 2003 provided speech recognition capabilities among Internet Explorer and Office applications; it also enabled limited speech functionality in Windows 98, Windows ME, Windows NT 4.0, and Windows 2000. Windows XP Tablet PC Edition 2002 included speech recognition capabilities with the Tablet PC Input Panel, and the Microsoft Plus! for Windows XP expansion package enabled voice commands to be used in Windows Media Player. However, this required installation of speech recognition as an additional component (with support primarily limited to individual applications); before Windows Vista, Windows did not include extensive or integrated speech recognition capabilities.
At the 2002 Windows Hardware Engineering Conference (WinHEC 2002) Microsoft announced that Windows Vista (then codenamed "Longhorn") would include advances in speech recognition and in features such as microphone array support; these features were part of the company's goal to "provide a consistent quality audio infrastructure for natural (continuous) speech recognition and (discrete) command and control." Bill Gates stated during the 2003 Professional Developers Conference (PDC 2003) that Microsoft would "build speech capabilities into the system -- a big advance for that in 'Longhorn,' in both recognition and synthesis, real-time"; and pre-release builds throughout the development of Windows Vista included a speech engine with training features. A PDC 2003 developer presentation stated that Windows Vista would also include a user interface for microphone feedback and control, and user configuration and training features. Microsoft later clarified the extent to which speech recognition would be integrated when it stated in a pre-release software development kit that "the common speech scenarios, like speech-enabling menus and buttons, will be enabled system-wide."
During WinHEC 2004, Microsoft listed WSR as part of its "Longhorn" mobile PC strategy to improve productivity. At WinHEC 2005, Microsoft emphasized accessibility, new mobility scenarios, and improvements to the speech user experience. Unlike the speech support included in Windows XP, which was integrated with the Tablet PC Input Panel and required switching between separate Commanding and Dictation modes, Windows Vista would introduce a dedicated interface for speech input on the desktop and unify the separate speech modes; users previously could not speak a command after dictating or vice versa without first switching between these two modes. Microsoft also stated that Windows Vista would improve dictation accuracy and support additional language; a demonstration emphasized email dictation, and a presentation about microphone arrays was also shown. Windows Vista Beta 1 included an integrated speech recognition application. To incentivize company employees to analyze WSR for software glitches and provide feedback during its development, Microsoft offered an opportunity for testers to win a Premium model of the Xbox 360.
During a demonstration by Micorosoft on July 27, 2006, before Windows Vista's release to manufacturing (RTM), a notable incident involving WSR occurred that resulted in an unintended output of "Dear aunt, let's set so double the killer delete select all" when several attempts to dictate led to consecutive output errors; the incident was a subject of significant derision among analysts and journalists in the audience. Microsoft later revealed that these issues were due to an audio gain glitch that caused the speech recognizer to distort the dictated words; the glitch was fixed before Windows Vista's release.
In early 2007, reports surfaced that WSR might be vulnerable to an attack that could allow attackers to play audio through a computer's speakers, thereby using speech recognition to perform undesired user operations on a target computer; it was the first vulnerability discovered after Windows Vista's general availability. While Microsoft stated that such an attack is theoretically possible, it would have to meet a number of prerequisites to be successful: the target system would have to have the speech recognition feature properly configured and activated; speakers and microphone(s) connected to the targeted system would need to be turned on; and the exploit would require the software to interpret commands without a user noticing—an unlikely scenario as the affected system would perform visible interface operations and produce audible feedback. Mitigating factors include dictation clarity and microphone feedback and placement. Because of User Account Control, an exploit of this nature also would not be able to perform privileged operations for users or protected administrators without explicit consent.
With Windows 7 Microsoft introduced several changes to improve the user experience. The recognizer was updated to use Microsoft UI Automation—substantially enhancing its performance—and the recognition engine now uses the WASAPI audio stack, which enables support for echo cancellation. The document harvester, which optionally analyzes and collects text in email and documents to contextualize and disambiguate user terms has improved performance, and has been updated to run periodically in the background instead of only after recognizer startup. Sleep mode has also seen performance improvements and, to address security issues, Windows 7 introduces a new "voice activation" option—enabled by default—that turns the recognizer off after users speak "stop listening" instead of putting the recognizer to sleep. Windows 7 also introduces an option to submit speech training data to Microsoft to improve future recognizer versions.
Windows 7 introduced an optional dictation scratchpad interface that functions as a temporary document into which users can dictate or type text for insertion into applications that are not compatible with the Text Services Framework. WSR previously provided an "enable dictation everywhere option" in Windows Vista.
Windows 8 and Windows RT
WSR can be used to control the Metro user interface in Windows 8, Windows 8.1, and Windows RT with commands to open the Charms bar ("Press Windows C"); to dictate or display commands in Metro-style apps ("Press Windows Z"); to perform tasks in apps (e.g., "Change to Celsius" in MSN Weather); and to display all installed apps listed by the Start screen ("Apps").
Overview and features
WSR allows a user to control a computer, including the operating system desktop user interface, through voice commands. Applications, including most of those bundled with Windows, can also be controlled through voice commands. By using speech recognition, users can dictate text within documents, email, and forms; control the operating system user interface; perform keyboard shortcuts; and move the mouse cursor.
WSR uses a local speech profile to store information about a user's voice. Accuracy of speech recognition increases through use, which helps the feature adapt to a user's grammar, speech patterns, vocabulary, and word usage. Speech recognition also includes a tutorial to improve accuracy, and can optionally review a user's personal documents—including email—to improve its command and dictation accuracy. Individual speech profiles can be created on a per-user basis, and backups of profiles can be performed via Windows Easy Transfer. WSR supports the following languages: Chinese (Traditional), Chinese (Simplified), English (U.S.), English (U.K.), French, German, Japanese, and Spanish. WSR relies on the Speech API developed by Microsoft, and third-party applications must support the Text Services Framework.
The WSR interface consists of a status area that displays instructions, information about commands (e.g., if a command is not heard by the recognizer), and the status of the recognizer; a voice meter displays visual feedback about volume levels. The status area represents the current state of WSR in a total of three modes, listed below with their respective meanings:
- Listening: The recognizer is active and waiting for user input
- Sleeping: The recognizer will not listen for or respond to commands other than "Start listening"
- Off: The recognizer will not listen or respond to any commands; this mode can be enabled by speaking "Stop listening"
Colors of the recognizer listening mode button denote its various modes of operation: blue when listening; blue-gray when sleeping; gray when turned off; and yellow when the user switches context (e.g., from the desktop to the taskbar) or when a voice command is misinterpreted. The status area can also display custom user information as part of Windows Speech Recognition Macros.
An alternates panel disambiguation interface displays a list of items interpreted as being relevant to a user's spoken word(s); if the word or phrase that a user desired to insert into an application is listed among results, a user can speak the corresponding number of the word or phrase in the results and confirm this choice by speaking "OK" to insert it within the application. The alternates panel will also appear when launching applications or speaking commands that refer to more than one item (e.g., speaking "Start Internet Explorer" may list the web browser and a version of it with browser add-ons disabled). However, an ExactMatchOverPartialMatch Windows Registry entry can limit commands to items with exact names if there is more than one instance included in results.
Listed below are common WSR commands. Words in italics indicate a word that can be substituted for a desired item (e.g., the word "direction" in the "scroll direction" command can be substituted with the word "down"). A "start typing" command enables WSR to interpret all dictation commands as keyboard shortcuts.
- Dictation commands: "New line"; "New paragraph"; "Tab"; "Literal word"; "Numeral number"; "Go to word"; "Go after word"; "No space"; "Go to start of sentence"; "Go to end of sentence"; "Go to start of paragraph"; "Go to end of paragraph"; "Go to start of document" "Go to end of document"; "Go to field name" (e.g., go to address, cc, or subject). Special characters, such as a comma, can be dictated by speaking the name of the special character.
- Navigation commands:
- Keyboard shortcuts: "Press keyboard key"; "Press ⇧ Shift plus a"; "Press capital b."
- Keys that can be pressed without first giving the press command include: ← Backspace, Delete, End, ↵ Enter, Home, Page Down, Page Up, and Tab ↹.
- Mouse commands: "Click"; "Click that"; "Double-click"; "Double-click that"; "Mark"; "Mark that"; "Right-click"; "Right-click that"; Mousegrid.
- Window management commands: "Close (alternatively maximize, minimize, or restore) window"; "Close that"; "Close name of open application"; "Switch applications"; "Switch to name of open application"; "Scroll direction"; "Scroll direction in number of pages"; "Show desktop"; "Show numbers."
- Speech recognition commands: "Start listening"; "Stop listening"; "Show speech options"; "Open speech dictionary"; "Move speech recognition"; "Minimize speech recognition." In the English language, applicable commands can be shown by speaking "What can I say?" Users can also query the recognizer about tasks in Windows by speaking "How can I task name," which opens related help documentation.
A mousegrid command enables users to control the mouse cursor by overlaying numbers across nine regions on the screen; these regions gradually narrow as a user speaks the number(s) of the region on which to focus until the desired interface element is reached. The regions with which a user can interact are based on commands including "Click number of region," which moves the mouse cursor to the desired region and then clicks it; and "Mark number of region", which allows an item (such as a computer icon) in a region to be selected, which can then be clicked with the previous click command. A user can also simultaneously interact with multiple regions of the mousegrid.
Applications and interface elements that do not present identifiable commands can still be controlled by asking the system to overlay numbers on top of them through a show numbers command. Once active, speaking the overlaid number selects that item so a user can open it or perform other operations. Show numbers was designed so that users could interact with items that are not readily identifiable.
WSR enables dictation of text in the operating system and applications. If a dictation mistake occurs it can be corrected by speaking "Correct word" or "Correct that" and the alternates panel will appear and provide suggestions for correction; these suggestions can be selected by speaking the number corresponding to the number of the suggestion in the list and by speaking "OK." If the desired item is not listed among suggestions, a user can speak it so that it might appear. Alternatively, users can speak "Spell it" or "I'll spell it myself" to speak the desired item on a per-letter basis; users can use their personal alphabet or the NATO phonetic alphabet when spelling. Multiple words in a sentence can be corrected simultaneously (for example, if a user speaks "dictating" but the recognizer interprets this word as "the thing," a user can state "correct the thing" to correct both words). In the English language over 100,000 words are recognized by default.
WSR includes a personal dictionary that allows users to include or exclude certain words or expressions from dictation. When a user adds a word beginning with a capital letter to the dictionary, a user can specify whether it should always be capitalized or if capitalization depends on the context in which the word is spoken. Users can also record pronunciations for words added to the dictionary to increase recognition accuracy; words written via a stylus on a tablet PC for the Windows handwriting recognition feature are also stored. Most of the information stored within a dictionary is included as part of a user's speech profile.
WSR supports custom macros through a supplementary application by Microsoft that enables additional natural language commands. As an example of this functionality, an email macro released by Microsoft enables a natural language command where a user can state "send email to contact about subject," which opens Microsoft Outlook to compose a new message with the designated contact and subject automatically inserted. Microsoft has also released sample macros for the speech dictionary, for Windows Media Player, for Microsoft PowerPoint, for speech synthesis, to switch between multiple microphones, to customize various aspects of audio device configuration such as volume levels, and for general natural language queries such as "What is the weather forecast?" "What time is it?" and "What's the date?" Answers to these queries are spoken via a speech synthesizer.
Users and developers can create their own macros that can be based on text transcription and substitution; application execution (with support for command-line arguments); keyboard shortcuts; emulation of existing voice commands; or a combination of these items. XML, JScript and VBScript are supported. Macros can be limited to individual applications if desired and rules for macros can be defined programmatically. For a macro to load, it must be stored in a Speech Macros folder within the current user's Documents directory. All macros are digitally signed by default if a user certificate is available, to ensure that commands are not corrupted or loaded by third-parties; if one is not available, an administrator can create a certificate for use. The macros utility also includes security levels to prohibit unsigned macros from being loaded; to prompt users to sign macros; and to load unsigned macros.
As of 2017 WSR uses Microsoft Speech Recognizer 8.0, which has not been changed since Windows Vista. For dictation it was found to be 93.6% accurate without training by Mark Hachman, a Senior Editor of PC World—a rate that is not as accurate as competing software. According to Microsoft, the rate of accuracy when trained is 99%. Hachman commented that Microsoft does not publicly discuss WSR, attributing this to the 2006 incident during development of Windows Vista, with few users knowing that documents could be dictated within Windows before the introduction of Cortana.
- Thurrott, Paul (October 6, 2010). "Jim Allchin Talks Windows Vista". Windows IT Pro. Penton. Retrieved March 29, 2018.
- Microsoft (2006). "Windows Vista Privacy Statement" (RTF). Retrieved July 1, 2015.
- Chambers, Rob (September 20, 2005). "Customized speech vocabularies in Windows Vista". MSDN. Microsoft. Retrieved March 29, 2018.
- "What happened to speech recognition?". Office Support. Microsoft. Retrieved November 9, 2016.
- Phillips, Todd (2007). "Windows Vista Speech Recognition Step-by-Step Guide". MSDN. Microsoft. Retrieved June 30, 2015.
- Microsoft. "What can I do with Speech Recognition?". Windows How-to. Retrieved June 26, 2015.
- "How to use Speech Recognition". Support. Microsoft. Archived from the original on October 25, 2012. Retrieved December 24, 2018.
- "How to use Speech Recognition in Windows". Support. Microsoft. August 31, 2016. Retrieved December 24, 2018.
- Microsoft. "Use Voice Recognition in Windows 10". Support. Retrieved August 24, 2015.
- Brown, Robert. "Exploring New Speech Recognition And Synthesis APIs In Windows Vista". MSDN Magazine. Microsoft. Archived from the original on March 7, 2008. Retrieved June 26, 2015.
- Microsoft. "How To Use Speech Recognition in Windows XP". Support. Retrieved June 26, 2015.
- Microsoft. "Description of the speech recognition and handwriting recognition methods in Word 2002". Support. Archived from the original on July 3, 2015. Retrieved March 26, 2018.
- Thurrott, Paul (June 25, 2002). "Windows XP Tablet PC Edition Review". Windows IT Pro. Penton. Retrieved June 26, 2015.
- Dresevic, Bodin (2005). "Natural Input On Mobile PC Systems" (PPT). Microsoft. Retrieved June 26, 2015.
- Thurrott, Paul (October 6, 2010). "Plus! for Windows XP Review". Windows IT Pro. Penton. Retrieved June 30, 2015.
- Stam, Nick (April 16, 2002). "WinHEC: The Pregame Show". PC Magazine. Ziff Davis Media. Retrieved June 26, 2015.
- Flandern Van, Mike (2002). "Audio Considerations for Voice-Enabled Applications". Windows Hardware Engineering Conference. Microsoft. Archived from the original (EXE) on May 6, 2002. Retrieved March 30, 2018.
- Microsoft (October 27, 2003). "Bill Gates' Web Site - Speech Transcript, Microsoft Professional Developers Conference 2003". Archived from the original on February 3, 2004. Retrieved June 26, 2015.
- Thurrott, Paul; Furman, Keith (October 26, 2003). "Live from PDC 2003: Day 1, Monday". Windows IT Pro. Penton. Retrieved June 26, 2015.
- Spanbauer, Scott (December 4, 2003). "Your Next OS: Windows 2006?". TechHive. IDG. Retrieved June 25, 2015.
- Gjerstad, Kevin; Chambers, Rob (2003). "Keyboard, Speech, and Pen Input in Your Controls". Professional Developers Conference. Microsoft. Archived from the original (PPT) on December 19, 2012. Retrieved March 30, 2018.
- Microsoft (2003). "Interacting with the Computer using Speech Input and Speech Output". MSDN. Archived from the original on January 4, 2004. Retrieved June 28, 2015.
- Suokko, Matti (2004). "Windows For Mobile PCs And Tablet PCs - CY05 And Beyond". Microsoft. Archived from the original (PPT) on December 14, 2005. Retrieved July 15, 2015.
- Fish, Darrin (2004). "Windows For Mobile PCs and Tablet PCs - CY04". Microsoft. Archived from the original (PPT) on December 14, 2005. Retrieved July 15, 2015.
- Dresevic, Bodin (2005). "Natural Input on Mobile PC Systems". Microsoft. Archived from the original (PPT) on December 14, 2005. Retrieved March 29, 2018.
- Chambers, Rob (August 1, 2005). "Commanding and Dictation - One mode or two in Windows Vista?". MSDN. Microsoft. Retrieved June 30, 2015.
- Tashev, Ivan; Strande, Hakon. "Microphone Array Support in Windows Longhorn". Microsoft. Archived from the original (PPT) on December 21, 2005. Retrieved March 29, 2018.
- Thurrott, Paul (October 6, 2010). "Windows Vista Beta 1 Review (Part 3)". Windows IT Pro. Penton. Retrieved June 26, 2015.
- Levy, Brian (2006). "Microsoft Speech Recognition poster". Archived from the original on October 11, 2006. Retrieved March 17, 2016.
- Auchard, Eric (July 28, 2006). "UPDATED-When good demos go (very, very) bad". Thomson Reuters. Archived from the original on May 21, 2011. Retrieved March 29, 2018.
- NBC News (August 2, 2006). "Software glitch foils Microsoft demo". Associated Press. Retrieved June 30, 2015.
- Montalbano, Elizabeth (July 31, 2006). "Vista voice-recognition feature needs work". InfoWorld. IDG. Archived from the original on August 5, 2006. Retrieved June 26, 2015.
- Montalbano, Elizabeth (July 31, 2006). "Vista's Voice Recognition Stammers". TechHive. IDG. Retrieved July 1, 2015.
- Chambers, Rob (July 29, 2006). "FAM: Vista SR Demo failure -- And now you know the rest of the story ..." MSDN. Microsoft. Retrieved June 26, 2015.
- "Vista has speech recognition hole". BBC News. BBC. February 1, 2007. Retrieved March 29, 2018.
- Miller, Paul (February 1, 2007). "Remote 'exploit' of Vista Speech reveals fatal flaw". Engadget. AOL. Retrieved June 28, 2015.
- Roberts, Paul (February 1, 2007). "Honeymoon's Over: First Windows Vista Flaw". PCWorld. IDG. Archived from the original on February 4, 2007. Retrieved June 28, 2015.
- "Issue regarding Windows Vista Speech Recognition". TechNet. Microsoft. January 31, 2007. Archived from the original on May 20, 2016. Retrieved March 31, 2018.
- Brown, Eric (January 29, 2009). "What's new in Windows Speech Recognition?". MSDN. Microsoft. Retrieved March 28, 2018.
- Brown, Eric (October 24, 2007). "Where does dictation work in Windows Speech Recognition?". MSDN. Microsoft. Retrieved March 28, 2018.
- Sarkar, Dona (January 24, 2018). "Announcing Windows 10 Insider Preview Build 17083 for PC". Windows Blogs. Microsoft. Retrieved January 8, 2019.
- "Windows keyboard shortcuts for accessibility". Support. Microsoft. Archived from the original on October 12, 2018. Retrieved January 8, 2019.
- Microsoft. "Common commands in Speech Recognition". Windows How-to. Retrieved June 30, 2015.
- Microsoft. "Windows Speech Recognition". Microsoft Accessibility. Retrieved June 26, 2015.
- Microsoft. "Setting speech options". Windows How-to. Retrieved July 1, 2015.
- Chambers, Rob (February 15, 2007). "Transferring Windows Speech Recognition profiles from one machine to another". MSDN. Microsoft. Retrieved June 28, 2015.
- Shintaku, Kurt (April 29, 2008). "BETA: 'Windows Speech Recognition Macros' Technology Preview". Retrieved March 17, 2016.
- Pash, Adam (May 20, 2008). "Control Your PC with Your Voice". Lifehacker. Gawker Media. Retrieved March 17, 2016.
- Chambers, Rob (November 19, 2007). "Speech Macros, Typing Mode and Spelling Mode in Windows Speech Recognition". MSDN. Microsoft. Retrieved August 25, 2015.
- Chambers, Rob (May 7, 2007). "Windows Speech Recognition - ExactMatchOverPartialMatch". MSDN. Microsoft. Retrieved August 24, 2015.
- Chambers, Rob (March 12, 2007). "Windows Speech Recognition: General commands". MSDN. Microsoft. Retrieved May 1, 2017.
- US patent 7742923, Bickel, Ryan; Murillo, Oscar & Mowatt, David et al., "Graphic user interface schemes for supporting speech recognition input systems", assigned to Microsoft Corporation
- Microsoft. "Windows Speech Recognition Macros". Download Center. Retrieved June 29, 2015.
- Protalinski, Emil (April 30, 2008). "WSR Macros extend Windows Vista's speech recognition feature". ArsTechnica. Condé Nast. Retrieved June 29, 2015.
- Chambers, Rob (June 9, 2008). "Macro of the Day: Send Email to [OutlookContact]". MSDN. Microsoft. Retrieved June 26, 2015.
- Chambers, Rob (August 2, 2008). "Speech Macro of the Day: Speech Dictionary". MSDN. Microsoft. Retrieved September 3, 2015.
- Chambers, Rob (July 1, 2008). "Macro of the Day: Windows Media Player". MSDN. Microsoft. Retrieved June 26, 2015.
- Chambers, Rob (June 3, 2008). "Macro of the day: Next Slide". MSDN. Microsoft. Retrieved September 3, 2015.
- Chambers, Rob (May 28, 2008). "Macro of the Day: Read that". MSDN. Microsoft. Retrieved June 26, 2015.
- Chambers, Rob (November 7, 2008). "Macro of the Day: Microphone Control". MSDN. Microsoft. Retrieved June 30, 2015.
- Chambers, Rob (August 18, 2008). "Macro of the Day: Mute the speakers!". MSDN. Microsoft. Retrieved September 3, 2015.
- Chambers, Rob (June 2, 2008). "Macro of the Day: Tell me the weather forecast for Redmond". MSDN. Microsoft. Retrieved June 26, 2015.
- Chambers, Rob (June 30, 2008). "Making a Speech macro Application Specific". MSDN. Microsoft. Retrieved September 3, 2015.
- Microsoft (2009). "Windows Speech Recognition Macros Release Notes" (DOCX). Retrieved June 28, 2015.
- Hachman, Mark (May 10, 2017). "The Windows weakness no one mentions: Speech recognition". PC World. IDG. Retrieved March 28, 2018.