Joel Gould NaturallySpeaking Unofficial Information Pages
The Insider's Guide to Dragon NaturallySpeaking by Joel Gould
 Introduction
 Latest Product News
 Guide To Products
 Product Features
    Number Formatting
    Desktop Commands
    Editing in Other Apps
    BestMatch
->     Vocabulary Builder
    How NatSpeak Learns
    NaturalWord
    NaturalText
 Frequently Asked Questions
 Getting Best Performance
 Python Macro System
 

Vocabulary Builder

Last Modified: March 22, 2000

(Editor's note: This topic was originally written in February of 1998. With version 4.0 of Dragon NaturallySpeaking, Dragon had a new feature to the Vocabulary Builder which allows you to incrementally add additional text when running the Vocabulary Builder a second time instead of starting over (see NatSpeak Version 4.0 for more information). The remainder of the information in this topic is still applicable, however.)

The Vocabulary Builder is designed to improve recognition performance by changing the language model. Language model is a term which applies to the statistics of how words follow other words. In Dragon NaturallySpeaking, the decision about what you actually said is made based on both the acoustics, which are trained using the General Training program and the language model. Dragon NaturallySpeaking comes with a built-in language model which reflects a variety of different topics that you could dictate in general English transcription. The Vocabulary Builder allows you to customize the language model to a tighter range of potential discussion topics. When you improve the language model with the Vocabulary Builder, you improve recognition accuracy because Dragon NaturallySpeaking then has a better idea of what you are possibly talking about when it tries to interpret your voice.

To use the Vocabulary Builder, first you identify a number of documents on your disk would reflect the way you will be writing. For example, if you are lawyer who writes contracts, you should find a number of contracts that you have all written as examples for the Vocabulary Builder. If you are journalist, you probably want to find a number of articles which you have written in the past to use in the Vocabulary Builder.

After you identify the documents you want to use, you should start the Vocabulary Builder from the tools menu of Dragon NaturallySpeaking. Then list all of the documents which would like the Vocabulary Builder to consider when building its language model. As a general guide, I like to build language models with at least 100,000 bytes of text, more if possible. This assumes of course, that you are dealing with text documents, and not Microsoft Word documents. If you're dealing with Microsoft Word documents, then the size of the documents will be larger. 100,000 bytes of text corresponds to approximately 17,000 words.

After you feed it documents, the Vocabulary Builder parses those documents to extract the word and punctuation information from them. Then the Vocabulary Builder computes some simple statistics about word usage within the documents that you fed it. The next interactive step is when the Vocabulary Builder presents to you a list of new words that it has found. This is one of the most confusing features of the Vocabulary Builder. The Vocabulary Builder will compile a complete list of all of the words which it found in the documents which it parsed. Most of words which the Vocabulary Builder finds will either already be in your active vocabulary, or be in your backup dictionary (the 230,000 words on disk). However, Vocabulary Builder will often find words which are neither in the active vocabulary nor in the backup dictionary, and these words will be presented to you a list.

The list which the Vocabulary Builder presents to you is sometimes very confusing. It includes a number of different types of words which you are required to sort through. For example, if there is a spelling mistake in your document, then the Vocabulary Builder will find the word which is misspelled like a spell checker would, and present it to you as a possible new word. In addition, if you used an acronym or abbreviation which the Vocabulary Builder did not find in your dictionary, that will be presented to you as well.

But most people find that the list of new words from the Vocabulary Builder actually looks like a list of capitalized words from your document. What is happening is that Vocabulary Builder is looking for candidate new words. If it finds a capitalized word at the beginning of the sentence then it will assume the that word was capitalized because it was at the beginning of sentence and not include it in the new word list. However, if the Vocabulary Builder finds a capitalized word in the middle of your sentence, then it presents that capitalized word to you for consideration based on the premise that the capitalization information in that word may be interesting.

For example, if you dictate legal documents then the word "plaintiff" is often capitalized. The Vocabulary Builder will detect that you are often capitalizing the word "plaintiff" in your document and present it to you in the word list as a potential new word. Then, if you decided to add the word "plaintiff" with a capital P to your vocabulary, the Vocabulary Builder will build statistics of when it should use "plaintiff" with a capital P and when it should use "plaintiff" with a lowercase P.

Unfortunately, the Vocabulary Builder is easily confused. If you have a lot of book titles, or names of controls, or other sentences in which almost every word is capitalized, the Vocabulary Builder gets confused and thinks that there are a lot of potential new words in that sentence. That reason, the list of new words which the Vocabulary Builder gives you often looks like a list of all the capitalized words in your document.

Consider for example, the following text:

The Plaintiff was caught reading the article "How to Spot Capitalized Words" in PC Magazine. He was struck by text which WAS IN ALL UPPERCASE, and which was spelled incorrectly.

The Vocabulary Builder should suggest the new words "Plaintiff" and "Magazine", both of which are interesting. However, the Vocabulary Builder will also find "How", "Spot", "Capitalized" and "Words" and list them as new words. (The words in all uppercase will be ignored.) In this case, you should only add "Plaintiff" and "Magazine" to your vocabulary and ignore the other new words.

The list of new words from the Vocabulary Builder is intentionally sorted in frequency order. This means that the new word which is most common is at the beginning of a list and the word which was found the least in your document is at the end of the list. This allows you to correctly consider only the words at the beginning of a list as potential ones to add your vocabulary, ignoring words which are very rare and, therefore, listed later.

Avoid the temptation to select every word in the list. Instead, when using the Vocabulary Builder, I strongly recommend the alternate, select almost no words in the list. You can always go back and add new words later.

Every word that you select will then be added to Train Words dialog and Vocabulary Builder will prompt you to speak each individual word once. This step is optional but recommended. Dragon NaturallySpeaking will guess at a pronunciation for every new word that it sees. However, in some cases the pronunciation which Dragon NaturallySpeaking guesses is not appropriate or accurate. By speaking the word once to the Train Words dialog, you can help Dragon NaturallySpeaking select a better pronunciation and therefore improved the chances of recognizing that new word properly.

Once you have completed training the new words, Dragon NaturallySpeaking will go ahead and build the language model based on the documents it scanned and the new words you selected.

The Vocabulary Builder actually does two things at this point. First, it builds a set of all of the words which Dragon NaturallySpeaking found in your documents whether they were in the active vocabulary, the backup dictionary, or in the list of new words which you presented. Let's say for example that it found 2,000 words which were already active, 5,000 words which were in the backup dictionary, and you added 10 new words. That is a total of 7010 words. The Vocabulary Builder will then make sure at all 7010 words are made active.

This is trickier than it seems. Every time you add a new word to Dragon NaturallySpeaking, Dragon NaturallySpeaking makes sure that there is room in the 30,000 word active vocabulary. If there is not room in the vocabulary, Dragon NaturallySpeaking discards some of the words currently in the 30,000 word vocabulary. The algorithm used to decide which words to discard is based on usage statistics. Dragon NaturallySpeaking will usually discard words which have not been dictated in a while.

However, when adding lots and lots of words, it is always possible that by the time you get to the 10,000th word, the first word added becomes a candidate for being discarded. The Vocabulary Builder interacts with Dragon NaturallySpeaking in such a way that the words defined in your documents are collected together to make sure that all of them become active. This is one reason that the Vocabulary Builder has a limit on the total number of words which it can consider at one time (25,000 when dealing with a 30,000 word vocabulary).

Once the Vocabulary Builder has made sure that all of words in your document are active, the Vocabulary Builder then builds a statistical language model from those words. The statistical language model includes information about how the words were used in your writing.

The purpose of the statistical language model is to predict the words which you will be saying based on the other words which have previously dictated before and after. For this reason, running the Vocabulary Builder improves your accuracy more than just simply adding in all of the words to your active vocabulary which the Vocabulary Builder would add.

If you run the Vocabulary Builder a second time, then the entire procedure is repeated from scratch. Having run the Vocabulary Builder before, any new words which are found in your document will also be compared against the words you previously added. Therefore, running the Vocabulary Builder twice on the same set of documents should produce a list of new words which is smaller by the words which you previously added to your vocabulary.

However, the rest of the process is identical. The Vocabulary Builder when run on a new set of documents will add a new set of words to your active vocabulary, potentially displacing words which you added in previous runs of the Vocabulary Builder. In addition, the Vocabulary Builder will completely replace the topic specific language model that was built previously with a new one based on the documents you fed it in the current session.

For this reason, you cannot incrementally improve the language model by running the Vocabulary Builder on one or two additional documents. If you have more text which you want Dragon NaturallySpeaking to consider in your language model, you'll have to rerun the Vocabulary Builder with all the previous text you used before plus the new text that you want to consider, in one session.

To allow people to have multiple topics based on running the Vocabulary Builder on different sets of documents, the Deluxe Edition has included the feature of supporting multiple topics. Each topic in the Deluxe Edition is a separate vocabulary with a separate set of 30,000 active words and each topic in the Deluxe Edition has its own statistical language model information produced by running Vocabulary Builder (called the Topic builder in the deluxe Edition) on the topic.

This means in the Deluxe Edition you can have, for example, a medical topic produced by running the Vocabulary Builder over a number of medical documents. You can also have a legal topic produced, for example, by running the Vocabulary Builder over number of legal topics. And both of these topics are available to a given user and you can dynamically select which topic you want use when dictating.

This web page (http://www.synapseadaptive.com/joel/VocabularyBuilder.html) was last modified on March 22, 2000. The contents of this page are (c) Copyright 1998-1999 by Joel Gould. All Rights Reserved. See Copyright Information for more details.