SearchEngine: Eliminating Words

Process Engineer Toolkit > User's Guide > Tools Reference > Search Engine > Eliminating words

Topics

Removing documents from the word list
Generating a word list
Creating word filter documents
Eliminating a word
Reducing words
Removing words in specific HTML tags
The order of filtering

This chapter discusses the filters available for eliminating words from entire files, useless words such as "and" or "the", reducing words such as "www.javasoft.com", and removing words within specific HTML tags.

Before looking at the various methods of eliminating words, it is necessary to describe what the compiler considers a 'word' to be. The word parser, incorporated into the compiler, parses words according to two separate algorithms.

Any numeric value (0 to 9 or a valid ISO-Latin1 numeric value) followed by other numeric values, or "." or "," is considered to be a number. Trailing "." or "," characters are ignored.

Any letter, followed by letters, numeric values, ".", "-", or "_" is considered to be a word. Trailing ".", "-", or "_" characters are ignored.

If you wish that a hyphenated word be split into its components, use the  () ampersand entity, also known as a soft hyphen, instead of the hyphen character '-', such as profitmargin.

Values such as "1.0" or "1,000" or even dewey decimal values such as "1.2.3" would all be considered to be numbers. Note however, that "1..6" would also be considered to be a number.

The compiler provides the -xn option, which removes all numbers from the word list.

Values such as "wasn't" would be considered to be two separate words; "wasn" and "t". The apostrophe is not tested by the word parser, as it would then have been required to understand single quoted phrases. Since there are no syntactical rules in HTML for #PCDATA (the text within tags), it would be impossible to tell when an apostrophe marks the start or end of a single quoted phrase, and when it is, well, just an apostrophe. Some people also prefer to use the "`" character to start a single quoted phrase.

Removing documents from the word list

A table of contents (TOC) document is an ideal candidate for word removal. Although needed to generate the dependency list, it would be unproductive for the TOC document contents to appear in the word database, since the descriptors (words) in that document invariably link the user to other pages.
In this case, all words within a document can be removed from the word list in the same way as documents are removed from the dependency list, described below.

Removing a specific document from the word list

To remove all words in a specific document from the word list, use the -xwu option, and specify the document's URL path and filename components, for example:

-xwu /www/rational/application/search/doc/TOC.html

Removing multiple documents from the word list

To remove all words in multiple documents from the word list, use the -xwu option, and a filter using the wildcard character '*'. For example:

-xwu */TOC.html

In this example, all words in all URLs ending with /TOC.html will be excluded from the word list.

Another more dangerous example of filtering is:

-xwu /www/extawt/*

In the above example, all words in all URLs beginning with /www/extawt/ will be excluded from the word list.

Finally an even more dangerous example of filtering is:

-xwu */extawt/*

In this example, all words in all URLs containing /extawt/ will be excluded from the word list.

No other combinations of the wildcard character '*' are valid. A filter definition of */extawt/*remove.* will result in a (probably useless) filter to remove all words from URLs containing /extawt/*remove., and not the probable intention of removing all words in all URLs containing /extawt/ and also remove.

The wildcard character '*' can appear at the start of the URL, and/or at the end of the URL, anywhere else it is treated as an ordinary character.

Generating a word list

Before individual words can be removed, you have to know what words appear in the search database. The compiler provides the -lw filename option, which lists all filtered words in HTML document format to the specified filename.

The following is an excerpt from the generated word list:

<dl>
<dt>absolute
<dt>accept
<dt>acceptable
<dt>access
<dt>according
<dt>accumulates
<dt>achieve
<dt>achieved
<dt>acronyms
<dt>add
<dt>added
<dt>addition
<dt>address
...
</dl>

Creating word filter documents

Common usage words, or useless words, can be removed from the database using word lists, which are stored in an HTML document, known as a word filter document. The same format is used as the parsed documents of the dependency list, so that HTML entity characters (&) can be used to represent ISO-Latin1 characters in ASCII files. The current list of valid ampersand entities is given in the appendix Ampersand entities.

Since the word filter document (see below) and generated word list file are both in HTML format, you can use your favorite text editor to cut and paste words to be removed from the word list to the word filter document.

Eliminating a word

A specific word can be eliminated by simply having the word appear in a word filter document. This is a file in HTML format, which lists the specific words or word filters to be used when removing words. It is a good idea to list them one per line, for readability, and ease of editing. The following is an excerpt from the exclude.english.html file:

<dl>
<dt>a
<dt>able
<dt>about
<dt>above
<dt>accomplish
<dt>accomplished
<dt>accomplishes
<dt>across
<dt>act
<dt>acts
<dt>actual
...
</dl>

Word filter documents are specified using the -xwf option, for example:

-xwf exclude.english.html

Reducing words

The compiler also provides for simple though potentially dangerous word reduction filters, which trim or reduce words. Generally, word reduction filters should be avoided, since they can have unexpected side-effects, similar to the filters used for eliminating URLs from the dependency list or word list.

In addition, word reduction filters slow down the speed of compilation, since each word parsed (there may be several thousand of them) has to be checked against each filter, until a filter is matched, or all the filters have been checked.

Word reduction filters have the same form as URL filters, only that, instead of being declared on the command line, they are placed in a word filter document. If a word matches a filter, that word is not eliminated, but reduced and put back into the word list.

For example, after a first compilation, the word list might produce words (taken from the text of links), such as:

ftp.javasoft.com
...
splash.javasoft.com
...
www.javasoft.com

In this case, say you are interested in keeping the javasoft part as a word in the database, and discarding the rest. You can achieve this by creating the word reduction filter (in your word filter document) as follows:

<dl>
<dt>*javasoft*
...
</dl>

You might think that such filters can be used for reducing plurals, or reducing adjectives, but this is not the case. If you create word reduction filters such as:

<dl>
<dt>*s
<dt>*ing
...
</dl>

they will reduce for example cards to card and playing to play, but will also reduce miss to mis and king to k. Caveat emptor.

Removing words in specific HTML tags

The compiler can remove words found in specific tags. There are four such tag groups:

-nt: exclude <TITLE> tagged words.
-nh: exclude <H1..H6> and <CAPTION> tagged words.
-nl: exclude <DT> and <LI> tagged words.
-nb: exclude words not inside the above listed tags.

The order of filtering

The compiler takes the parsed word list, and filters them for the final word list in the following order:

All words are converted to lower case.
If any of -nb, -nh, -nl, or -nt flags are set, all words corresponding to those HTML tags are removed from the list.
If the -xn flag is set, all numbers are removed from the list.
The resulting word list is tested against word reduction filters, matches are removed, reduced and put back into the list.
The resulting word list is tested against the exclusion word lists, and matching words are removed.

This ordering allows for words which were reduced to then be removed.

Rational Unified Process