Process Engineer Toolkit > User's Guide > Tools Reference > Search Engine > Overview

SearchEngine: Overview

Topics

This chapter explains the workings of the SearchEngine, and the iterative process of generating, and later, regenerating the final applet word database. 

The purpose of the SearchEngine

The SearchEngine reads one or more HTML files, parses the words within the markup tags, and then parses all linked HTML files. Each word is checked for word removal and word reduction, and the resulting word list for the HTML file is stored internally. When all the linked HTML files have been parsed, the word database is constructed, together with the applet tag for the HTML applet search page.

Though in theory this could be achieved the first time 'round the buoy', in practice, it is usually an iterative process. When compiling a new database, the parser may signal HTML syntax errors, which you may want to correct. There may be some non-text files linked, which the SearchEngine should be told not to parse, or sections of linked HTML documents which should be excluded. Finally, there may be filenames, acronyms, or other words which you may not wish to have appear in the database.

The command line application performs the function above by typing:

java ruptools.SearchEngine -r search.response -gw search

or

SearchEngine.exe -r search.response -gw search

The SearchEngine options

The SearchEngine has a rather lengthy, but necessary list of options:
-f   filename   the root HTML filename (required) 
-gw  filename generate Web applet files 
-lu  filename   list dependency URLs to filename 
-lw  filename   list words to filename 
-nt            exclude <TITLE> tagged words from database 
-nh           exclude <H1..H6><CAPTION> tagged words from database 
-nl exclude <DT><LI> tagged words from database 
-nb           exclude <BODY> tagged words from database 
-p  filepath   intermediate data filepath 
-r  filename   execute response file 
-s suppress HTML syntax error reporting 
-u   url       the WWW URL equivalent of the root HTML document 
-xn  exclude numbers from word list 
-xu  url       exclude URL from dependency list
-xwf filename   word exclusion HTML filename 
-xwu url       exclude URL from word list 
-l The file with language dependent messages. 
-c          The characterset to use when reading input. If this option is used it has to be the the first option. Default is local characterset. 
-h  File containing text to make the output from the application language dependent.

Options are separated by white space, so if you have a filename, or URL which contains a white space character, you must place that parameter in double quotes:

-lu   "/html/Site dependency list"

Dependency options

The following options control how the dependency list is constructed: 
-f  filename the root HTML filename (required)
-u  url the WWW URL equivalent of the root HTML document 
-xu url exclude URL from dependency list

The resulting dependency list can be output to a file using:

-lu  filename    list dependency URLs to filename

The intermediate parsed data files are stored in the directory specified by:

-p   filepath     intermediate data filepath

if this argument is not specified the current working directory is used.

These options are further explained in the chapter Building the dependency list.

Word elimination options

The following options control how the word list is constructed:
-nt exclude <TITLE> tagged words from database
-nh exclude <H1..H6><CAPTION> tagged words from database
-nl exclude <DT><LI> tagged words from database
-nb exclude <BODY> tagged words from database
-xwf filename    word exclusion HTML filename
-xwu url exclude URL from word list
-xn exclude numbers from word list

The resulting word list can be output to a file using:

-lw   filename   
list words to filename

These options are further explained in the chapter Eliminating words.

Applet generation options

The following option create the applet tag file, and search database:
-gw   filename   
generate Web applet files

The option are explained in the chapter Building the applet database.

Using response files

Since the SearchEngine acts on a series of options, these options can be placed for commodity, in one or more text files. In addition to reducing keystrokes, these files can also contain comments. The following is an extract from the response file used to build the database for this manual:


Response file for the SearchEngine manual
(where on the hard disk)
-f \www\rational\application\search\search\TOC.html

(where on the World Wide Web)
-u http://www.ruptools.com/rup/rational/application/search/search/TOC.html

Dependency exclusions:
(ignore any links to zip files, java files, and the link to my java page)
-xu *.zip
-xu */javapage.html
-xu *.java

Word count exclusions:
(ignore the search page, and table of contents)
-xwu */docsearch.html
-xwu */TOC.html

Standard word exclusion filters:
(ignore all numbers)
-xn

(standard english language exclusion list)
-xwf exclude.english.html

(specific exclusion list for the manual)
-xwf search.exclude.html

The SearchEngine parses a response file, ignoring all lines which do not begin with a hyphen as the first non-white space character. Any valid SearchEngine option can appear in a response file, invalid or illegal options produce an error message.

The -r filename option itself can also appear in a response file, so that, for example, you can create standard dependency or word file exclusion filters, which can be used to generate multiple databases.

Each option and its associated parameters must appear on a single, separate line of the response file.

Understanding the output

The SearchEngine can generate several output files, as well as HTML syntax error messages to the standard output device.

Command line errors

Command line errors appear on the standard output. Most errors are due to missing or incorrect options, or option parameters.

HTML syntax errors

Syntax errors appear on the standard output. The line and column of the syntax error is also provided. This is described further in The HTML parser; syntax errors.

The dependency list

The dependency list shows all document links, external document links, data links (such as images, and applets), and missing links. This is described in detail in the chapter Building the dependency list.

The word list

The word list shows all parsed words in the database, after word removal and word filtering have been carried out. This is described in detail in the chapter Eliminating words.

The applet HTML file

The applet HTML file is the SearchEngine generated <APPLET> tag, which can then be cut and pasted into your HTML document. This is described in detail in the chapter Personalizing the applet.

Copyright  © 1987 - 2001 Rational Software Corporation


Display Rational Unified Process using frames

Rational Unified Process