Topics
This chapter explains the workings of the SearchEngine, and the iterative process of
generating, and later, regenerating the final applet word database.
The
purpose of the SearchEngine
The SearchEngine reads one or more HTML files, parses the words within
the markup tags, and then parses all linked HTML files. Each word
is checked for word removal and word reduction, and the resulting word list for
the HTML file is stored internally. When all the linked HTML
files have been parsed, the word database is constructed, together with the applet
tag for the HTML applet search page.
Though in theory this could be achieved the first time 'round the
buoy', in practice, it is usually an iterative process. When compiling a new database, the
parser may signal HTML syntax errors, which you may want to correct. There
may be some non-text files linked, which the SearchEngine should be told not to parse, or
sections of linked HTML documents which should be excluded. Finally, there
may be filenames, acronyms, or other words which you may not wish to have appear in the
database.
The command line application performs the function above by typing:
java ruptools.SearchEngine -r search.response
-gw search
or
SearchEngine.exe -r search.response -gw search
The
SearchEngine options
The SearchEngine has a rather lengthy, but necessary list of options:
-f filename
|
the root HTML filename (required)
|
-gw filename
|
generate Web applet files
|
-lu filename
|
list dependency URLs to filename
|
-lw filename
|
list words to filename
|
-nt
|
exclude <TITLE> tagged words from database
|
-nh
|
exclude <H1..H6><CAPTION> tagged words from database
|
-nl
|
exclude <DT><LI> tagged words from database
|
-nb
|
exclude <BODY> tagged words from database
|
-p filepath
|
intermediate data filepath
|
-r filename
|
execute response file
|
-s
|
suppress HTML syntax error reporting
|
-u url
|
the WWW URL equivalent of the root HTML document
|
-xn
|
exclude numbers from word list
|
-xu url
|
exclude URL from dependency list
|
-xwf filename
|
word exclusion HTML filename
|
-xwu url
|
exclude URL from word list
|
-l |
The file with language dependent messages.
|
-c |
The characterset to use when reading input.
If this option is used it has to be the the first option. Default is local characterset. |
-h |
File containing text to make the output from the application language dependent.
|
Options are separated by white space, so if you have a filename, or URL
which contains a white space character, you must place that parameter in double quotes:
-lu
|
"/html/Site dependency list"
|
Dependency options
The following options control how the dependency list is constructed:
-f filename |
the root HTML filename (required) |
-u url |
the WWW URL equivalent of the root HTML document |
-xu url |
exclude URL from dependency list |
The resulting dependency list can be output to a file using:
-lu filename |
list dependency URLs to filename |
The intermediate parsed data files are stored in the directory specified by:
-p filepath |
intermediate data filepath |
if this argument is not specified the current working directory is used.
These options are further explained in the chapter Building
the dependency list.
Word elimination options
The following options control how the word list is constructed:
-nt |
exclude <TITLE> tagged words
from database |
-nh |
exclude
<H1..H6><CAPTION> tagged words from database |
-nl |
exclude <DT><LI> tagged
words from database |
-nb |
exclude <BODY> tagged words
from database |
-xwf filename |
word exclusion HTML filename |
-xwu url |
exclude URL from word list |
-xn |
exclude numbers from word list |
The resulting word list can be output to a file using:
-lw filename
|
list words to filename
|
These options are further explained in the chapter Eliminating
words.
Applet generation options
The following option create the applet tag file, and search database:
-gw filename
|
generate Web applet files
|
The option are explained in the chapter Building the applet
database.
Using
response files
Since the SearchEngine acts on a series of options, these options can be placed
for commodity, in one or more text files. In addition to reducing keystrokes,
these files can also contain comments. The following is an extract from the response
file used to build the database for this manual:
Response file for the SearchEngine manual
(where on the hard disk)
-f \www\rational\application\search\search\TOC.html
(where on the World Wide Web)
-u http://www.ruptools.com/rup/rational/application/search/search/TOC.html
Dependency exclusions:
(ignore any links to zip files, java files, and the link to my java page)
-xu *.zip
-xu */javapage.html
-xu *.java
Word count exclusions:
(ignore the search page, and table of contents)
-xwu */docsearch.html
-xwu */TOC.html
Standard word exclusion filters:
(ignore all numbers)
-xn
(standard english language exclusion list)
-xwf exclude.english.html
(specific exclusion list for the manual)
-xwf search.exclude.html
The SearchEngine parses a response file, ignoring all lines which do not begin
with a hyphen as the first non-white space character. Any valid SearchEngine option can
appear in a response file, invalid or illegal options produce an error message.
The -r filename option itself can also appear in a response file,
so that, for example, you can create standard dependency or word file exclusion filters,
which can be used to generate multiple databases.
Each option and its associated parameters must appear on a single, separate line of the
response file.
Understanding
the output
The SearchEngine can generate several output files, as well as HTML
syntax error messages to the standard output device.
Command line errors
Command line errors appear on the standard output. Most errors are due to missing or
incorrect options, or option parameters.
HTML syntax errors
Syntax errors appear on the standard output. The line and column of the syntax error is
also provided. This is described further in The HTML parser; syntax
errors.
The dependency list
The dependency list shows all document links, external document links, data links (such
as images, and applets), and missing links. This is described in detail in the chapter Building the dependency list.
The word list
The word list shows all parsed words in the database, after word removal and word
filtering have been carried out. This is described in detail in the chapter Eliminating words.
The applet HTML file
The applet HTML file is the SearchEngine generated <APPLET> tag,
which can then be cut and pasted into your HTML document. This is described
in detail in the chapter Personalizing the applet.
Copyright
© 1987 - 2001 Rational Software Corporation
| |
|