SearchEngine: Frequently Asked Questions

Process Engineer Toolkit > User's Guide > Tools Reference > Search Engine > SearchEngine - Frequently Asked Questions

Topics

The FAQ index
Questions about the SearchEngine
Questions about the Search applet

Some common problems have occurred when using the SearchEngine. This chapter lists these problems and their solutions. Questions have been divided into two categories; the SearchEngine and the Search applet.

The FAQ index

Files are not being excluded: The SearchEngine is reading files excluded with the -xu flag.
SearchEngine: tags or tag attributes are being stored in the database: The SearchEngine is storing words which look suspiciously like tags or tag attributes.
SearchEngine: keywords in titles and headers are missing: The SearchEngine is not storing words which appear in HTML tags like <TITLE>, <H1..H6>, etc.
SearchEngine: runs fine for a while, then slows down: The SearchEngine parses the first few hundred files, then slows down and starts thrashing (repeatedly using) the hard-disk.
SearchEngine: stops with an OutOfMemoryException: The SearchEngine parses the first few hundred files, then displays a long list of error messages, starting with OutOfMemoryException.
SearchEngine: stops with a 'Too many files for the search applet database' message: The SearchEngine parses many hundreds of files, then displays a 'Too many files for the search applet database' message.
Applet: Search button remains gray, or an error message appears: The applet starts up, but after a few seconds, the search button appears grayed out, or an error message is displayed.
Applet: Clicking on a title causes the browser to issue a 'document not found' error: When the user double clicks on a found document title, instead of the browser opening the document, it issues a 'document not found' error message.

Questions about the SearchEngine

Files are not being excluded

The SearchEngine is reading files excluded with the -xu flag.

Take care when using the wildcard character '*'.

The wildcard character '*' can appear at the start of the URL, and/or at the end of the URL, anywhere else it is treated as an ordinary character.
No other combinations of the wildcard character '*' are valid. A filter definition of */extawt/*remove.* will result in a (probably useless) filter to ignore all URLs containing /extawt/*remove., and not the probable intention of ignoring all URLs containing /extawt/ and also remove.

The SearchEngine uses case sensitive URLs when filtering.

Some operating systems (Windows) are case insensitive to file names, however, the SearchEngine is not. If for example, the filter

-xu *.zip

was used, then all files ending in .zip will be removed, but files ending in .ZIP will not. Use both lower case and upper case to filter file extensions:

-xu *.zip
-xu *.ZIP

Tags or tag attributes are being stored in the database

The SearchEngine is storing words which look suspiciously like tags or tag attributes.

The HTML documents may indeed contain the tag keywords as text, if the argument is about HTML: Check the documents for the offending keywords, and ensure that they are or are not inside HTML markup, watch out for incorrectly formed comment syntax.
The HTML document may have syntax errors, which caused the SearchEngine to store the words in the body, or ignore them completely.: Check the documents for the offending keywords, and ensure that they are inside the correct HTML markup, watch out for incorrectly formed comment syntax.

Keywords in titles and headers are missing

The SearchEngine is not storing words which appear in HTML tags like <TITLE>, <H1..H6>, etc.

The HTML document may have syntax errors, which caused the SearchEngine to store the words in the body, or ignore them completely.: Check the documents for the offending keywords, and ensure that they are inside the correct HTML markup, watch out for incorrectly formed comment syntax.

Runs fine for a while, then slows down

The SearchEngine parses the first few hundred files, then slows down and starts thrashing (repeatedly using) the hard-disk.

The SearchEngine is running out of virtual memory.

The SearchEngine requires about 1.5 to 2.0 times the virtual memory, as the size of the documents being parsed. If, say, you have 9 MB of documents, then you will require about 15 to 18 MB of virtual memory.

Start the Java interpreter with as much virtual memory as needed using the -mx switch (the default is 16 MB):

java -mx24m ruptools.SearchEngine ...

Not enough virtual memory.

Possible solutions are:

Split the files up into sub-groups, and create databases for each.
Remove word groups, -nb, -nl, -nh (in that order).
Do both, a restricted global search, with complete sub-search.
Increase the word exclusion list (english.exclude.html is very generic)

Stops with an OutOfMemoryException

The SearchEngine parses the first few hundred files, then displays a long list of error messages, starting with OutOfMemoryException.

The SearchEngine ran out of virtual memory.

Start the Java interpreter with as much virtual memory as needed using the -mx switch (the default is 16 MB):

java -mx24m ruptools.SearchEngine

Not enough virtual memory.

Possible solutions are:

Split the files up into sub-groups, and create databases for each.
Remove word groups, -nb, -nl, -nh (in that order).
Do both, a restricted global search, with complete sub-search.
Increase the word exclusion list (english.exclude.html is very generic)

Stops with a 'Too many files for the search applet database' message

The SearchEngine parses many hundreds of files, then displays a 'Too many files for the search applet database' message.

The SearchEngine exceeded the applet database maximum file size.: The applet database can hold information on up to a maximum of 4096 HTML documents.

Questions about the Search applet

Search button remains gray, or an error message appears

The applet starts up, but after a few seconds, the search button appears grayed out, or an error message is displayed.

The cause of this problem is that the applet failed to find or load the database.

Check that the file path is correct.

The applet will look in the path made up from the codebase plus database parameter value. Supposing the applet definition is:

<applet codebase=".." archive="Search.zip"
 code="ruptools.Search.class" width=100 height=20>
<param name=database value="docsearch">

and assuming the applet file is in the /search directory, then the applet will look for the file in /search/../classes/docsearch.ws or, when reduced /classes/docsearch.ws
If this is not the correct location of the database file, then either copy the database to that location, or change the database parameter value.
Remember that the database file must appear in the codebase path of the applet, otherwise some browsers may refuse access to the file, causing the applet to fail.

Check the file path for spelling.

On some operating systems, the filename is case insensitive (Windows), whilst on others it is not (Unix). Ensure that the codebase path and database parameter path have the same case as the directories and filename. The database file extension is .ws, in lower case.

Check that the file path is within the codebase path.

As for checking the file path, ensure that the reduced file path is the same or a child directory of the codebase, otherwise some browsers may refuse access to the file, causing the applet to fail.

Check the database file.

The database file may have become corrupt, or have been replaced. Recompile the database, and copy the file, then try running the applet again in the browser or appletviewer.

Clicking on a title causes the browser to issue a 'document not found' error

When the user double clicks on a found document title, instead of the browser opening the document, it issues a 'document not found' error message.

The path parameter is probably wrong or missing.

The path parameter is used to correct the database document URL with respect to the search applet HTML file URL.
If, for example, when compiling the database the root file is specified as:

-f /rational/application/search/doc/index.htm

and the root URL as:

-u http://www.ruptools.com/rup/rational/application/
search/doc/index.htm

then the root file URL will be stored in the database as:

rational/application/search/doc/index.htm

which corresponds to the identical path in both options:

-f rational/application/search/doc/index.htm
-u http://www.ruptools.com/rup/rational/application/
search/doc/index.htm

If we now suppose the search applet HTML file to be at:

/rational/application/search/doc/docsearch.htm

for the local file, or

http://www.ruptools.com/rup/rational/application/
search/doc/docsearch.htm

for the Internet URL, then we need to correct the document URL references in the applet database file to move back three directories:

<param name=path value="../../../../">

Now, when the user clicks on a link, the browser will construct the URL as follows:

rational/application/search/doc/
../../../../rational/application/search/doc/index.htm

for the local file, or

http://www.ruptools.com/rup/rational/application/search/doc/
../../../../rational/application/search/doc/index.htm

for the Internet URL, which reduces to:

/rational/application/search/doc/index.htm

for the local file, or

http://www.ruptools.com/rup/rational/application/
search/doc/index.htm

for the Internet URL.

Rational Unified Process