Use of the following software is assumed:
WAIS provides advanced server-side search and retrieval capabilities, including support for binary datatypes and very fast searches of the entire contents of large textual databases.
Use Mosaic as a front-end client and WAIS as a back-end server and you can provide your users with a friendly yet powerful window into your information universe and sophisticated query, retrieval, and indexing capabilities.
Download and install the freeWAIS 0.202
(or later) distribution from the UNC SunSITE
FTP server. Installation instructions are in the file
INSTALLATION
in the freeWAIS distribution.
You can place data files of any type in a WAIS database; possibilities include HTML documents, plaintext documents, GIF images, audio files, and so on. In the following example, we will assume there will at least be HTML documents in the WAIS database, and possibly other types of files as well.Create a directory (e.g.
~/fluff
) and put copies of all the files you wish to
place in the database in that directory. Make sure they all have
relevant extensions (e.g. ".html"
for HTML documents,
".gif"
for GIF images) to make life easy for you in the
short term.
Create a directory (e.g.
~/localwais/sources
) to hold the WAIS index file for your
database. This index file will be created automatically by the WAIS
indexing program, waisindex
, and will be consulted by the
WAIS server program, waisserver
, when clients ask the
WAIS database for query information or specific documents.
Create and run a shell script (call it
doindex
) that will index all of the files in
~/fluff
and place the resulting index file in
~/localwais/sources
. The following is such a shell
script:
#!/bin/csh # Go to the directory with the documents to be indexed. cd ~/fluff # Create index, initially with HTML documents. waisindex -export -d ~/localwais/sources/marc -T HTML *.html # Add plaintext documents to index. waisindex -a -d ~/localwais/sources/marc -T TEXT *.txt # Add PostScript documents to index -- index contents, why not? waisindex -a -d ~/localwais/sources/marc -T PS *.ps # The following types are all indexed without contents # (thus use of the -nocontents flag). So all you can # do is search on filenames... # Add GIF images to index. waisindex -a -d ~/localwais/sources/marc -T GIF -nocontents *.gif # Add RGB images to index. waisindex -a -d ~/localwais/sources/marc -T RGB -nocontents *.rgb # Add HDF data files to index. waisindex -a -d ~/localwais/sources/marc -T HDF -nocontents *.hdf # Add audio files to index. waisindex -a -d ~/localwais/sources/marc -T AU -nocontents *.au
waisindex
, the program
that looks at files you are adding to a database and adds
information about them to the database's index file. The
information built up in that index file is used to allow very
fast searches to be made across the entire contents of the files
in the database.
waisindex
uses the
-export
flag, which specifies that the database
we're creating is to be made available over the network (the
actual effect is to make sure that the database has a reasonable
name).
waisindex
use the
-a
flag to tell the indexer to add to an
existing index rather than creating a new index. (The first call
to waisindex
created a new index.)
-d ~/localwais/sources/marc
arguments to
waisindex
tell the indexer what the name of the
index should be. Since a single WAIS server can serve multiple
WAIS indexes (databases), all the indexes are commonly kept in a
single directory (in this case, ~/localwais/sources
)
and each index is given a distinct name (in this case,
marc
).
waisindex
uses the -T
flag
to specify the type of the files being indexed at that
time.
WAIS types have historically been ad hoc but straightforward --
TEXT
for text files, GIF
for GIF
images, etc. Mosaic recognizes these ad hoc types using a method
that the author thinks is actually pretty damn slick -- a WAIS
type retrieved as the result of a query is matched to a MIME type
as though it were a file extension.
In other words, since a file with extension ".text"
is normally considered plaintext (MIME type
text/plain
) by Mosaic, a WAIS query result of WAIS
type TEXT
is also considered
text/plain
.
Similarly, if Mosaic were configured to recognize file extension
".foo"
as MIME type application/x-foo
,
a WAIS query result of WAIS type FOO
would also be
considered of type application/x-foo
.
(Note: At some point in the future, WAIS will start using MIME types directly. Mosaic supports this already: if a WAIS type corresponds to a MIME type that Mosaic understands, then Mosaic will recognize that and act appropriately.)
-nocontents
flag is used while indexing binary
filetypes for which it would make no sense to actually index the
contents. (E.g., indexing a GIF file's binary contents would do
nothing useful.) Use of the -nocontents
flag means
that only the filename for each file being indexed is added to
the index.
waisindex
can be made recursive -- files in
subdirectories will be indexed also -- via the -r
flag (which we don't use in this example).
waisserver
-- the WAIS
server program -- and therefore make your new index available to
Mosaic clients over the network, construct and run a shell script
(call it doserve
) that looks like this:
#!/bin/csh # Go to the directory containing the WAIS sources. cd ~/localwais/sources # Start the WAIS server in standalone mode; # have it use port 2010. waisserver -p 2010 &
The URL for connecting to the server from Mosaic is:
wais://machine:2010/marcIn this URL,
machine
is the name of the system on which
you are running the WAIS server. 2010
is the port you
chose to run the WAIS server on, and marc
is the name you
gave the WAIS database. When you do a query on your new database, the resulting URL will look like this:
wais://machine:2010/marc?query
...
where query
is the search string you
enter.
A WAIS gateway, in this context, is a server that accepts a query from a Web client via HTTP, issues a query to a WAIS database on behalf of the client, post-processes the results of querying the WAIS database, and returns the information to the Web client (again via HTTP). The purpose of this is to provide access to WAIS databases by clients that do not speak the WAIS protocol natively.With Mosaic 2.0 and some of the other more advanced Web clients coming along now, the rules are changing, since it is now possible to have the same client capable of accessing both the normal range of Web servers (HTTP, Gopher, FTP, NNTP) as well as WAIS servers, without requiring a gateway at any stage of the information retrieval process.
But, many Web clients still don't have native WAIS support -- two good examples are NCSA Mosaic for the Mac version 1.0 and NCSA Mosaic for Microsoft Windows version 1.0. Those clients still must go through a WAIS gateway, as must any instance of Mosaic for X version 2.0 that isn't compiled with native WAIS support.
The big catch here is that, at the present time, the WAIS gateways available on the network don't do a good job of providing full access to WAIS databases. In particular, access to anything other than plain text files is likely not work, and multiformat query responses (see below) will not work.
The solution is to write a better WAIS gateway, probably based on the native WAIS support in Mosaic 2.0. We'll probably do that at some point, but it isn't done yet (that I know of).So what do you do if you want to provide WAIS databases to people using various Web clients, some of which don't support native WAIS?
Web clients without native WAIS access should be set up to
automatically use one of the public WAIS gateways (probably either
NCSA's or CERN's) to handle wais
URLs.
Mosaic for X version 1.2 and earlier did not do this properly, for which we are ashamed, but Mosaic for X 2.0 will do this properly if it's not compiled with direct WAIS support.What this means is that a
wais
URL that looks like the
following:
wais://cnidr.org:210/directory-of-servers
...
should be automatically converted to a URL that looks
something like the following:
http://www.ncsa.uiuc.edu:8001/cnidr.org:210/directory-of-serversNote that
www.ncsa.uiuc.edu:8001
is the address of the
public NCSA WAIS gateway; everything after the first single slash in
this URL is exactly the same as in the original wais
URL.
This should give the gateway all the information it needs to access
the specified WAIS database and provide the non-native-WAIS client
with the equivalent of direct access, with a minor performance hit.
So, that's a stopgap solution that will provide transparent access to at least text files in WAIS databases by a wide range of Web clients.One final note: If you happen to be using a Web client that is lacking both native WAIS support and the ability to automatically feed
wais
URLs through a gateway, your remaining option is to
explicitly use the http
form of wais
URLs
as shown above. This is not a good solution and hopefully it won't
ever be necessary in the very near future.
A big problem here is that, using WAIS as it currently exists as the search and retrieval engine for existing sets of HTML documents, any and all relative links and relative pointers to inlined images in all indexed HTML documents will break.
Why is this? Well, when you retrieve an HTML document from a WAIS server, the URL corresponding to that document will be an encoded WAIS "docid", or document identifier. This docid is not the same thing as the path and filename of the file that you're retrieving. (In fact, it looks like a horribly mangled stream of random and spurious bytes -- its structure and meaning are definitely not transparent at the user level.)
So, when an HTML document contains a relative link or inlined image
pointer, the document is pulled over via WAIS, and Mosaic tries to
resolve the relative link into an absolute URL by combining it with
the URL for the current document ...
-- well, it just
don't work.
One near-term but generally undesirable solution is to always use absolute URLs for hyperlinks and inlined images in all HTML documents on your server.
The real solution is for HTTP servers (which, of course,
commonly use URLs that correspond exactly to directory and file names
and therefore allow relative links to freely work) to use WAIS as a
search engine only -- and to make sure that URLs given to
browsers as the results of searches are exactly normal
http
URLs.
This is completely technically possible and will be more and more common in the very near future. An experimental WAIS back-end interface that provides this functionality is known to exist for Plexus, and either that interface or something similar will eventually be made available for NCSA httpd (and presumably other HTTP servers). I'll attempt to stay up to date on the progress of these efforts and roll the results of ongoing work into this tutorial.One more thing: WAIS is evolving towards greater separation of indexing and retrieval. It should eventually be possible to have WAIS itself return arbitrary URLs (matching, say, the actual directory and file names of files it indexes), which would allow relative links to work. This is an intriguing idea because it would mean that you could potentially run an entire standard Web server entirely with a single WAIS server.
(See experimental information on integrating WAIS and HTTP servers.)
This is a useful capability if, for example, you have a set of images, each of which has a corresponding text description. You can set up your WAIS database in such a way that the text descriptions are searched, but appropriate images are given to the user as a result of successful search hits in the text descriptions.The following describes how to set up a WAIS server to return multiformat responses. We'll assume you're using the
doindex
script and directory structures as given in the
examples above.
Create a directory called
~/multifluff
. This is where you'll put all files to be
indexed with WAIS's multiple format support.
A condition of freeWAIS's multiformat support is that the various files follow certain file name and extension conventions very closely.Place the various text files and associated GIF and PostScript files inWe'll assume, for this example, that you have a set of text files; each text file has either an associated GIF image, an associated PostScript document, or both a GIF image and a PostScript document.
You will give all the text files the extension
".TEXT"
, all the GIF files the extension".GIF"
, and all the PostScript files".PS"
. Note use of uppercase.It is assumed that related files have the same name, with the exception of the extension -- in other words,
"foobar.TEXT"
and"foobar.GIF"
will be considered to be related."blargh.TEXT"
and"blorf.GIF"
, however, will not.
~/multifluff
. Be
sure they have appropriate filenames and extensions, as described
above: filenames match for related files; extensions are
".TEXT"
, ".GIF"
, and ".PS"
.
Add the following lines to the end
of your doindex
script:
# Go to the directory containing the files in multiple formats. cd ~/multifluff # Index *.TEXT and associate *.GIF and *.PS. waisindex -a -d ~/localwais/sources/marc -T TEXT -M TEXT,GIF,PS *.TEXT
-M
argument to
waisindex
: the types in the comma-separated list
following -M
are used by the indexer to determine how to
tie different files in ~/multifluff
together. A
given query will be able to return a matching TEXT
file
as well as an associated GIF
image (if one exists with
the same filename and extension ".GIF"
) and an associated
PS
document (if one exists with the same filename and
extension ".PS"
).
Example: Here's an example set of files that you might place in
~/multifluff
:
crufty.GIF crufty.TEXT maybe-marc.GIF maybe-marc.PS maybe-marc.TEXT tarot.PS tarot.TEXTAfter you index these files as described above, a query on
"crufty"
should return a hit corresponding to
"crufty.TEXT"
. When you access that hit, Mosaic should
tell you that you are at a "Multiple Format Opportunity" and present
you with a menu from which you can choose TEXT
,
GIF
, or PS
.
Works for me! :-)
Important Note: Mosaic for X version 2.0 compiled with direct WAIS support is the only Web client known to actually handle multiformat responses. The modifications we made to the common Web library WAIS code to make this happen should be easy to roll into other clients, but to our knowledge no one has yet done so, and certainly no gateways will be able to handle multiformat responses.
However, this is a quite powerful capability, and if you are able to assume use of Mosaic for X, we certainly suggest you give it a shot and see if it works for you.
waisserver
.
mosaic-x@ncsa.uiuc.edu
and we'll try to help you.
You can also post questions to the Usenet newsgroup
comp.infosystems.www
, which the Mosaic authors read.
comp.infosystems.wais
.