Search engine backend

Search Engine Back End For a Real-Time Filesharing Infrastructure

Ian Hall-Beyer, LANgistiX

Preliminary Draft, April 16, 2000

Abstract

The recent introduction of Gnutella to the internet community has provided us with but a glimpse of what is to come in terms of reatlime search technology. Many problems plague its current incarnation, but as it is in practice a technology demonstrator, it gives us a jumping off point to improve on the concept and simultaneously provide a solid real-time search infrastructure. The key to this being able to work is a robust, reliable, and extensible way of providing results and access to the data.

The main objective of this design is to make the process of searching for data and documents as transparent and consistent to the end user as possible, regardless of what type of indexing system the target host uses. The secondary objective is to make the back-end as flexible as possible, so that content providers may use the most efficient method of providing data to the end-user, while protecting data that should not be made public.

Server-Side

Database/Indexer

The first key component of the system is the database itself. An indexer, which actually goes out and digs through the files, populates this database with pointers and descriptors. This indexer should be configurable to overlook specific files or directory structures, as well as flag certain entities as requiring authentication if needed, and if so, what method and ACL to use. In populating the database, it assigns a locally unique document identifier as well as some sort of hash to each document. The main objective of indexing documents in the database is to reduce system load every time a search comes in.

It should be noted here that the indexer/database combination could also be something OS-based such as the indexer in Windows 2000 or Sherlock under MacOS.

Search Engine

Once populated, the database is accessible to the actual search engine. At this point, using something standards-based such as LDAP seems like the best course of action, but with proper design, this could realistically be any database, and could evolve over time. Internally to the search engine are multiple parsers for various types of queries. Examples of this could be:

Natural Language Query
Boolean Keyword Query
Regular Expression Query
Wildcard query
...

Once a search returns a result, the results are then put into an XML document that is served up by the HTTP server in the system. The mechanics of what happens then will be discussed later.

Communications Layer

This piece handles all the communications between various hosts on the network and receives search queries and passes them on to the search engine layer. The actual mechanics of the communication protocol are beyond the scope of this document and will not be discussed.

HTTP Server

The internal HTTP server handles all of the document exchange for the system, from serving up query results, to supplying files and documents to the requestor.

How A Search Works

The requesting client's user is searching for a document such as a paper on ecological pest management. They enter a search query and select a query type (one mentioned above). The client then assigns a locally unique identifier to the search and sends it on through the network, attached to a globally unique identifier for the requesting system. When this search is received by a system, it is then passed on to the search engine layer and the search is performed.

Upon finding a match, the search engine dumps all matches, with their descriptors, hashes, and identifiers into an XML document that is then served up by the HTTP server. A packet is then sent back to the requesting client, telling it that a search result was successful, and that the results are at:

http://10.11.12.13:9362/results/GUID/SearchID.xml
The requesting system then retrieves this document (This can theoretically be done securely) and parses it.

Since the requestor's client assigns a unique search ID to each query, the XML documents that result from it can be cached by the client for recall at a later date. As the client retrieves these XML documents from the systems, it parses them and then assembles them into a page that can then be displayed to the user. The benefit of using hashes here is that this page can group identical hashes together and present the user with a list of multiple links of where a particular file or document can be obtained. Alternately, these can be kept hidden, and the client will attempt to get the file from the best available source. The client may also group files based on type for ease of use.

When the user then selects a file, the request is then passed on to the system that hosts the file via HTTP or HTTPS in a form resembling:

http://10.15.192.63:9362/documents/DocumentID
The server then returns the file to the user, with appropriate MIME information. The user's client then processes the file according to the MIME type, by either rendering it, displaying it, saving it, streaming it, and so on. If the document is actually a page on a main web server on the target box, the HTTP server in the search mechanism can actually provide a redirect to that document.

Having locally unique search identifiers also allows the client to cache results if the user wishes to revisit those searches at a later point in time.

Existing infrastructure

In response to some questions I've received, It is theoretically possible to make this work of the existing gnutella protocol with a few slightly unorthodox methods. The first is that the search can be encapsulated into the existing search query field in a manner such as boolean(rhubarb AND strawberry), and then return the XML document containing search results as a single result field.

Design Considerations

This design is to fullfill the following objectives:

Scalability

This system can represent anything from a lightweight server within a standard client, on up to large database clusters

Enhanced Searching

This design tries to maximize the ability of users to find what they're looking for with a variety of query types, to satisfy everyone from the casual user to power users

Reduced load on the network

By sending a small result packet, overall load on the intercommunication layer is greatly reduced. Once the result is returned, all further communications are via standard HTTP.

Security

This design also allows for the introduction of a secure communications layer for dealing with the actual search results.

<manuka@nerdherd.net>