Indexed Text Retrieval
This article originally appeared in Doctor Dobb's Journal, November 1995.
With the advent of inexpensive storage devices (disk storage currently costs about $0.50 per megabyte), and the advent of large bodies of free (or inexpensive) data, such as telephone directories, Internet documents, FAQs (Frequently Asked Questions), and others, it has become increasingly necessary and desirable to get this data into local storage, and index it for fast and convenient access.
While the topic of text retrieval can span many articles, I've chosen to focus on one area — telephone number databases. The ideas presented here can be expanded to include other types of text databases and formats quite easily, and, where appropriate, will be mentioned.
Phone Database Requirements
My requirements for the telephone number database are simple: it must be able to support fairly large amounts of data (549 Megabytes are required for all of Canada, for example), it must be able to index by phone number in under one second, and by text keys in a matter of several seconds. I also wanted the search criteria for the database to be as flexible as possible, ie: "find all hospitals in Toronto", or "find all the Smith's on Sunset Boulevard", etc. Disk space was not a consideration, due to the relatively low price per megabyte.
My first requirement, where I need to find all information associated with a telephone number in under one second, stems from the fact that I subscribe to the telephone company's 'caller ID' service. For a nominal monthly fee, they deliver the number of the caller between the first and second ring. I want to be able to get the associated information quickly, before the second ring.
This requirement was actually quite simple to solve. In a classical database, the telephone number would serve as the key into the database, and would perhaps be hashed, and then looked up. In my database, the telephone number represents a pathname, which I can open and do a binary search on. Unix file systems are not terribly efficient at searching through thousands of filenames to find the file that you want to open, so I implemented what's called a 'digit tree' file structure.
In this type of structure, there is a root directory that contains the first digit of the phone number (first digit of the area code, "6" for "613", etc). Under each of those directories are further subdirectories that are named after the subsequent digits in the phone number. For example, the telephone number "6135141212" would be stored in a data file called:
where the last "4" is not a directory, but rather a file. This file contains all of the 613514 numbers, sorted by the last four digits of the phone number (station code). For example, the "4" file could contain (sample only):
1212 Weather Number, 24 hours / day, Ottawa Valley 1213 Liz and Brian's Pizza House, 2666 Temp Street, Ottawa 1443 Pay Phone, 55 Queensway Offramp
The advantages of this particular database organization are numerous. First of all, the file system has to open only a few levels of directory (a reasonably quick operation), and search through just one small file. My experience with the telephone directory for Ontario, Canada indicates that office codes (the "514" in the above example) are typically about half full — so approximately 5000 records. Secondly, a fair amount of data compression is built into this system. Since the individual files are stored in a hierarchical structure, the area code and office code information are implied by the file pathname itself, and do not have to be repeated as data within the file.
The other advantage is that the 613514 entry is still a small, flat ASCII text file -- I can edit it with my favourite editor, without having to go into a specialized (and substantially more limited) 'database editing utility' just to change an entry.
Of course, if you wanted to squeeze some data space out of this scheme, the station code, instead of being represented in 4 character ASCII, followed by a space, could be squeezed into just 14 bits (say 2 bytes for ease, saving 3 bytes).
So far, not rocket science. The implementation of getting your existing telephone database into that particular organization is left as an exercise to the reader — I wrote a small C program in a few hours to do it, which took several hours to run. In fact, strictly speaking, you don't need the database organized that way to be able to use most of the other ideas in this article.
Search by Text
Now, to address the second requirement. In the scheme described so far, the entire database is split out by area code and office code, over a large number of subdirectories and files (I have 8221 files in my database at the time of writing!) Obviously, it is not practical to open each and every file looking for a record that matches a bunch of keys.
Another problem that I have is that not all telephone records have the same data, nor in the same order. As I accumulated telephone directory information, I merged from many different databases (Canadian White Pages CD-ROM, Ottawa/Hull Pay Phone database, private database of acquaintances, etc), all with their own different formats. (The example above for 613514 is typical). This precluded using the traditional database-design trick of creating a lastname index, a firstname index, a street name index, etc.
In fact, the situation I was stuck with was that I had only two pieces of data that made any sense across all of the different formats — a telephone directory number (DN), and some descriptive text. Since the descriptive text is not 'tagged' in any way (ie: firstname, lastname, address), it all has to be considered during a keyword search. This therefore means that each and every word in the database has to be indexed.
This is the concept of Indexed Text Retrieval (ITR).
Indexed Text Retrieval (ITR)
By indexing each and every word, I now have the ability to search based upon the logical AND/OR combination of any given words that I choose. For example, to find all of the hospitals in Ottawa, I just type:
itr hospital ottawa
where "itr" is the name of the ITR program described in this article, and "hospital" and "ottawa" are just two text keywords. The ITR program will print out all phone number records that match the two keywords. (Notice that I've made my version of ITR case insensitive).
Implementation of ITR requires two separate programs — an indexing program, and a lookup program. Source code is provided for both.
The process of indexing each and every word involves opening each and every input file, reading each line (station code and text) and parsing out the words. In my implementation, I skip anything that is not "A" through "Z"; it automatically becomes a word separator. Therefore, for each word, I have the following information: file name (in this case, that's the area code and office code), position (in this case, the station code), and the word itself. I don't care about the word's relative position on the line — I just care that it is associated with a given DN (the implication is that when I specify the search keys, I don't have to specify them in any particular order; "ottawa hospital" is equivalent to "hospital ottawa").
This index is large, and managing this amount of data is in itself a challenge.
The Alpha Tree
However, having had good success with the digit tree approach described above, I decided to experiment with an alpha tree. Using this scheme, to find the keys for 'ottawa', I would simply open the file "o/t/t/a/w/a.db" and find all of the keys. The keys are directly translatable into the telephone database record.
In order to preserve disk space (after all, I don't have infinite disk space!), some decisions had to be made about the extent of the alpha tree. The first decision made was that the maximum depth that I would need is five levels. Depending upon the data that you wish to index, this restriction may be drastically altered to suite your requirements.
To generate the index, I implemented a "bucket sort" algorithm. During analysis, I would take each DN and word, and generate a ten byte entry, consisting of six bytes from the word (null padded), and four bytes representing the DN. Then, I would append this entry to the file with the same name as the first character of the word. For example, given the sample database above, the word 'weather' would be stripped to just five characters ("weath"), and, along with the key (6135141212) would be appended to a file called "w.1". The next word, "Number", would also be stripped to five characters ("numbe"), and along with the key (same key as 'Weather', 6135141212), would be appended to a file called "n.1".
Once all words in all files were processed, each ".1" file would be opened in turn, and analyzed. If there was a sufficient number of records in the file to warrant splitting it up (this is a tunable parameter in the source), then a subdirectory would be created, and the process would repeat, except using the next character of the word as the basis for the filename. For example, the "w.1" file would be analyzed, and the word "weath" would be written to "w/e.1". This process continues until either five levels of directory structure have been generated, or the ".1" file is small enough that it is no longer necessary to split it. As a final step, when all of the files are split apart as necessary, the partial word is stripped out of the files, as it is no longer required, making each record contain only the four byte (telephone number) key.
The final result of the indexer is an alpha-tree structure on disk that has, as files, indices representing phone numbers. This looks like a classical 'hash' function, except that the length of the hash is flexible, depending upon the number of entries. On my disk, I have the following files (extract):
This excerpt shows that there aren't as many words that start with "WEK" as there are that start with "WELC".
An interesting side note is the fact that as the number of levels increases, the theoretical number of files increases. The following table summarizes the theoretical maximum number of files, and the number actually observed on the database that I have.
Consuming 141.2 Mbytes of index for the 549 MB database.
Let's look at the retriever.
The retriever takes the keywords passed on the command line, and attempts to find database records that match.
In the case of specifying only one keyword to search for, the retriever opens the associated keyword file (for example, if the keyword was "welcome", the retriever would open 'w/e/l/c.db'). Then, for each entry in the keyword file, the retriever opens the corresponding database file, performing secondary matching on the database records as they go by. Any records that fully match are printed.
The reason that a secondary match is required is because the ".db" files are unique only as far as their filenames go. For example, the w/e/l/c.db file contains keys for all words that START with "welc", such as "welcome", "welch", etc.
In the case of two or more keywords, the retriever opens all of the specified keyword files, and attempts to find keys that are the same in all of these keyword files. (Since the key files are sorted by key, this is a simple task). Once keys are found that are the same in all files, the actual telephone database file is opened, and a secondary match is performed to check that the keywords are all present in the database entry.
The current retriever as listed provides only an "AND" function — all of the keywords listed on the command line must match in order for the record to be printed. It is a relatively simple matter to add an "OR" capability, and partial key matching.
Extending the Concept
How can we extend this to usefully index arbitrary bodies of text? How can we extend the search capabilities to be more powerful?
Let's revisit the indexer first. Our indexer generated alpha-tree entries that contained the index. The index consisted of a complete telephone number. We interpreted the telephone number in two parts; together, the area code and office code was the "filename" portion, and the station code (last 4 digits) was the "record position" within the specified filename.
By revising the indexer to generate 8 byte indices, divided into 4 bytes for the file name (as a file ID number, DOCNUM), and 4 bytes for the word position within that file, arbitrary text can be indexed. In fact, our sample telephone database indexer is a subset of this approach.
(You can't actually store a full telephone number into 4 bytes, we cheated a little. We created a table of area code and office code pairs, and stored a six digit index into this table, instead of the actual area code and office code. In my database, I have 8221 area code / office code pairs, which means that the index ranges from 0000010000 through to 0082219999, well within the range of 4 bytes. Some string manipulation is then performed on the decimal ASCII representation of the 4 bytes, and the area/office code pair and the station ID are split out, or merged in, depending upon the operation being performed).
The interpretation of the 4 byte file position is left up to the designer. For certain types of database searches, the 4 bytes might represent the byte offset of the word from the start of the file. For other types, it may represent the relative word number in the file. In our telephone database example, it represented the record number.
A truly flexible approach to ITR could be organized as shown:
INDEXER ======= +-----------+ +-----------+| +-----------+|| \ +-----------+||| +---------+ documents | | Front-end |||+ | | to be +-+---->| |---------------->| Indexer |----> alpha-tree indexed | | | analyzer |+ Doc#, | | / | +-----------+ position, +---------+ | /|\ & word | | | | | | \|/ | ---- ---- / \ / \ \----/ \----/ | | | | | |---->| | | | | | \----/ \----/ Filename Doc# to to Doc# filetype database database (DOCNUM) (DOCTYPE)
RETRIEVER ========= Keywords +-----------+ | +-----------+| \|/ +-----------+|| +---------+ +-----------+||| | generic | | Back-end |||+ alpha-tree ------>| |---------->| specific |------->matching entries | matcher | Position | matcher |+ +---------+ +-----------+ | /|\ | D | | O | | C | | N | | U ---- ---- | M / \ / \ | \----/ \----/ | | | | | +-------->| |---->| | | | | | \----/ \----/ Filename Doc# to to Doc# filetype database database (DOCNUM) (DOCTYPE)
In this scheme, any documents being added to the ITR database are assigned a document number in the DOCNUM database. This allows an association to be made between the filename (which can be arbitrarily long and complex, e.g. hypertext links) and a 4 byte document number to be established.
Next, another database is established that allows the association between the document number and a document type (DOCTYPE). The document type is used by the front-end analysis scheme to choose an appropriate analyzer, and by the back-end matching scheme to choose a corresponding matcher.
To index a file, once the DOCNUM and DOCTYPE database entries have been created, every word within the file is processed by the appropriate front-end analyzer, and the document number, word position, and word itself are sent to the indexer. The indexer creates/updates the alpha-tree based upon the word, adding the document number and word position at the appropriate place.
To retrieve a document, a generic matcher is passed the list of keywords that are being searched for, and finds them in the alpha tree. Every time the generic matcher finds a match (based upon the alpha-tree information), it calls the specific matcher for that file type (from the DOCTYPE database) to verify the match. If the match is verified, the appropriate information is printed. Again, depending upon the exact requirements, there may be no need for a back-end matcher, if, for example, the alpha-tree information is complete, rather than partial (the telephone database used a partial tree).
The DOCTYPE database and analyzer/matcher are where a large part of the flexibility of the system comes into play. The analyzer/matcher operate as a pair, so it is entirely up to them to interpret the word position that is returned from the generic matcher. As an example, the DOCTYPE could indicate a postscript file, or a spreadsheet file, or a word processor file, or ...
This would allow the indexing of arbitrary documents.