Improving USENET News Performance
This article originally appeared in Doctor Dobb's Journal, May 1996
Let me start by teasing existing system administrators who have a newsfeed coming into their site. How would you like to be able to run "expire" in a few seconds instead of the usual several hours? Would you like to see your disk have almost no activity while a newsfeed is coming in, or news is being expired, instead of the usual "this PC is about to explode" frenzied disk thrashing?
What is news?
Most everyone today is familiar with Usenet news. Just the mention of newsgroups like "comp.os.qnx", or "alt.fan.dan-quayle", is enough to recognize what I mean. People variously call it "News", or "Internet News", or "Usenet news", or whatever it happens to look like on the system that they log in to. We'll stick with the original name, "Usenet news".
We'll look at how Usenet news works on existing Unix systems. Then, I'll describe how this arrangement can be improved in speed and efficiency by at least an order of magnitude, and then I'll explain the steps required to do this under the QNX operating system.
How does news work?
People from around the world post messages, (called "articles", in news jargon) to the various newsgroups. Their news systems then distribute these articles to neighboring machines. Neighboring machines distribute the articles to their neighbors, and so on, until the article has propagated all the way around the world. Machines check the incoming articles to see if they already have a copy, and quietly delete the article if they do. Amazingly, this all works!
We won't focus on the details of how the machines get the articles, suffice it to say that there is a program on the user's system that is responsible for getting articles from other machines, and storing them.
It is how this program DOES that operation that is the main focus of this discussion.
Let's look at a typical system. This system accepts articles for about 4000 newsgroups (this is by no means a huge system, either!). As each article comes in, the article's "header" portion is scanned, and the news software determines where (ie: into which newsgroup(s)) that article should be stored.
A long time ago, when there wasn't all that much news traffic, it seemed like a good idea to just store one article per file. The newsgroup names got converted into pathnames, and everything was simple.
For example, if I had an article for the newsgroup "comp.os.qnx" that just came in, I would pick the next article number for that newsgroup (say it was 1143), and store the new article in a file called /usr/spool/news/comp/os/qnx/1143. (The /usr/spool/news part is just the name of the directory where all of the incoming news articles live -- it's up to each site to determine which directory that will be, but /usr/spool/news is very common.)
The next article that came in for that newsgroup would go into /usr/spool/news/comp/os/qnx/1144, and so on.
So why is this a problem?
There are a number of reasons why this isn't an ideal solution. Articles can arrive in any order — we're not always going to get all of the articles for comp.os.qnx, then all of the articles for comp.os.rsx11, then comp.os.svr4, etc. This means that as the articles arrive, the news storage program is creating files in an ad-hoc manner, all over the disk. Not only that, it is creating from a few thousand to several hundred thousand files per day, depending upon the size of the feed! (This works out to tens of files per second). This means that the poor disk is getting quite a workout.
Given the volume of news these days, the disk would fill up quickly. So, most news systems have an "expiry policy". This is described in a text file that associates newsgroups with retention periods. For example, how long do we retain articles in "comp.os.qnx", before deleting them? This usually depends upon two major factors — how much disk space you have, and which newsgroups you value. "news.answers", which contains the FAQs (Frequently Asked Questions) for most newsgroups is a valuable source of information — I keep mine around for a month. Articles in the newsgroup "alt.barney.dinosaur.die.die.die", while amusing, don't get to stay around for more than 4 days.
If the disk is to never overflow (experienced sys admins will chuckle knowingly here), the volume of news coming in must equal the volume of news being expired. On a typical news system, the "expiry" processing is done by a batch file that runs periodically, say once or twice a day. With current implementations, the expiry processing takes on the order of a few hours. Sometimes, the expiry processing will take so long that it appears that the machine is doing nothing but expiring!
Therefore, not only are there many files being created in random places on the disk each second, there are also a roughly equal number of files being deleted, from different random places each second.
This is suboptimal.
To make matters worse, the news systems in common use end up copying each article many times over before placing it in its final resting place, and then, when it comes time to expire the article, an additional read of the article takes place. (See diagram 1, "Conventional News Flow").
How can this possibly be made better?
Once the files are in their appropriate places, it is hoped that a large number of users will read a large percentage of the articles that have been stored.
Sadly, even in a large company, the vast majority of news articles are never read. They simply expire.
Rather than using a general purpose filesystem to (suboptimally) manage a news system, why not use our knowledge of how these files are going to be created, read, and then deleted, to create a filesystem uniquely tailored to handling this information?
The one major drawback to coming up with a "new" filing strategy is that there is currently a HUGE body of free software available on the Internet that is intimately aware of the current organization of news articles (ie: the /usr/spool/news structure just described). It would certainly not be a popular idea to tell thousands of developers that there is a new way of doing news, and would they please change all of their programs!
I'm going to describe the operation of a program called KNews (kay- nooz), that takes advantage of some of the features of the QNX operating system to more efficiently handle news articles.
Under KNews, there are a host of cooperating programs, of which we'll only describe the two main ones. One program gets the articles from other machines and stores them, and another program implements a "virtual filesystem" (which I'll describe shortly).
Getting the articles
The first program, the "newsfeeder", knows how to get articles from other machines. There are actually two main variants of the newsfeeder, one that knows how to get articles from NNTP-based hosts, and another that knows how to get articles from UUCP/rnews feeds.
When the newsfeeder gets an article, it looks at the header. In conjunction with the local expiration policy file, the newsfeeder program determines right then and there when this article will expire. The article is then appended to a file that has other articles already in it that expire at that same time. When we put the article into the file, we make a note of the position within the file where we start writing the article, and the length of the article.
Let's assume we were examining article #677, posted to "ott.forsale" that had a posting date of Jan 15th, 1996, at 10:34 GMT. Our expiration period for that particular newsgroup is 4 days. So, we know that the article will expire Jan 19th, 1996, at 10:34 GMT. Therefore, we store the article at the end of a file called "960119.10" (we ignore the ":34", as we have defined our expiry granularity to be 1 hour). We make a note of the position within the file that we stored the article at, the length of the article, and the article number:
ott.forsale, art#677, exp 1996 01 19 10:00, pos 36554, len 495
By the time that we had received a day's worth of news, we would have created some number of files, each containing collections of articles.
The number of files created is a function of the expiration policy and the expiration granularity. For example, with a maximum expiration of 10 days, and an expiration granularity of 6 hours, we would create 40 files (10 * 24 / 6). This is between 3 and 4 orders of magnitude fewer files than conventional means!
Ok, so now all of the articles are in a few big files (we'll call these files "expiry-ordered heap (EOH) files"), stored in a directory. (By convention, this directory is /usr/spool/knews).
What we really want to do is have the operating system present a /usr/spool/news directory structure to all applications. This way, none of the applications have to be modified to work with KNews.
Really what this means is mapping files under /usr/spool/news into portions of our EOH files. (See diagram 2, "mapping /usr/spool/news to EOH files in /usr/spool/knews").
The newsfeeder program already provides all of the information that we need; it tells us the newsgroup name, the article number, the EOH file name, the position within that file, and the length of the article.
Under most operating systems, creating a virtual filesystem that does this kind of mapping is a major undertaking. Under the QNX operating system, it is a few hundred lines of code. This article focuses on just this task.
Taking over /usr/spool/news
We want our fake /usr/spool/news directory structure to walk, talk, and act just as if it was a real disk-based filesystem.
Our virtual filesystem for news (VFsys.knews) should export the illusion of two main object types; subdirectories and files. It should support read-only attempts to access these objects; we don't need the additional complexity associated with writing for our example.
We should be able to:
$ cd /usr/spool/news/comp/os/qnx $ ls 1134 1136 1137 1142 $ cat 1134 <contents of article 1134 show up> $ cd ../vms $ pwd /usr/spool/news/comp/os/vms
Since it is a read-only filesystem, we also require some method of telling VFsys.knews about new articles, by giving it the article number, expiry date, offset, and length.
A Little Background
Let me digress just a little before we consider the meat of the problem. QNX is a microkernel, message-passing operating system, with most services implemented via a message. For example, when an application ("client") opens a file, the application's C library open code places the arguments passed to the open() call into a message, and sends this message to the process that is managing the file system, Fsys ("server").
Fsys decides if the open should succeed or not (based on permissions, existance of the file, etc), and returns a status. Some time later, the client might request a read of 200 bytes. Again, the client's library composes a message which it then sends out to the server, requesting that a read be performed. The server will go out to disk, and return the data to the client. (See reference at end of article for more information about message passing performance).
Fsys isn't the only program that is allowed to receive and process open and read calls, though. Under QNX there is no "sacred" attribute associated with programs like "Fsys", versus user written programs like "hello world".
We can do a really neat trick under QNX that we can't do under other operating systems easily. Under QNX, I can write a program that assumes responsibility for a portion of the pathname space. This functionality is a subset of what is generally termed a "resource manager" under QNX.
What does a resource manager have to do?
Let's look at what has to happen in order to successfully take over /usr/spool/news.
We have to:
- tell someone that we are now responsible for a portion of the pathname space (we register ourselves),
- handle requests from clients, (open, read, close, etc)
- handle other messages from our cooperating news feeder program (new article has arrived, expire, etc).
To implement step 1, the function resmgrInit in resmgr.c calls the qnx_prefix_attach call to associate our program with a particular pathname prefix — in this case "/usr/spool/news". This then causes any process opening files and directories in this portion of the pathname space to have its messages sent to our process, instead of the process managing the actual disk drive.
To implement steps 2 and 3, the function serve in resmgr.c waits for messages from clients (the Receive function call). This is a blocking call — the program does not continue past the Receive call until a message has been received. Once we get a message, we call the appropriate routine from the jump table. We can receive a whole range of messages, each of them corresponding to either some C-function call that a client can execute, or some of our special "news feeder" messages.
Most notably, in terms of "regular" messages, we receive messages corresponding to the open, read, write, readdir, and stat calls.
To implement our virtual filesystem, we just have to handle these calls.
Let's look at these calls in turn. The open call specifies which particular file is being opened. Because we can get requests from multiple clients (ie: three different terminal sessions can all do an "ls" at the same time of some portion of /usr/spool/news), we have to track processing based upon who is sending the message. So, in the open handler you will notice that we associate an "OCB" (Open Control Block) with the particular request, via the "map_ocb" function call. This OCB holds the state information associated with a particular client's use of the virtual filesystem.
There are really two types of requests that we can handle: file requests and directory requests.
File requests deal with the following functions (resmgr.c procedures in brackets): open a file (io_open) read that file (io_read) close that file (io_close)
and directory requests are similar: open a directory (io_handle) read entries out of that directory (io_readdir) close that directory (io_close)
Once we have associated the context with the open call, we then expect that we will receive a read request, followed by a close request. In the close request is where we dissociate the context from the request.
The real work is done in the open (in the case of files) or the read (in the case of directories) functions.
For a file, the open call has to verify that the file actually exists, then, it has to initialize the context for that file.
To determine if the file exists, we look at some internal data structures that VFsys.knews maintains (in database.c). This data is a network of directories and files, similar to what a filesystem manager would actually maintain on disk. (See diagram 3, "Internal Database Organization"). By following the chain of directories (shown as circles), we eventually get to a file entry (shown as a rectangle). The file entry tells us the article number, EOH filename, position, and length.
We then open the EOH file, seek to the starting position for the article within the EOH file, and store the file descriptor into the context block.
Then, whenever the client performs a read, our server just reads bytes from the file descriptor, and transfers them back to the client.
For directories, we have a similar flow. The directory path is evaluated against the internal database structure, and checked for validity. If invalid, an error status is returned to the client. If valid, an "ok" status is returned to the client, and it is expected that the client will send the server some "io_readdir" messages to get the directory entries. We again make use of the context block, but this time we drive the processing by a finite state machine. The context block's state variable, ocb -> state, can have one of the following state values: OCBState_Initial, OCBState_GetFiles, OCBState_GetDirectories, or OCBState_Done.
OCBState_Initial generates the virtual directory entries "." and ".." (that each directory must have), and then transits to either the OCBState_GetFiles or OCBState_GetDirectories state, depending upon what is contained in the internal database structure.
OCBState_GetFiles returns directory entries for the files, one by one from the internal database structure, and OCBState_GetDirectories does the same thing for subdirectories.
That's the basics of handling a virtual filesystem under QNX 4. You will notice that I have other functions in there as well, like an "io_stat" handler (for the client's "stat" function call), and "io_chdir" etc. The operation of these is analogous to the operation of the simple open/read/close and handle/readdir/close messages.
Apart from the standard "resource manager" messages, there are also a number of messages that deal specifically with VFsys.knews itself — we have messages that tell VFsys.knews about a new article that's just been made available, or tell VFsys.knews to expire articles, etc.
Let's look at the handling of the expiry message. When a client sends the expiry message, the Receive call in procedure "serve" in "resmgr.c" gets the message, and the newsmsgsJT jump table ends up calling the procedure newsExpire. The code for newsExpire is in resmgr.c, and basically calls journalExpire and databaseExpire, which do the actual work. "databaseExpire" runs through all of the internal database elements, and finds all of the ones that are expired, and removes them. (We optimized things here for speed by having the database entries sorted by expiration. On my system, a 486/33, with about 10k articles being expired, this took under 1 second. You can run expire once per minute if you wanted to, although a practical limit would be once per hour. I run mine every 3 hours.)