Learning C++ — One year later

Almost exactly a year ago, I decided to learn C++ by writing a VI-like editor. The full article is here.

I spent about 5 months on that project and abandoned it in August 2015; things were not going well. (Well, they were going well, but I had completely screwed up the visual mode and painted myself into a bad corner.)

While I went on to do other projects (some I2C drivers for data acquisition, A C++ library for stock analysis, Real time FFT for a light organ, A music synthesizer emulator, An AI for trading stocks, and so on), I made a committment to do all new development work in C++ only (maintenance would still be done in C as appropriate).

But the editor vexed me. It taunted me. Surely I could write a VI-like editor!

So, in February 2016 I decided to try again. Whereas the previous editor effort had started at the command line end, and then ended up falling flat on the visual end, I decided to start this one from the visual end first. (It's amusing to note that the original VED editor that I used was started at the REGEXP end of things — so neither the command line nor the visual side — but that was around 1991).

What is C++?

But first, a small digression — this post is about learning C++ after all. Going in to the effort a year ago, I thought C++ was all about inheritence and operator overloading! I desperately searched for that in my editor implementation — but couldn't really find any good candidates. Nothing lent itself to inheriting from other things. And very few operators needed to be overloaded.

Instead, I learned about constructors, destructors, STL list and vector types, string operations, lambda functions, templates, stream operations, and so on. I started to learn more about “The C++ Way” of thinking. (Now, to be fair, I had done a reasonable job of “object-oriented” C programming; it's just more elegant in C++.)

Designing the Visual Part of VI

So what I want to talk about here, though, is the design of the visual part of VI. At the heart of it is a state machine, that operates in one of two states — I called them Command and Edit. In Command mode, any keys you hit are either cursor keys or commands (for example, “d” begins the process of deletion, “w” travels to the next word, and so on). In Edit mode, you're entering lines of text into your workspace.

Command mode has two major submodes, I called them Moving and Selecting. Moving is what I described above; you can press cursor keys or movement commands (like the “w” to go to the next word), but you haven't committed to an operation. Once you commit to one (for example, by pressing “d”), then you transit to Selecting mode. On the transition, I need to store the current Loff (“Line Offset” — a class that stores the iterator to a line in the workspace, and an offset within that line). In Selecting mode, I'm looking for a target — so if you hit a cursor key (e.g., the down arrow) or the end of command (a “d” in the case of delete), I mark the current location as another Loff and I now have the beginning and end of the command's range.

Depending on the command, (commands are !, c, d, y, <, and > [Yes, real VI also has v, V, and CTRL-V, mine doesn't yet]), I may delete the range, capture the range, pipe the range to a shell, or do some internal processing on it. Of course, I save the range in both the “"0” buffer, and another buffer if you've specified one with the " command. Then I may need to optionally either insert the output of the command (as in the ! command, for example), or put you into Edit mode.

The ugliest command is along the lines of "B2c5w which says:

  • Be prepared to make a copy of the original part of the line that we're going to destroy to be appended to buffer B,
  • Start at the current cursor location,
  • Push the value 2 onto the repeat stack,
  • Prepare to enter CHANGE mode,
  • Push the value 5 onto the repeat stack,
  • Move forward by the specified number of words,
  • Delete from the initial cursor location to the new cursor location,
  • Capture the deleted portion of the line into the default buffer 0, (and append to buffer B),
  • Place the editor into Edit mode at the original cursor location.

This is typical. Note a few things. The case sensitivity of the “B” — a lowercase value would clear buffer “B” first, whereas an uppercase value allows appending to it. Note the “2” and the “5” — they are actually multiplied together to cause the cursor movement command (the w to move foward by word) to move 10 words ahead. The fun part, of course, is that you need to maintain all this state until the w and then all hell breaks loose!

Of course, half the work is in problem formulation. Once the states are clear, and the operation of the state machine is defined, it becomes a “simple matter of coding.”

Why does a 400MB file take 1.3GB of memory?

One of the impacts of my editor design is that I used an STL list as my primary “holder of lines.” The reason I did this was simple — doing g/^/.m0 (ostensibly to reverse the order of lines) exhibits O(N2) performance on my previous editor (as well as on VIM). So, on a 1M line file, I have to hit ^C after 10 minutes. On my editor, it gets done in a matter of tens of seconds.

Most editors implement their line storage as a plain array, e.g.:

char **lines;

in the most unsophisticated example; a simple, re-allocated array of pointers to chars. In C++, this might be done as:

vector <string> lines;

In mine, it's:

list <Line> lines;

where Line is a class containing a string and some bools.

On my editor, however, opening a test file (all of the titles from Wikipedia, 16M lines @ 400MB or so) takes ONE POINT THREE GIGABYTES of memory. This seemed a little high :-) I was thinking something on the order of 500 - 600 MB (since the base content is 400MB, a few hundred MB for overhead seemed reasonable).

Initially, I thought, “aha! this is the horrible C++ bloat people keep warning me about!” — nope, not so much. Here's the minimum C implementation for a roughly equivalent class hierarchy:

typedef struct
    char        *content;       // 64 bit pointer
    unsigned    nalloc;         // 32 bit number of allocated bytes
    unsigned    size;           // 32 bit string length
}   string_t;                   // total = 16 bytes

typedef struct
    string_t    data;           // 16 bytes
    unsigned    flags;          // 4 bytes
}   Line_t;                     // total = 20 bytes

struct list_s;
typedef struct list_s list_t;

struct list_s
    list_t      *next;          // 64 bit pointer
    list_t      *prev;          // 64 bit pointer
    Line_t      data;           // 20 bytes
};                              // total = 36 bytes

list_t  lines;

So, for 16M lines, we have 16M list_t entries, at 36 bytes per line is 576MB of data, plus the 400MB of content itself is almost a gigabyte. This means, therefore, that the C++ implemention actually has 56 bytes per line instead of my 36 bytes per line in terms of overhead. At this point, I shrug my shoulders and say “Meh.” The flexibilty of using list vs an array style is one of the main reasons I'm writing an editor (so I can get decent performance), and the advantages of using STL/C++ vs hand-rolling it in C are why I'm using C++.

The moral of the story is, multiplying by 16M will quickly add up, and 64-bit software is a main cause of bloat (on a 32-bit machine, each entry would only have taken 24 bytes instead of 36 bytes each, or 2/3rds of the memory).