Beyond VI: Design Decisions in The Last Editor

The Last Editor (TLE) is a VI-compatible text editor being written from scratch in C++17. It started as a reimplementation and became something more — a set of deliberate departures from, but based on, over forty years of accumulated VI convention, each one motivated by a concrete frustration. This article describes those departures, roughly in order of increasing novelty.


1. Short-Form Address Ranges

Standard VI uses 'a,'b to specify a range from mark a to mark b. TLE shortens this to "ab — a double-quote followed by the two mark letters. It reads as "the ab range" and saves two keystrokes (33%) on one of the most common sequences.


2. Smarter Search with //

A bare / starts a standard regex search. A double // activates smart mode, which preprocesses the pattern before handing it to the regex engine:

Smart mode works everywhere a pattern appears; in search addresses and commands (e.g., g, v, s, etc.)


3. Named Yank Buffers as First-Class Citizens

Standard VI has named buffers but the semantics are murky. TLE makes two things explicit:

Buffer spec syntax: the buffer letter comes after the command, not before. d a deletes into buffer a (replacing it). d A appends to buffer A. Capital letter means append; lowercase means replace. This is consistent across all commands.

Inside g//: the buffer is cleared exactly once at the start of the first global command, then each matched line appends. So g/re/y N collects every matching line into buffer N, replacing whatever was there before; g/re1/g/re2/yN clears the buffer on the first g only.


4. Alternative Delimiters Everywhere

Standard VI allows s-foo-bar- as an alternative to s/foo/bar/ — useful when the pattern contains slashes. TLE extends this consistently: any non-alphanumeric character that isn't a reserved operator (,, ;, |, space) can serve as a delimiter, and the rule applies uniformly to all search, command LHS, RHS, and target addresses.

The forbidden delimiter list is explicit rather than implicit, which makes the grammar unambiguous and easier to teach.


5. Microcode Architecture Regex

TLE's regex engine operates with fewer native symbols. Think of it as a microcode engine; it processes something that no human would ever write directly.

The only symbols that the microcode regex understands are:

Symbol(s) Meaning
#...# repeat range pattern (see below)
$ end of line anchor
( ) grouping / pickup
8 UTF-8 character
< > UTF-8 anchor (beginning / end)
C Combination operator
P Permutation operator
R Rotation operator
[ ] character class
^ beginning of line anchor
{ } identifier anchor (beginning / end)
| alternation

All other characters are reserved, effectively removing all "non-special" ASCII characters (e.g., 'a', '5') and relegating them to a class expression.

The translations between standard regex form and microcode regex form are performed at the human interface level.

Example

The standard regex pattern "foo" is expressed as three classes: [f][o][o] When dual-case search is turned on, it's [fF][oO][oO]

Example

To break up all UTF-8 characters and put them on their own line s/8/\0\n/g picks up a UTF-8 character (microcode 8) and replaces it with whatever was picked up (the \0) and a newline, globally.


6. Generalized Repeat Syntax

Standard VI uses the traditional *, +, and ? as quantifiers. In TLE's microcode regex, these are expressed using a unified repeat range pattern expression:

Pattern Meaning
#,# zero or more (classic *)
#1,# one or more (classic +)
#,1# zero or one (classic ?)
#,7# 0 through 7
#3,# 3 or more
#3# exactly 3
#3,7# 3 through 7

The human-facing preprocessor translates *, +, ? into this form. The engine itself is cleaner for it — no special cases, one quantifier mechanism.

So, the pattern foo.* is translated into the microcode [f][o][o][-]#,#.


7. Command Chaining with |

Standard VI allows sequential ex commands on one line with | in some implementations, but the semantics are inconsistent. TLE adopts | as a first-class streaming pipe operator:

s/foo/bar/|s/baz/qux/|p

Each command operates on the result of the previous. The metaphor is shell pipes. A substitution that creates a new pattern makes that pattern visible to the next substitution. This resolves the classic "I need two substitutions but one pass" problem naturally:

g/anchor/d|.+1,/end/s/^/@@@/

Delete the anchor line; then prefix all lines up to /end/ with @@@. Because the current line (.) is updated correctly after each step, this chains correctly.


8. Cross-Document Commands: G, S, V,Y, and others

Standard VI operates on the current document. TLE adds G and V as cross-document versions of g and v — they iterate over all open documents, applying the pattern and command to each in turn.

G/TODO/p          -- print all TODO lines across every open file
V/Copyright/d     -- delete all non-copyright lines from every file

Combined with named buffers, this becomes a document-spanning collection tool:

G/TODO/y N        -- collect all TODO lines from all files into buffer N

Furthermore, S works across all documents for substitutions, Y for yanks, and so on.

9. Substitution Engine

The traditional VI "right hand side" (RHS) is usually pretty modest; it allows & (meaning "the entire matched part"), \X (meaning "put down pickup X"), and sometimes even \n to allow a new line to be inserted.

TLE suppoorts RHS operators of the general form { source [modifier] [format] }

source

Specifies the pickup number from the left hand side (LHS); just like standard VI. 0 means the entire LHS side (equivalent to &). Two other forms are available, however:

Form Meaning
a op b spatial slice between a and b
# sequence counter
#= n initialized sequence counter
@n emit contents of named buffer n
@$n emit contents of buffer named by capture n

The spatial slice allows you to get at stuff between the pickups, without having to make extra pickup sequences on the LHS. The four slice operators are:

Operator Meaning
>< end to start
=< start to start
>= end to end
== start to end

Example

An example is in order. Given the string:

My cat is not a dog

and assuming an RHS microcoded pickup of each word via ({[a-zA-Z]#1,#})#1,# (translated as "pickup one or more identifiers consisting of one or more upper or lower case letters"), here's how the operators work:

Expression Result Comments
${2><4} is picked up from after 2 to before 4
${2=<4} cat is picked up from 2 to before 4
${2>=4} is not picked up from after 2 to 4
${2==4} cat is not picked up from 2 to 4

It might not be clear from the above table, but the spaces before and after the is are part of the result.

The # (and initialized version, #= N) are used to generate numbered sequences.

Example

The emit contents of buffer forms allow macro substitutions to be performed cleanly:

g/^INCLUDE ([a-z])$/s/.*/${@$1}/

Expands each INCLUDE x directive to the contents of buffer x.

modifier

The optional modifer is one of u (converts result to uppercase), l (lowercase), and the mathematical operators +, -, /, %, and *. The upper and lower case modifiers are self explanatory; the mathematical operators work on a pickup if and only if it is a number and perform the respective operation on the number, preserving the radix and format as best as possible. This is most useful for moving a list of numbers along by an offset or straddle, for example.

format

The optional format can be used to translate the number, or the number after operation by the modifier, to a given format, and consists of a printf()-style format.


10. C, P, R: Combinatoric Pattern Matching

This is where the regex engine departs most sharply from tradition. Three LHS variadic functions generate alternation trees at compile time:

C(N, arg, arg[, arg]...)N-length selections. C(2, [a],[b],[c]) matches any two of the three elements in any order: ab, ac, ba, bc, ca, cb.

P(arg, arg[, arg]...) — all permutations. P([a],[b],[c]) matches abc, acb, bac, bca, cab, and cba. Six alternatives, generated mechanically. Useful for matching function arguments that could appear in any order, or assembly operands that commute.

R(arg, arg[, arg]...) — cyclic rotations only. R([a],[b],[c]) matches abc, bca, and cab — the three rotations, not all six permutations. Useful for ring-buffer patterns or rotational symmetry. For example, R([0],[0],[0],[1]) would match all 4-bit binary bit patterns with only one bit on.

These are pure AST expansions — each argument is an AST expression, and the engine never actually sees the C, P, or R; it sees only alternation and concatenation. Sure, the size grows factorially (e.g., the expression P(a, b, c, d, e) generates 120 permutations). However, typical patterns used in code tend to be two or three elements, so this is fine. A human would almost never write the expanded form by hand, and if so, it would tend to be an error prone exercise.


11. The Listor: A Hybrid Data Structure

Most editors represent the line database as either a flat array (fast random access, slow insertion) or a linked list (fast insertion, slow random access). TLE uses both simultaneously.

The Listor is a two-level block array. Under normal conditions, it behaves like a vector: line N is at blocks[N >> 10][N & 1023]. Random access is O(1).

For bulk operations — reversing a file, deleting every other line, moving a range — the Listor switches to a logical mode: deletions are flagged, insertions are queued as linked lists hanging off each line. No backbone movement happens yet. This mode is threadsafe per line, so a sharded g// can tag thousands of lines concurrently without locks.

When the operation completes, a single O(N) commit pass resolves all logical operations into the physical backbone. The total work for reversing a file is O(N) instead of O(N2).

You can reference the original article describing the Listor concept.


12. Memory Field and Multi-Threaded Architecture

TLE views memory as a field (like magnetism, it surrounds us and is infinite). Thus, it allocates memory, but never releases it. This is driven partly by the infinite undo/redo requirement (therefore the memory must persist). Copy-and-paste operations simply use the existing memory for the content, and threads can do lockless allocations from private per-thread memory arenas.

In modern implementations using virtual addressing, this has basically no impact on the system, but greatly benefits code simplicity and processing speed.

Where possible, thread sharding is used to distribute processing work. For example, the sharded g/re/d operation assigns a range of addresses to each thread to execute the g/re/ part of the command, marking lines for later deletion. A single commit phase runs once all threads are finished, and collapses the listor.

Even operations that increase the number of lines can be sharded. For example, %s/./&\n/ to put each character on a separate line adopts the same concept as the sharded delete: assign each thread a range of addresses, and have the thread perform the operation into the listor's "list side" result list. When all threads are done, run the commit phase and expand the listor.


13. Boolean Tag Algebra (Planned)

This is the most speculative feature; I haven't seen anything like it.

Standard g// and v// tag lines matching a pattern with an anonymous ephemeral marker. TLE extends this with persistent and named markers, and marker modification operators:

g/foo/+A          -- tag lines matching foo with marker A
g/bar/&A          -- retain A only where bar also matches (AND)
g/baz/-A          -- clear A where baz matches
g/qux/^A          -- toggle A where qux matches

Then a boolean expression selects tagged lines:

g{A&B}p           -- print lines tagged both A and B
g{A|!B}d          -- delete lines tagged A or not tagged B
g{A^B}s/old/new/  -- substitute on lines tagged A or B, but not both (XOR)

This replaces nested g//g// with a more readable two-phase model: first classify, then operate. The boolean expression parser {...} allows full logical syntax (&, |, ^, !, and parenthesis).

Implementation in progress.


Summary

The editor is written as a set of independent libraries — libregex, libmemfield, liblistor, libex — each with its own regression suite.