Note (2022): This web page was mostly written in 2019, but now (2022) I have a much better understanding of the program and its big picture, and can basically read it off the book somewhat comfortably. I’ll add a new page with my improved understanding (and what helped); hopefully sometime this year.
This webpage describes my (quixotic) project to “understand” (and maybe explain) the TeX program.
There are a lot of plans, but very little has been done so far. If you’d like to help with (or take over!) any of this, let me know!
The source code of the TeX program, written in a “literate programming” style, has been published as a book:
The main contents of the book (the program itself), other than some introduction and appendices, is also available as a PDF:
texdoc tex
if you have a TeX distribution installed, or else find it online here (3.4 MiB).Apparently a lot of people have successfully read this.
When I tried to read this book (in 2018 or so), I found it not very easy going. Despite the obvious (and obviously correct) reasons for this diffculty—namely, that I am not very good at reading code, and that reading a program is never expected to be very easy anyway—I cannot help thinking that the task could be made easier if the program were to be presented differently.
I can think of three aspects to this:
The program is written in Donald Knuth’s own literate-programming system called WEB (created for writing TeX in), which can be seen as a preprocessing layer on top of a certain dialect of Pascal. The code mostly makes sense to programmers familiar with a C-like language, but there are several subtle differences. Moreover, the typographical conventions of the typeset of output—the use of proportional fonts instead of monospaced fonts, the (today) uncommon style of indentation, putting multiple statements on a line like “paragraphs”, syntax-highlighting using bold/italic fonts rather than using colour, etc.—are not what most of us are used to. There are also the special features/innovations specific to literate programming: the printed book has a mini-index on each two-page spread (TODO: add example image below); in the PDF the references to other sections are hyperlinked. Neither of them has back-references (where all does this symbol appear?), nor does it have any way of distinguishing variables from functions from WEB macros. Also, the context of a LP module (section) would be useful to have: expand/collapse a section in place, see a quick list of what names (symbols) it defines/uses/indexes, etc.
And so on. So fixing all this would require parsing the code and writing a code formatter and code browser.
A more advanced version of this would be to also translate the code from WEB/Pascal to some other more familiar programming language, one in which functions can return values without assigning to a pseudo-variable, for example. Also, the code browser could highlight the non-trivial control-flow in cases of gotos.
Apart from presenting code nicely or differently, which helps in the small, we could try to make it easier to find answers to the questions that one normally asks about a program as a whole. What are the (major?) components of the program, and how do they interact? What is its architecture?
The TeX program is, partly because of limitations of the language that was being targeted at the time, written (at the Pascal level) as a single monolithic program (in a single file) of the form something like:
program TEX;
label
1, 9998, 9999;
const
...;
type
...;
var
...;
procedure p1
...
end;
...
procedure pN
...
end;
begin
...
end;
That is, a single file which contains a bunch of (global) declarations of labels, constants, types and variables, then a bunch of procedures and functions, then finally a “main” begin
…end
block.
There are no “modules” in the sense of separate compilation units. There are hundreds (about 300?) of global variables. Such a program is hard to understand usually, though the Literate Programming paradigm helps a great deal—there are “modules” (chapters and sections) at the WEB level. So what would be good would be to highlight these different major components—hyphenation, macro expansion, line-breaking, math lists, etc.—and show the precise interactions and interfaces between them. If someone wants to understand just one part of TeX in isolation (with the minimal amount required from other parts), they should be able to do so. The hundreds of global variables are actually mostly used in one chapter or so each (apart from some common places like initialization code), so wouldn’t it be cool to automatically associate them to corresponding chapters? Etc.
“An algorithm must be seen to be believed”, wrote DEK near the start of TAOCP. There are limits to how much we can understand a program purely from a static representation of its code, no matter how well-formatted, hyperlinked, contextualized, etc. It would be useful to see what TeX does as it runs, on a few easy and not-so-easy inputs. E.g., what is TeX’s input processor doing, what is its macro expansion doing, its paragraph builder and page builder, etc. TeX does have debug output for much of this (with \tracingall
for example), but all this debug output goes to the same terminal in cryptic form, from which it must be decoded.
Ideally, debug output would be in a clear/verbose format that highlights the workings of each part of TeX individually, and in expandable detail. This could be got by parsing log output or by running TeX in a regular debugger like gdb
, but best would be a custom debugger. It could have separate areas for TeX’s “eyes”, “mouth”, “stomach”, or, in more detail for say TeX’s input stack, macro expansion, token lists, hyphenation, etc., with clear highlighting of control passing between them.
[main loop]
/ \ \ action procedures
get_next, / \
get_token / \ ...
/
/
[Input = mouth]
/ \
/ \
/ \
Overall, these three (A1, A2, A3) amount to writing a code formatter, code browser, sophisticated static analysis, a custom runtime/interpreter/debugger… all for the sake of a single program, which is clearly insane. Hence, the “quixotic” at the top.
While these (A1, A2, A3) above all involve changing how the program is presented, there are other things we can do in parallel either as complement or alternative:
Just get some familiarity and learn about the WEB language, both the semantics of Pascal/WEB, and the typographical conventions of WEAVE (the typeset output) — the cross-references, etc. “How to read a WEB program”, basically.
Instead of attacking the TeX program directly, work up to it, by reading other programs by Knuth or other examples of WEB, or even Knuth’s more recent (CWEB) programs — just to get practice with the style and programming philosophy.
Read other TeX implementations or programs — you don’t have to read Knuth’s WEB if that’s not the easiest to read.
Look at the TeX program historically. Start with the earliest available version or design document (TEXDR.AFT
and TEX.ONE
), match with SAILDART, read TeX78, follow along with errorlog, etc. What were the constraints at the time?
Some general remarks about software engineering, about how Knuth is different, why we should expect this to be hard, etc. Probably doesn’t help, but is fun to do.
Just read the program as it is, understand, take notes from understanding. If I had any sense this is what I’d do, but I’m saving it for the last. Preserving my precious state of ignorance/confusion, because once (if) I get too comfortable I’ll no longer have an idea what the difficulties were.
These then — A1, A2, A3, and B1, B2, B3, B4, B5, B6 — are the 9 kinds of activities that this project involves. I’ll (soon) add more detail on each of these below: what has been done (nearly nothing), and what could be done.
I gave a talk about some of this at TUG 2019. At the time I proposed the talk, I was wildly optimistic about how much would be done by the time of the talk. Turns out it was nothing, so (looking back at it now…) the talk looks like nonsense with nothing of value whatsoever, so I feel guilty and embarrassed about it. On the other hand, the talk was (fortunately for me) not recorded, so maybe I can console myself that it was just a few minutes of lighthearted fun for the participants who were physically present. Here’s a link to the slides.
(WIP)
To simply use TeX, you don’t need to know anything about how the TeX program itself is implemented.
See preface from book. Software engineering. E.g. how to do memory allocation, packed tries, ….
Another reason to read the TeX program: sometimes it’s easier to read the program than to understand The TeXbook.
Of the book and PDF,
Chart of TeX’s code components (dark) and memory regions (light+dark) (page 594), also available here (via this in TUGboat – but note that the middle section is scaled by amount of code as Knuth says in the videos, and not by memory usage: scaling by memory is only for the outer sections).
\escapechar
or \baselineskip
are merely “entrances” (that OP’s terminology) for internally stored parameters. Shows the run loop (with switch) in main_control
, the prefixed_command
procedure that’s called for both of them, and the init_prim
and primitive
procedures that set eqtb
for the primitives.A debugger may help. I started writing something (has a really bad interface and is implemented in a stupid way currently).
\expandafter\expandafter\expandafter\meaning\expandafter\uppercase\expandafter{a}
\end
Pages from the book:
Preface (page v)
Supplementary Bibliography (publications of the TeX project at Stanford University) (pages vi, vii)
How to Read a WEB (pages viii-xiii – temporarily taken down till I can double-check permission issues)
Additionally, to become familiar with the WEB style of writing programs, you may also want to read:
texdoc webman
)You can read Jensen and Wirth (“Pascal user manual and report”), the original tutorial and definition of Pascal.
An interesting document is Kernighan (the K in K&R) on “Why Pascal is Not My Favorite Programming Language”. Kernighan faced many difficulties with Pascal, and WEB in many ways is a solution to those same difficulties (I wrote more on that at the beginning and end of the annotated version of POOLTYPE).
You can try writing a few of your own small programs in Pascal (with and without WEB), as with my 7-page “Hello, world!” program here.
You can try reading some Pascal programs written by others, so that at least the idiosyncrasies that are common to all Pascal programs can be got over. My suggestions are:
Let’s Build a Compiler by Jack Crenshaw. The original is here, but there’s a very convenient reformatted version here.
The P4 Compiler and Interpreter by Steven Pemberton and Martin Daniels.
Note that these two programs are both compilers. TeX itself is written like a compiler, so you’ll find many similarities.
Another personal favourite is the (in-progress after a long hiatus) Let’s Build A Simple Interpreter series by Ruslan Spivak. (No relation, I think, to Michael Spivak the author of AmsTeX and The Joy of TeX.) This is Python code, but in trying to write a Pascal compiler you may learn something about the language. If not, you’ll at least learn about compilers in a fun way.
You may want to work your way up to TeX from smaller programs written by Knuth in the same style: I’ve prepared a List of WEB files. Specifically, a possible reading order, from smallest to largest, is:
Program | Pages | Sections | Fraction of TeX |
---|---|---|---|
POOLTYPE | 7 + 4 | 20 + 2 | ≈ 1.4% to 2% |
GLUE | 8 + 3 | 26 + 1 | ≈ 1.6% to 2% |
DVITYPE | 47 + 7 | 111 + 2 | ≈ 8% to 10% |
TANGLE | 66 + 9 | 187 + 2 | ≈ 13.5% to 14% |
WEAVE | 98 + 12 | 263 + 2 | ≈ 19% to 20.5% |
TEX | 478 + 57 | 1378 + 2 | 100% |
Of these,
POOLTYPE
and DVITYPE
share a lot of their code with TeX (specifically, TeX’s string handling and DVI output sections respectively),
GLUE
shows alternatives to some code in TeX (and is probably best read after understanding the corresponding parts of TeX, even though it was published first),
TANGLE
and WEAVE
are the implementation of WEB, and perhaps worth reading as programs of smaller/intermediate size.
I have separate pages for each of these programs on this site:
TeX — just included for completeness here, mentioned again at the bottom of this page. Also, extensions of TeX: eTeX, pdfTeX, XeTeX.
Totally unrelated to TeX, but you could look at other “literate programs” entirely: Knuth’s CWEB programs, or the (Academy Award winning!) Physically Based Rendering book (see random chapter).
Instead of reading the Pascal source as written by Knuth, there are many other versions you could read. I’ve collected a bunch here.
Many (not all!) are only of historical interest.
The most relevant may be LuaTeX, which is under active development, and started with a manual translation of the Pascal code to C. For (a randomly picked?) example, compare section 426 of the TeX program with this part of LuaTeX — it is in more familiar C style, but there are also more cases. (Compare with pdfTeX and XeTeX.)
A good version to read may be Richard Sandberg’s rsTeX. For example, here’s that same section. See comments at the top of the file.
web2w
is a conversion of TeX from literate programming in Pascal to literate programming in C. See Martin Ruckert’s TUGboat article and website.
After having read these smaller programs and having gained a bit of familiarity with Pascal, before reading the full TeX program I strongly recommend the series of 12 lectures that Knuth gave in 1982 called The Internal Details of TeX82. They are available on YouTube, but I’ve embedded them on this website, with some comments, here.
Also, as you read the program / watch the videos, you can also try solving these exercises from a course DEK gave about TeX:
It is unclear how much this would help, but often the earlier versions of a program are less complex, or at least illuminate how the program got into its current state.
That's the end of this page. If you have any feedback about anything on this site, you can contact me here. To go back to the top-level page (if you are not on it already), click here.