Session 7

Of the videos, this is the seventh session. Topic of this one: The paragraph builder (Breaking Paragraphs Into Lines)

This will get replaced with the video (if the JavaScript runs properly).

The transcript below is autogenerated from YouTube, and needs to be cleaned up.

I did notice because as the author of the system it was always clear to me what where there was a Makarov for a procedure car without looking at the index you have no way to know right now so probably it would be good idea to have a little type of graphic symbol indicating it’s a macro call or something no no that see when you when you call a macro you I might even want to have a indicated by a by a subscript which tells me what’s the module number where the macro was defined that would be another way we know it would say incre sub 3 then you’d know that that was a macro because it has a subscript on procedure named stone so I like to experiment with that it’s a problem that I just thought that it wasn’t aware of until this week now I guess they they’ve started to tape so we saw that people want to make comments so we ought to do it for the for the people out in TV land so well this continuous discussion later this is six down and six to go and our current our current this thing under discussion will be the paragraph maker and a couple of well young we can talk about web later now okay put it in context again we’ll give this old slide and I have another slide that you haven’t seen before that you’ll get to see in a minute so the paragraph it counts for a little bit of code here about what is that 10 100 modules and and if you add hyphenation to that which is which is about an about the same amount of code this this is the thing that takes place when a horizontal list has been finished and it’s time to break it up and and put it into boxes of the age size it’s a self-contained little part of the program that that only gets called twat in two places once it gets called when you have finished an ordinary paragraph any other places when you’ve started a displayed equation in both cases while you take the previous lines and make them up into paragraph there’s one of the things in the state we remember we had the mode in the head and the tail and the space factor there’s another thing in the state called already and this is it this keeps track of how many lines you’ve already set for a paragraph in case of a displayed formula you have to have to resume maybe at line eight or nine the reason is that if paragraphs can be kept hanging indentation and the Hang annotation might say go on for the first 10 lines and so you have to remember how many lines you’ve already done in the in tech you know that you can also play games with the width paragraphs and you can get different shapes something like that and and the specify the part of the paragraph shapes so it’s another reason why we want to know how many lines we’ve we’ve said you could have a displayed equation in the middle of such such paragraphs what what actually go tech actually does is it considers a displayed equation is three lines long and all three lines are equal in size to the mid to the middle of those three lines usually the first and last are blank and the only time that that you have other lines as if you have to drop an equation number or or raise an equation number now there’s tiny little numbers in the right-hand margin here this is indicate this is the glue ratio of the that had to be used in that line if you could read it it would in the first line it would say point 305 say these spaces are a little larger by three-tenths of their stretchability the next one point 867 here’s a mind whoops where the minus yeah the bottom one is a minus point 5 17 and this is in some sense the optimum way to put those words into that shape and we changed the text of you to other texts and we can fit them in in also there was no it wasn’t particularly difficult to to to do this example we didn’t have to really have to fudge with it or anything very much but the question is what kind of algorithm is actually used to make paragraphs and it’s one of the most interesting algorithms in tech probably the most interesting one and it’s due in major part 2michael class whose whose wife you all know because she’s been our host this week now let’s first talk about the case where we don’t have a paragraph shape and all the lines are the same length this is this diagram indicates one of the tests that we have carried out over and over again to to our to our favorite text this is the text taken from Grimm’s fairy tales and the only reason we did this that for some reason somebody put Grimm’s fairy tales online in a sale computer system this is fairy tale number one and the first paragraph of fairy tale number one if you want to read the rest of the fairy tale I have it online and we can look at it later and all the other tales are on there too and we also have withering Heights but I don’t i end up and we have oh goodness what was calling higgins first movie Harold and Maude yeah ok so those are the books that we have online and we think we can do our experiments on paragraphing with off with those things also they’re happy art of computer programming is online well in olden times when wishing still helped one there live the king and so on this is the paragraph and the question is what’s the best way to put it into into a paragraph and tech has it’s a way to define best in terms of concepts of badness and demerits badness is something that depends on that that little glue ratio number that we had on there this one is so smeary I can’t even read it off of the slide but point 7 something on the top line now the the badness is in simple cases is equal to the cube of that number times a hundred so if it’s point seven week you a bit to get point 49 and then what three is it 343 is seven cube 0.34 so 34.3 units of badness because we multiply by 100 afterwards now this hundred of course is an arbitrary scaling factor but then we convert to an integer and this this calculation of badness is actually done but like all the other things that tech the with integers by some precise method that you’re not allowed to deal with so the actual calculation of getting that thing that we didn’t compute the blue ratio with floating point arithmetic we have the true integer number of scaled points that we wanted to expand that line by and we had the true integer number of stretchability that we could do it and those two numbers were computed by a subroutine called badness that will produce probably the number 34 maybe 35 but whatever number it produces every implementation of tech is supposed to produce exactly that same number and it’s pretty much the ratio of the two cubed times 100 and the reason that number was chosen was because it has the right properties for that we were looking for namely it’s pretty it stays close to zero for a while and then but then it starts to rise rapidly after things start to get real bad then once we have badness we add on to that another number called the line penalty which just says that any line has a certain amount of is has something wrong with it just because we don’t want to waste lines and so blind penalty in tech the old version of tech was always fixed at one but in tech 82 you can say the line penalty is 100 and that will make it that will make tech a lot less happy to make a paragraph longer it will try to you know the other line penalty the the the short of the paragraphs will tend to be and then it’s funny if you try to set line penalty negative strange things are going to change things will happen and well I does not use about that for someone anyway there’s there’s a whole there’s a whole bunch of other parameters that can be played with so many that that I doubt if anybody will ever be able to experiment with with very many of them in their lifetime I mean there’s just have many possibilities to try the so you find ones that you like and with those but the line penalty is one of these parameters so we take the badness computer they told you we add the line penalty then we add penalties associated with this particular kind of a break so if some if there’s a break at the end of a line is due to a hyphen then you add on the hyphenation penalty there’s two kinds of those there’s the kind of there was a discretionary hyphen and there’s the kind that was an explicit hyphen like somebody has a dash or a real hyphen in his text where he said nitty-gritty or something and there was a hike in there in anything in that there’s a peasant explicit hyphen that gets one penalty the kind of had to be inserted the hyphenation algorithm that gets another kind of any way that that penalty would be added or if you broke because of an explicit penalty that you said penalty 100 then the penalty is of course 100 now we calculate the demerits of a line by taking the booze badness and the line penalty and the other penalty and add us all together and square it is and that’s demerits and and the more demerits you have the worse you’re off now on the other hand if the penalty is it is negative for example if somebody said break here with penalty minus 100 then we don’t want to subtract 100 and then square it because that’s going to go the wrong way so in that case we compute the square of the badness and we subtract the square the penalty now these are arbitrary farmers but they tend to agree with what what we’re supposed to do namely as the penalty gets larger this demerits cook go up and and they tend to to Bruce the right results so we had to have a way to find a quantitative statement how much does it cost you for a certain a certain effects on the page so so the costs are computed by mathematical formula finally we get to a notion of demerits and the idea is to find the way to break a paragraph so that the total number of the merits is as small as possible now this particular example we can illustrate that problem by this neat little chart here where we where I draw on each line the number of demerits that would be associated with the line from a break at this this dot means at the beginning of the paragraph here is a break after the word a so on the first line which had a point seven something of badness so that would be 34 and so on and after squaring and everything it turned out to be adding line penalty and scoring so on we turn out that was 2357 demerits for that line on the other hand if we’d brought king over to that up to the end of that line it would have fit and it would have been only seventeen hundred and thirty four demerits okay now similarly as what about if you stopped after a and and just considered from dot to live I didn’t even show that on here it’s not because the demerits are so large the reason I didn’t show it is because there’s something that tech calls the the tolerance in which the old version of tech called the J par which says anything that has to stretch too much is not considered at all and so we say that a line break is feasible if there’s a way to get there from the beginning of the paragraph without violating any of this tolerances say that again it’s feasible if you can get there without violating any tolerances now I’ve drawn blue lines for all the feasible breaks in this paragraph so so you notice at the beginning we had a feasible break after and after King but whose would made it too tight lived would have made it too loose and there was a sort of a v-shaped effect that is further on you get into the paragraph more breaks become feasible and if your tolerance was very sloppy this was done with tight tolerance where we had much more rigid space and controls then publishers generally do use softest if they if we gave ourselves a lot of tolerance and we would have we would probably had all all blue line spies after we got to gone for four or five lines into the paragraph in tech 82 if you make the tolerance 10,000 or more everything is everything is feasible and tries everything and wouldn’t who wouldn’t mind going in space oldin all the way to the end think so that’s that there’s a notion of infinitely feasible but the tolerance of it but is is typically 200 that means in units of badness the tolerance is now given in units of badness it used to be given in another other way but tolerance now is given in units of badness so so if I set the time to 200 this means that I’m not going to accept any line whose badness is more than 200 this means it would stretch more than a cube root of two of its of its penis because cube root of two when we cubed it would give us two and then multi 100 to get 200 so feasible breaks are shown here and every time I have a blue line I have a corresponding box down here and so the question is how what’s the best way to get from the beginning of the paragraph to the end in the sense of total geology merits being small as possible this translates into another well-known problem in computer science the shortest path problem what’s the best way to get from here to there in the sense that the total distance should be as short as possible so if week so if we knew all the feasible breakpoints and we knew all the demerits between feasible break points and we apply shortest path algorithm and it gives us the shortest way to go and this one on the top is the shortest way if you so it’s why we come down through off instead of going through king because even though it took more demerits here we’re going to win pretty soon by take picking up only six here where that one has to pick up 4692 in the next line by putting king in there you get stuck on the next line you don’t have as good a chance so in this way tech is able to look use the paragraph as a whole to get the best way to do the breaking and we use this in the first tech in a na s well it was it was successfully used for about two years before we realized there was a bug in it well before I discussed about the bug i should mention that how was the algorithm going to go on we we don’t we don’t simply go through one pass through the paragraph note all the feasible breakpoints and demerits set up this graph and then then start up a shortest path algorithm the whole thing is all done on the fly simultaneously so as we’re going through the paragraph we’re keeping track of as much information as we need to be able to recognize a feasible break point when we come to it and at the same time is that we’re also doing the shortest what the shortest path algorithm would do on on an acyclic graph like this now the broad that we noticed some time after it was illustrated on this on this chart and the one on the right hand side is the way it would have been done by our first algorithm and the slide sort of explains the problem itself if the bug only occurs when the line with changes and here we allocate an example where the line with changes and we had been keeping in our in our diagram for each feasible break point we had been keeping track of the best way to get from the beginning to that break point but we hadn’t been keeping track of the best way to get from the beginning to that break point on line 10 as different from the best way to get to that break point on line 11 but but it might be possible to have the paragraph be feasible both ending at one place at line 10 an inning at line 11 okay so in for example in this case at the end of the word like there’s two yo in the on the left-hand side like fit okay we’re at the end of the of the Indian of the indentation there well it also fits on a line above in fact it’ll probably you can probably squeeze it up some somewhat higher and find a way to do it feasibly we set tolerance very high in this particular example so what happened was in our old algorithm and keeping only the information for a breakpoint and without without considering line number we would say that it was the best way to get from the beginning to like would be well that way and and then when we when we come to the lung this this long equation on the next line and it says now what to do well there’s no way there was no good way to do it because we had lost track of the fact that there was a feasible way to get to this on another we only kept track of the best way but ok I someone you might see the problem man as to how our algorithm didn’t forgot that there was a feasible way also to get like at line 11 with each break point we stored the best way to get there and the line number on which it would be at the best way we should have done for each combination of break point and line number store the best way to get there so the algorithm it’s in the present version of tech has a bug fixed and does the thing on the left I showed this to some printers and they said the spacing is so white on the thing in the left they would still prefer the one on the right well okay anyway it was about so we now those are the only slides I’ve got for you on that question yawn on the last slide you mentioned that you keep a feasible way to get a certain break at a certain line is there way also to figure out a feasible break at arriving at a certain distance down if you don’t know what the line with spacing is going to be on the way yeah this is just this generalization has been developed by in hockey and his PhD thesis and he’s uh he’s generalized this algorithm so that it will deal with shapes putting putting text into into shapes in to me it’s amazing that it could ever be done but Pete but the other day he convinced me that it would work yeah that’s one of the many things that he’s doing in the season not the the thing I told you about the merits isn’t quite complete there’s also the there’s also extra demerits thrown in for let’s see what is it called double hyphen demerits is one of them there’s a parameter in technique to you say how many demerits do you pay for double hyphen to hyphenated lines in a row and so you add to the demerits that I’ve already told you how to calculate this extra penalty for double hyphen there’s another one for final hyphen demerits this this applies if you are if you’re the at the end of the paragraph in the previous the second last line of paragraph was was hyphenated and that’s a separate tunnel a separate calculation added onto demerits this is called the merits rather than penalty because it doesn’t get squared like the like penalties do now there’s also the adjacency demerit and this is when you have a tight line followed by a loose line or vice versa you pay a penalty for incompatible lines in consecutive it’s sort of better if you’re going to have a tight line somewhere then you don’t want to have a loose line right afterwards it’s particularly noticeable when when when is in traumatic switch so there’s there’s a lot there’s a calculation based on a bunch of parameters the user can supply will which then leads to a shortest path calculation where we try to find the shortest path of the respect that we talked about pipelines and loose lines means that we have to we have to improve the algorithm still further along the lines of this this bug I mentioned mainly we have to remember for every break point for every line number and for every class of fitness whether it was tighter or loose on the last one we have to remember what’s the best way to get there that triple of things is really what we have are really the the vertices in our shortest path now that the idea then is going to be to keep a data structure that represents that represents enough information for us to do all that and we do it in terms of the so called passive nodes and active nodes as we’re going through now as we go as we begin at the beginning of the paragraph maybe I ought to put that slide up on the on the screen again first I’ll wave my hands and then I’ll tell you a little more details so how do we actually generate this on the fly we start out and and and there’s one active node one active break point and it’s for the beginning of the paragraph and the list of active breakpoints contains only that thing then when we when whenever we get to a valid break we start searching for the paragraph and every time we get to a space between words here that will be a valid break point and the subroutine called try break is called and try break looks to see looks at every active breakpoint and every active break point if first there is only this one it tries to see or what about the best way to get from the beginning of the paragraph to the current breakpoint via the active breakpoint now the only way wherever we’re ever going to get to so suppose our current break point is called P then we want to get from beginning to act 2 a-2 p where a is some active break point we’ve already recorded the best way to get to a so we can figure out the demerits from a 2p then we know the cost from beginning to a 2p the best possible weight from beginning to 82 p has to be the best possible way to a followed by a 2p his prime they also includes whether it was tight or loose this is this is a slight subtlety Nelson clues at what line number there is but I’ll for simplicity just talk about break points so we look at that number F and and for all a fixing p and for all a that are currently active we take the one that turns out to be best and if none of those is feasible then we fell p away and nothing ever happens but if at least one of them is feasible the best one it’s it’s considered an active break point and it become end and then it will start testing it against other things now if if we find out that the some active break point that’s so far away from P that from a 2p causes the house to exceed the shrink ability so that we could never fit the line from a to pee in a to get the current line then a becomes inactive we take it off the active list I have to mention i hope that works i believe that yacht it must be there must be okay yeah so because a has a certain line number associated with it it would still remain active at other line numbers now the in this case then what’s going to happen we’ll start out with this one active and after we get to a that one in a will be active after king will test King against this one in this one actually and we’ll find that the best one is from the first to King and when we get to whose we’re going to find out that this one should be retired so this one becomes inactive becomes passive and finally when was the curs will find it was she become active and soul occurs okay then the next word after so beautiful a is not going to be no longer active nor king and so eventually will only keep a few active nodes at a time so the search is the search is is the number of rather the time to do this algorithm is proportional to the number breakpoints times the number of active nodes and the number of active nodes tends to be at worse the number of words you can have in a line so it’s so the algorithm runs quite rapidly and it even even when we’re setting unusual paragraph shapes we can we can get the thing as globally optimum for the paragraph as a whole with a rhythm method that is only slightly shorter slightly slower than what you would do if you were just setting a line at a time on the fly as another system let’s see I wanted to make another quick point but what was it flashed through my mom oh it’s just I just wanted to say it’s amazing when you play with this and and you see you correct a typo in the last line of a paragraph and then all of a sudden something else shifts up in the first line after this algorithm goes on it also does amazing things when it goes wrong as we discovered for example the mysterious problem was happening on the current membership list with the American mass society until we found out that it was that we were that it was because of the final hyphen penalty the height the line before the last line was hyphenated and this and wasn’t choosing this alternative as the best one although it looked obviously better to us because we didn’t realize it was the line before the end of the other paragraph day in the current membership list they ended the members entries with was like a for association I for the Siam and and as for the society and so a person named given here and then at the end it says some stars and si I here would be the hyphen but but it would Tek wouldn’t like this this solution because we hadn’t realized that the final hyphen penalty was was was was acting there so it would prefer setting something without hyphenating that word and bringing something over here and look this way and abduct it surprised us so wait so there’s so many blue there’s a bunch of parameters in there and uh and if you look at the code the probably each parameter just appears once in text program but the the the heart of using these parameters is is something that can be learned but isn’t but what takes a little little experience I think it’s fun to watch it now that the I said that we have active nodes and passive nodes active actually we we make up a passive Nord at the same time as we make an active node and then when the node becomes inactive we just we just recycle the active node part the passive nodes are very short for every feasible break point we have a passive node that’s two words long and in fact it could only be one and a half words long if we if we were working on a 16-bit instead of a 32-bit machine active nodes I forget how long they are three or four words long each active node points to the corresponding passive node and also has these values on it like the the the total demerits the best total demerits to get to here and it will contain other things into like the line number the fitness class whether it’s tight or loose and the and this one points to the break point P that that we’re actually trying the break points that we might find our glue nodes or a discretionary nodes or or in our penalty nose and then in special cases math or current node if if you write a math formula and it says say X and somebody has used a maths around option in tech which says that he wants every math pointing to be surrounded by a certain amount of space well then you could break at this at this at this surround here namely if a if of math formula is immediately followed by a space it’s valid to break their and and so that X would be flush right similarly with currents if a current is followed me to be by a space it’s allowed to break their so so those are the kind of legal break points we have and that’s what p represents of something in mem that’s either a glue node or a penalty node or a discretionary node or masa current in these special cases the the difference between the the difference between texts new algorithm tech 82’s algorithm and the one in the present tech is primarily that in the present tech knees were these are both combined in one big ten word node is taking a lot of memory and what we had what we did was when it became inactive we just moved it to a passive list and I just noticed when I was recording it that we weren’t using many of those words anymore so now I just make it up bus with the part that’s that I’m going to need at the end putting immediately on a passive list the passive list points to pee it points to the best previous node from P in other words the best way to get to a was from some other passive node and it points to the previous passive node so that we can at the end recycle all the passive notes to the to the available storage so there’s three pointers in every passive node one and a half wordsworth active nodes would contain the total demerits and they also contain things like the line number and I mentioned now that in the in the existing version of tech we we also kept track of the total length of of everything from the beginning of the paragraph to hear this is the the if you imagine the paragraph all stretched out in one straight line to the given point how far how much was there and then and we would use that to calculate the badness of a line because if we knew the beginning to a in the beginning to be then for a line from A to B we subtract those two numbers all those two numbers were kept and floating-point notation and they include it and and there was also a similar thing for stretch ability and shrink ability that we that we would keep track of in floating-point notation so that we can determine between a and B how much stretch ability there was between the line as well as how much space there was between those two now that it we want to get rid of floating-point arithmetic in tech 82 and also led to other problems in the present version of tech if you ever get this mess error message that said too much stretch for proper line breaking this is because when you when you took the glue up to be and subtracted the glue up to a you got you’ve got infinity minus infinity which which canceled out 20 it didn’t give you very much information as to well because he has the authority pouring arithmetic all the what you need it was was was was wiped out the significant figures had had the number of fills in the paragraph but not the details that you needed for the for line breaking so that was a that wasn’t working properly and the only way to fix it would be to in continue that same algorithm would have been to add make these active nodes very large keeping track of how much glue there was of type fill of type fillable and so on and each of these another 32 bits for each for each active node but still I wouldn’t be good enough because we had a large enough paragraph you would overflow I said the other day well who’s going to be caring about dimensions that get larger than 17 or eight or whatever ever somewhat feed is concerned and it’s true that B minus a would never be 17 feet but from the beginning of the paragraph to a given point might very well get get quite large in fact we’ve had cases where we’ve seen paragraphs that run all the three pages and and you know the we’re trying to find the optimal way to break those pages by the way if you do that it feel not memory in another way probably when you can and you saw the triangular effect of these feasible lines so after you get into about 30 lines you could just arbitrarily say okay break it here this is going to be a good place to break and and get rid of this paragraph but certainly we didn’t want to worry about overflow calculate total length up to here that would be even on reasonable paragraphs this would this would give arithmetic problems so instead of that I’ve got I I keep track only a relative distances and the distance is from the first active node 2 here is that I keep so I keep track with the first active node to hear of for every node and then in my active list I have two kinds of nodes I have active lit and active node another active node and then I have a call a delta node and then another active known let’s say a 1 a 2 a 3 and so on so Delta nodes can also be in the active list now and these are nodes that tell me the differ this is the difference between from from breaking at a 2 2 breaking at a three so when I pass by a delta mode it tells me to two it gives me the right information for relative distances relative to a three well before I had relative distances relative to a two so this is a this is a extra feature with was added to the data structure in order to avoid the problem of multiple precision arithmetic and also making all these active nodes larger but the expense of that was of several more modules of code and several more days of debugging time and yes and you know but the but should run fast now the so the debts or delta nodes are in there for this calculation of that I need to figure out the badness as you go then I’d see if we can put ourselves in the intex shoes it’s trying to determine whether p is a feasible break point it starts at the beginning of the active list and it has in some table already are the length of the line from a 2p from the first node and active list to pee then okay so it can then compute the badness and see if it’s feasible and and then move to a two if it passes a delta node adjust its tables of the distances and go on now the there’s still more details to to take care of when you when you do this because you you have to figure 2 to remember that when when you break at the end of a line the glue after that break disappears and so you really are considering when you want to figure out the distance from a break a day to a break at be what you really have is is the is is be well I think I call it bait alpha of beat Nessie be be minus alpha of a is the distance from now this means the distance from the beginning of the paragraph 2 B 4 B beet beta for before an alpha for after to the to a place after a so if you have a if you have a your paragraph laid out here and here’s be some glue say and here’s a then the length of the line is going to be from the point after a up to the point before be it’s not just the distance to be minus the distance to a it’s the distance to before B minus the distance to after eight and so there’s that little difference in there okay so the Delta node that we put in here is really alpha a2 minus alpha a three cuz we as if because we’re assuming that our that we know and we’re at this we have data of p minus alpha of a two and then when we pass the note we want to get beta of p minus alpha of a three so the correction delta is equal to alpha a 2 minus alpha a 3 that’s that’s the the way you derive these formula so this is the the length of the line and that we that we can compare to the true to the desired length of the of whatever line were we’re worried about if the break is at a discretionary node for example a simple case where it’s a hyphen or not a hyphen then we have to take into account a break point here we have to say that to break arm before the hyphen covers a certain amount of text and the break after the hyphen is a certain amount of text and the meaning of this notation beta and alpha has to be properly defined for discretionary and I went around about three iterations before I finally had that one right now the actual width of the line isn’t exactly this though is another constant has to be thrown in because tech 82 has introduced left skip and right skip which are added to every line of a paragraph this makes things like ragged right setting and centering very much simpler but it also meant that in my calculation I have to add this background information gamma which is the sum of the left skip and right skip and that gives me the true size so with all these qualifications you see the idea is we start out with some fairly simple simple ideas and add one or more complication now after another until you finally get something that can only be described in web I think with any hope of possible success because web allows you to describe one idea at a time and string them together otherwise it’s everything is really a is reached to involve pretty putting it together now okay what more should I say about this except that we want that it is part of the inner loop so we also wanted it to be fast and I calculated that that the the the hyphenation algorithm would not have to be applied very often this is quite quite important to uh to to the efficiency so how do we handle this problem of hyphenation well here’s what this way it goes the the algorithm goes in two passes on the first pass no hyphenation take place and instead you we just keep on trying the algorithm seeing what we can do try to get the best the best result and unless the active list becomes empty now in the diagram that we have there the every time I get to one of the feasible breaks there was always some way some escaped out from below it that’s not always true you can certainly get to a place where where it be feasible to get up to there but nothing feasible after that point and then if you are in a conventional typesetting system you’d be stuck you would have to you have to have a really bad line and here you can imagine packing up and going down another route but we simultaneously are covering all bases as we explore the we keep track of all the active nodes all the all the leads at once but sometimes all of your leads disappear you got nothing active anymore you’re stuck okay then you go to the second pass the second pass tries hyphenating to get out of it and every word is hyphenated every word that it that meet certain conditions is hyphenated the first condition is that it’s got to be at least five letters there is no point in trying to hyphenate a four letter word because tech is not going to hyphenate ever after the first letter or or less than or less than three letters after so so tech will never hyphenate a hyphen something or other and it will never hyphenate something ed the lawyers put three letters after the hyphen and one and two before so that means that you’ve got to have at least five letters so why bother trying to hyphenate a four-letter word if the word starts with an uppercase letter you don’t try to hyphenate it unless the parameter uppercase hyphenation is is set on right now we’re putting it on by default all because it algorithm seems to work a lot much better we thought it would on uppercase words if there’s something if it’s if you’re in the middle of a math formula you don’t try to hyphenate well there are some conditions have to be satisfied but but otherwise everything every word passes this condition this test and we and we have to have to submit it to the hyphenation algorithm well the nice thing was that in in the entire book art of computer programming volume 2 700 pages worth of stuff we could test this on thousands of paragraphs and we found out that the average number of HYFR nations we had to try was 1.2 4 paragraph that was because only one time in 12 did we have to go into the second pass so most the time we could go through in the first pass find a feasible way to do the paragraph without hyphenation even though our tolerance was extremely high so that we had the tolerance so that no really bad spaces with food it would be allowed at all that and the hyphenation then is seems to be kept down to us too not it not really considered in inner loop but if we do have to hyphenate a word and the next lecture will be about hyphenation if we if we do have to hyphenate a word then the hyphenation algorithm just returns the word with discretionary the same thing of the list is replaced by another list that’s that puts in discretionary instead in in place of the word so so then tech proceeds to go through the new list with this discretionary Xin it just as if that was that was given to it by the user on the enterprise the same algorithm on the second pass you might say where do overflow boxes come from and that I can tell you that’s in the second pass we never allow the active list to become empty the active this becomes totally empty we’d be sunk we have no note because at the end of the paragraph we have to go through the passive nodes and well I’ll tell you about the end of a paragraph in a minute but yet we lose all contact with with the paragraph if the active list becomes empty so so there’s a time when we determine the actor that we’re just about to lose the active list and at this point we we leave something active even though it’s not feasible and that’s where over full box becomes created now the test for when that happened I might as well find the module exactly where it is but that was the source of the most difficult bugs in the in the present version of tech and I and I changed that at least four times before getting it right if I had to say that there was 11 really subtle part of tech that was to name the one that looked at was the hardest for me anyway it would have to be that particular one and let’s see if I can find it let’s see Sue’s did you do not lose all active nodes here 744 during the second pass we do not lose all active knows less we lose touch with the line breaks already found coach on here make sure that such a catastrophe doesn’t happen by permitting over full boxes last resort and then it says readers who seek to improve texture therefore think thrice before daring to make any changes here this little test if second pass and minimum demerits equal awful bad and link of our is last active and pre VAR equals active all four of those were added one time I think in chain and replacing some other ones then I changed the badness 20 and I that it it might take me a while to reconstruct exactly all the reasoning behind that one line there but but they the computer of course doesn’t know that it was hard to think of so the computer would just go as fast as ever when it gets there but but but when we’re reading it we we might stop a minute to think why why they do that okay that was the part where the overflow boxes come in to and of course if somebody doesn’t want over four boxes by setting the tolerance to 10,000 you don’t get any you just get under full ones when when there’s nothing when there’s nothing there I found that it for my purposes it was much better to see over full because then I could see where how much room there was I wanted to make a change anyway I didn’t want to leave a glitch in my final output if I if I don’t care about glitches i said tolerance very high but if i do care then i want to see this over full box because it i can look right at it and see how much it will fit otherwise if I if it’s stretched that one and quickly carryover word in the next line takes it you know I have to try to estimate the sum of all those spaces on the previous line and how much they would shrink down and how much room I have it this is a visual way to really see what what kind of correction to make now I mentioned that we got to figure out what to do at the end we come to the end of the paragraph and then they think is to to actually make the boxes so at this point we’ve got a whole mass of whole mess of passive notes like for every blue mark in that line we’ve got one of these little passive mode boxes and these are all strung together by links each one pointing to the previous one just so that at the end I have a pointer passive to the last one and at the end I’ll just recycle all these nodes by going through these links and I’ll get sent to available storage but they also contain pointer to the the break point where they are discretionary or something like that our glue or something and they also contain a pointer to the best previous place to go so this one will say the best one for that was here best one for that was here so let’s go over to this slide for example at the end of a paragraph I the best previous thing from thing must have been her so I would have a backlink from this one back to her and from her it would tell me back to ball and so the pointer you seek out and the way i calculated those was as i was making this the thing from the from the beginning I would keep track for each of these I would write down the minimum totally merits from beginning to here and then the way I got to it because what was the last step in that chain so by going back from these backlinks that’s what tells us how to do the paragraph and so I at the end of the paragraph pipes I got only one active node and it’s this one then I make it pass it when i look at the corresponding pastor know the past node points me back to her that past pointed back to ball and so on I’ll reconstruct all of the setting all of the lines as their as they’re given their so I have to reverse those links so i can do them in the forward order so they go through and i started my best my best passive node I go through and I follow its length you know maybe 10 line so it goes down to the best ninth line 8 line all the way back to the beginning I reverse those links so that I now have the appearance of clairvoyance the big beginning tells me the best way it was to go to this one and these links are now going this way and telling me the right thing to do so the remaining thing to do is take my horizontal list and break it up at these breakpoints these these things will indicate the 10 break points as to where I ought to break the list subtle bug that occurred in here a couple years after the first tech was was apparently working well was that everyone to where we got a weird case where where there would be a break point with some glue and then there would be some penalty after hearsay and then maybe some more glue and this would be wiped out another penalty over here so that what usually when we break at glue then we also cancel off all penalties that immediately follow it but what actually happened one day was that this penalty turned out to be one of my best break one of my best breaks even though this glue was also one of my best breaks and when I when I was going through boxing the whole thing I broke here I said okay wipe out all the stuff after all the nodes after it here and then I come to my next one and it was supposed to go here and I was out in my available space listing and starting to point to nothing and you can imagine all hell would break all heck would break loose at this point so now this dassault so this is one subtlety that has to be what these break points might actually point to knows that are that are wiped out but usually they don’t and usually that means that if it’s glue you wipe out the glue here and any penalties that follow it and then you get to a box or a character that begins the next line this becomes your next line so here we get a box that you send to the H pack routine H pac-13 does like we did before when we made an H box computes the height and depth and so on it will give the error message about an overflow box and so this we build now a vertical list a vertical list containing first line of text and then the baseline skip calculations made between all of these and for each line of text week we could go up another vertical list now the H pack routine as we send it up a sub list of this original paragraph to be boxed H pac-13 not only calculates this to height and depth and width of these box and glue setting and things of the box but it also looks for send enough looks for insert nodes and via just nodes that are in that box and and those those nodes get appended right after the box so it removes them from inside the box and they come out the box that’s how the readjustment things like that get get worked out we also look for penalties for example between the first time the second line a widow line penalty is inserted here so that the so that there won’t be a break between these two lines unless you pay the penalty for widows there’s a penalty for a break after a line and ends with a hyphen so we also keep track of that as to whether or not there’s a hyphenated because we don’t like to break a page these breaks are four page breaks and so don’t like to break a page after a hyphen if we if we if we don’t have to young dan what about whittling lines at the top of the next page what whittling lines at the top of the next page yeah so the second last the widow penalty is also inserted here the the front there’s a final Whittle there’s a there’s a final widow penalty there’s a display widow penalty or another this penalty changes depending on whether it is a displayed equation following or not publishers tend to some pic publisher tend to think it’s much better well I think they all think it’s infinitely better to to start a line with a page with a widow and before display them to start with the display itself and so that is a separate thing and present version of tech that’s a zero but now it’s a special parameter where there’s no reason why we couldn’t makes we have a final we have a pre pre display penalty or something like that display widow penalty it’s called which gets added before a display but otherwise this one gets it there at the beginning and at the end there’s also interline penalty which gets added to all of these guys and that we recommend for footnotes if you don’t want if you want to make people pay a penalty for footnotes you pay the interline penalty plus the widow penalty for here in a footnote and and just the interline penalty in here and anyway those are both those penalties are passed in vertical mode this is now I a bunch of boxes in vertical mode and the page builder has to decide how to break that in two pages the page builder does it one page at a time on the fly and uses a calculation much simpler than the then the paragraph builder because it only has to look at the badness not demerits it only takes the badness two for the current page and finds the place where the badness is and plus penalties is as small as possible so the way this this program is organized then the first section about breaking paragraphs in two lines is section 37 our module 704 and that is basically the try break subroutine and all that all that all the details about how you calculate these Delta nodes and keep track of the maintaining the active list then section 38 breaking paragraphs and Alliance continued and that’s the part that talks about running through the list telling whether or not it’s it’s a legal it’s legal to break here calling try break at those places and sick and then doing the final packaging afterwards the final packaging is done by a separate procedure because again it’s what otherwise been a procedure that that some Pascal’s wouldn’t have been able to handle the hype and now I mentioned that hyphenation is actually as much code as the as parekh as breaking paragraphs and that’s the subject of the next lecture which we which will start at in a half an hour question though our rivers considered all here or they just don’t occur anymore yeah rivers that are not looked at as you can see in this algorithm I don’t have a phone but the experience has been that that with good tolerance is rivers just simply are very rare but rivers definitely show up when you have wide spacing and when you use the looseness like I didn’t mention looseness if somebody says looseness equals one then the algorithm will will figure out the best way to get to the end of a pair and find out how many lines that took maybe it was eight lines and then it will find the best 999 line way so we would run through the list at the end at the end all the active nodes are the only I could notice the end of paragraph node and those are all stopped but those are all marked with how many lines it took to get here and so we can say the best way to get here in nine lines the best way to get here in ten lines and when you when you make the paragraph a little loose then you’ll find the best 10 line way to do it and usually that looks fine but then somebody says looseness free something like that I don’t have a slide here but we can take this paragraph and set it with looseness 0 1 2 and 3 and minus 1 actually you can actually get into one one fewer line and by the time you see looseness two rivers start to appear and and this is this is typical of what of what happens if the but rivers also also really generally go away except there’s another case where where I’ve noticed them we had a we typeset a paragraph we’re out of an old newspaper which was listing the people who had attended a wedding it was mr. and mrs. Smith comma mr. and mrs. Jones comma and so on and you’ve got a lot of written misters lining up at the left-hand margin the tech doesn’t keep track of that that has to be manually broken the there’s also a theory theoretical proof that to eliminate rivers would be np-complete of me and say it would be a very difficult problem to do it in the general case and since we found that it was was not a serious problem anymore we there’s no attempt to do anything about rivers here another guy was it was the actually if somebody’s really believes in this non foggy writing and starts to use only three and four letter words then again you see rivers because at the end if you have all three letter words they’re going to line up but there’s going to be a white space near the H near the margin on both sides of the paragraph got to use a few big words once in a while just to just to kill the rivers oh yeah one more question in the example on the screen there you’ve got routes through that involved hyphenation despite the fact that the best fruit doesn’t use it that’s true there was a penalty associated with this hyphen that flight 3440 was rather large because the penalty was added in there and that probably made this branch on you know I didn’t if if if this one had gotten in trouble then then hyphenation would have been taken but this branch didn’t look too good after we added up all the other things but it did try the hyphen because in this particular paragraph I oh I see because you’re talking about my second pass thing no in this case i was i manually put in all the hyphens up because we’re making a lot of a lot of tests on this on this particular paragraph on very narrow sizes and we wanted to make sure that that we would we would get the truly optimum setting in this one so we imagined that we were going second pass on this otherwise we wouldn’t even look at that one yeah but on hyphenation i didn’t say how i didn’t allow all possible hyphen stuff I after fai didn’t consider hyphenating because i had intended to become rather conservative about that where hyphens should be allowed in my own opinion I believe you should be able to pronounce the fragment of the sentence before looking and if you stopped after fa it would have you might have think it meant father and you would say far then you come the other self a for it you know end up and so I try to I would like to try to avoid hyphenation saw it in places that you couldn’t pronounce the fragment that you see first without looking on the next page or let next line ok let’s meet 330

That's the end of this page. If you have any feedback about anything on this site, you can contact me here. To go back to the top-level page (if you are not on it already), click here.