Of the videos, this is the eighth session. It is about hyphenation, and includes some live demos!
The transcript below is autogenerated from YouTube, and needs to be cleaned up.
I was worth waiting for soul be so I’m in hyperspace okay so hyphenation our space is what we’re going into this hour and and tech 82 is using brand-new hyphenation algorithm of frankly angle I noticed is in the audience here to correct me in case I make any terrible there’s this is going to be described in his thesis that comes out or any day now I I guess it is isn’t that right Frank and I’m going to first explain to you the the way it works and then you come and then we’ll we’ll play with it a little bit so that you get some idea about it and I’ll then i’ll show you how it fits in in the tech code this of course is a special interest to people in in foreign countries so that they’ll be able to use tech with with other languages I however i have to mention that the president will only support one language at a time and these per paragraph it would be right a difficult the way i remember the way I described hyphenation saying in second pass will go through and hyphenate every word but what if the word is you know paragraph is half in Hebrew and half in English or something we would have to we only have one hyphenation tables that we’re using all this all the same through so there’s other problems with that of course too because there’s like left to right versus right to left but we’re going to assume that we’re using only one language at a time anyway for and if somebody wants to extend it to switch between hyphenation tables and so on well that’s a whatsit note that could live there in the paragraph when you pass the website it could tell it can select hyphenation table such and such now the hyphenation is based on idea of patterns which is one but one thing that unifies meant many of the things in the ad hoc hyphenation algorithm that were used in the present version of tech if I have a pattern that’s for example a B to D and I put zeros on between it when there is no no positive number this would mean that if there was a word if I’m trying to hyphenate a word that contains the subsequence a B and D and i would i would try to insert a zero before the a zero between a and B or two between B and D and so on I do this for all patterns I take the maximum of the numbers that with that apply if the final result is odd it’s it’s a place for hibernation and if it’s even it it’s a place where I don’t type in it so zero is even so I wouldn’t hyphenate there if I put a one in here that would be a good nice pose I put a 1 here and that would say that’s a good place to hyphenate abd that would say for the one they’re so high tonight there but then you look through all the patterns that you did with zeros and ones and you got some some bad cases so then you can go and say well but let’s suppress it if if it fits in this pattern also on the two will then knock out a one because it’s even and higher than one and we takin the maximum and then if we don’t still don’t get enough hyphens that we want then we put threes in there and elk and they’ll override the tools and so this is the genesis of the idea the the neat thing is how well how well it works with the right pattern so let’s take a look at some at some examples now the I guess first of all I could look at the at the patterns that that Frank generated on Sunday so let’s see that’s edit ot5 now this is page 1 of oh yeah 38 79 so there are three thousand eight hundred seventy-nine patterns in this particular list and these patterns seem to work rather well for english now a DOT stands for the beginning of the word so so the first pattern is beginning of the word a/c II and the floor are means between the four and between the sea and the e you would suppress very strongly you can probably never want a hyphen with a word that starts with AC in the next letter being e the next one says if you got a word that starts with us you don’t want to hyphenate there and so on okay these are the these are the thing let’s take a look at line two thousand you get some of the we’re in the middle of it now here’s a LP saying don’t open it before the pair LP but here’s a strong case for hyphenation LP a hyphen be okay yeah you can move to a seat word that’s easier for you to see the monitor lpa if you got us a letter that has LP ABI can think of any right now is what is the would be the word for that then you want to break their with very strong five so odd numbers are for hyphens even over it’s not for heightens okay that’s what patterns look like and now if we see what what actually this algorithm can do we take a look at the at I had some sample files there’s a there are several of these and first let me show you the one that’s called accept hyphen let’s see that some IP these are these are words that that were exceptions to the nod to the present tech hyphenation algorithm I guess and so they were put in here to see what we could what would happen to him so the asterisk means that that we do find the hyphen and I mean my monitor isn’t showing anything is but yours our call you here it goes okay past which means we find out no this isn’t yeah yeah this is just a bunch of exception so for example if you try tech right now the one you’re using now it would say cell fat joint and and that was this was in a collection of words that were a little bit hard to hyphenate asterisks show the hyphens that is new algorithm will find and hyphens I ones that it didn’t didn’t find didn’t do well on this one let’s try the next one those those tend to be hard words this was next one the tech com these are the words that were listed in the appendix of the tech manual as being as being words that we that computer scientists were we’re interested in and and what we wanted to get something we wanted to pick up some of the hyphens and so the asterisks here every almost everywhere I mean that we’ve picked up just about every hyphen on these words so algorithm computer it’s hyphenated declaration we missed the hyphen after the sea there the subroutine stay ok so these words all got pretty well hyphenated know the if we want to look at the dick at the whole dictionary and this is Webster’s pocket dictionary but that that was the dictionary that was largely the basis for these hyphenation patterns and this one is 101 pages so each page will give us a few of a few words out of it ok so we started out and we found the hyphen an aardvark but avek I we didn’t we didn’t get and so on like it’s ok so now they are you see that we pick up almost all of them here yeah oh let’s see what’s going on we can’t oh you can’t see that uh-huh because I can see it on my monitor up there but does this do it this is screwed up it’s no good okay well I don’t want to play i wanna mess with that okay yeah is there a is there a hack and eat in a that will move every line over but control right arrow does why edit control and that moves it over as much as you can okay do that again okay our computers are wonderful okay now they’re so okay you see that in this case there’s a lot of after AC with a tea then we it’s a good place for for a hyphen but acuity it wouldn’t put a hyphen after an AC come on okay well let’s get to some harder cases let’s see what’s going on here adaptations adaptable of that adage I on a job you see sometimes you want a hyphen after after 2d after the second D and sometimes not Adam insulin okay now that the statistics are that actually the the method has picked up 88.3 of the hyphens in this dictionary and it made only 27 errors now error means it put in a hyphen where they’re showing the been one okay so let’s see if we can’t find a place where there was an error and a period chosen here as a place where reaffirm mode read read read on him okay now I’ll have to shift this over to the right again what happened it just did it from that point on okay okay no no that’s okay we don’t care i was only interested in this associate so you put in it and put in a hyphen there on associate and that was considered an error but why is that it’s because apparently the dictionary lists that word twice once with a hyphen and once without a hyphen okay so that’s not such a bad error a soul is the noun and associate is the verb and in according to the dictionary in one case you would consider to be one a three-syllable word in one case a four-syllable word okay so if you really are upset about about that then we would put that into it into the hyphenation exception dictionary that tech desk alright so that’s one of the twenty seven errors and so a good idea so where should I look in this in this file though that people have some idea that some of the I ones are where we start with a wood de or something like that let’s take a look at that age difficult okay around and round difficult let’s take let’s see first of all I go to page one that’s the directory to the whole file read on it okay so let’s see by the way this gives you a fairly random look into the file to 44 words we were hitting all these pretty well difficult it’s going to appear in page 26 let’s take a look at some of these others to get an idea forger didn’t get hyphenated there but who would want it to come out for any way you know Genesis we missed one of them we don’t really care those that are two from the end because tech isn’t going to hyphenate 22 letters away anyway okay I said page 26 was where we where we could look for for that page 26 in is it okay i’m sorry i got i supposed to move this over now so tadam put onto them sorry repeat substitute command is why oh I forgot top okay didn’t see so to them to me there we go mmm okay now so remember the aspects that are the ones that get picked up and you’ll notice that it does pick up nicely the sometimes here de hyphen s in other places des hyphen depending on the context and no errors are our present in this particular in this particular part okay now these these patterns were generated by a program that is written in web and is going to appear in Frank’s thesis and and it would be possible to get this program and running for other languages for other dictionaries and you have different constraints in different kinds of texts that you’re setting even if you’re working with English language so that you would you could use different dictionaries for the for those words the patterns that frank says that the time to run the program or was 57 CPU minutes on sunday for it makes you know it goes through lots and lots of of calculations in order to find the good patterns for this purpose the program is available to use ok so now we still haven’t gotten not too difficult but we can see that that when it does miss a hyphen attends to miss two of them in similar words that they tend to go in in in clusters well aware is difficult it’s going on detain detect deteriorate we’re picking up lots of these determinants determine a bowl no wonder it didn’t find that one ok we’re all tea I can’t search for difficult because it has these other things in it maybe I could search for di FF star or di FF period would be good enough how about Oh what stabbed me oh that means yeah okay I got it so different okay we’re close here now difficult okay and difficulty but difficulties didn’t see curious ok now if you look so you look up above and below it lots and lots of hyphens are in there so so indeed this just saw one uniform method for parents is is quite successful now let’s try let’s try playing with it all I wanted to show you another another set that now we took another dictionary that we got from commercial firm for newspaper setting and and ran it through a previous set of patterns and looked at all the the things that didn’t work in that case and and then made sure that and they weren’t worried the wording in the pocket dictionary morning Webster’s pocket dictionary but still were common a lot of them are modern words words they’ve gotten into the language recently or more more having some interest we wanted to make sure that we didn’t make errors on these on these words one reason no they weren’t in a pocket dictionary so well Afghanistan we don’t seem to find any hyphens in oh yeah right okay so there’s that there are some of these words that these were these were ones that that cause trouble for patterns that were only based on the pocket dictionary and then we had a much larger database as several hundred thousand words I don’t remember exactly how many but much larger one which we could also check that we wouldn’t that we wouldn’t make any air it’s not and I believe there’s only one error in this whole file now and I’ll I’ll be able to find it in a minute our script analysis this is one that was hyphenated wrong in my article in Scientific American I remember that one that came out crypt analyzed or something like that you know okay digitize the ones that were that were kind of interested in actually this list of exceptions include some things like the dumb head that were kind of interesting and and a surprising number of obscene words okay well heuristic sorry joystick we didn’t pick up the hyphen in mr. chow okay so now what I did was I wrote a okay read on so I wrote a program called tech hyphen and as you can see oh well I i can try executing tech high for now this physics this just excerpts the hyphenation part of tech and then it gives us a chance to type in words and it’ll show us how the hyphenation algorithm performs so i’m i’m compiling it now and loading it and i have to input my test file Pat dad and it typed out you can’t really put it typed out please wait now at this point it’s computing from all those patterns it’s computing a very efficient data structure so that to do the hyphenation is going to be real fast I’ll explain the data structure in a few minutes okay now it’s ready and it said the hyphenation table sizes are 5207 and 174 this 5207 is the number of words in a try ER IE that’s in your in your tech web listing and this is a data structure will be talking about it so it’s 5750 200 words to store all those patterns in a very efficient way to select hyphenation it will really be quite fast for english words and that’s four bytes per word so 20 k bytes was used for our 2821 kbytes essentially our that table 174 is is the number of opcodes I use in another table where I’m limited to 255 and those are the so-called hyphenation ops in the listing okay now we can type a we can type some text in here and i’ll type it over at the right so that you can see the you can see what i’m typing i hope but i’m not living at an edited file now so we’re stuck with with this problem of something being on the left I guess what i can do is i can just put a bunch of nulls at the at the at the beginning of the word so i’ll type x xxx supercalifragilisticexpialidocious and see what that hyphenates into okay the exes came out on the left so you can see the rest of it now where this puts a now known this there’s no different notation here I a hyphen is something that the algorithm finds and a double in equal sign is a place where it found an even number greater than zero so we’re a hyphen was suppressed by one of the patterns so where I had up where I had a two or 4i get an equal sign here where I get a 1 3 or 5 i’ve gotten it got a hyphen so if you look at it super cow lift freh it missed one type in between lit and Fran Jill I think it could have put one there anyway it’s certainly a valid subset of the of the legal hyphens but at the end of the word it did it didn’t get very many now this of course is a word not in a dictionary but it you see that will find words that are English like and it will it will do pretty well on on those words I I’m willing to take suggestions from the audience if you want to try any word excuse me record somebody wants to see if therapist comes out as the rapist okay there are pissed okay the other one is weeknight coming out as a wee night is that isn’t that the other one penny stone is that a PE n n PE ni or Penistone oh yes okay okay you broke it now what a second oh actually this would probably work out all right if I didn’t put the x’s in front of it right no I course this wouldn’t be in our database what was it two ends okay no I did just the same thing now let’s see Barbara do you have any ideas okay let’s try a German word this in Shafton okay let’s confess it’s a computer vision shaft and so you get to see the hooter vicinage often it will he worked okay now somebody said record and that it turned out was one of the errors in that in that other file it had it found either record or record one of the two and the same thing for progress versus progress and the same thing for one other case present and present right so they’ve all now been added as high as exceptions in the hyphenation exception list seven of them all together and so if i type record now it will just not do anything to it because it’s been an exception where I said don’t touch it don’t touch that word the funny thing happened in Nepal in the powers of times tribune a couple of weeks ago or a month or so ago where there was a paragraph that had his three lines long in the first line it hyphenated record properly re hyphen in the second line it happened to end with record re C hyphen o ID and on the third line there was a missed a completely wrong hyphenation i forget what the i saved the example I’ll make a slide out of it but but there’s it must have been done by manual intervention but then they let the automatic algorithm go on for the rest of the paragraph and it blew it yeah so so those so right now I have seven exceptions in the dictionary so when we generate the patterns then we look for the dots in the resulting file and we we decide whether those dots are real errors or not in our mind sometimes there’s bugs in the dictionary and the ones that that we are upset about like I could have put associate in there if I was upset about it then they then make it a dictionary for that but this percentage eighty-eight percent found with a very compact table it compares very favorably with any other other method that was known to get an approximation for this and what was the corresponding figures for for the present tech algorithm Frankie remember we pick up about forty percent now I thought it was only point in the 20s but may I guess so right because we got so many prefixes in the respite respite got no hyphens must have yeah must be in the pocket dictionary there won’t be any errors for any word that’s in poly dictionary which is a pretty complete complete dictionary now I I think if i type backslash t here it’ll go through a database that came right out of the pascal manual i type in every word of five letters or more out of the pascal manual and I guess you can’t read it now but I’ll tell Springer didn’t get hyphenated for log did hide though Berg did the hyphen before the G there isn’t going to bother anybody because it you know tech doesn’t doesn’t use the hyphens at the very end of the word Berlin came out ok Pascal okay Manuel ok report ok Kathleen yes Jensen no hyphens at all Niklaus that’s fine I think no no okay well then it wasn’t but it would be to an American it sounds fine so it’s based on english so i must be ok breakfast ok know what i mean is our hyphenation patterns are supposed to work for english you see some version programming language drafted followed spirit al Ghul languages x10 civ-d velopment phase didn’t get hyphenated first in that’s good became operational publication followed you know this its working interest development yep so it’s all it’s all wrong it’s all doing fine there so booklet consists parts was not hyphenated I’m glad to see so all these things done it ok what a treat so let’s start at the end we got some more French word here pierdes chardon criticism suggestions encouragement helmet son meir Switzerland so so that’s the way this thing performs are based on these patterns and since the if you look at the details of the words uh it really is something that’s that has promise of working even better in languages other than English which english is the weirder in this respect I think then than other languages are so so the success of this looks quite promising for just replacing this by another set of pattern all the patterns are being put into basic text so so in other words your your format when you start running tech at all it knows no pattern and you have to input those as part of your format and basic will input a file a separate file that contains the patterns on it so that people have their own favorite files that they can do it now that besides patterns that go in patterns are read only by Innotech and it does a fairly elaborate computation to get the patterns into an efficient form but then there’s also the chance for users to put in their own exceptions and that’s done with hyphenation and then here you list the words that you want to hyphenate especially in case you notice one we had a statistician picked up over there’s some 20 letter word I’d never heard of before and surely wouldn’t be in this dictionary and it was header row something rather and I and and if this method doesn’t do well on it then you can certainly put this put this in if you curse about which patterns which were the seven exceptions i guess like might as well flash goes up on the screen that’s a that’s Pat file oops okay these were the words that were sufficiently strange that the that it was better to have them as exceptions than to make the patterns check for them because I would knock out other other good patterns oh there’s more than seven ok let’s see that was control so declination obligatory philanthropic present presents progress with reciprocity per record records retribution and rhetoric in this file i use this backslash eights instead of hyphenation because I this is a program it doesn’t recognize control sequences except one letter once to just to give it like this backslash t at the bottom here is a toggle that switches back between online and and stored in a store dictionary of stuff so the last pattern was zzy so like you said xyzzy it would not hyphenate between the two Z’s and one two three four five six seven eight nine ten eleven words then were given as hyper so so basic tech will include these seven exceptions and and the fifth Bentley 3000 some patterns any question on that you can now let’s see how they use these patterns for to go real fast to do the hyphenation this idea of a of a try is is a data structure that I discussed at some length in my book on sorting and searching and another word for is a digital search tree but there’s a special kind of a packed try that Frank developed and that that does the thing beautifully and let’s see if I find the thing in them let me find the D algorithm for you in the manual hyphenation is section 41 and it starts to modulate 06 page 283 and so if you have a pattern p 1 p 2 p well let’s suppose I have a pattern ABC because I use that or Abdi use that example before so to find this pattern you know to see if there’s anything in the tree for a BD we can set Z 1 a 2 a and then Z 2 would be set to try link which I can just call L for brevity treeling of z1 plus B and z3 is l of z2 plus C plus D where these are the internal ASCII codes of the other things xiii is now going to be the identification of the pattern and however i have to have a check that this pattern is legitimate well let me see so first and so then i also have a check that that something is gone so what should I call that let me call it C C of Z 1 should equal a sea of zitu should equal B and C of Z 3 should equal D if any of these conditions fails then there’s no pattern in the dictionary but if all three of these conditions are true then z3 is a number identifying the pattern abd so the Triad is constructed in this fashion and basically it’s like this at the beginning you think of this as as some long say 5,000 word array and we start out and we look in position 97 of the array and we put and we put that number in z1 don’t we set Z 1 297 and look at sea of z1 and see if there’s an a in there if so then then good at least we have at least one pattern starting with a then we then we look at LMZ one which tells us another starting place in the tree and that’s for all patterns starting with a you see and we and L of z1 might be 15 say so then we add me to 15 and that and we look here and if there’s a be stored there then then there’s been a pattern starting with a B and then we pick up LOL and put and that will be z2 and we saw LMZ too and we can add d to it and we see if there’s a d there if so then we’ve got an identification of that particular pattern now what so what it actually it turns out is that if you look at the tree and you see well first of all you’ve got to reserve all the locations for the first letters in the tree so suppose everything can occur the first letter then we’d fill up 26 places here in this table but but then after a maybe there’s only seven letters that fit so we find a place where we can stick where we can pack all seven of these in here and finally we get to plate two things that have only one valid successor and so there’s a easy to stick them in and fill up all the holes and not waste any space once you see the the idea is a method is extremely fast we just have to check one one thing and fetch add check fetch dad and we’ve got and we’ve identified whether the pattern occurred the the actual code for this is I think module 8 10 yeah we’ve got our word stored in H in let’s see we’ve got our word to see module 8 10 we’ve got our word stored in the HC array and we set here we go HCR a and so we put the limiters before and after the word to mark the beginning and end of the word and then for each place within the word we try to look for a pattern starting there in this loop is the look for in there so Z is set to the HC of J that’s this like z1 set to a and L is set to j this is our starting position well L is well HC of l is equal to try charge eat right choice my see that I was talking about then I continue and I and I and I and I go ahead as soon as I get 20 but by the way the pattern might be of various lengths there might be a pattern of length 3 and also a pattern of length 4 starting in a given place so I also look at z1 and z2 to see if there’s a trick to try up stored with it a try out means that that that we have found a pattern there and so in that case and I would go through and store maximum values in the hive table as I’m supposed to usually it’ll be min quarter word which means no so then I increase l + Z set to the next development plus HCL and and go through the loop again so just a few instructions / for iteration and the loop almost always dies out right very quickly at the end then we have something that will contain odd numbers or even numbers depending on whether we should put a hyphen in a night okay the the hyphenation exception table is kept in another in a in a hash table with a technique called ordered hashing that’s a little interest to but I won’t go through it now and and the first thing first we look up in the hyphen in the exception table to see if we’ve got a hyphen if that fails then we go through the pattern matching process like this and in both cases we’re going to we’re going to finally store the patterns in in this life thing which will be either even or odd depending on whether we won the run a high from there or not you questions on this okay now there’s a there has to be a definition of what what what things in a paragraph we try to hyphenate well hyphenation is shut off with inside of a math formula but we also want to be sure that we could try hyphenating words that are surrounded by quote marks or parenthesis or something like that followed by punctuation we want that to interfere with our hyphenation process and so tech it was on the second pass does the following thing it looked it starts out of space and then passes after the space until it gets to the first letter so this this could pass over quote marks and and / ends of things like that till it gets to a letter and then starting at that letter if it’s if the letter is uppercase it makes sure that it’s that we’re supposed to be hyphenating uppercase it only looks at the first letter of the word if the first letter is lowercase and secondaries uppercase it to go out we’ll try hyphenating in any case and in this and for this purpose the definition of a letter is not using that your code to see whether to code type is is 11 or 12 for other char but the definition of a letter is something that has a you see code value that’s the uppercase L C code I forget one of the two I think it’s LC code lowercase code non zero if the if the character has LC code nonzero then it’s considered a letter so standard thing is for example LC code of lower case a is lowercase a LC code of upper case a is is lowercase a this is the way convert uppercase to lowercase through the LC code table and it’s intended of course for large alphabets and other other foreign alphabets so that we have idea as to what it means to convert to lower case because we convert everything to lower case law to look it up for hyphenation and then afterwards we after we know the hyphenation patterns and we put the hyphens of course in the original word preserving now per case lower case the way they were yeah then how do you tell if it’s a capital if L C is not equal to itself it’s upper case it’s considered application so that’s yeah and then we can we go until we get to a non letter and then we go until we get to a sort of a safe point so a non letter might be a period or a comma or quotation mark or something and we go to till we reach either glue or penalty or something like that now if we reach either before this word or after the word if we reach a dangerous thing like a box then we give up so before the word after word it’s supposed to be it’s supposed to be innocuous stuff so if there’s if there is a discretionary hyphen it’s considered dangerous means that the author has has said how he wants this word hyphenated so we don’t touch that so if we run into a discretionary hyphen anywhere in this process we assume that that we’ve been given already the correct information about this and that shuts the whole process off if we run into a box that that turns it off and a couple other things would turn it off a math formula for example we would stop it but penalty would would would say we’re at the end of a word penalty or glue so those are the though so we really we cover all the common cases of text and there is another slight clarification has to be made we’re passing through ligatures here as well as kerning you know kerning little pieces of space that have been inserted into the word for typographic effects and let’s talk about those well kerning we just ignore we just bypass kerning and collect the word that would be there if the kerning was this and try to hyphenate it then we put the kerning back in again talk about how to put the kerning back in again awesome now that but ligatures suppose you made a ligature out of a letter and a non letter so you maybe somebody gave a ligature for left parenthesis see something like that well i’m sorry i’m not going to worry about about weird cases like this so a ligature if a letter is in the middle of a ligature I don’t start hyphenating there I’ll start hyphenating after it and at the end similarly if there’s a letter followed by a non letter in the ligature I will chop that letter off and not consider it part of the word being hyphenated so what we actually have here is the word being hyphenated is is a is given by nodes and not by parts of nodes it’s not good part of a ligature at the beginning or the end but we definitely have a set of nodes in our in our horizontal list that’s close that’s a word that we consider a candidate for hyphen at this this set of nodes which consists of either character nodes ligature nodes or current nodes is is to is the candidate expands into of the candidate for hyphenation and and so it’s not we don’t split a ligature between punctuation and and and letters that’s that that would be weird to try to handle that so okay now we got the we got this word as a candidate for hyphenation and if it has fewer than five letters we still don’t we don’t bother with it so we’ve got to have five letters or more if so then we look up in hyphenation table if it’s not there we try to we go through the patterns even an odd business and and we wind up with a bunch of hyphens now I mentioned the other day about difficult and what we have to do after that point let me give you an even worse example now I’m going to give you one that’s so bad that we might not even get it right but I think we’ll get it right and it’s in the test program at the end of your pantheon of this document Appendix A contains it a test program and and this is like I said the other day it’s a program is not intended to do anything useful its intended to exercise the different parts of tech and it’s uh it’s so have I discussed these appendices so far in this class or was that in conversation outside of class I don’t remember i talked about it yet at all okay this was oh I mentioned the other the other day I mentioned that trip Tech was supposed to be a little bit of a pun but that’s about all I did okay so trip is this test program starts on in Appendix A here and it and I haven’t gotten all the way through it but I believe that by the time i get all the way to the end of this program and i’ll probably change a little bit afterward i get all the way to the end i believe there will be high probability that tech is near the near its final bug that all the ones afterward after anyway will be extremely subtle or related to error error recovery however i have not tested it yet so that i got all the way to the end and in fact you’ll see that there’s the word end sit somewhere in the middle of this file right now just stopping it where it is i think it’s some BO wait let me see no no it’s actually probably i get probably oh and I thought yeah right there it says temporary ending on page 1 dash 3 temporary ending and that’s just where I where I looked up hyphenation of a very exchange word bbbbbb br6 BS there I set just here in this test program i set the h the uppercase hyphen to one which means i’m going to hyphenate uppercase letters and i set H size to 100 point line penalty to a large number and so this is going to going to give me on one line of a paragraph something that it won’t be able to do without breaking the tolerance because this will have to stretch too much in order to in order to do it so it will go into the second pass and try hyphenating this word bbbbb and then it’ll give me an under full box message that will show me how many hyphens had found so and then I said n that’s as far as I got through testing this program so I did you might just read the rest of it and try to guess what it’s going to do but you can’t tell now now the Appendix C shows the the result of the test program so far so when you when you install tech at your system you should you can at least try it on our trip tech and and see if it doesn’t give you the same output as you see here in Appendix B both all these files are on the tape that be distributed so you can compare pair with it and at the very end of that appendix c will be what it did this under full box message that it gave us for the bbbbb and it’s here on the bottom of page c9 the short version of it says we’re in font 20 and it hyphenated BB hype and be hyphen 3 b’s and the long version of it oh gosh unfortunately it says etc there and so we don’t get to see what the discretionary nodes are that we’re inserted for those hyphens well that’s a shame because they’re they’re rather beautiful but we’ll we’ll work out in class exactly what those would be and in the final test program i will have i’ll have cut the of the box that RL of native so that part of the box of print now why did bbbb did hyphenated that way well the reason was i put that in as a hyphenation exception so previously in that test program it actually told it to hyphenate everywhere I put in I put in hyphenation actually with lowercase please I couldn’t be hyphen be everything but I told you tech takes back a hyphen in these positions and so so the test program at least working into the stent I did it they found all six hyphens in the in that algorithm but it took them back if you really want to get a hyphen there you could put it in as a discretionary yourself but otherwise text hyphenation method won’t do it now what about this there’s a strange fondant use here the called the trip fund the trip font is something that doesn’t exist on any printer but it only exists as a tech font metric fire and appendix b shows this font and gives the ligature information about it and this is an example of a PL file and we’ll talk more about the ligatures and things tomorrow but basically so let me just summarize for you when Abby is followed by a bee we convert it into an A when Abby is followed by a hyphen so like let me write this down bv goes into a when Abby is followed by a hyphen it goes into a see you realize that ligatures can also involve hyphens and so when we’re doing our discretionary that that would have that would also have to take account of ethic and be followed by a sea gets gets kerning it goes into be our current see now a followed by a bee has some kerning associated with it and a followed by a hyphen is also kerning but I well I know I think that will Ferrell quick a high a followed by hyphen get some kerning and then in that font it turns out that a a followed by a turns into a but that isn’t going to rise this example so now anyway this this letter then when I when I made up the paragraph for the word bbbbbb what happened well the BB turned into an a so so this was a ligature a a followed by being sort of the current mode there and then I hadn’t I had a be another be so it turned into a current a current a and that was what was in the list when the paragraph first started to work on it then the second pass took over and so we had to hyphenate this word we looked at this leakage at this and the ligature remembered what where it came from it came from six beasts that looked up hyphenation table found that there were hyphenation points here and we see what we can do so then in order to insert the first hyphen notice that we have to go back and figure out what the ligatures would be in both cases so in the first case where there’s no hyphenation we get an A followed by a Kern followed by something else but in the case that the hyphen is present we would get an A followed by a different Kern by this hyphen so we have to make up a discretionary node where the first case corresponds to if we’re going to break here at this hyphen and insert the hyphen in that case we would put in i believe the ligature BB followed by a followed by current followed by a hyphen and in the other case there would be nothing i think and in the end then but but be being followed by a current in the case that there is no break taking remember a discretionary has three parts that this is if there’s a break this comes before break this comes after the break-in and if there’s no break we use this text now the next case is more interesting or more complicated depending on your point of view and that is after you’ve decided whether or not to break at this hyphen what about this hiking if you don’t break at the hyphen we see what’s supposed to turn out into a kernay for those last four B’s if we do break at the hyphen the whole pattern is changing because these two bees are then going to combine into a ligature and this be in the hyphen is going to be a ligature so the next part of its going to make a discretionary and if we take the hyphen it will come out to a C and on the next line will have a BB as a ligature followed by a current followed by a bee and then but if we don’t take the hyphen we we go a current a B beaker and B be like all right now I think I did this right but only the computer computer can really handle these kind of things well in other words the the the process of reconstructing after you found a hyphen can be a little complicated in the presence of ligatures but it’s important to do this because it was like difficult and and because we wanted to keep a general pattern of ligatures a tech to to prop to do ligatures properly depending on what the the user defines because i feel that not now that ligatures are really the answer to accents in in foreign languages and the previous way attack was doing accents turns off hyphenation but also doesn’t doesn’t give the right keyboard patterns for somebody who for whom my accents is bread and butter now if we have some some text that that involves multiple keystrokes to get accented letters or even if we have some accented letters coming in as a pair of as a pair of character is less than 128 which you could which you could imagine doing then then all the hyphenation algorithms and so on will still work fine because this ligature mechanism has been built in to be quite general something I see Barbara nodding her head here and I just recall it in her Cyrillic font plan for a Cyrillic font there one of the one of the things for keyboarding is to just type I think Jay 141 kind of a soft stop and j2 for a hard stop this is the Russian symbols I think they look something like that I remember how to do them in script but they somehow i don’t remember exactly but the keyboard is to type in those characters with type of j-14 one in a day too and these is recognized as a ligature which which converts into that now you might say how am I going to give patterns for such for such things because ones and twos in a pattern are supposed to mean are supposed to mean numbers remember if i said a be 4d that for isn’t considered part of a ligature with be okay well though so what if i wanted to give a pattern that said if I you know if the j1 is followed by a j2 then I won a hyphen oh that’s never going to happen of course in Cyrillic but how would I give that pattern and the answer is that you just put a 0 here and in a 0 there and a 1 there and if you see if it sees a 0 then it assumes a digit after it is like a letter so there’s the dis escape for giving the patterns that that that work so the patterns are can be described quite in generally at least I thought of that much now so people in foreign languages will will have a chance to get Frank’s program that will generate high patterns for them and they can compare against dictionaries I think this is one of the more fun programs to not to play with with different dictionaries and to see what and to see what happens when any questions on that yes have you considered extending that program for spelling verification yes for uh no I haven’t but frank cass and Frank’s thesis discusses it turned out that this pact try structure is it is a extremely compact way to represent a spelling checker and it’s a good end so off it has two good uses and and probably the best method known for for quick spelling checking I could cook good comment yes yeah how do you handle how do I handle what apostrophes in words up apostrophes well they could be treated they could be put into hyphenation patterns if you wanted to or else it would be that it would be the end of the word and we would hyphenate only up to the how to the apostrophe ok because I think it is something like French where you have a lot of words that begin with l ’ followed by the right yeah good find in fact people from AMS I I just noticed this french place name do you know this one reveille mmm the director of American mass society is Leveque and he capitalizes the V and I just noticed here was a place where the e was capitalized but they just but these words yeah probably in French we would want to strip off of the the prefix like that or else just put the words in the dictionary and make up the patterns on that assumption see in in our english dictionary we put in the words ending with s and ed all those things are in this dictionary so that the patterns will be recognized without having to strip off suffixes and so this probably is an analogous analogous thing as a prefix the common prefix that would be put into words and probably be best to put that in part of the patterns ok thanks a lot though tomorrow the four sessions are specifically on system dependent things and if the man wants to ask me what I would recommend to read tonight I would say well there’s a couple of good movies on in town but if you really want to read some of this then I would go to the index look up system dependencies that refers you to about 30 40 modules look at those modules just to see what it says those are the modules that the entire set of modules that I think you might be especially interested to to modify for your home systemThat's the end of this page. If you have any feedback about anything on this site, you can contact me here. To go back to the top-level page (if you are not on it already), click here.