*TALO’s LANGUAGE TECHNOLOGY A Note on Hyphenation tel: +31 35 69 32 801; fax: +31 35 69 75 993e-mail: [email protected]; http://www.talo.nl/
All rights reserved. Without limiting the rights under copyright reserved above, no part of this production may be re-produced, stored in or introduced into a retrieval system or transmitted, in any form or by any means (electronic,mechanical, photocopying, recording or otherwise), without the prior written permission of both the copyright ownerand the above publisher of this article.
The greatest care has been taken in compiling this article. However, no responsibility can be accepted by the pub-lisher or author for the accuracy of the information presented. ABSTRACT
Most hyphenation programs are based on a linearity principle. According to this linearity principle the chance of an erroneous hy-phenation is distributed evenly (= linearly) over the positions of let-ters in a word. In general, however, the syllables in a word are notdistributed evenly over the positions of the letters in a word. In thisNote, the linearity methods proposed by Liang and by other re-searchers are discussed. It is made clear why and how these meth-ods lead to erroneous hyphenations, or, sometimes, incorrectly, tono hyphenation at all. In the layout of a page these faulty hyphenation methods cause un-desirable white rivers and irregular right margins. SAMENVATTING
De meeste woordafbreekprogramma’s zijn gebaseerd op lineairi-teitsprincipe. Volgens dit lineairiteitsprincipe is de kans op een foutieve woordaf-breking gelijkmatig (=lineair) verdeeld over de plaatsen van de let-ters in een woord. De lettergrepen in een woord zijn echter in hetalgemeen juist niet gelijkmatig verdeeld over de plaatsen van de let-ters in een woord. In deze Notitie worden de lineairiteitsmethodenvan Liang en van andere onderzoekers besproken en wordt duide-lijk gemaakt waarom en hoe deze methoden tot foutieve, of somszelfs tot helemaal geen, woordafbrekingen leiden. In de opmaak van een pagina veroorzaken deze gebrekkige woord-afbreeksmethoden ongewenste witruimten en onregelmatige rech-terkantlijnen.
1. Hyphenation
“The hyphenation dictionary/algorithm proves to be efficient up to a certain de-gree (like in any program). Some words are properly hyphenated (or else, allsuggested hyphenations are OK). Some others are partially OK. Some arewrong. This is not new and Scrubus .” (Louis Desjardins, internet communica-tion, 2004). In the additional information some basic errors in hyphenation forFrench are mentioned, but going after each and every word would be impracti-cable.
This type of behaviour applies to many hyphenators. Quark XPress 6.1 doesnot hyphenate French mutes (muettes) at least on the face of it. However this isnot completely true. QX does not hyphenate the last 2 letters and thereforemany muettes of two letters (at the end of the word) are prevented from hyphen-ation (bru~nette), but not muettes as "prac~tique" which QX hyphenates as"prac~ti~que". Some words with two letter endings are not muettes, e.g. "coupé", and should be hyphenated as "cou~pé", but Quark XPress 6.1 doesn’thyphenate this word.
Other hyphenators produce hyphens at both sides of a single consonant "ge-s-on-de" instead of "ge-son-de" (Afrikaans, Machteld Fick1). The model she usesis based on the mechanisms of human brain and is called a "neural network". Principles of these models might be valuable for visual perception and relatedpattern recognition, but the theoretical assumptions do not match to language. The microstructure of neurons does not necessarily behave identical to the ma-cro level of language itself. Hyphenation of words in texts has a long history. Plain hyphenated-word lists have been used to hyphenate texts in documents(Troff on the Unices, PageMaker for the pre-press sector). Liang2 introduced a linear pattern recognition technique, which technique wasapplied to many languages (Boot3, Tutelaers4). Knuth5 claims that the Liang al-gorithm can be adapted for other languages by computing new patterns. Daelemans6 has shown that this is not true, because complete dictionaries ofmany languages are impossible to construct. According to Nunn7) hyphenation programs based on pattern matching are es-sentially list-based as well. A draw-back is that patterns which are correct in de-rived words (the database of words which was used to calculate the patterns)but may produce incorrectly placed hyphens in words yet unknown to the sys-tem. In other words, this linear pattern recognition technique is not generally ac-curate. Considerable exception lists have been added to increase the accuracy. However, if pattern recognition incorrectly hyphenates in derived words the ex-
Fig.1.:AnAfrikaansHyphenatordoubleshyphensduetoinstabilityaroundasyllablebounda-ry(markedasarectangle,be-s-ka-ving.insteadofbe-ska-ving.);and(markedwithanun-derline)missesahyphen,andinsertsotherhyphensincorrectly(markedwithacircle).
ception list will continue to grow with time. The method to compute Liang’s pattern has been published in detail and be-cause of its free availability the technology was widely applied in professionalprograms (QuarkXpress, InDesign, MSWord, etc.).
An aspect Liang and other authors didn’t take into consideration are the as-sumptions underlaying the hyphenation method. Incorrect assumptions might vi-olate the structure of the language and therefore lead to bad results. Thereforehyphenation pattern recognition should be based on algorithms that do not vio-late underlaying assumptions in language.
As will be demonstrated below, the structure of hyphenation of a language istoo complex for most hyphenation techniques.
1.1. The morphology of language
A hyphen is a sign (-) used to join words as in "pick-me-up" and rock-forming". or to divide the syllables of a word at the end of a line.
The position of the hyphen mark varies and sometimes the variance of positionis considerable.
bi.o.met.ricsbi.om.e.trybi.o.log.ic.albi.ol.o.gist
Even in the English language hyphenation strategies differ between famous dic-tionary publishers. Hyphenation varies within a language8,9,10. These variationsarise from the preferences of the different Style Manuals. In this article we willlook at words with as many syllables as possible, regardless of Style Manuals.
The European languages can be grouped into clusters of languages, which arebranches of the Indo-European family. Each of these branches has its own char-acteristics which determine hyphenation. Some languages do not belong to theIndo-European group but for centuries have been influenced by their neigh-bours.
Most of the European languages are strongly inflected. The case system addsendings (dative, accusative, locative, etc.) which in English would look like:
house.athouse.tohouse.inhouse.abovehouse.below
This type of ending occurs in Finnish and Estonian. These languages add prep-ositions to the end of the word and if youngsters (Fins and Estonians) have tolearn a foreign language like English they have to memorize the concepts inrows like the above examples.
The Scandinavian languages add the article to the end of the word. For in-stance "hus" (which means house):
It should be noted that these morphological endings do not necessarily deter-mine hyphenation.
Many languages use compounded words. The mechanism to bind words is limit-
ed to implicit rules usually set by meaning. Compounded words can be long.
German:Ab.fall.ent.sorgungs.satzungAgrar.ab.satz.förderungs.durch.führungs.gesetzArznei.mittel.aus.gaben.be.grenzungs.gesetzDonau.schiff.fahrts.gesellschaft
This variety of mechanisms in a language should be accommodated for in themechanism of hyphenation.
1.2. Linear methods
In the 1980’s Liang2 developed an English hyphenator based on linear patterncomparisons. In Dutch the patterns look like:
1. lan4d3r2. 4lann3. l4do4p4. er3t2h5. 3t2hei
The patterns use numbers to represent a hyphen position. Odd numbers are ac-ceptable, even numbers are unacceptable, higher numbers take precedence oflower numbers. A few examples demonstrate the inadequacy of the generated patterns. Thefirst of the five patterns listed above cannot choose between hyphen positionslike: eiland~raad , calan~drone, flan~dricisme. The second pattern does notmatch with any word in *TALO’s data base11, and it would compete with wordslike aanbouw~plannen. The third pattern does not differentiate between wordsas geld~opname, peul~doppen, meubel~dop etc. The fourth pattern prefers alert~heid, knoert~hard over hyphenations like kinder~theater. The fifth patternis more selective and catches all words with .~heid, but it leads to problemswith words like Elizabeth~eilanden. In most cases more than one candidate pattern needs to be considered in orderto be able to decide which one to use. The set of patterns is linearly scanned,from one position to the next position, and so on, running through the word. Insome cases the selection criterion leads to the correct solution, in other cases awrong solution is chosen. As is illustrated above, the Liang model requires acompetition between patterns but it is questionable whether competition can re-
solve every dilemma. Statistical calculations were applied for generating the patterns, resulting in aset of probabilities. To deal with such a set of probabilities either the ranking orthe averaging method can be used:
i = 0, ., n ; j = i, . , i + m; p = 0, . , 9‡
in which i is the linear index in the word and j is the linear index in the pattern, ca pattern itself, and p the probability of the pattern. The probability can onlytake a value in 0 ≤ p ≤ 9 (from certainly ’no’ up to certainly ’yes’). Its real value,as it would applied to the language itself, is truncated to a small integer number. Therefore the classification of hyphenation weights h used to hyphenate con-
sists of a few 10% steps only. Taking the highest value in a winner-takes-all ap-proach disregards any inhomogeneity in the words to be hyphenated. The distribution of probability is the level of certainty (or uncertainty) expressedas a function of the position within words. At individual positions the uncertainty"whether or not to hyphenate" could be high. The model is based on probabilitywhich automatically implies the acceptance of errors. Usually those positionswith a low uncertainty are accepted and the remaining positions are left unhy-phenated.
a) the assumption of linearityb) the implications of probability
ad a) The assumption of linearity implies that the probability of hyphenation isequally distributed over the word, which would be the case in (see fig. 2):
This is very unlikely to occur. ad b) The implications of probability tell us that one of the positions is moreprobable than the other positions, but this statement also tells us that there isthe probability of being wrong. Once probability is accepted and categorisationonly runs from 1,.,9 it will be impossible to get a finer degree of separation(less than 1 unit). ___________
‡ This formula represents a moving average to emphasize the strongest hyphenation weights (h). p repre-sents the unfiltered probability of the pattern c. To speed up the process one might choose a winner-takes-all approach, the pattern to be probably correct. Since the calculated patterns were based on probabilitieserrors are likely. The distribution of these errors of the position within the word is not affected by any se-lection criterion. The distribution of errors is related to the levels of hyphenation marks in the patternsthemselves and to the sample size on which the patterns were calculated. Fig.2.:Thedistributionoferrorsaccordingtothelinearmodel.Atthebeginningandendofawordthedistributionfallsofftozerobecausehyphenationbeforethebeginningandaftertheendarenotpossible.Withintheremaininginterval(horizontalsection)thedistributionofer-rorsisequallyhigh.
The integer steps to describe differences in probability lead to a considerabledegree of uncertainty. They only differentiate in steps of 10. A considerablenumber of errors can be expected if all probability levels are used in a linear hy-phenation model.
1.3. The implications of incorrect assumptions
If the linear assumption is not valid errors in hyphenation will be made. The inva-lidity of the Liang class of models can be demonstrated in German compoundsas:
The incorrect cases follow a more regular distribution matching the distributionof a linear model. In fact quite often there exists a bimodal or trimodal distribu-tion (compounds consisting of 3 or 4 root words). Accepting a linear distributioninstead of a multimodal compound distribution increases the uncertainty at com-pound boundaries, which increases the variance at the compound boundary. Consequently, less reliable information is available to determine a correct hy-phen location. The real effect is that uncertainty rises sharply where correct hy-phenation is needed most (see fig. 3). If the strategy is chosen "to suppress hyphenation when uncertainty is high"many compound boundaries will never be hyphenated. This is likely to result inwhite rivers running all over the printed page and/or the appearance of a highlyirregular right margin. Fig.3.:Theprobabilityof errorsaccordingtothelinear model.Theinaccuracyisenlargedaroundthecompoundboundaries.Thereforemoreerrorsareexpectedtobemadeatthebumpofthecurvethanatotherlocations.
1.4. Uncertainty and suppression of hyphenation.
Nearly all implementations of Liang’s model2 use a probability value p to rankhyphen marks to differentiate between certain and uncertain cases. Since thelength of an individual pattern is restricted, certainty becomes less than expect-ed, which leads to to an unacceptably high number of hyphenation errors. Inpractice this inconvenience is resolved by "not hyphenating" above a particularuncertainty level. However, this method has its drawbacks. Due to the invalidassumption of linearity, uncertainty of hyphenation decisions rises at the com-pound boundary (see fig. 3). Consequently, compounds will be hyphenated in alesser number of cases, resulting again in highly fluctuating white rivers on thepage or in an irregular right hand margin. Even a neural network like Fick’s model1 is linear and uses summation to stabi-lize the calculations, and therefore it is subject to the same type of errors. A dramatic case of handling uncertainty during hyphenation is given in fig. 4. Anopen software package derived from Sun’s StarOffice doesn’t treat the text ade-quately. The text was printed in small columns. Fig.4.:HyphenationintheDutchlanguageassuggestedbyanopensoftwarepackage.Thehyphenator sresultnotonlyiserroneous,but,also,whilekeepinguncertainlocationsunhy-phenatedanawkwardprintingimageisallweget.Wordsaredividedunexpectedlyandwhiteriversflowthroughcolumns.Eachparagraphisdefinedbyanindentandwithinaparagraphthe lines should be filled up, but they aren t! (correct hyphenations: beademingsinrich-tingen,beaujolaisstreek,bedrijfsinternetvoorzieningen)
2. Conclusion
Current linear hyphenation technology is largely based on the early work of Li-ang (1983), but results have been disappointing. This approach to hyphenationapplies a standard mechanism to all kinds of languages, despite the very differ-ent morphologies of these languages. One of the assumptions used by Liang and others, is linearity itself. Liang andothers use a comparison method which goes step by step through a word. Theyassume that the probability of errors is evenly distributed as a function of posi-tion within a word. Given the structure of compounds in many languages it canbe made clear that the syllable distribution and therefore the distribution of hy-phens does not behave linearly. Consequently, this wrong assumption givesrise to errors at the compound boundary. It could be demonstrated that accepting probability to calculate and to giveweights to the pattern implies acceptance of hyphenation errors, or an increasein errors that is greater than it needs to be. The decision not to hyphenate uncertain cases causes white rivers or irregularright margins and in case of narrow columns unacceptable conflicts which dis-tort the lay-out. REFERENCES
Neurale netwerke as moontlike woordafkappingstegniek vir Afrikaans,Machteld Fick, S.A. Tydskrif vir Natuurwetenskap en Tegnologie, 22, 1,2003.
Liang. M., Word hy-phen-a-tion by Com-put-er, PhD thesis, Standford Uni-versity, 1983.
Boot, M., Taal, tekst, computer, Servire, Katwijk, 1984.
Tutelaers, P., Herziene afbreekpatronen voor het Nederlands, MAPS1993, pp. 187-190, 1993.
Knuth, D.E., TEXT and Metafont: New Directions in Typesetting, Bedford,MA: Digital Press, 1979.
Dealemans, W., AUTOMATIC HYPHENATION: Linguistics versus Engi-neering, IN: Worlds Behind Words, F.J.Heyvaert & F. Steurs eds., Leuev-en University Press, 1989.
Nunn, A., Automatic Hyphenation of Dutch Words based on LinguisticRules, Van Dale Lexicografie, Utrecht, 1998.
Longman Dictionary of contenporary English, Longman, Harlow, UK, 1987.
The Oxford Colour Spelling Dictionary, Oxford University Press, New York,1996.
Webster’s Third New International Dictionary Unabridged, Merriam-Web-ster, Springfield, USA, 1993.
*TALO’s Language Technology, Hyphenation, Spell checkers, Dictionaries,J.C.Woestenburg, *TALO bv, Bussum, 2002. Glossary Accuracy, the quality or state of being correct or precise, in hyphenation an in-
verse function related to hyphenation errors. Assumption, statement that is accepted as true without proof. Bi-, tri-modal, involving two or three modes, in particular having two or three Compounds, a concept consisting of more than one root word, e.g., house-
master. In English a compound can be open, closed or the two rootwords can be connected with a hyphen, e.g., well-dressed. Distribution, the way in which something is spread out among a group or Hyphen, Hyphenation, a sign (-) used to join words to indicate that they have
a combined meaning, or a division of a word at the end of a line. Placing hyphens is called hyphenation. Indo-European, the family of languages spoken in most of Europe and Asia
arranged along a straight, or nearly straight, line
Linguistic pattern, a language specific sequence of characters or letters found mutes, muettes, a letter not pronounced, in French muettes as "brunette", Neural Network, a computer system or program modelled on the human Non-linear, not denoting, or not involving or not arranged in a straight line. Pattern recognition, the detection of defined patterns to be used in a deci- Probability,the extent to which an event is probable; the likelihood that an
event will happen. Probable events are certain, improbable eventsare uncertain. Quarks, Quantum, in physics a subatomic particle, postulated as the building
blocks for other particles. A quantum is a discrete, smallest amountof energy. These smallest particles are a metaphor for the smallestparticles of meaning.
FLEXIBLE SPENDING ACCOUNT (FSA) REQUEST FOR REIMBURSEMENT FORM Employer _______________________________________________________________________________________________ Employee Name _____________________________________________________ Soc.Sec.No. _________________________ Last Home Address ___________________________________________________________________________________________ Num
MS CHRONICLE ® Volume 10, Issue 2 April 2008 A Publication of Multiple Sclerosis Resources of Central New York, Inc. ® Message from the Executive Director: Spring is finally here! As I write this The 10th Anniversary MS Dinner of Hope though, the weather is rainy and cold. A was held on April 1 at the Empire Room, typical start to springtime in Central New