Institute for Signal and Information Processing Department of Electrical, Computer, and Systems Engineering ABSTRACT
specifically on the spontaneous conversational speechfound in the Switchboard corpus where error rates have not Hidden Markov Models and n-gram language modeling moved much beyond 50% word accuracy in the past have been the dominant approach in continuous speech two to three years. In addition, commercial researchers recognition for almost 15 years. Though successes have often observe error rate increases of factors of 3-4 when been well-documented, fundamental limitations of this paradigm surface at both the acoustic and languagemodeling ends of the speech recognition problem.
Of course, both trigram language models and hidden Although acoustic models based on linear statistical M a r k o v a c o u s t i c m o d e l s h ave m a d e t r e m e n d o u s assumptions have led to steadily improved performance on contributions to progress in speech recognition. They speech collected in benign environments, they are still clearly established the value of automatic training sorely lacking on spontaneous data encountered in the field.
algorithms and the usefulness of Markov assumptions in Similarly, robust parsing of dialogs and unconstrained man- simplifying both recognition and training complexity. In machine communications is a serious problem for today’s addition, the successes provide important evidence that local context — neighboring phonemes in acousticmodeling and neighboring words in language modeling — In this session, we attempt to stimulate a discussion on new provides the most important information for speech approaches in statistical modeling. Researchers from both recognition. Conditioning on local context seems to be an inside and outside of the speech community are invited to important attribute of a good statistical model. The present new perspectives on how complex behavior can be question raised here is whether sufficient progress can be modeled in a parsimonious manner. Our panel discussion made simply by increasing the number of parameters in will attempt to identify and debate a handful of promising these models. Adaptation certainly helps improve new directions in statistical modeling of speech.
performance, but it is predicated on reasonably accuratebaseline performance. Progress in speech recognition is 1. INTRODUCTION
ultimately limited by the sophistication of statisticalmodels, and current technology is unlikely to provide the Currently, the most successful speech recognition systems capability for computers to really converse with humans.
use detailed models of local context with large numbers of New models are needed to better capture variability at a parameters trained in a limited domain. In acoustic local level and/or to model trends operating at a higher modeling, context-dependent hidden Markov models (HMMs) have become a standard approach forhandling the variability due to local phonetic context, Nonlinear systems theory has been an active area of where local context may be a window of 3-5 phones. In research in the last twenty years in the field of language modeling, trigrams have dominated the field, dynamics. What is new is that it promises to be an active with major improvements coming from use of higher-order area of engineering in the next twenty years as a new wave n-grams. For problems where training data has been of mathematics begins moving from the laboratory to the steadily increasing these models show steady improvement, field. Much as linear system theory provided tools for as demonstrated in the NAB benchmarks where word scientists to analyze classes of problems previously thought accuracy rates of less than 10% have been achieved on an too complicated, nonlinear system theory offers hope of open vocabulary task However, the same technology providing tools to unlock the mysteries of a wide range of has provided minimal advances on less constrained tasks, important biological signals such as speech.
Similarly, computational research in language has spanned acoustic modeling and language modeling. On each of these decades. In the late 1950s, a hierarchy of grammatical topics, we have included a speaker drawn from outside the formalisms was defined in an attempt to document the normal speech research community, and a speaker complexity of language. As HMMs were introduced in representing a somewhat more mainstream viewpoint.
speech recognition, great excitement was generated by thefact that both acoustic models and language models could The first two talks deal with the issue of nonlinear acoustic be represented as state machines. Researchers were quick to modeling. The piecewise constant model has been a staple see, however, that this was simply the first step in of digital speech processing since the early 1970’s. A representing the entire speech recognition problem as a multivariate Gaussian model of observation vectors has formal language theory problem. HMMs were shown to be been employed in hidden Markov model-based speech equivalent to regular grammars, and shown to simply be recognition systems since the early 1980’s. Though neural one step in a progression towards context-sensitive network-based approaches have been researched since the grammars. Today, we find systems routinely implementing mid-1980’s, only recently has the performance of such context-free grammars and left regular grammars. True models rivaled conventional technology.
context-sensitive grammars, however, have so far been Simon Haykin suggests a new approach to signal modeling impractical for speech recognition and understanding based on chaotic signals. His research is representative of a applications. In addition, experiments in understanding new body of science devoted to the application of nonlinear spontaneous speech (e.g. in the ATIS task have shown dynamics to conventional classification problems.
that conventional parsing techniques are ill-suited to Classification of signals into deterministic and stochastic processing spontaneous spoken language and various ignores an important class of signals, known as chaotic robust parsing algorithms are now being explored.
signals, that are deterministic by nature yet random in In sum, research in both acoustic and language modeling appearance. While direct modeling of the speech signal as currently benefits from the power of context-sensitive output from a nonlinear system has not proven to provide statistics, but both are also limited in not moving beyond enhancements over conventional analyses, recent research the local level. Too many parameters are dedicated to the suggests these techniques are applicable to the statistical local structure at the expense of capturing global structure.
modeling problem that is the core of the acoustic modeling As noted in “Knowing the microscopic laws of how problem. In his talk, Haykin advocates an architecture that things move still leaves us in the dark as to their larger employs neural networks to perform the actual prediction/ consequences.” One of the attractions of nonlinear systems detection task. This is not unlike many hybrid speech is the hope of modeling the coarse behavior of a system in recognition systems that now use a combination of hidden which a detailed analysis is not required, a very common problem in statistical mechanics. Similarly, one of the In a companion talk, Tony Robinson discusses issues in attractions of grammatical language models is the potential acoustic modeling in the context of connectionist/HMM for capturing the higher level structure inherent to language.
systems, which use neural networks to estimate posterior It is clear that our current formalisms are not adequate for distributions for HMM states, and can be thought of as a the difficult recognition tasks at hand. Here, we take a look non-linear extension of current HMM technology. While his at some new directions that may offer a means of general approach is based on non-linear models, the theme overcoming limitations of existing statistical models.
of his talk is new applications of one of the most 2. SESSION OVERVIEW
powerful tools behind HMM technology: the expectation-maximization (EM) algorithm [Robinson examines the This session consists of four invited panelists: notion of a hidden component of the process, e.g. the statein an HMM. He explores how this state, which is currently used to capture contextual and temporal phonetic variability, can improve aspects of speech recognition from feature extraction to posterior distribution modeling to “Some New Uses of EM in Acoustic Modeling” channel or outlier (goat) identification.
Ted Briscoe, Cambridge University“Language Modeling or Statistical Parsing?” The subsequent two talks deal with issues related to the language modeling problem, which many people believe “A Context-Free Headword Language Model” will be the source of most of the improvement in speechunderstanding systems in the next few years. We have seen The panelists were selected to provide perspectives on two many attempts at improving recognition through more key dimensions of the speech understanding problem: sophisticated language modeling, from higher-order and variable-order n-grams to the introduction of grammatical • If linear models work well for variations within a phone structure, but such techniques have generally resulted in as long as the time span is sufficiently small, might that modest improvements in performance at the expense of argue for the use of non-linear models to represent the significant increases in complexity. However, new work aimed at combining the advantages of grammatical New and improved spectral estimation techniques have structure and local context are emerging in various forms of often not proven to provide substantial gains in lexicalized grammars. This combined approach may offer recognition performance, perhaps leading us to believe the most hope for performance gains, and one theme of the the hidden component of the acoustic model is the area two language modeling talks is lexicalization and its needing a better statistical model. However, non-linear techniques are also motivated by articulatory models,where non-linearities are usually placed at the output Ted Briscoe suggests that interpretation of text requires a level rather than embedded within the internal structure framework beyond n-gram models, due to a need of such systems to evaluate the relative likelihoods of complexgrammatical relationships, which are hierarchical and • Are the language models presented here only practical in therefore not easily captured by n-gram models. He an n-best rescoring framework, or can we envision using advocates the use of statistical parse selection models over them to reduce the search space? Are there better ways to stochastic context free grammars, but quickly notes that integrate new language models with acoustic modeling to integration of estimate-maximize (EM) techniques into produce more efficient recognition systems? these formalisms is challenging. Such formalisms do notlend themselves to the treatment of conditional probabilities When recognition performance is poor, or the search a n d m a x i m u m l i k e l i h o o d c a l c u l a t i o n s i nvo l v i n g space is large, n-best outputs can be quite limiting, ambiguous-in-time candidate partial parses. Research forcing one to sift through large numbers of competing to-date in formalisms beyond CFGs has been disappointing.
hypotheses that look very similar. New language models Yet, it is clear that such formalisms are needed to deal with must find a way to limit the number of hypotheses passed the complex language models required for spontaneous REFERENCES
In a companion talk, Fred Jelinek introduces lexicalized S. Young, “Large Vocabulary Continuous Speech Rec- stochastic context-free language model that takes advantage ognition: A Review,” presented at the 1995 IEEE Auto- of a parser to define the phrase structure of a word string.
matic Speech Recognition Workshop, Snowbird, Utah, The authors use the notion of a phrase headword to relate non-terminals directly to lexical items, and use the given D.S. Pallett, et. al., “1994 Benchmark Tests for the parse structure to reduce the cost of computing the ARPA Spoken Language Program,” in Proceedings of probability of a word string. The problem of sparse data in the ARPA Spoken Language Systems Technology Work- parameter estimation is addressed by defining word classes shop, Austin, Texas, USA, January 1995, pp. 5-36.
as would be used in a class grammar, but here the classes F. Jelinek, R. Mercer and S. Roukos, “Principles of simply define a smoothing hierarchy. This approach is Lexical Language Modeling for Speech Recognition,” representative of a growing trend towards lexicalized in Readings in Speech Recognition, ed. A. Waibel and K.-F. Lee, Morgan Kaufmann Publishers, 1990.
A summary of recent SWITCHBOARD results areavailable at the URL:
Perhaps the most important aspect of this session will be the J.R. Deller, J.G. Proakis, and J.H.L. Hansen, Discrete panel discussion held after the plenary talks. Some of the Time Processing of Speech Signals, MacMillan, New issues we feel are an outgrowth of the papers presented in H.O. Peitgen, H. Jurgens, and D. Saupe, Chaos andFractals: New Frontiers of Science, Springer-Verlag, • What will be the impact on the number of parameters required in a speech recognition system based on A.P. Dempster, N.M. Laird and D.B. Rubin, “Maxi- mum likelihood from incomplete data via the EM algo- If more parameters are required, then the benefits of such rithm,” Journal of the Royal Statistical Society, Vol. 37, an approach might vanish for small training sets. If the number of parameters decreases, perhaps sensitivity tospeaker or channel will increase.
Traditionally, signals have been classified into two basictypes: deterministic, and stochastic. This classificationignores an important family of signals known as chaoticsignals, which are deterministic by nature and yet exhibitmany of the characteristics that are normally associatedwith stochastic signals.
In this talk, we begin by reviewing some important aspectsof nonlinear dynamics. This would then naturally lead intoa discussion of chaotic systems, how they arise, physicalphenomena that are known to be chaotic, and their practicalapplications.
The second half of the talk will be devoted to thecharacterization of chaotic signals and the theory ofembodology, with emphasis on time series analysis.
Specifically, we will describe the following notions: • Attractor dimension, and the correlation dimension • Minimum embedding dimension, and its estimation usingthe method of false nearest neighbors • Lyapunov spectrum, and its estimation • Recursive prediction, and how to implement it using SOME NEW USES OF EM IN ACOUSTIC MODELING
also be used However, the conversion of the posteriorprobabilities to likelihoods involves some approximations This talk will raise some problems with the current which means that, as currently implemented, the training techniques used in acoustic modeling and suggest some algorithm is not an EM algorithm. These approximations directions for future research. Firstly the connectionist/ aside, we have shown that we can train on posterior HMM system known as ABBOT will be briefly introduced.
probabilities and that this results in better models over the The talk will then progress to suggest a series of new and largely untested applications of the EM algorithm inacoustic vector and acoustic model estimation for automatic It is interesting to consider the connectionist architecture speech recognition. These topics are under investigation at within the EM framework. We consider each unit as Cambridge University and it is hoped that they will estimating an indicator variable which has values of “fire” contribute to the ABBOT system in the future.
of “not fire”. We can estimate the MAP probability of firingif we know both the input and the output to the network It is acknowledged that the current acoustic vectors used in A l t h o u g h t h i s wo r k i s c u r r e n t l y c o m p u t a t i o n a l l y speech recognition systems are a poor representation of the constrained by an exhaustive search it does propose speech signal. This is clear from speech coding work approximations applicable to large networks or the use of whereby a standard LPC coder (e.g. LPC10e) may produce unintelligible output in the case of certain speaking styles ormild background noise.
A recent improvement has been the modeling of contextdependent phones Here we assume an indicator Drawing from speech coding, we can aim to model the variable not only for the phone class at a given time but the parameters of source-filter model such as LPC. In such a phone context given the phone class. We have been able to model the source is Gaussian white noise or an impulse use connectionist models to estimate this variable which has train. In conventional applications the LPC filter parameters resulted in improved speech recognition accuracy and speed of voiced speech are estimated assuming the white noise source, but it has been shown that an application of the EMalgorithm can provide a maximum likelihood estimate to Another promising candidate for acoustic modeling is the both the LPC parameters and the excitation parameters (the hierarchical mixture of experts This is essentially a decision tree with a probability of branching associatedwith each node. The EM algorithm may be used to Drawing from speech perception we know that formant reestimate the parameters of the system. There are several locations are a important to vowel identity and that formant practical aspects of this architecture that need to be frequencies are determined by vocal tract length and are addressed before it can be applied to large speech tasks speaker dependent. The simplest speaker invariant The HME can either be applied as a static pattern classifier parameter is a formant ratio. However, we are not close to and a Markov model used to model the dynamics in much incorporating this knowledge in current ASR system as they the same way as connectionist/HMM hybrid systems or the generally work in the power spectra or cepstral domain. As dynamics can be directly incorporated.
start in this direction is to estimate the power spectraldensity as a Gaussian mixture and then to model the Finally, an acknowledged problem in speech recognition is that some speakers are much easier to recognize thanothers. Out of ten speakers in a unlimited vocabulary read Currently HMMs model acoustic vector densities. We have speech evaluation it is not uncommon for the best speaker shown that statistical models of posterior probabilities can to have an order of magnitude lower error than the worst speaker. Hence the overall error rate is dominated by a fewoutliers (the goats). A proposed “EM” solution to thisproblem is to label every speaker with an indicator variable(sheep or goat) and use the observed recognition rate toestimate the probability of being a goat. By weighting thetraining set by these probabilities it is expected that theava i l a b l e m o d e l i n g p ow e r c a n b e b e t t e r u s e d a n dthe expected error rate decreased.
Another viewpoint on the same scenario is that thegoatiness factor is determined by the channel conditions.
We consider broadcast speech as a major source of acousticdata for future speech systems. By considering theconfidence that the decoded speech came from a cleansource rather than a dirty source we hope to filter theunending supply of broadcast speech and train in proportionto the sequential MAP estimate of sheepishness. We hopethat this will liberate us from the very significant resourcesrequired to construct today’s speech corpora and henceresult in significantly better speech systems.
Burshtein 1990: “Joint maximum likelihood estimationof Pitch and AR parameters using the EM algorithm”,pages 797-800, Proceedings of ICASSP 90, IEEE.
Zolfaghari and Robinson 1995: Cambridge UniversityEngineering Department Technical Report.
Bourlard and Morgan 1994: Continuous Speech Recog-nition: A Hybrid Approach, Kluwer Academic Publish-ers.
Robinson Hochberg and Renals 1995: “The use ofrecurrent networks in continuous speech recognition”,chapter 19 in Automatic Speech and Speaker Recogni-tion - Advanced Topics, Editors C. H. Lee, K. K. Pali-wal and F. K. Soong, Kluwer Academic Publishers.
retraining of recurrent neural networks”, Advances inNeural Information Processing Systems, Vol. 8, Mor-gan Kaufmann.
Cook and Robinson 1995: “Training MLPs via theEstimation-Maximisation algorithm”, Proceedings ofthe IEE conference on Artificial Neural Networks.
Jordan and Jacobs 1994: “Hierarchical mixtures ofexperts and the EM algorithm”, Neural Computation,Vol. 6, pp. 181-214.
Kershaw, Hochberg and Robinson 1995: “Context-Dependent Modeling in the ABBOT LVSCR System,”presented at the 1995 IEEE Automatic Speech Recog-nition Workshop, Snowbird, Utah, U.S.A., Dec. 1995.
Waterhouse and Robinson 1995: “Pruning and Grow-ing Hierarchical Mixtures of Experts”, Proceedings ofthe IEE conference on Artificial Neural Networks.
From the perspective of speech recognition, the alternativederivations for (a) are not as important as the relative In language modeling, a corpus of sentences is treated as a likelihood with which rabbit may follow play with (though set of observed outputs of an unknown stochastic accurate assessment of the likelihood with which a noun generation model, and the task is to find the model which denoting a plaything rather than playmate will be followed maximises the probability of the observations. When the by is/was/gets/.played with might well require both model is non-deterministic and contains hidden states, the modeling the distinct derivations and passivization.) probability of a sentence is the average probability of eachdistinct derivation (sequence of hidden states) which could For interpretation, the relative likelihood of the distinct have generated it. For speech recognition, language derivations is crucial. Furthermore, the derivation must modeling is an appropriate tool for evaluating the encode hierarchical consistuency (i.e. bracketing) to be plausibility of different possible continuations (word useful. Thus, for (a) we need to know whether the candidates) for a partially recognized utterance. For speech preposition with combines with play to form a phrasal or text understanding the derivations themselves are crucial transitive verb (d) or whether it combines with her rabbit to to distinguish different interpretations. For example: form a prepositional phrase which in turn combines withthe intransitive verb play (e), because recognition of which verb we are dealing with determines the difference of (c) ?Charlotte’s father gets played with a lot N-gram models cannot directly encode such differences of (d) (S(NP Charlotte) (VP is (VP (V playing with) hierarchical organization which is why most stochastic approaches to text understanding have employed stochasticcontext-free grammars (SCFGs) or feature/unification- (e) (S(NP Charlotte) (VP(VP is (VP playing)) based abbreviations of them. Within this framework it is possible to treat the grammar as a language model and In (a), there is an ambiguity between an interpretation in r e t u r n t h e m o s t l i ke l y d e r iva t i o n a s t h e b a s i s f o r which Charlotte is playing accompanied by her (pet) rabbit and one in which her plaything is a (toy) rabbit. In the latter However, there are a number of problems with this case play with is best analyzed as a phrasal verb and the approach. Firstly, SCFGs and their feature-based variants rabbit as a direct object, predicting for example the associate a global probability with each grammar rule rather possibility of passivization in (b).
than a set of conditional probabilities that a given rule will However, given the former interpretation, with is best apply in different parse contexts. A number of experiments analyzed as introducing an adverbial prepositional phrase by different groups have independently confirmed that modifier of the verb play predicting the accompaniment modeling aspects of the parse context improves the interpretation of with (as one possibility). The oddity of (c), accuracy with which the correct derivation is selected by up in which we are forced to interpret Charlotte’s father as a to 30%. Most researchers are now experimenting with p l a y t h i n g , i s t h e n a c o n s e q u e n c e o f t h e fa c t t h a t statistical parse selection models rather than language passivization only applies to direct object noun phrases and models within the text understanding community.
Secondly, SCFGs and close variants, unlike n-gram models,are not easily lexicalized, in the sense that the contributionof individual words to the likelihood of a given derivation is not captured. Those models which are lexicalized use wordforms rather than word senses to condition the probabilityof different derivations. It is likely that considerableimprovement could be obtained by utilizing broad semanticclass information (such as that rabbit can denote an animalor toy) to evaluate the plausibility of different predicate-argument combinations.
Selecting a correct derivation is only one aspect of theproblem of robust practical parsing for text or speechunderstanding. Another bigger problem is ensuring that thegrammar covers the (sub)language. Language models offero n e estimation-maximisation (EM) techniques can be utilizednot only to find the (locally) optimal probabilities forgrammar rules but also to find the optimal set of rules (by,for example, removing rules re-estimated to some “floor”threshold probability). However, it is not clear that EMtechniques can be coherently utilized in this way withstatistical parse selection models which cannot beinterpreted as language models.
In the talk I will discuss the issues of parse selection vs.
language models, lexicalization of models and integratedapproaches to statistical rule induction as well as ranking infurther detail.
Ciprian Chelba, Anna Corazza, and Frederick Jelinek Center for Language and Speech Processing {chelba, corazza, jelinek} model. A headword language model whose power could becompared to that of the trigram language model would have This is an attempt to base a language model on a context- free lexicalized grammar. A language model must bestatistical, and therefore simple enough so its parameters can be estimated from data. Since it should reflect message H ∠ ( ) = F HG H RB(H) = F HG The lexical non-terminals will be related directly to the H LB(H) = F GH words belonging to a vocabulary V and will correspond to the intuitive notion of headwords. Headwords are thought to have inheritance properties, and so the production rules H LB(H) = F v(H) should basically have the form ( s denotes the unique H RB(H) = F v(H) sentence non-terminal located at the root of the tree) where the notation H LB(H) = F HG means that F is the left brother of H and that HG is generated from H . Naturally, RB(H) = F means that F is the right We use the automatic transformational (AT) parser of In Eq. v(H) denotes the unique word v which has the Brill [to parse a large amount of text and thus provide a same name as the headword H .
basis, in conjunction with headword inheritance rules1, forthe collection of headword production statistics. The Figure 1 illustrates a possible derivation of the sentence Her availability of an AT parser solves in principle the problems step-mother, kissing her again, seemed charmed Note of scarcity of data and of transportability. Any domain with that attached to the root node of the tree is the headword sufficient text data provides sufficient parse data.
corresponding to the main verb seemed of the sentence.
A language model is a device that provides to the It is interesting to observe that a bigram language model can be considered to have the special headword form hypothesis about the string of the preceding k – 1 words.
1. That is rules of the type: the headword of a simple nounphraseis the last noun; or the headword of a verb phrase is its first verb, Comparing Eq. with Eq. , we see that the general etc. Such rules have been derived by linguists for use in the headword language model defined by Eq. has essentially IBM statistical direct parser Headword inheritance rules can the same parameter complexity as the bigram language also be derived automatically from data by use of information the-oretic principles.
Figure 1. An illustration of a possible derivation of the sentence “Her step-mother, kissing her again, seemedcharmed.” For context free grammars these probabilities can be P(H HG) = λ f (H HG) + computed by the method of Jelinek and Lafferty The headword language model is particularly suited to N best resolution when we use the hypothesis that the word strings ( (H) → c(H)c(G)) were produced by the process corresponding to the parse specified by the AT parser applied to these strings. Thecomputation of such a probability involves one derivation We have derived the classes c(H) based on the method of only and is therefore linear with the length of the sentence.
Clearly, a large amount of parsed data is necessary to REFERENCES
adequately estimate headword production probabilities. In Henry James: What Masie knew, Penguin Books, New fact, many more productions of the types H HG and H GH will have non-zero probabilities than would Eric Brill: A corpus-Based Approach to Language P(v(G) v(H)) Learning, Ph.D. dissertation, Department of Computerand Information Science, University of Pennsylvania, = v(H), w = v G H HG and H GH the headwords H and may correspond to words that are quite separated in the text F. Jelinek, et. al.: Decision tree parsing using a hidden derivation model, in Proceedings of the ARPA SpokenL a n g u a g e S y s t e m s Te c h n o l o g y Wo r k s h o p , Therefore, we must smooth efficiently the relative frequencies obtained in the collection process. Let F. Jelinek and J. Lafferty: Computation of the probabil- denote an appropriate class of the headword denote an ity of initial substring generation by stochastic contextfree grammars, Computational Linguistics, Vol. 17, appropriate class of the headword H . Then the appropriate P.F. Brown, et. al.: Class-based n-gram models of natu-ral language, Computational Linguistics, Vol. 18,No. 4, pp. 467 - 480, December 1992.


01 migradao

Evaluation DAO-deficiency patients migraine. Carmen Vidal1, Feliu Titus2 y Rafael Guayta-Escolies3 1 Professor of Nutrition and Bromatology at University of Barcelona, Barcelona (Spain); Member of honour in the Spanish Society of Neorology, Madrid (Spain) and Scientific assessor in the Spanish Association of Patients with Headache (AEPAC), Valencia (Spain); Assessor in th

Uso do Prediderm® (prednisolona) associado ao micofenolato de mofetil no tratamento de anemia hemolítica imunomediada canina: relato de caso Introdução A anemia hemolítica imunomediada (AHIM) é caracterizada pela redução no número de eritrócitos, resultante da destruição por meio de resposta autoimune (1, 2). Tais respostas são mediadas por anticorpos, as imunoglobulinas

Copyright © 2010-2014 Medical Articles