【摘 要】 In recent years, variants of a neural network ar- chitecture for statistical language modeling have been proposed and successfully applied, e.g. in the language modeling component of speech rec- ognizers. The main advantage of these architec- tures is that they learn an embedding for words (or other symbols) in a continuous space that helps to smooth the language model and pro- vide good generalization even when the number of training examples is insufficient. How- ever, these models are extremely slow in com- parison to the more commonly used n-gram models, both for training and recognition. As an alternative to an importance sampling method pro- posed to speed-up training, we introduce a hier- archical decomposition of the conditional proba- bilities that yields a speed-up of about 200 both during training and recognition. The hierarchical decomposition is a binary hierarchical cluster- ing constrained by the prior knowledge extracted from the WordNet semantic hierarchy.