Canalblog
Editer l'article Suivre ce blog Administration + Créer mon blog
Publicité
Premiers pas virtuels ves le Traitement automatique de la parole en Alsace
Archives
Visiteurs
Depuis la création 2 741
10 décembre 2011

Speech Processing Glossary

 Source http://www.vocapia.com

  • Acoustic model A model describing the probabilistic behavior of the encoding of the linguistic information in a speech signal. LVCSR systems use acoustic units corresponding to phones or phones in context. The most predominant approach uses continuous density hidden Markov models (HMM) to represent context dependent phones.
  • Acoustic parametrization (or acoustic front-end) see Speech Analysis
  • ASR Accuracy The speech recognition accuracy is defined as the 1-WER.
  • Automatic Language Recognition Process by which a computer identify the language being spoken in a speech signal.
  • Automatic Speaker Recognition Process by which a computer identify the speaker from a speech signal.
  • Automatic Speech Recognition (ASR) Process by which a computer convert a speech signal into a sequence of words.
  • Backoff Mechanism for smoothing the estimates of the probabilities of rare events by relying on less specific models (acoustic or language models)
  • CDHMM Continuous Density HMM (usually based on Gaussian mixtures)
  • Filler word Words like uhm, euh, ...
  • FIR filter A Finite Impulse Response (FIR) filter produces an output that is the weighted sum of the current and past inputs.
  • Frame An acoustic feature vector (usually MFCC) estimated on a 20-30ms signal window (see also Speech Analysis).
  • Frame Rate Number of frames per second (typically 100).
  • GMM Gaussian Mixture Model (i.e. a 1-state CDHMM)
  • HMM Hidden Markov Models (or Probabilistic functions of Markov chains)
  • HMM state Usually an GMM. An HMM contains one or more states, typically 3 states for a phone model.
  • IIR filter An Infinite Impulse Response (IIR) filter produces an output that is the weighted sum of the current and past inputs, and past outputs.
  • Language model A language model captures the regularities in the spoken language and is used by the speech recognizer to estimate the probability of word sequences. One of the most popular method is the so called n-gram model, which attempts to capture the syntactic and semantic constraints of the language by estimating the frequencies of sequences of n words.
  • Lattice A word lattice is a weighted acyclic graph where word labels are either assigned to the graphs edges (or links) or to the graph vertices (or nodes). Acoustic and language model weights are associated to each edge, and a time position is associated to each vertex
  • Lexicon or pronunciation dictionary A word list with pronunciations
  • LVCSR Large Vocabulary Speech Recognition (large vocabulary means 10k words or more). The size of the recognizer vocabulary affects the processing requirements.
  • MAP estimation (Maximum A Posteriori) A training procedure that attempts to maximize the posterior probability of the model parameters (which are therefore seen as random variables) Pr(M|X,W) (X is the speech signal, W is the word transcription, and M represents the model parameters).
  • MAP decoding A decoding procedure (speech recognition) which attempts to maximize the posterior probability Pr(W|X,M) of the word transcription given the speech signal X and the model M.
  • MLE (Maximum Likelihood Estimation) A training procedure (the estimation of the model parameters) that attempts to maximize the training data likelihood given the model f(X|W,M) (X is the speech signal, W is the word transcription, and M is the model).
  • MMIE (Maximum Mutual Information Estimation) A discriminative training procedure that attempts to maximize the posterior probability of the word transcription Pr(W|X,M) (X is the speech signal, W is the word transcription, and M is the model). This training procedure is also called Conditional Maximum Likelihood Estimation.
  • MFCC Mel Frequency Cepstrum Coefficients. The Mel scale approximates the sensitivity of the human ear. Note that there are many other frequency scales "approximating" the human ear (e.g. the Bark scale).
  • MF-PLP PLP coefficients obtained from a Mel frequency power spectrum (see also MFCC and PLP).
  • MLP Multi-Layer Perceptron is a class of artificial neural network. It is a feedforward network mapping some input data to some desired output representation. It is composed of three or more layers with nonlinear activation functions (usually sigmoids).
  • N-Gram Probabilistic language model based on an N-1 order Markov chain
  • N-best Top N hypotheses
  • OOV word Out Of Vocabulary word -- Each OOV word causes more than one recognition error (usually between 1.5 and 2 errors). An obvious way to reduce the error rate due to OOVs is to increase the size of the vocabulary.
  • %OOV Out Of Vocabulary word rate.
  • Perplexity The relevance of a language model is often measured in terms of test set perplexity defined as pow(Prob(text|language-model),-1/n), where is n is the number of words in the test text. The test perplexity depends on both the language being modeled and the model. It gives a combined estimate of how good the model is and how complex the language is.
  • Phone Symbols used to represent the pronunciations in the lexicon
  • Pitch or F0 The pitch is the fundamental frequency of a (periodic or nearly periodic) speech signal. In practice, the pitch period can be obtained from the position of the maximum of the autocorrelation function of the signal. See also degree of voicing, periodicity and harmonicity. (In psychoacoustics the pitch is a subjective auditory attribute)
  • PLP analysis Perceptual Linear Prediction: Compute perceptual power spectral density (Bark scale), perform equal loudness preemphasis and take the cube root of the intensity (intensity-loundness power law), apply the IDFT to get the equivalent of the autocorrelation function, fit a LP model and transform into cepstral coefficients (LPCC analysis).
  • Quinphone (or pentaphone) Phone in context where the context includes the 2 left phones and the 2 right phones
  • Recording channel Means by which the audio signal is recorded (direct microphone, telephone, radio, etc.)
  • Sampling Rate Number of samples per second used to code the speech signal (usually 16000, i.e. 16 kHz for a bandwidth of 8 kHz). Telephone speech is sampled at 8 kHz. 16 kHz is generally regarded as sufficient for speech recognition and synthesis. The audio standards use sample rates of 44.1 kHz (Compact Disc) and 48 kHz (Digital Audio Tape). Note that signals must be filtered prior to sampling, and the maximum frequency that can be represented is half the sampling frequency. In practice a higher sample rate is used to allow for non-ideal filters.
  • Sampling Resolution Number of bits used to code each signal sample. Speech is normally stored in 16 bits. Telephony quality speech is sampled at 8 kHz with a 12 bit dynamic range (stored in 8 bits with a non-linear function, i.e. A-law or U-law). The dynamic range of the ear is about 20 bits.
  • Spectrogram A spectrogram is a plot of the short-term power of the signal in different frequency bands as a function of time.
  • Speech Analysis Feature vector extraction from a windowed signal (20-30ms). It is assumed that speech has short time stationarity and that a feature vector representation captures the needed information (depending of the task) for future processing. The most popular set of features are cepstrum coefficients obtained with a Mel Frequency Cepstral (MFC) analysis or with a Perceptual Linear Prediction (PLP) analysis.
  • Speech-to-Text Conversion A synonym of Automatic Speech Recognition.
  • Speaker diarization Speaker diarization, also called speaker segmentation and clustering, is the process of partitioning an input audio stream into homogeneous segments according to speaker identity. Speaker partitioning is a useful preprocessing step for an automatic speech transcription system. By clustering segments from the same speaker, the amount of data available for unsupervised speaker adaptation is increased, which can significantly improve the transcription performance. One of the major issues is that the number of speakers is unknown a priori and needs to be automatically determined.
  • Triphone (or Phone in context) A context-dependent HMM phone model (the context usually includes the left and right phones)
  • Voicing The degree of voicing is a measure of the degree to which a signal is periodic (also called periodicity, harmonicity or HNR). In practice, the degree of periodicity can be obtained from the relative height of the maximum of the autocorrelation function of the signal.
  • Word Accuracy The word accuracy (WAcc) is a metric used to evaluate speech recognizers. The percent word acccuracy is defined af %WAcc = 100 - %WER. It should be noted that the word accuracy can be negative. The Word Error Rate (WER, see below) is a more commonly used metric and should be prefered to the word accuracy.
  • Word Error Rate The word error rate (WER) is the commonly used metric to evaluate speech recognizers. It is a measure of the average number of word errors taking into account three error types: substitution (the reference word is replaced by another word), insertion (a word is hypothesized that was not in the reference) and deletion (a word in the reference transcription is missed). The word error rate is defined as the sum of these errors divided by the number of reference words. Given this definition the percent word error can be more than 100%. The WER is somewhat proportional to the correction cost.
Publicité
Commentaires
Premiers pas virtuels ves le Traitement automatique de la parole en Alsace
Publicité
Publicité