YOSEKS VAS
Voice Analysis System
Platform Features
Phonexia Speaker Identification (SID) uses the power of voice biometry to automatically recognize a speaker by their voice.
Technology
- A calibration tool for even higher accuracy
- 1:1 (verification), 1:n and n:m (identification) comparison possible
- The technology is language-, accent-, text-, and channel- independent
- Uses deep neural networks to generate highly representative voiceprints
- Applies state-of-the-art channel compensation techniques, verified by NIST evaluation
- Compatible with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, satphones, etc.
Input
- Input format for processing: WAV or RAW (PCM unsigned 8 or 16 bits, IEEE float 32-bit, A-law or Mu-law, ADPCM), FLAC, OPUS; 8 kHz+ sampling (other audio formats automatically converted)
- Recommended speech signal for enrolment: recommended 20+ secs
- Minimum speech signal for identification: recommended 7+ secs
In specific use cases the time required for the speaker enrolment and identification can be much shorter.
Output
- XML/JSON format with all results or results files with a log likelihood ratio (-∞;∞) and/or percentage metric scoring <0-100%>
Accuracy and Processing speed
Achieves more than 99% accuracy (0.96% Equal Error Rate based on NIST evaluation data set).
Up to 182× faster than real-time processing on 1 CPU core with the most precise model – for example, a standard 1 CPU core server processes up to 4,368 hours of audio in one day of computing time.
Enables segmentation of voices in one monochannel audio
. Technology
- Trained with an emphasis on spontaneous telephone conversation
- The technology is language-, accent-, text-, and channel- independent
- Compatible with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, satphones, etc.
Input
- Input format for processing: WAV or RAW (PCM unsigned 8 or 16 bits, IEEE float 32-bit, A-law or Mu-law, ADPCM), FLAC, OPUS; 8 kHz+ sampling (other audio formats automatically converted)
Output
- XML/JSON format with all results or results files with segmentation of speech, silence, and technical signals (i.e., elimination of phone lines beeps, DTMF tones, music, etc.)
- Audio file extracted for each speaker
Processing speed
Approx. 50x faster than real-time processing on 1 CPU core.
I.e., a standard 1 CPU core server processes 1,200 hours of audio in 1 day of computing time.
System allows detecting the spoken language or dialect automatically.
Technology
- The technology is text and channel independent
- Applies state-of-the-art channel compensation techniques, verified by NIST evaluation
- Compatible with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, satphones, etc.
Supported languages
Afan_Oromo, Albanian, Amharic, Arabic, Arabic_Gulf, Arabic_Iraqi, Arabic_Levantine, Arabic_Maghrebi, Arabic_MSA, Azerbaijani, Bangla_Bengali, Bosnian, Burmese, Chinese_Cantonese, Chinese_Dialects, Chinese_Mandarin, Creole, Croatian, Czech, Dari, English_American, English_British, English_Indian, Farsi, French, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Khmer, Kirundi_Kinyarwanda, Korean, Lao, Macedonian, Ndebele, Pashto, Polish, Portuguese, Punjabi, Russian, Serbian, Shona, Slovak, Somali, Spanish, Swahili,Swedish, Tagalog, Tamil, Thai, Tibetan, Tigrigna, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese
A user can add new languages to the system, no assistance from Phonexia is necessary. Approx. 20 hours of audio recordings recommended for new language training.
Input
- Input format for processing: WAV or RAW (PCM unsigned 8 or 16 bits, IEEE float 32-bit, A-law or Mu-law, ADPCM), FLAC, OPUS; 8 kHz+ sampling (other audio formats automatically converted)
- Minimum speech signal for identification: recommended 7+ secs
Output
- XML/JSON format with all results or results files with a logarithm of probabilities scoring (-∞;0> and/or percentage metric scoring <0-100%>
Processing speed
Approx. 20x faster than real-time processing on 1 CPU core with the most precise model.
I.e. a standard 8 CPU core server processes 480 hours of audio in 1 day of computing time.
Gender Identification (GID) automatically recognizes the gender of a speaker.
Technology
- Uses the acoustic characteristics of speech
- Speech is converted to frequency spectra and modeled with advanced statistical methods
- The technology is language-, accent-, text-, and channel- independent
- Compatible with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, satphones, etc.
Input
- Input format for processing: WAV or RAW (PCM unsigned 8 or 16 bits, IEEE float 32-bit, A-law or Mu-law, ADPCM), FLAC, OPUS; 8 kHz+ sampling (other audio formats automatically converted)
- Minimum speech signal for identification: recommended 7+ secs
Output
- XML/JSON format with all results or results files with processed information (scores for male and female)
Processing speed
Approx. 200x faster than real-time processing on 1 CPU core.
I.e. a standard 1 CPU core server processes 4,800 hours of audio in 1 day of computing time
Age Estimation (AGE) estimates the age of a speaker from an audio recording..
Technology
- Trained with an emphasis on spontaneous telephone conversation
- The technology is language-, accent-, text-, and channel- independent
- Compatible with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, satphones, etc.
Input
- Input format for processing: WAV or RAW (PCM unsigned 8 or 16 bits, IEEE float 32-bit, A-law or Mu-law, ADPCM), FLAC, OPUS; 8kHz+ sampling (other audio formats automatically converted)
Output
- XML/JSON format with all results or results files with age estimates
Processing speed
Up to 182× faster than real-time processing on 1 CPU core with the most precise model – for example, a standard 1 CPU core server processes up to 4,368 hours of audio in one day of computing time.
Speech Transcription (STT) converts speech signals into plain text.
Technology
- Trained with an emphasis on spontaneous telephone conversation
- Based on state-of-the-art techniques for acoustic modeling, including discriminative training and neural network-based features
- Compatible with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, satphones, etc.
Supported languages
Arabic, Chinese (beta version), Czech, Dutch, English UK, English US, Farsi (beta version), French, German, Italian, Spanish – Lat.Am., Polish, Russian, Slovak
Input
- Input format for processing: WAV or RAW (PCM unsigned 8 or 16 bits, IEEE float 32-bit, A-law or Mu-law, ADPCM), FLAC, OPUS; 8 kHz+ sampling (other audio formats automatically converted)
Output
- XML/JSON format with all results or results files with:
- One-best transcription i.e., a file with a time-aligned speech transcript (the time of the words’ start and end)
- n-best transcription i.e., a confusion network with hypotheses for words at each moment
Processing speed
The 5th generation is approximately 7x faster than real-time processing on 1 CPU core – for example, a standard 1 CPU core server processes 168 hours of audio in one day of computing time.
The 4th generation is approximately 1.2x faster than real-time processing on 1 CPU.
Keyword Spotting (KWS) identifies the occurrences of keywords and/or keyphrases in audio recordings.
Technology
- Robust acoustic-based technology, even with noisy recordings
- Keywords are automatically converted into phonemes and searched for
- Compatible with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, satphones, etc.
Supported languages
Arabic, Chinese (beta version), Croatian, Czech, Dutch, English US, Farsi (beta version), French, German, Hungarian, Italian, Pashtu, Polish, Russian, Slovak, Spanish – Lat.Am, Turkish (beta version)
A user can add an unlimited number of keywords to the system, as well as an unlimited number of pronunciation variants for each keyword.
Input
- Input format for processing: WAV or RAW (PCM unsigned 8 or 16 bits, IEEE float 32-bit, A-law or Mu-law, ADPCM), FLAC, OPUS; 8 kHz+ sampling (other audio formats automatically converted)
Output
- XML/JSON format with all results or results files generated with detected keywords (containing the keyword, start/end time, path, probability, etc.)
Processing speed
The 5th generation is approximately 30x faster than real-time processing on 1 CPU core, i.e. a standard 1 CPU core server processes 720 hours of audio in one day of computing time.
The 5th generation is approximately 10x faster than real-time processing on 1 CPU core.
Voice Activity Detection (VAD) identifies parts of audio recordings with speech content vs. nonspeech content.
Technology
- Trained with an emphasis on spontaneous telephone conversation
- The technology is language-, accent-, text-, and channel- independent
- Compatible with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, satphones, etc.
Input
- Input format for processing: WAV or RAW (PCM unsigned 8 or 16 bits, IEEE float 32-bit, A-law or Mu-law, ADPCM), FLAC, OPUS; 8 kHz+ sampling (other audio formats automatically converted)
Output
- XML/JSON format with all results or results files with labels (speech vs. nonspeech segments)
Processing speed
Approx. 150x faster than real-time processing on 1 CPU core.
I.e. a standard 1 CPU core server processes 3,600 hours of audio in 1 day of computing time.
Speech Quality Estimator (SQE) measures the quality parameters of the speech in an audio recording.
Technology
- The technology is language-, accent-, text-, and channel- independent
- Compatible with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, etc.
Input
- Input format for processing: WAV or RAW (PCM unsigned 8 or 16 bits, IEEE float 32-bit, A-law or Mu-law, ADPCM), FLAC, OPUS; 8 kHz+ sampling (other audio formats automatically converted)
Output
- XML/JSON format with all results or results files with
- Global score i.e., a percentage expression of audio quality (range <0;100>), by default, the global score is calculated based on waveform_n_bits and waveform_snr variables.
- Detailed outputs i.e., clipped signal, amplitude, sample values, sampling frequency, SNR, technical signal, encoding, etc.
Processing speed
Approx. 2,000x faster than real-time processing on 1 CPU core.
I.e. a standard 1 CPU core server processes 48,000 hours of audio in 1 day of computing time