Training an acoustic model for a speech-to-text engine2019 Community Moderator ElectionNetwork design for extracting MFCC features from audio samplespython - What is the format of the WAV file for a Text to Speech Neural Network?Input and output feature shapes in CNN for speech recognitionHow to resume training of a model?Text understanding and mappingIs machine learning a viable tool to map accent from speech onto text/syllables?Representing output label for character level speec recognition using RNNApproximate Text compression by training model?How to efficiently store large amounts of speech to text dataWhat to use in setting up a Speech to Text engine in production?
Can one be a co-translator of a book, if he does not know the language that the book is translated into?
Can a virus destroy the BIOS of a modern computer?
Can a rocket refuel on Mars from water?
How to show the equivalence between the regularized regression and their constraint formulas using KKT
Why can't we play rap on piano?
How can I make my BBEG immortal short of making them a Lich or Vampire?
Is the Joker left-handed?
How can I prevent hyper evolved versions of regular creatures from wiping out their cousins?
1960's book about a plague that kills all white people
Has there ever been an airliner design involving reducing generator load by installing solar panels?
I'm flying to France today and my passport expires in less than 2 months
Do I have a twin with permutated remainders?
Why "Having chlorophyll without photosynthesis is actually very dangerous" and "like living with a bomb"?
Today is the Center
90's TV series where a boy goes to another dimension through portal near power lines
Why is consensus so controversial in Britain?
Why is the 'in' operator throwing an error with a string literal instead of logging false?
Brothers & sisters
Is it possible to download Internet Explorer on my Mac running OS X El Capitan?
How could indestructible materials be used in power generation?
I Accidentally Deleted a Stock Terminal Theme
Did converts (ger tzedek) in ancient Israel own land?
What to put in ESTA if staying in US for a few days before going on to Canada
Is "remove commented out code" correct English?
Training an acoustic model for a speech-to-text engine
2019 Community Moderator ElectionNetwork design for extracting MFCC features from audio samplespython - What is the format of the WAV file for a Text to Speech Neural Network?Input and output feature shapes in CNN for speech recognitionHow to resume training of a model?Text understanding and mappingIs machine learning a viable tool to map accent from speech onto text/syllables?Representing output label for character level speec recognition using RNNApproximate Text compression by training model?How to efficiently store large amounts of speech to text dataWhat to use in setting up a Speech to Text engine in production?
$begingroup$
What are the steps for training an acoustic model? The format of the data (the audio) includes its length and other characteristics. If anyone could provide a simple example of how to train an acoustic model, it would be greatly appreciated.
neural-network nlp speech-to-text
$endgroup$
add a comment |
$begingroup$
What are the steps for training an acoustic model? The format of the data (the audio) includes its length and other characteristics. If anyone could provide a simple example of how to train an acoustic model, it would be greatly appreciated.
neural-network nlp speech-to-text
$endgroup$
add a comment |
$begingroup$
What are the steps for training an acoustic model? The format of the data (the audio) includes its length and other characteristics. If anyone could provide a simple example of how to train an acoustic model, it would be greatly appreciated.
neural-network nlp speech-to-text
$endgroup$
What are the steps for training an acoustic model? The format of the data (the audio) includes its length and other characteristics. If anyone could provide a simple example of how to train an acoustic model, it would be greatly appreciated.
neural-network nlp speech-to-text
neural-network nlp speech-to-text
edited 4 hours ago
Stephen Rauch
1,52551330
1,52551330
asked 13 hours ago
BlenzusBlenzus
14610
14610
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
This paper has details on how to prepare Audio data (and merge it with Language Models) for speech-to-text :

http://slazebni.cs.illinois.edu/spring17/lec26_audio.pdf
Slide 16 has very high level description of feature engineering.

Following papers provide more detailed analysis of audio processing for speech to text :
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.701.6802&rep=rep1&type=pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.689.4627&rep=rep1&type=pdf
https://www.ijariit.com/manuscripts/v4i3/V4I3-1646.pdf
$endgroup$
add a comment |
$begingroup$
There are many different ways to do this. It depends a lot on what you want to do. A general dictation algorithm which works for most people who speak the same language? One that works or optimises itself for each person? Just a few keywords to input into a device? And so on.
I'd personally just download some of the open source versions on the internet (https://github.com/topics/speech-to-text), try to understand how they work (literature like the one above can help), and then write something that fits my requirements. That will also help avoid unnecessary complications when trying to interpret sounds. Translating sounds into phonemes, for instance, is a step which for most applications will only add complexity and, with the differences in how people pronounce things, increase errors. A database/table with the sounds of complete syllables or words is more likely to lead to results in many use cases. The database should be optimised each time a user corrects something the computer wrote - by adding the according sounds-word combination, by increasing the priority of the solution which turned out to be correct if there where other possible choices, or by using surrounding words as indicators of the correct choice (dynamic word-pairing).
Be aware that even if your code identifies sounds better than any human, it will still make a lot of mistakes with similar sounding words, or by missing where one word ends and the other begins. The usual way around is word pairs - if you have different possibilities, you choose the ones where words often come together. That may even produce text better than most humans - but still with very funny misinterpretations. If you want to avoid that, I'm afraid you'd have to model a far more complex ai which does not yet exist except maybe in a few labs - making sense of what someone says so it can have informed guesses which of many similar sounding options were actually meant. Not to mention catching any errors by the speaker.
If you only need a very limited vocabulary, you could also let people train the device (which would then work for any language) and maybe learn from experience - like when a word is pronounced differently sometimes. This can be done as easily as just recording a sound for each word. Adding similar sounds to each word when the user doesn't quickly change his mind (an indication that the computer misunderstood) or different sounds when a correction occurs is also an option, and helps improve recognition. The sounds and words could also be averaged out or parsed for differentiating parts, for better hit rates. Such measures would also work for above database.
New contributor
Carl Dombrowski is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48589%2ftraining-an-acoustic-model-for-a-speech-to-text-engine%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
This paper has details on how to prepare Audio data (and merge it with Language Models) for speech-to-text :

http://slazebni.cs.illinois.edu/spring17/lec26_audio.pdf
Slide 16 has very high level description of feature engineering.

Following papers provide more detailed analysis of audio processing for speech to text :
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.701.6802&rep=rep1&type=pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.689.4627&rep=rep1&type=pdf
https://www.ijariit.com/manuscripts/v4i3/V4I3-1646.pdf
$endgroup$
add a comment |
$begingroup$
This paper has details on how to prepare Audio data (and merge it with Language Models) for speech-to-text :

http://slazebni.cs.illinois.edu/spring17/lec26_audio.pdf
Slide 16 has very high level description of feature engineering.

Following papers provide more detailed analysis of audio processing for speech to text :
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.701.6802&rep=rep1&type=pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.689.4627&rep=rep1&type=pdf
https://www.ijariit.com/manuscripts/v4i3/V4I3-1646.pdf
$endgroup$
add a comment |
$begingroup$
This paper has details on how to prepare Audio data (and merge it with Language Models) for speech-to-text :

http://slazebni.cs.illinois.edu/spring17/lec26_audio.pdf
Slide 16 has very high level description of feature engineering.

Following papers provide more detailed analysis of audio processing for speech to text :
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.701.6802&rep=rep1&type=pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.689.4627&rep=rep1&type=pdf
https://www.ijariit.com/manuscripts/v4i3/V4I3-1646.pdf
$endgroup$
This paper has details on how to prepare Audio data (and merge it with Language Models) for speech-to-text :

http://slazebni.cs.illinois.edu/spring17/lec26_audio.pdf
Slide 16 has very high level description of feature engineering.

Following papers provide more detailed analysis of audio processing for speech to text :
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.701.6802&rep=rep1&type=pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.689.4627&rep=rep1&type=pdf
https://www.ijariit.com/manuscripts/v4i3/V4I3-1646.pdf
answered 12 hours ago
Shamit VermaShamit Verma
1,3791214
1,3791214
add a comment |
add a comment |
$begingroup$
There are many different ways to do this. It depends a lot on what you want to do. A general dictation algorithm which works for most people who speak the same language? One that works or optimises itself for each person? Just a few keywords to input into a device? And so on.
I'd personally just download some of the open source versions on the internet (https://github.com/topics/speech-to-text), try to understand how they work (literature like the one above can help), and then write something that fits my requirements. That will also help avoid unnecessary complications when trying to interpret sounds. Translating sounds into phonemes, for instance, is a step which for most applications will only add complexity and, with the differences in how people pronounce things, increase errors. A database/table with the sounds of complete syllables or words is more likely to lead to results in many use cases. The database should be optimised each time a user corrects something the computer wrote - by adding the according sounds-word combination, by increasing the priority of the solution which turned out to be correct if there where other possible choices, or by using surrounding words as indicators of the correct choice (dynamic word-pairing).
Be aware that even if your code identifies sounds better than any human, it will still make a lot of mistakes with similar sounding words, or by missing where one word ends and the other begins. The usual way around is word pairs - if you have different possibilities, you choose the ones where words often come together. That may even produce text better than most humans - but still with very funny misinterpretations. If you want to avoid that, I'm afraid you'd have to model a far more complex ai which does not yet exist except maybe in a few labs - making sense of what someone says so it can have informed guesses which of many similar sounding options were actually meant. Not to mention catching any errors by the speaker.
If you only need a very limited vocabulary, you could also let people train the device (which would then work for any language) and maybe learn from experience - like when a word is pronounced differently sometimes. This can be done as easily as just recording a sound for each word. Adding similar sounds to each word when the user doesn't quickly change his mind (an indication that the computer misunderstood) or different sounds when a correction occurs is also an option, and helps improve recognition. The sounds and words could also be averaged out or parsed for differentiating parts, for better hit rates. Such measures would also work for above database.
New contributor
Carl Dombrowski is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
$begingroup$
There are many different ways to do this. It depends a lot on what you want to do. A general dictation algorithm which works for most people who speak the same language? One that works or optimises itself for each person? Just a few keywords to input into a device? And so on.
I'd personally just download some of the open source versions on the internet (https://github.com/topics/speech-to-text), try to understand how they work (literature like the one above can help), and then write something that fits my requirements. That will also help avoid unnecessary complications when trying to interpret sounds. Translating sounds into phonemes, for instance, is a step which for most applications will only add complexity and, with the differences in how people pronounce things, increase errors. A database/table with the sounds of complete syllables or words is more likely to lead to results in many use cases. The database should be optimised each time a user corrects something the computer wrote - by adding the according sounds-word combination, by increasing the priority of the solution which turned out to be correct if there where other possible choices, or by using surrounding words as indicators of the correct choice (dynamic word-pairing).
Be aware that even if your code identifies sounds better than any human, it will still make a lot of mistakes with similar sounding words, or by missing where one word ends and the other begins. The usual way around is word pairs - if you have different possibilities, you choose the ones where words often come together. That may even produce text better than most humans - but still with very funny misinterpretations. If you want to avoid that, I'm afraid you'd have to model a far more complex ai which does not yet exist except maybe in a few labs - making sense of what someone says so it can have informed guesses which of many similar sounding options were actually meant. Not to mention catching any errors by the speaker.
If you only need a very limited vocabulary, you could also let people train the device (which would then work for any language) and maybe learn from experience - like when a word is pronounced differently sometimes. This can be done as easily as just recording a sound for each word. Adding similar sounds to each word when the user doesn't quickly change his mind (an indication that the computer misunderstood) or different sounds when a correction occurs is also an option, and helps improve recognition. The sounds and words could also be averaged out or parsed for differentiating parts, for better hit rates. Such measures would also work for above database.
New contributor
Carl Dombrowski is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
$begingroup$
There are many different ways to do this. It depends a lot on what you want to do. A general dictation algorithm which works for most people who speak the same language? One that works or optimises itself for each person? Just a few keywords to input into a device? And so on.
I'd personally just download some of the open source versions on the internet (https://github.com/topics/speech-to-text), try to understand how they work (literature like the one above can help), and then write something that fits my requirements. That will also help avoid unnecessary complications when trying to interpret sounds. Translating sounds into phonemes, for instance, is a step which for most applications will only add complexity and, with the differences in how people pronounce things, increase errors. A database/table with the sounds of complete syllables or words is more likely to lead to results in many use cases. The database should be optimised each time a user corrects something the computer wrote - by adding the according sounds-word combination, by increasing the priority of the solution which turned out to be correct if there where other possible choices, or by using surrounding words as indicators of the correct choice (dynamic word-pairing).
Be aware that even if your code identifies sounds better than any human, it will still make a lot of mistakes with similar sounding words, or by missing where one word ends and the other begins. The usual way around is word pairs - if you have different possibilities, you choose the ones where words often come together. That may even produce text better than most humans - but still with very funny misinterpretations. If you want to avoid that, I'm afraid you'd have to model a far more complex ai which does not yet exist except maybe in a few labs - making sense of what someone says so it can have informed guesses which of many similar sounding options were actually meant. Not to mention catching any errors by the speaker.
If you only need a very limited vocabulary, you could also let people train the device (which would then work for any language) and maybe learn from experience - like when a word is pronounced differently sometimes. This can be done as easily as just recording a sound for each word. Adding similar sounds to each word when the user doesn't quickly change his mind (an indication that the computer misunderstood) or different sounds when a correction occurs is also an option, and helps improve recognition. The sounds and words could also be averaged out or parsed for differentiating parts, for better hit rates. Such measures would also work for above database.
New contributor
Carl Dombrowski is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
There are many different ways to do this. It depends a lot on what you want to do. A general dictation algorithm which works for most people who speak the same language? One that works or optimises itself for each person? Just a few keywords to input into a device? And so on.
I'd personally just download some of the open source versions on the internet (https://github.com/topics/speech-to-text), try to understand how they work (literature like the one above can help), and then write something that fits my requirements. That will also help avoid unnecessary complications when trying to interpret sounds. Translating sounds into phonemes, for instance, is a step which for most applications will only add complexity and, with the differences in how people pronounce things, increase errors. A database/table with the sounds of complete syllables or words is more likely to lead to results in many use cases. The database should be optimised each time a user corrects something the computer wrote - by adding the according sounds-word combination, by increasing the priority of the solution which turned out to be correct if there where other possible choices, or by using surrounding words as indicators of the correct choice (dynamic word-pairing).
Be aware that even if your code identifies sounds better than any human, it will still make a lot of mistakes with similar sounding words, or by missing where one word ends and the other begins. The usual way around is word pairs - if you have different possibilities, you choose the ones where words often come together. That may even produce text better than most humans - but still with very funny misinterpretations. If you want to avoid that, I'm afraid you'd have to model a far more complex ai which does not yet exist except maybe in a few labs - making sense of what someone says so it can have informed guesses which of many similar sounding options were actually meant. Not to mention catching any errors by the speaker.
If you only need a very limited vocabulary, you could also let people train the device (which would then work for any language) and maybe learn from experience - like when a word is pronounced differently sometimes. This can be done as easily as just recording a sound for each word. Adding similar sounds to each word when the user doesn't quickly change his mind (an indication that the computer misunderstood) or different sounds when a correction occurs is also an option, and helps improve recognition. The sounds and words could also be averaged out or parsed for differentiating parts, for better hit rates. Such measures would also work for above database.
New contributor
Carl Dombrowski is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Carl Dombrowski is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
answered 3 hours ago
Carl DombrowskiCarl Dombrowski
1
1
New contributor
Carl Dombrowski is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Carl Dombrowski is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Carl Dombrowski is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48589%2ftraining-an-acoustic-model-for-a-speech-to-text-engine%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown