Isolated word recognition without API

Question

1.20/5 (2 votes)

See more:

I need to build a small project which can recognize some words.
The problem that i don't know how i need to format the data and on which format put into my neural network input layer.
I need your helps.

Posted 4-Apr-14 9:05am

rolandvagyok

Add a Solution

Comments

Sergey Alexandrovich Kryukov 4-Apr-14 16:04pm

Are you serious? What does it mean, "without API"? Doing word recognition from scratch? Then you are too good to us. :-)
—SA

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Bill_Hallahan · Answer 1 · 2014-04-13T15:52:00

API means "Application Programming Interface". Whatever your solution, you should have an API.

I believe what you mean is that you do not want to use an off-the-shelf isolated word speech recognition API.

What you are asking is a significant project that will take some time, and then likely work much poorer than well-developed existing solutions.

At the highest level, you want to do:

1. Feature extraction
2. Feature post-processing
3. Pattern recognition

For speech recognition, there is usually a two-phases of pattern recognition, which is to first extract phoneme symbols from the features extracted from an audio stream, and then extract words from the stream of phonemes. This is a non-trivial problem. The best existing solutions for speech recognition were invented by a man named Jim Baker who wrote a classic paper on using Hidden Markov Models (HMMs) for speech recognition.

I have read that neural nets have been used with some success, but I believe the best systems, such as Sphynx, the open source Carnegie Mellon recognizer, use HMMs.

A relatively simple solution that will allow recognizing a limited set of words that are sufficiently different can be done by doing FFTs of windows segments of overlapping audio. For one solution, sample at 16 KHz., do overlapping FFTs that are overlapped by 128 samples and do 256 point Fast Fourier Transforms (FFTs). You can find open source FFT source code. Computer the power spectrum, and then train a pattern recognizer to convert frequency spectrum patterns/sequences to phonemes. Then use another pattern recognizer to convert phoneme sequences to words.

Such a solution will work for a limited set of carefully chosen isolated words. Speech recognition is a very complex problem, and a more general solution requires more features, massive training data, high computer power to process the data. I have read that the most advanced solutions today use a cochlear ear model, not an FFT, although FFT solutions can provide good results - I think the more advanced models are just better in noisy environments.

I realize that, if you don't know Digital Signal Processing, what I wrote above will only lead to more questions. This forum is too short to teach college courses in signal theory, filtering theory, and pattern recognition. You will have to play with these yourself.

I, and a classmate, made an isolated work recognizer project in college around 1980. We used hardware we built for feature extraction and a Kim-1 microprocessor to recognize specific words, and not having much computer power, we used zero crossings for the feature extraction. It worked horribly, but it did work after a fashion. With an FFT and spending a lot of time tweaking, you should be able to get something useful today, but it will not compete with something like Sphynx, and even Sphynx, which is excellent, is not nearly as good as the commercial solutions I've seen.