Click here to Skip to main content
15,889,992 members
Please Sign up or sign in to vote.
1.20/5 (2 votes)
See more:
I need to build a small project which can recognize some words.
The problem that i don't know how i need to format the data and on which format put into my neural network input layer.
I need your helps.
Posted
Comments
Sergey Alexandrovich Kryukov 4-Apr-14 16:04pm    
Are you serious? What does it mean, "without API"? Doing word recognition from scratch? Then you are too good to us. :-)
—SA

1 solution

API means "Application Programming Interface". Whatever your solution, you should have an API.

I believe what you mean is that you do not want to use an off-the-shelf isolated word speech recognition API.

What you are asking is a significant project that will take some time, and then likely work much poorer than well-developed existing solutions.

At the highest level, you want to do:

1. Feature extraction
2. Feature post-processing
3. Pattern recognition

For speech recognition, there is usually a two-phases of pattern recognition, which is to first extract phoneme symbols from the features extracted from an audio stream, and then extract words from the stream of phonemes. This is a non-trivial problem. The best existing solutions for speech recognition were invented by a man named Jim Baker who wrote a classic paper on using Hidden Markov Models (HMMs) for speech recognition.

I have read that neural nets have been used with some success, but I believe the best systems, such as Sphynx, the open source Carnegie Mellon recognizer, use HMMs.

A relatively simple solution that will allow recognizing a limited set of words that are sufficiently different can be done by doing FFTs of windows segments of overlapping audio. For one solution, sample at 16 KHz., do overlapping FFTs that are overlapped by 128 samples and do 256 point Fast Fourier Transforms (FFTs). You can find open source FFT source code. Computer the power spectrum, and then train a pattern recognizer to convert frequency spectrum patterns/sequences to phonemes. Then use another pattern recognizer to convert phoneme sequences to words.

Such a solution will work for a limited set of carefully chosen isolated words. Speech recognition is a very complex problem, and a more general solution requires more features, massive training data, high computer power to process the data. I have read that the most advanced solutions today use a cochlear ear model, not an FFT, although FFT solutions can provide good results - I think the more advanced models are just better in noisy environments.

I realize that, if you don't know Digital Signal Processing, what I wrote above will only lead to more questions. This forum is too short to teach college courses in signal theory, filtering theory, and pattern recognition. You will have to play with these yourself.

I, and a classmate, made an isolated work recognizer project in college around 1980. We used hardware we built for feature extraction and a Kim-1 microprocessor to recognize specific words, and not having much computer power, we used zero crossings for the feature extraction. It worked horribly, but it did work after a fashion. With an FFT and spending a lot of time tweaking, you should be able to get something useful today, but it will not compete with something like Sphynx, and even Sphynx, which is excellent, is not nearly as good as the commercial solutions I've seen.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900