API means "Application Programming Interface". Whatever your solution, you should have an API.
I believe what you mean is that you do not want to use an off-the-shelf isolated word speech recognition API.
What you are asking is a significant project that will take some time, and then likely work much poorer than well-developed existing solutions.
At the highest level, you want to do:
1. Feature extraction
2. Feature post-processing
3. Pattern recognition
For speech recognition, there is usually a two-phases of pattern recognition, which is to first extract phoneme symbols from the features extracted from an audio stream, and then extract words from the stream of phonemes. This is a non-trivial problem. The best existing solutions for speech recognition were invented by a man named Jim Baker who wrote a classic paper on using Hidden Markov Models (HMMs) for speech recognition.
I have read that neural nets have been used with some success, but I believe the best systems, such as Sphynx, the open source Carnegie Mellon recognizer, use HMMs.
A relatively simple solution that will allow recognizing a limited set of words that are sufficiently different can be done by doing FFTs of windows segments of overlapping audio. For one solution, sample at 16 KHz., do overlapping FFTs that are overlapped by 128 samples and do 256 point Fast Fourier Transforms (FFTs). You can find open source FFT source code. Computer the power spectrum, and then train a pattern recognizer to convert frequency spectrum patterns/sequences to phonemes. Then use another pattern recognizer to convert phoneme sequences to words.
Such a solution will work for a limited set of carefully chosen isolated words. Speech recognition is a very complex problem, and a more general solution requires more features, massive training data, high computer power to process the data. I have read that the most advanced solutions today use a cochlear ear model, not an FFT, although FFT solutions can provide good results - I think the more advanced models are just better in noisy environments.
I realize that, if you don't know Digital Signal Processing, what I wrote above will only lead to more questions. This forum is too short to teach college courses in signal theory, filtering theory, and pattern recognition. You will have to play with these yourself.
I, and a classmate, made an isolated work recognizer project in college around 1980. We used hardware we built for feature extraction and a Kim-1 microprocessor to recognize specific words, and not having much computer power, we used zero crossings for the feature extraction. It worked horribly, but it did work after a fashion. With an FFT and spending a lot of time tweaking, you should be able to get something useful today, but it will not compete with something like Sphynx, and even Sphynx, which is excellent, is not nearly as good as the commercial solutions I've seen.