If this means what I think it means, then don't bother.
Speech recognition is difficult, then you want to be able to do it when the singer is putting their voice all through the spectrum, you have background noises, etc...
Althoug you only want the word count and not the actual words, in song 2 or more words often blend together seamlessly, so if you want it to be accurate you would need it to have some understanding of language.
I'm no pro on audio analysis, but what I would do is start by removing all the frequencies outside of the vocal range perhaps with a
FFT[
^], then try to apply a
speech recognition[
^] system over that.