Click here to Skip to main content
15,867,870 members
Please Sign up or sign in to vote.
4.20/5 (3 votes)
I'm trying to write a program than can take two audio files (e.g. WAV) as inputs, compare them, and spit out a number that tells you how similar the audio files are.

If someone has done something like this, know how to go about doing it, or just have some ideas, please let me know. Anything will be greatly appreciated.

Specific questions: What language is suitable? How hard is it to do (how many
hours, roughly)? Where can I find a good source of audio library/tools?

Thanks!
Posted
Comments
Sergey Alexandrovich Kryukov 21-Jan-11 14:56pm    
Excellent question, would make a great project, very complex.
None of the answers approach the goal -- no wonder.
--SA

Since you have listed your platform as iPhone, you have the exciting choice of C, C++ and Objective C programming languages, which are mostly the same thing and can be mixed and matched pretty easily.

This is good, because most open source projects are written in C/C++.

The basic method would be what you'd expect:
Decode music data (WAV is pretty much raw data enclosed in headers)
Analyse the raw data.

I don't know of any decoders for the iPhone, but I read somewhere that LAME[^] works, but is not optimised for it.

As for the analysis, it depends on how dynamic you want it to be.
A binary compare would be a bad choice, if you compare 2 songs the same, 1 starts just 0.1s later than the other then they will hardly match.
A better method would be something like beat mapping (at least as a first pass) to find interesting points in the song and try to match them together.

EDIT: I only provided this answer because no one else has yet. I'm no professional in the field and there may be better ways about doing this.
 
Share this answer
 
v2
Comments
Sergey Alexandrovich Kryukov 21-Jan-11 15:15pm    
All right, my 5. As I say, it is impossible to give a real answer.
Please see what I think in my answer.
(Do you want to change field of activity or a major at you university :-) ?)
Andrew Brock 21-Jan-11 22:18pm    
Not really. I'm doing Games Technology. Its pretty awesome, and we get all the interesting projects :D
Sergey Alexandrovich Kryukov 21-Jan-11 23:48pm    
All right. Good luck.
--SA
Depends on what do you mean with "similar".
A binary compare (even with the only raw data) doesn't work, because a same music played in different executions is not necessarily made by the same identical "waves".

A method that can give you some success is try to guess suitable periods (a low pass filter can help you to identify a "beat") then, for each period estimate the spectrum, find the correlation function between the homologous spectrum and find how much it is different form an "impulse" function.

That's not "easy math" (google may help you, but lot of theory is needed).

Another empirical method (based on the fact that the "zip" algorithm is based on periodograms)can be the following:
keep "a.wav" and zip it (giving a.zip)
keep "b.wav" and zip it (gicing b.zip)

now zip a.zip and b.zip together.
If the length of the resulting zip is similar to the sum of the included zip nothing can be said.
If the length of the resulting zip is shorter (and similar to the longest of the two composing zip-s) than the two wav files are probably similar.

(The trick is that, if they are similar they will have similar compression patterns, that -when compressed together- will merge, otherwise will remain distinct, and placed one after the other)

[EDIT]
Oh, I forgot to mention that the real problem is the complexity (not only in the sense on numbers ...) of the algorithms.
The programming language, at that point, is just a detail.
(C will probably be the best "number cruncher")
[/EDIT]
 
Share this answer
 
v2
Comments
[no name] 21-Jan-11 8:18am    
Interesting idea, deserves a 5.
Sergey Alexandrovich Kryukov 21-Jan-11 15:14pm    
My 5. You're absolutely right, especially first sentence. Please look what I think in my answer :-)
Manfred Rudolf Bihy 23-Jan-11 12:17pm    
Interesting! 5+
Disclaimer: this is not an answer, because it is impossible to give any definitive answer.

My answer is a (very tentative) draft for a whole research program:

1) Develop conversion of all files into identical raw format (see answer by Andrew).
2) Develop Fourier analysis for a arbitrary fragment of audio, see http://en.wikipedia.org/wiki/Fourier_analysis[^], http://en.wikipedia.org/wiki/Fast_Fourier_transform[^].
3) Develop comparison criteria for Fourier images with weights such as length of the fragment.
4) Develop "tokenization" if the whole audio piece: a way to break up all audio stream into several "distinct" fragments; with the requirement that the sub-fragments withing a "token" fragments would be relatively "close" compared to the nearby fragments. Attention! This is most difficult part. Prepare to learn fuzzy sets or genetic programming or something like that used in image recognition.
5) You have tokenized audio fragments. Try to match fragments from different audio streams, score good matches with weights.
6) Compare the scores, present result.
7) Develop optional criteria. One criterion may put more weight on high frequencies, another on tempo/duration, etc.

Even in most successful case, I am not expecting good result for audio performing, for example, same music opus played by different instruments, or, say, different singers.

Even if you get limited success, prepare yourself for major awards in computer science. :)

Now, in real life this is a pretty important problem. I remember announcement on Boston Craig's List. A "patient" offered considerable fee to anyone who would sort tons of his music records, with elimination of duplicates (important!), merging tags, descriptions, etc. He did not care manually or through programming but hoped for development of the technology (pretty naive hope, but...). Now imaging you love music (like I do) -- that would mean a real torture while listening a lot of... well, different records. If you don't care, chances are you would fail to recognize the tunes... Makes sense, right?

[EDIT]

Please pay attention for the Answer by Espen Harlinn: he was able to go much deeper then I did.
See also my comment to that Answer.
 
Share this answer
 
v11
Comments
Yusuf 21-Jan-11 15:17pm    
Hey it may not be an answer, but very good lead +5
Emilio Garavaglia 21-Jan-11 16:08pm    
Wow! very challenging!

But ... hey: the human hear is a spectrum analyzer and the human brain is one of the most sophisticated "correlator"...
The risk is that we just have to ... listen to the songs! :-)
Sergey Alexandrovich Kryukov 21-Jan-11 16:17pm    
Of course you're right. It's cutting edge computing right now, but matching a human in recognition quality is not even on horizon.

..or, you advice a human sweatshop behind a Web engine? Also a method, by the way... :-)
Sergey Alexandrovich Kryukov 21-Jan-11 20:15pm    
Speaking of listening to the song: please see another paragraph I added to my answer.
Sergey Alexandrovich Kryukov 21-Jan-11 16:24pm    
Fixed my text. Important: "genetic" (not that I'm an expert in the method).
One approach could be to create an audio fragment recognition/matching system based on Hidden Markov Models[^] (HMM), where the audio is considered a piecewise signal.

This is an elaboration on SAKryukov's answer, that I think may have some relevance to 3), 4) and 5)

Update:
As with most things - it turns out that somebody wrestled with part of the problem before:
Greed is Good: Algorithmic Results for Sparse Approximation[^], allows you to retrieve a sparse signal, an optimized minimal representation of the input signal. I imagine this could be used to process the inputs to the system, generating better "pieces" for the "dictionary" that could serve as the foundation for the dynamically generated HMM.


This approach can possibly be used to recognize sequences of pieces from the first stream in the second stream.

There is a conceptual similarity to a bottom up parser where one stream is used to construct the "grammar" - please note the "'s, and you'll have to deal with probable matches defining multiple possible solution paths, and then use that information to determine if you have something you want to consider a match or not. I envision some kind of state machine/automata where the branches are evaluated according to how probable the next state is. This state machine is generated based on what can be “learned” from one of the streams and used during matching against the other stream.

HMMs have found their use in speech recognition systems, so maybe they can be used to implement what you are trying to achieve - the hard part will be to formulate one of the input streams as a HMM - "hard" as in I don't think this has been tried before, but other uses of HMM indicates that this may be a feasible approach.

To get the coefficient vectors for the HMMs, calculate the Fourier transform for a piece of the signal, as proposed by SAKryukov, and then select the most significant coefficients - those that contribute the most the the "shape" of the signal as input to the HMM.

This is so oversimplified, I'm just trying to point you in the direction of a possible solution.

It will probably be quite hard to automatically determine the statistical significance of the decomposed input to the HMM, but I guess a something can be learned from research on "part of speech" recognition systems, including how to "train" them.

This is just my 2ct's worth on one possible approach to the development of SAKryukovs correlation method. This is as SAKryukov says, research material :)

What language is suitable? c/c++, you'll find some high quality math/numeric libraries around the net.

How hard is it to do (how many hours, roughly)? Starting from scratch? Well, maybe a couple of years for some really dedicated scientists and developers.

Where can I find a good source of audio library/tools? sourceforge.net gives you a lot of options - but as far as I know, none that will provide more than the "trivial" parts of your solution.

Regards
Espen Harlinn
 
Share this answer
 
v5
Comments
Sergey Alexandrovich Kryukov 23-Jan-11 12:36pm    
Espen, I think you're much more qualified to this matter than I am (I also remember your adding note about wavelets; wavelets might be more relevant to this problem rather them relatively simple oscilloscope application).
My respect.

As to the expected success of the development, this is questionable. Main problem is not even technology but some kind of science (in particular, in human recognition): what criteria of closeness should match intuitive human, what criteria make sense for different application, how humans can be classified into different types by their recognition. I have learned a wonderful lesson on how much different people are, but this is quite a different story.

(5 of course, who would doubt...)
--SA
Espen Harlinn 23-Jan-11 13:11pm    
Thanks SAKryukov - remember I've taken a look at your resume, I imagine you tackled some pretty complicated scenarios at http://www.inr.ac.ru/ - and other places :) - once upon a time I imagined that doing signal processing and related hardware would be what I would do for a living, turns out I was wrong – but I do remember a few things … Given that anything we present on this issue is a vast oversimplification - the paper http://www.math.princeton.edu/tfbb/spring03/greed-ticam0304.pdf, seems to justify part of the approach you outlined - but obviously in more detail. ( The parts the paper doesn't justify aren't dealt with at all )
Sergey Alexandrovich Kryukov 23-Jan-11 19:01pm    
Great, Espen, thanks you for this reference.
I am a physicist but most of the time spent on software architecture and implementation and practically all related stuff.
I found a good article regarding your question
Duplicate songs detector via audio fingerprinting[^]
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900