This is literally a big task. The answer is you can't do it with the strategy you suggest (of checking all websites):
Downloading what we have now will take a long time. Even if you take a download and check, download and check approach, by the time you have finished a lot of stuff will have changed and will need to be checked again. Worse, more stuff will be added that you are going to be able to download in the same time, so it will take effectively an infinite amount of time to process.
You therefore need to work smart rather than hard. First you can cut the problem domain down: only check sites related to your topic, you can use something like Google Custom Search API to lower the number of sites you need to check. Secondly, you could also use something like Google's API to find text in the thing potentially plagiarised article, the more unusual the text the better or look at the abstracts. You could employ an Heuristic approach to improve performance and results, but that will get complicated. Unfortunately Google has
restricted its API without paying[
^] Even with google, you'd be hard pressed to get 100% coverage, or 100% accuracy (especially if the article has been re-worded).
Finally, you could look at the existing plagarism checkers(e.g.
http://www.duplichecker.com/[
^]), this would take the hard work out of your hands entirely, but also you'd lose the interesting part of your project.