how to get all sites to check plagiarism?

Question

1.40/5 (3 votes)

See more:

this site
Turnitin

do as plagiarism check. i can to upload my research and site compare it with 150 billions website ,files and researches site show if i copied paragraph or sentences from internet and show the source of this copy

i want to develop the same idea but in Arabic . i want to know how to search in these number of sites how to store and how to compare my paragraph with these numbers of paragraphs in all these websites belong to WWW

Posted 12-Nov-12 5:44am

nagiub2007

Updated 12-Nov-12 5:46am

v2

Add a Solution

Comments

joshrduncan2012 12-Nov-12 11:45am

What have you tried so far?

Nelek 12-Nov-12 11:53am

Good luck, it won't be easy.

Sergey Alexandrovich Kryukov 12-Nov-12 14:00pm

If you want to check for exact match of code fragment, this is one thing, but plagiarism... It would give too many false negatives and false positives. Citations are legal if proper attribution is done, and big fragments of text could be plagiarized by introducing tiny differences...
--SA

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Keith Barrow · Answer 1 · 2012-11-12T06:15:00

Solution 1

This is literally a big task. The answer is you can't do it with the strategy you suggest (of checking all websites):
Downloading what we have now will take a long time. Even if you take a download and check, download and check approach, by the time you have finished a lot of stuff will have changed and will need to be checked again. Worse, more stuff will be added that you are going to be able to download in the same time, so it will take effectively an infinite amount of time to process.

You therefore need to work smart rather than hard. First you can cut the problem domain down: only check sites related to your topic, you can use something like Google Custom Search API to lower the number of sites you need to check. Secondly, you could also use something like Google's API to find text in the thing potentially plagiarised article, the more unusual the text the better or look at the abstracts. You could employ an Heuristic approach to improve performance and results, but that will get complicated. Unfortunately Google has restricted its API without paying[^] Even with google, you'd be hard pressed to get 100% coverage, or 100% accuracy (especially if the article has been re-worded).

Finally, you could look at the existing plagarism checkers(e.g. http://www.duplichecker.com/[^]), this would take the hard work out of your hands entirely, but also you'd lose the interesting part of your project.

Posted 12-Nov-12 6:15am

Keith Barrow

Comments

Sergey Alexandrovich Kryukov 12-Nov-12 14:02pm

You are talking about detection of matching text fragment, but it hardly helps to fight plagiarism. It would give too many false negatives and false positives. Citations are legal if proper attribution is done, and big fragments of text could be plagiarized by introducing tiny differences. Besides, there is no a criteria to find out which text is original and which is plagiarized.
--SA

Keith Barrow 12-Nov-12 14:12pm

If you read my text, I'm basically telling him brute force is effectively impossible, and anything else he is likely to acheive inaccurate results. At no point did I say this was going to produce reliable result. I expect a lone dev, however talented l, is unlikely solve this poblem.

Sergey Alexandrovich Kryukov 12-Nov-12 14:36pm

I basically agree with that. My note was not something to contradict your assessment; it's just another aspect to take into account. Even is you have to code finding the text matches, it would not solve the problem.
--SA

nagiub2007 13-Nov-12 4:04am

@Keith Barrow i will search this api thanls alot

Keith Barrow 13-Nov-12 5:01am

IMO you should try and find a simpler problem if this is a University project. Sites like Turnitin will have teams of specialised developers running complex algorithms, and they still won't be 100% accurate. I *really* dislike discouraging developers like this, but it is important to screen the problems first to see if your goals feasible . In my view, a plagarism checker is going to be too hard. Even my suggestions is only the very tip of the iceberg, it only brings down the scale of the challenge, whilst increasing the probability of losing plagiarised articles in the process.

nagiub2007 13-Nov-12 5:47am

thanks a lot but it isn't a University project ,it's related to business
and really i want to develop this site even i will need to purchase API or services do it.
by the way it's seems to be a great challenge for me
my problem is to know how these sites compare my file uploaded
and how get all documents and sites when comparison occurred and all done In an ideal time how??

Keith Barrow 13-Nov-12 5:57am

How? If I knew that, I'd set up a Plagarism-checking website :).

nagiub2007 13-Nov-12 5:59am

hhhhhhhhhhhhhhhhhhhhhhh
thanks for Ur efforts

nagiub2007 13-Nov-12 4:06am

@Sergey Alexandrovich Kryukov
there is websites do it like turnitin
i mentioned in my question i want to know how this websites did it