> Q: How can I completely exclude TurnitinBot from my site?
> To exclude TurnitinBot from all or portions of your site all you have to to do is create a file called robots.txt and put it in the top most directory of your web site.
1. So that case was about the CFAA (Computer Fraud and Abuse Act). So at most it would say that ignoring the robots.txt does not violate the CFAA -- a law that makes some things felonies as "hacking", basically. I agree that ignoring the robots.txt (say if you are Archive Team? [1]) should not be considered a criminal "hacking" felony.
But there can still be other reasons ignoring the robots.txt is against a law -- or cause for a civil tort action. (Most copyright violation is a civil tort action for instance, the CFAA is, again, a law that establishes some felonies with many years of jail time, intended to punish "hackers"). The decision in that case said nothing about anything except the CFAA.
For instance, taking copyrighted content from the public web and re-selling is probably still going to put you in various kinds of legal trouble -- just not a CFAA violation. It's possible ignoring a robots.txt could put you in other kinds of criminal or civil trouble, depending on the particular circumstances -- just not a CFAA violation. It would be interesting to research what other possible liability there might be. If for instance you caused harm to the site by ignoring the robots.txt (say, an accidental or intentional DOS), I bet there'd at least be cause for civil tort.
2. Even so, even under that case, if that specific case didn't involve a robots.txt (did it?), it's always possible the presence of a robots.txt would result in a differnet outcome. My sense is probably not though, that Supreme Court decision referenced by the ninth circuit on remand -- probably does mean ignoring a robots.txt is not a violation of the CFAA. (And again, I say, PHEW, that would have been terrible if it were -- if say someone trying to archive MySpace before it went away could be put in prison for a couple decades for disrespecting the robots.txt).
That case isn't even decided yet, the court only ruled on a preliminary injunction so there's still quite a bit of case left to go before any final decisions are made. For now it's only "likely" that HiQ will prevail (though that means it's pretty likely).
In this case Linkedin sent HiQ a cease and desist letter before they sued and claimed that letter revoked access for the purpose of the CFAA, so not quite the same as a robots.txt but legally it's probably close enough. If anything it's stronger because HiQ can't claim they didn't see it.
Good point, thank you, yet another reason it may not mean that!
The 2020 Supreme Court decision in Van Buren v. United States that the Ninth Circuit said they were relying on has been decided though, and does seem pretty pertinent and a (welcome, thank god) nail in the coffin of CFAA overreach. https://techcrunch.com/2021/06/03/supreme-court-hacking-cfaa...
(Although now that I actually look at THAT case... damn! I think that's a WAY better CFAA case than a lot of them! I guess it took a police officer being the defendent to get the currently pro-police Supreme Court to actually say CFAA prosecution was going too far, okeydoke).
But yeah, nothing is for sure... almost ever with the law. But it does seem to be moving in a direction...
Well, it's possible to also return a 403 (forbidden) to any request based off the user agent. Of course, this can be relatively easily circumvented, but then it's also possible to block IP ranges and suchlike. You can return a 403 off of any detectable aspect of the client that you don't like if you so wish.
I don't know how well this would work with a CDN, but presumably if you pay for the right tier of Cloudflare (or whatever) you can perform similar operations to prevent content being hoovered from their by clients you'd prefer not to serve.
Yep. I 403 turnitin and similar companies via nginx configuration, if ($http_referer ~* (TurnitinBot|PaperLiBot|idmarch|FairShare|Lightspeedsystems|ZmEu|BPImageWalker|semrushBot|ias_crawler|360spider|copyrightinfringementportal|PetalBot|Adsbot|SlySearch|NPBot)) { return 403; }
Legit Huawei IP ranges identifying as Huawei PetalBot were being abusive, definitely not obeying robots.txt, and searching for subsets of content that indicated they were looking to identify political dissidents with no worries about actually indexing the full site. I don't consider it a real search engine.
But yeah, maybe not a good fit for this list of educational and copyright parasites.
I get some of the feeling behind this. But in terms of Turnitin my wife taught in art college and a close friend taught a Masters in Economics, and the amount of plagiarism was ridiculous. Sure, in theory there should be smaller class sizes, teachers should have more time per student, etc. etc., but Turnitin was an extremely helpful tool that meant they could offload the cognitive effort of detecting mechanical reproduction and get into reading the work. Unless there's something about Turnitin that I'm not aware of which tarnishes what they do? (Beyond making money out of already cash-strapped universities, I suppose...)
I'm a student at a university that uses Turnitin. I understand the need for the tools and I absolutely don't have a problem with them using automated tools to check my work for plagiarism.
What I do have a problem with is the fact that, after uploading an assignment, I am required to click a checkbox that says "I agree to Turnitin's end-user license agreement." I should not have to agree to a license for a piece of software that I'm not even using; it's my professor who's using Turnitin's services. And if it's 11:55 PM and I'm trying to submit my assignment, it feels really scummy to suddenly force me to sign a legal contract that I don't even have time to read.
My university used to use Blackboard, which had SafeAssign (if I recall correctly). It was proprietary to Blackboard and the only option (at that time) was for students to choose if papers were included in the permanent reference database. Because I was given the choice, I selected that because I thought it was both considerate toward me and valuable to my scholarly work.
Then my university moved to Canvas and TurnItIn. At first there was no license agreement check box, and all the courses were force-enabled to allow TurnItIn to store student submissions forever.
I raised a lot of bell over that and the next term there was that same checkbox that I assume you also see.
It always felt very coercive. I hated checking that box. I fought tooth and nail. I had conference calls with the Academic Technologies leadership. They absolutely didn’t understand the objection. They compared it to Office 365 and didn’t understand the point that neither the university nor Microsoft was requiring that I give them a perpetual, virtually limitless license to my content in order to use the service.
I pointed to the university policies which explicitly and very clearly categorized non-compensated student output as the property of the student, who was to regain all rights. I pointed out the conflict of interest that iParadigms brings to the table.
All I ever got in response was the talking points I found on the TurnItIn marketing material. I’d have been OK if they disagreed after an actual discussion, but they weren’t interested.
OK, this objection sort of makes sense to me. Do they have something in their Ts and Cs which says "by submitting your work you consent to us storing your work..."? Presumably people who submit their work to them also benefit to some extent though, because then plagiarisers of your work will be caught?
>When a paper is submitted to Turnitin, it is compared against a vast, secure proprietary database of licensed source material, including millions of periodicals, academic journals, books, and web pages. Turnitin also maintains a separate repository of student papers. Each institution (at the discretion of his or her school administrator) can determine whether or not to include student papers in the repository. We can remove student papers from the standard repository at the request of a school administrator.
Just because something is commonly practiced without choice, doesn't mean that it is legally defensible, which is what the person above was questioning.
Authors generally have right to their works by default, even students. Those who use copyrighted works for academic purposes do get some exceptions to copyright law.... but a commercial service is not this.
Furthermore, student assignments are additionally protected by FERPA.
I'm not sure the answer, but I do think it's a great question.
Well, it might be in your interest to know if someone plagiarizes you. I don't know anything about Turnitin, though I doubt they notify the original copyright holder?
If you provide copies of your work for free on the internet, this is why they get to keep one, just as everyone else? They are probably not allowed to distribute it, though?
I don't think they're talking about posting your work for free on the internet, they mean when you submit an assignment for a class turnitin stores your assignment so they can check future submissions against yours. In this case, you haven't (necessarily) shared it anywhere public, all you did was submit it to your teacher, and now turnitin gets a copy forever.
Presumably because someone with the rights to distribute it, either you or your school, uploaded it to them.
A more interesting question is, if these companies do well and stay in business a long time, won't it become increasingly difficult to write an original paper that isn't flagged for plagiarism? There's only so many ways to describe the effects of the Lend-Lease Act on postwar Europe.
Because we're talking about plagiarism, and because this is so often a point of confusion in internet debates, I want to just point out that plagiarism and copyright are almost unrelated things.
* plagiarism is not generally against the law, although it is a violation of school policies and can get you punished by your school. It's claiming someone else's work as your own. It may or may not involve a copyright violation, it can be plagiarism without involving a copyright violation -- it could be the author gave you permission, or the item isn't in copyright, or it would count as "fair use" -- it's still plagiarism if your teacher/school/honor code panel says it is.
* copyright is about the law, violating a copyright is against the law and can get you civil or criminal penalties. It involves copying work someone else legally owns without permission. It may or may not involve claiming the work as your own, for the most part whether you attribute something properly or claim it for your own is not relevant to whether it is a copyright violation. (I suppose in some edge cases it could be relevant to whether you have a "fair use" defense, but mostly it's not significant in whether something is a copyright violation).
So by that logic I can pirate all content, and as long as I don't claim it's mine it's not violating copyright laws?
U.S. copyright law provides copyright owners with the following exclusive rights:
- Reproduce the work in copies or phonorecords.
- Prepare derivative works based upon the work.
- Distribute copies or phonorecords of the work to the public by sale or other transfer of ownership or by rental, lease, or lending.
- Perform the work publicly if it is a literary, musical, dramatic, or choreographic work; a pantomime; or a motion picture or other audiovisual work.
- Display the work publicly if it is a literary, musical, dramatic, or choreographic work; a pantomime; or a pictorial, graphic, or sculptural work.
-Perform the work publicly by means of a digital audio transmission if the work is a sound recording.
Copyright also provides the owner of copyright the right to authorize others to exercise these exclusive rights, subject to certain statutory limitations.
Fair use is an affirmative defense to copyright violation. In this case, fair use _probably_ applies. First, it's a transformative use, albeit not a typical one. Secondly, TurnItIn's service does not diminish the market for the original work. Institutions pay TurnItIn to detect plagiarism, not to read content of websites it scrapes. Arguably, maybe this reduces the demand from university libraries to buy the works, but since there's still value to have the documents in the library for student access, I'd want some convincing evidence that this is actually happening.
Combined, TurnItIn is offering a service that individual copyright holders are likely uninterested in providing themselves, the use does not reduce the marketability of the copyrighted work, and there is probably not a market for this service that individual copyright holders could monetize. This is a pretty good case for fair use.
Not trying to stir something, genuinely trying to find the line/difference, so this may be a dumb question, but:
By that logic, isn't it the same for Google? They keep a copy of your content in their cache/index (you can even get the indexed page directly from their cache).
If I'm not entirely mistaken, online copyright enforcement has focused on sharing/sending/transmitting unlicensed copyrighted work. It has not enforced downloading against the downloader.
So, sure - download to your heart's content. Which means TurnItIn has little incentive to keep works out of their database - you can't make a claim against them if they don't publicize that your work is in it.
You can opt out of that, also it's published content and caching has a carved out exception. When you use this service at a school the teacher submits your paper to the service and they store it to validate against other students papers as well.
They don't only check against "published" sources.
Not claiming ownership, or explicitly disclaiming ownership (e.g. "I don't own this", "No copyright infringement intended", or giving 'credit' to a source) are not defenses to copyright infringement. These are just urban myths that people repost on the internet with no basis in reality.
The objection likely is due to crawling VLC's website costing the project money, and crawling the site being completely useless for Turnitin...but them not caring and having the resources of a for-profit company vs an open source software NPO.
I used the ultimate Nginx bad bot blocker on a couple of my side projects, and it is a pretty good project https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blo... . Apart from the Cloudflare offers UA blocking and AI driven bot management too. Most of these bots are for content scrapping and then creating search spam results. I am a one-person show, and it hurts both financially and resources wise on my tiny severs. So I block them.
Why crawl when a sitemap is provided. Honest question.
IME, using a sitemap is much more efficient. For example, HTTP/1.1 pipelining can be used to reduce the number of TCP connections needed.
Is resource exhaustion what draws a public website^1 operator's attention to "bots". If it is not resource exhaustion then what is it.
1. For this question, assume "public website" means a website serving public information where there are no legitimate intellectual property rights in the information that can be asserted by the site operator.
Though, you gotta figure it's tough to be a moderator when you have a massive report queue, full of toxic behavior. You come across this post with just a plain "Fuck off", not much more or less toxic than anything else you have to deal with. But~ instead of just getting your backlog done, you instead click into the post, look at the wider context of not just the person they're responding to, but also the post itself.
The bot is clearly crawling enough to be noticed, and consider the site: how much are you plagiarizing from videolan.org? If they're wasting even a small amount of resources, they're worth blocking.
There are arguments on whether plagiarism is a bad thing in an academic context. I'm not nearly qualified enough to make them, but they exist if you want to go looking.
So, I can understand the hate towards copyright enforcement bots, but... did TurnItIn hammer the shit out of VLC's website? Or do VLC's developers just hate the idea of automated enforcement in general?
(I doubt they're pro-plagiarism - not even copyright abolitionists go that far.)
I imagine the objection is that it is consuming bandwidth, electricity, and computing capacity of an open source project as part of a profit-making service, with an extra fuck-you of a)crawling the site making no sense whatsoever for the service and b)the service being of no possible use to VLC or its users
You don't need to be relevant, just being accessible is enough reason for all sorts of bots, legit or sketchy, to shove hundreds of thousands of requests down your throat.
Correct me if I'm wrong: After the recent web scraping ruling[1] it seems that it's perfectly legal to ignore the robots.txt.
[1] https://news.ycombinator.com/item?id=31075396