Videolan.org robots.txt

evv · on April 20, 2022

These are some cute "fuck off"s but its unlikely that these sites actually respect the robots.txt, right?

Correct me if I'm wrong: After the recent web scraping ruling[1] it seems that it's perfectly legal to ignore the robots.txt.

[1] https://news.ycombinator.com/item?id=31075396

dave5104 · on April 20, 2022

Depends on the bot owner on whether they want to be respectful. Following the link to the TurnItIn bot...

https://www.turnitin.com/robot/crawlerinfo.html

> Q: How can I completely exclude TurnitinBot from my site?

> To exclude TurnitinBot from all or portions of your site all you have to to do is create a file called robots.txt and put it in the top most directory of your web site.

jrochkind1 · on April 20, 2022

So... not necessarily.

1. So that case was about the CFAA (Computer Fraud and Abuse Act). So at most it would say that ignoring the robots.txt does not violate the CFAA -- a law that makes some things felonies as "hacking", basically. I agree that ignoring the robots.txt (say if you are Archive Team? [1]) should not be considered a criminal "hacking" felony.

But there can still be other reasons ignoring the robots.txt is against a law -- or cause for a civil tort action. (Most copyright violation is a civil tort action for instance, the CFAA is, again, a law that establishes some felonies with many years of jail time, intended to punish "hackers"). The decision in that case said nothing about anything except the CFAA.

For instance, taking copyrighted content from the public web and re-selling is probably still going to put you in various kinds of legal trouble -- just not a CFAA violation. It's possible ignoring a robots.txt could put you in other kinds of criminal or civil trouble, depending on the particular circumstances -- just not a CFAA violation. It would be interesting to research what other possible liability there might be. If for instance you caused harm to the site by ignoring the robots.txt (say, an accidental or intentional DOS), I bet there'd at least be cause for civil tort.

2. Even so, even under that case, if that specific case didn't involve a robots.txt (did it?), it's always possible the presence of a robots.txt would result in a differnet outcome. My sense is probably not though, that Supreme Court decision referenced by the ninth circuit on remand -- probably does mean ignoring a robots.txt is not a violation of the CFAA. (And again, I say, PHEW, that would have been terrible if it were -- if say someone trying to archive MySpace before it went away could be put in prison for a couple decades for disrespecting the robots.txt).

[1] https://wiki.archiveteam.org/

henryfjordan · on April 20, 2022

That case isn't even decided yet, the court only ruled on a preliminary injunction so there's still quite a bit of case left to go before any final decisions are made. For now it's only "likely" that HiQ will prevail (though that means it's pretty likely).

In this case Linkedin sent HiQ a cease and desist letter before they sued and claimed that letter revoked access for the purpose of the CFAA, so not quite the same as a robots.txt but legally it's probably close enough. If anything it's stronger because HiQ can't claim they didn't see it.

jrochkind1 · on April 20, 2022

Good point, thank you, yet another reason it may not mean that!

The 2020 Supreme Court decision in Van Buren v. United States that the Ninth Circuit said they were relying on has been decided though, and does seem pretty pertinent and a (welcome, thank god) nail in the coffin of CFAA overreach. https://techcrunch.com/2021/06/03/supreme-court-hacking-cfaa...

(Although now that I actually look at THAT case... damn! I think that's a WAY better CFAA case than a lot of them! I guess it took a police officer being the defendent to get the currently pro-police Supreme Court to actually say CFAA prosecution was going too far, okeydoke).

But yeah, nothing is for sure... almost ever with the law. But it does seem to be moving in a direction...

bartread · on April 20, 2022

Well, it's possible to also return a 403 (forbidden) to any request based off the user agent. Of course, this can be relatively easily circumvented, but then it's also possible to block IP ranges and suchlike. You can return a 403 off of any detectable aspect of the client that you don't like if you so wish.

I don't know how well this would work with a CDN, but presumably if you pay for the right tier of Cloudflare (or whatever) you can perform similar operations to prevent content being hoovered from their by clients you'd prefer not to serve.

superkuh · on April 20, 2022

Yep. I 403 turnitin and similar companies via nginx configuration, if ($http_referer ~* (TurnitinBot|PaperLiBot|idmarch|FairShare|Lightspeedsystems|ZmEu|BPImageWalker|semrushBot|ias_crawler|360spider|copyrightinfringementportal|PetalBot|Adsbot|SlySearch|NPBot)) { return 403; }

But my favorite robots.txt is,

    User-agent: Zombies
    Disallow: /brains

rcarmo · on April 20, 2022

Shouldn’t that be…

    User-agent: Zombies
    Disallow: /braaains

?

easrng · on April 20, 2022

Why are you blocking PetalBot? It's an actual search engine.

superkuh · on April 20, 2022

Legit Huawei IP ranges identifying as Huawei PetalBot were being abusive, definitely not obeying robots.txt, and searching for subsets of content that indicated they were looking to identify political dissidents with no worries about actually indexing the full site. I don't consider it a real search engine.

But yeah, maybe not a good fit for this list of educational and copyright parasites.

easrng · on April 22, 2022

Oh yikes, that sounds bad. It is a real search engine though, https://petalsearch.com/

erk__ · on April 21, 2022

Would it not have to be tried in a French court since that is where VideoLan is located?

frereubu · on April 20, 2022

I get some of the feeling behind this. But in terms of Turnitin my wife taught in art college and a close friend taught a Masters in Economics, and the amount of plagiarism was ridiculous. Sure, in theory there should be smaller class sizes, teachers should have more time per student, etc. etc., but Turnitin was an extremely helpful tool that meant they could offload the cognitive effort of detecting mechanical reproduction and get into reading the work. Unless there's something about Turnitin that I'm not aware of which tarnishes what they do? (Beyond making money out of already cash-strapped universities, I suppose...)

NobodyNada · on April 20, 2022

I'm a student at a university that uses Turnitin. I understand the need for the tools and I absolutely don't have a problem with them using automated tools to check my work for plagiarism.

What I do have a problem with is the fact that, after uploading an assignment, I am required to click a checkbox that says "I agree to Turnitin's end-user license agreement." I should not have to agree to a license for a piece of software that I'm not even using; it's my professor who's using Turnitin's services. And if it's 11:55 PM and I'm trying to submit my assignment, it feels really scummy to suddenly force me to sign a legal contract that I don't even have time to read.

TheNewsIsHere · on April 21, 2022

My university used to use Blackboard, which had SafeAssign (if I recall correctly). It was proprietary to Blackboard and the only option (at that time) was for students to choose if papers were included in the permanent reference database. Because I was given the choice, I selected that because I thought it was both considerate toward me and valuable to my scholarly work.

Then my university moved to Canvas and TurnItIn. At first there was no license agreement check box, and all the courses were force-enabled to allow TurnItIn to store student submissions forever.

I raised a lot of bell over that and the next term there was that same checkbox that I assume you also see.

It always felt very coercive. I hated checking that box. I fought tooth and nail. I had conference calls with the Academic Technologies leadership. They absolutely didn’t understand the objection. They compared it to Office 365 and didn’t understand the point that neither the university nor Microsoft was requiring that I give them a perpetual, virtually limitless license to my content in order to use the service.

I pointed to the university policies which explicitly and very clearly categorized non-compensated student output as the property of the student, who was to regain all rights. I pointed out the conflict of interest that iParadigms brings to the table.

All I ever got in response was the talking points I found on the TurnItIn marketing material. I’d have been OK if they disagreed after an actual discussion, but they weren’t interested.

woofcat · on April 20, 2022

Why does Turnitin get to keep a copy of all of my work for free? Do I not own the copyright of my papers?

frereubu · on April 20, 2022

OK, this objection sort of makes sense to me. Do they have something in their Ts and Cs which says "by submitting your work you consent to us storing your work..."? Presumably people who submit their work to them also benefit to some extent though, because then plagiarisers of your work will be caught?

treesknees · on April 20, 2022

>When a paper is submitted to Turnitin, it is compared against a vast, secure proprietary database of licensed source material, including millions of periodicals, academic journals, books, and web pages. Turnitin also maintains a separate repository of student papers. Each institution (at the discretion of his or her school administrator) can determine whether or not to include student papers in the repository. We can remove student papers from the standard repository at the request of a school administrator.

https://help.turnitin.com/Privacy_and_Security/Privacy_and_S...

So your school or university are really deciding whether your paper is stored for future plagiarism checks.

crtasm · on April 20, 2022

Defaults to on. I'd hope it's explained clearly and the institution is forced to choose on or off during signup, not buried somewhere in the settings.

woofcat · on April 20, 2022

Often you don't really have a choice, if you're enrolled in a school. Tough luck you get to give up your rights.

kube-system · on April 20, 2022

Just because something is commonly practiced without choice, doesn't mean that it is legally defensible, which is what the person above was questioning.

Authors generally have right to their works by default, even students. Those who use copyrighted works for academic purposes do get some exceptions to copyright law.... but a commercial service is not this.

Furthermore, student assignments are additionally protected by FERPA.

I'm not sure the answer, but I do think it's a great question.

google234123 · on April 20, 2022

You don’t have a choice if your grader/professor keeps a copy either.

jopsen · on April 20, 2022

But technically, you have recourse if your professor distributes the paper.

In practice, this could easily fall under fair-use, though, since the paper isn't distributed for the purpose of consumption.

tedunangst · on April 20, 2022

Most people are probably less concerned about their prof running a business off a desk drawer of old papers.

red_trumpet · on April 20, 2022

Well, it might be in your interest to know if someone plagiarizes you. I don't know anything about Turnitin, though I doubt they notify the original copyright holder?

If you provide copies of your work for free on the internet, this is why they get to keep one, just as everyone else? They are probably not allowed to distribute it, though?

xypage · on April 20, 2022

I don't think they're talking about posting your work for free on the internet, they mean when you submit an assignment for a class turnitin stores your assignment so they can check future submissions against yours. In this case, you haven't (necessarily) shared it anywhere public, all you did was submit it to your teacher, and now turnitin gets a copy forever.

ineptech · on April 20, 2022

Presumably because someone with the rights to distribute it, either you or your school, uploaded it to them.

A more interesting question is, if these companies do well and stay in business a long time, won't it become increasingly difficult to write an original paper that isn't flagged for plagiarism? There's only so many ways to describe the effects of the Lend-Lease Act on postwar Europe.

hackmiester · on April 20, 2022

This was my issue and was why I refused to use it.

google234123 · on April 20, 2022

They don’t claim to own your paper though.

jrochkind1 · on April 20, 2022

Because we're talking about plagiarism, and because this is so often a point of confusion in internet debates, I want to just point out that plagiarism and copyright are almost unrelated things.

* plagiarism is not generally against the law, although it is a violation of school policies and can get you punished by your school. It's claiming someone else's work as your own. It may or may not involve a copyright violation, it can be plagiarism without involving a copyright violation -- it could be the author gave you permission, or the item isn't in copyright, or it would count as "fair use" -- it's still plagiarism if your teacher/school/honor code panel says it is.

* copyright is about the law, violating a copyright is against the law and can get you civil or criminal penalties. It involves copying work someone else legally owns without permission. It may or may not involve claiming the work as your own, for the most part whether you attribute something properly or claim it for your own is not relevant to whether it is a copyright violation. (I suppose in some edge cases it could be relevant to whether you have a "fair use" defense, but mostly it's not significant in whether something is a copyright violation).

woofcat · on April 20, 2022

So by that logic I can pirate all content, and as long as I don't claim it's mine it's not violating copyright laws?

U.S. copyright law provides copyright owners with the following exclusive rights:

- Reproduce the work in copies or phonorecords.

- Prepare derivative works based upon the work.

- Distribute copies or phonorecords of the work to the public by sale or other transfer of ownership or by rental, lease, or lending.

- Perform the work publicly if it is a literary, musical, dramatic, or choreographic work; a pantomime; or a motion picture or other audiovisual work.

- Display the work publicly if it is a literary, musical, dramatic, or choreographic work; a pantomime; or a pictorial, graphic, or sculptural work.

-Perform the work publicly by means of a digital audio transmission if the work is a sound recording.

Copyright also provides the owner of copyright the right to authorize others to exercise these exclusive rights, subject to certain statutory limitations.

andjd · on April 20, 2022

Fair use is an affirmative defense to copyright violation. In this case, fair use _probably_ applies. First, it's a transformative use, albeit not a typical one. Secondly, TurnItIn's service does not diminish the market for the original work. Institutions pay TurnItIn to detect plagiarism, not to read content of websites it scrapes. Arguably, maybe this reduces the demand from university libraries to buy the works, but since there's still value to have the documents in the library for student access, I'd want some convincing evidence that this is actually happening.

Combined, TurnItIn is offering a service that individual copyright holders are likely uninterested in providing themselves, the use does not reduce the marketability of the copyrighted work, and there is probably not a market for this service that individual copyright holders could monetize. This is a pretty good case for fair use.

anyfoo · on April 20, 2022

Not trying to stir something, genuinely trying to find the line/difference, so this may be a dumb question, but:

By that logic, isn't it the same for Google? They keep a copy of your content in their cache/index (you can even get the indexed page directly from their cache).

jagged-chisel · on April 20, 2022

If I'm not entirely mistaken, online copyright enforcement has focused on sharing/sending/transmitting unlicensed copyrighted work. It has not enforced downloading against the downloader.

So, sure - download to your heart's content. Which means TurnItIn has little incentive to keep works out of their database - you can't make a claim against them if they don't publicize that your work is in it.

woofcat · on April 20, 2022

>anyfoo

You can opt out of that, also it's published content and caching has a carved out exception. When you use this service at a school the teacher submits your paper to the service and they store it to validate against other students papers as well.

They don't only check against "published" sources.

kube-system · on April 20, 2022

Not claiming ownership, or explicitly disclaiming ownership (e.g. "I don't own this", "No copyright infringement intended", or giving 'credit' to a source) are not defenses to copyright infringement. These are just urban myths that people repost on the internet with no basis in reality.

bombcar · on April 20, 2022

Sounds like there's a business opportunity here for a turnitinturnitin bot that runs your plagiarism through turnitin until it passes ...

KennyBlanken · on April 20, 2022

The objection likely is due to crawling VLC's website costing the project money, and crawling the site being completely useless for Turnitin...but them not caring and having the resources of a for-profit company vs an open source software NPO.

frereubu · on April 20, 2022

Could be, but the comment in the robots.txt file in that case is... enigmatic.

nixcraft · on April 20, 2022

I used the ultimate Nginx bad bot blocker on a couple of my side projects, and it is a pretty good project https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blo... . Apart from the Cloudflare offers UA blocking and AI driven bot management too. Most of these bots are for content scrapping and then creating search spam results. I am a one-person show, and it hurts both financially and resources wise on my tiny severs. So I block them.

1vuio0pswjnm7 · on April 20, 2022

Why crawl when a sitemap is provided. Honest question.

IME, using a sitemap is much more efficient. For example, HTTP/1.1 pipelining can be used to reduce the number of TCP connections needed.

Is resource exhaustion what draws a public website^1 operator's attention to "bots". If it is not resource exhaustion then what is it.

1. For this question, assume "public website" means a website serving public information where there are no legitimate intellectual property rights in the information that can be asserted by the site operator.

jamespwilliams · on April 20, 2022

> iThenticateÂ®

Unexpected place to see latin1 -> utf8 mojibake

jokoon · on April 20, 2022

I once started logging user agents and ips for my small old php website, which is a bit hard to find.

I was quite surprised to see all the weird bots that were crawling it.

ozfive · on April 20, 2022

I don't get it...

woliveirajr · on April 20, 2022

Probably because of the

># --> fuck off.

comment that is added after 3 specific robots.

wormer · on April 20, 2022

fuck off.

ozfive · on April 20, 2022

How do you report people like this on HN?

teaearlgraycold · on April 20, 2022

Seems like you don't understand the joke. Not something I'd take action on if I was moderating HN.

oehpr · on April 21, 2022

Though, you gotta figure it's tough to be a moderator when you have a massive report queue, full of toxic behavior. You come across this post with just a plain "Fuck off", not much more or less toxic than anything else you have to deal with. But~ instead of just getting your backlog done, you instead click into the post, look at the wider context of not just the person they're responding to, but also the post itself.

Toss a coin to your admin, oh valley of plenty~~~

spamtarget · on April 20, 2022

Yeah, NPBot and SlySearch can just fuck the fuck off, but what is wrong with fighting plagiarism? (honest question)

bastawhiz · on April 20, 2022

The bot is clearly crawling enough to be noticed, and consider the site: how much are you plagiarizing from videolan.org? If they're wasting even a small amount of resources, they're worth blocking.

spamtarget · on April 20, 2022

reasonable

Kaze404 · on April 21, 2022

There are arguments on whether plagiarism is a bad thing in an academic context. I'm not nearly qualified enough to make them, but they exist if you want to go looking.

Vladimof · on April 20, 2022

My homeworks aren't written to become public (they probably could because of their business)... I guess the schools are probably more to blame though.

spamtarget · on April 20, 2022

it's not your homework that is public, but you may sourced text from the public

Vladimof · on April 21, 2022

That company might leak and/or sell the data that they get from the schools?

kmeisthax · on April 20, 2022

So, I can understand the hate towards copyright enforcement bots, but... did TurnItIn hammer the shit out of VLC's website? Or do VLC's developers just hate the idea of automated enforcement in general?

(I doubt they're pro-plagiarism - not even copyright abolitionists go that far.)

KennyBlanken · on April 20, 2022

I imagine the objection is that it is consuming bandwidth, electricity, and computing capacity of an open source project as part of a profit-making service, with an extra fuck-you of a)crawling the site making no sense whatsoever for the service and b)the service being of no possible use to VLC or its users

jrochkind1 · on April 20, 2022

> consuming bandwidth, electricity, and computing capacity of an open source project as part of a profit-making service,

And yet they don't disallow Googlebot! For obvious reasons.

rickstanley · on April 20, 2022

What is that "# $Id$" at the top? Just a comment or serves a purpose?

tedunangst · on April 20, 2022

If the file lived in cvs, it would be replaced with the revision.

JNRowe · on April 20, 2022

If you want a little background on Ted's answer, the keywords and their use are described in the RCS docs¹. TIL, RCS still gets releases² ;)

¹ https://www.gnu.org/software/rcs/manual/html_node/Concepts.h...

² https://lists.gnu.org/archive/html/info-gnu/2022-02/msg00001...

paxys · on April 20, 2022

What content is videolan.org hosting that would be relevant for these bots?

ahmetkun · on April 20, 2022

You don't need to be relevant, just being accessible is enough reason for all sorts of bots, legit or sketchy, to shove hundreds of thousands of requests down your throat.

a3w · on April 20, 2022

What would a webcrawler be called which reads only the disallowed robots.txt routes? Still just an unfriendly webscraper? Shodan? Shodan on steroids?

notorandit · on April 20, 2022

Fuck off: I won't crawl it through!

kome · on April 20, 2022

i'm going to copy them. brilliant.

bombcar · on April 20, 2022

# Plagiarism?

# --> fuck off.

User-Agent: kome

Disallow: /

kome · on April 20, 2022

ahaha