Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yep. I 403 turnitin and similar companies via nginx configuration, if ($http_referer ~* (TurnitinBot|PaperLiBot|idmarch|FairShare|Lightspeedsystems|ZmEu|BPImageWalker|semrushBot|ias_crawler|360spider|copyrightinfringementportal|PetalBot|Adsbot|SlySearch|NPBot)) { return 403; }

But my favorite robots.txt is,

    User-agent: Zombies
    Disallow: /brains


Shouldn’t that be…

    User-agent: Zombies
    Disallow: /braaains

?


Why are you blocking PetalBot? It's an actual search engine.


Legit Huawei IP ranges identifying as Huawei PetalBot were being abusive, definitely not obeying robots.txt, and searching for subsets of content that indicated they were looking to identify political dissidents with no worries about actually indexing the full site. I don't consider it a real search engine.

But yeah, maybe not a good fit for this list of educational and copyright parasites.


Oh yikes, that sounds bad. It is a real search engine though, https://petalsearch.com/




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: