Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> These scam sites load megabytes of junk, load slowly, have text interpersed with ads and modals that render right on top of them

Only if you're not googlebot. The crawler sees a much nicer site.



which should — in theory — get them penalized for cloaking. But obviously it doesn’t. Reinforcing GP’s point.


Google has gotten pretty lenient about that: https://developers.google.com/search/docs/essentials/spam-po...

"If you operate a paywall or a content-gating mechanism, we don't consider this to be cloaking if Google can see the full content of what's behind the paywall just like any person who has access to the gated material and if you follow our Flexible Sampling general guidance."

I wonder if they just gave up


Hypothesis: Search, being Google's oldest product, is no longer prestigious to work in. It's in maintenance mode.


Does Google run other indexers for the purposes of catching cloaking? Are there other strategies that can be used? One of the problems of SO is that most of the valid content is out there and easily available without having to scrape the site which may mean penalizing for bad content is harder.


And the fact that google is not detecting those is damning (to google)


Does it even make sense to serve different content to a bot than what a human would see? Isn't the search engine trying to rank content made for humans?


It's an adversarial process. The search engine is, in theory, trying to rank by usefulness to the user, and the site owner is trying to maximize revenue by lying to the search engine. And the user.


I'm generally puzzled by Google's reluctance to do manual intervention in these cases. It's not like this is a secret. Just penalize the whole domain for 60 days every time a prominent site lies to the crawler.


There are very many sites where the content you see as a non-logged-in user is different from what you see if you have in your possession an all-important user cookie.


If Google's support is any indication, Google doesn't like to involve humans in their processes. There probably isn't enough humans to do this manual intervention you propose.


Then, maybe the "crawler" should be an actual PC navigating to the browser, taking a screenshot (or live feed) of the page and processing it with AI.


Eh, Google choose to be identifiable as googlebot and to obey robots.txt for other reasons of "good citizenship", because not everybody wants to be crawled.


It makes sense if you know your content isn't nice for humans (e.g. full of ads and tracking stuff) but you want it to rank high anyway.


I wonder what will I see if I change my browser's user agent?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: