Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> look at the network tab

The challenge there is automating it, though - usually the rest endpoints require some complex combination of temporary auth token headers that are (intentionally) difficult to generate outside the context of the app itself and expire pretty quickly.



You can use the application context, while also automatically intercepting requests. Best of both worlds.

puppeteer: https://pptr.dev/api/puppeteer.page.setrequestinterception

playwright: https://playwright.dev/docs/network#network-events


And the article is about using puppeteer ...


In browsers 'copy as curl' is decent enough. Do the request through a command line.

If there's ephemeral cookies, they tend to follow a predictable pattern.


Static files are much easier to scrape. Its even easier to scrape a static page then it is to use api's


Care to provide some examples. The majority of sites submitted to HN do not even require cookies let alone tokens in special headers. A site like Twitter is an exception not the general rule.


As a scraping target, Twitter is closer to the rule than the exception.


Not sure about "scraping targets". I'm referring to websites that can be read without using Javascript. Few websites submitted to HN try to discourage users with JS disabled from reading them by using tokens in special headers. Twitter is an exception. Twitter's efforts to annoy users into enabling Javascript are ineffective anyway.

https://github-wiki-see.page/m/zedeus/nitter/wiki/Instances


That's true....but that was already true.

Whatever method you were using before SPAs to authenticate your scraper (HTTP requests, browser automation), you can use that same method now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: