> Turning a natural language specification into web scraping code is exactly the...

zamfi · on May 18, 2024

This has not been my experience, to be honest -- I wrote a pretty complex scraper for extracting portions of a page and structuring their contents into JSON, using ChatGPT-4, about a year ago and it worked pretty well. (But not "well" in the sense that a non-programmer would've been able to do this, if that's your bar.)

I even got it usefully updated when the format changed!

Mathnerd314 · on May 19, 2024

It could depend on the page. In my case it was almost half a megabyte, mostly HTML markup junk, and the text was in Japanese. https://kakuyomu.jp/works/16818023214223186311 And the task was "write a selector to identify the author". I even tried giving it the author, didn't help.

g8oz · on May 19, 2024

I love to see more websites where people can share successful relatively complex coding prompts.

worldsayshi · on May 18, 2024

>You are better off with an element picker and some heuristics to generate a selector / xpath query.

Bummer. I wanted to try my hand at this. There has to be some trick where you can combine LLM and some element picker to get a really robust solution right?

worldsayshi · on May 18, 2024

Hmm, extractnet seems somewhat promising:

https://github.com/currentslab/extractnet

Mathnerd314 · on May 18, 2024

That does a list of blocks and "cheats" by using heavily-engineereed features such as Readability's algorithm. It is suitable for some purposes, I guess. But the paper I am talking about, https://arxiv.org/pdf/2201.10608, does self-supervised learning.