> Turning a natural language specification into web scraping code is exactly the kind of code synthesis that current LLMs can already achieve.
I wish, I have tried. LLM's don't understand the DOM. Unless it is as simple as the address being in element id=address, an LLM is useless for generating scraping code. You are better off with an element picker and some heuristics to generate a selector / xpath query. Now there are some specialized models, I found a paper that takes the DOM tree and fits a vector to each node, but I think they are too much effort for little gain, unless someone integrates them into an open source scraping library so they are easy to use.
This has not been my experience, to be honest -- I wrote a pretty complex scraper for extracting portions of a page and structuring their contents into JSON, using ChatGPT-4, about a year ago and it worked pretty well. (But not "well" in the sense that a non-programmer would've been able to do this, if that's your bar.)
I even got it usefully updated when the format changed!
It could depend on the page. In my case it was almost half a megabyte, mostly HTML markup junk, and the text was in Japanese. https://kakuyomu.jp/works/16818023214223186311 And the task was "write a selector to identify the author". I even tried giving it the author, didn't help.
>You are better off with an element picker and some heuristics to generate a selector / xpath query.
Bummer. I wanted to try my hand at this. There has to be some trick where you can combine LLM and some element picker to get a really robust solution right?
That does a list of blocks and "cheats" by using heavily-engineereed features such as Readability's algorithm. It is suitable for some purposes, I guess. But the paper I am talking about, https://arxiv.org/pdf/2201.10608, does self-supervised learning.
I wish, I have tried. LLM's don't understand the DOM. Unless it is as simple as the address being in element id=address, an LLM is useless for generating scraping code. You are better off with an element picker and some heuristics to generate a selector / xpath query. Now there are some specialized models, I found a paper that takes the DOM tree and fits a vector to each node, but I think they are too much effort for little gain, unless someone integrates them into an open source scraping library so they are easy to use.