It depends on the site - but I haven't found 'lost formatting' to be an issue at all - since, when I want to do a granular search I'm using 'pdftotext' to search on plaintext, and when I find a PDF of interest, I open it and can go directly back to the web page from which it was printed by way of the footer/header which contains a clickable URL.
Most of the time though, the formatting isn't an issue. It depends on the site though - some authors produce stuff that doesn't look good as PDF, even if the content is still there. That doesn't bug me much.
Ok, so we seem to agree print-to-pdf loses formatting. I share your interest in and fascination with this (weirdly irksome and edgecasey) problem but just about any modern browser provides better facilities for saving web pages with higher fidelity than 'print to pdf'. Print to pdf is so easy to beat, you'd have to go out of your way to find a way to not-beat it - say, saving just 'page source'.
PDF gives me an off-line readable version of the website, and is pretty compatible with pdftotext as a pipeline tool... I'm not sure that formatting is such a huge issue - I'm yet to find a web page I can't extract some meaningful info from, later on ..
As does 'Save as: Web Archive' in Safari or, if you tell it to store offline, 'Add to Reading List'. In Chrome you can save pages to .mht. All of these are single-keystroke, better ways to locally archive a web page.
You can extract the full text from these (with whatever tools you like) with better fidelity than you can from a pdf, which is a lossy conversion from the same source. This seems to barely merit debating, unless I'm missing something.
PDF works just fine.
I'm sure it works for you and I'm not harbouring any delusions I'm going to talk you out of your decades-established workflow. But for anyone looking for ways to keep track of web pages, thinking about building tools in this space, etc - no, PDF is not a good way to archive web pages, either manually or programmatically.
I am not finding this to be true. Pretty much every PDF I have has been usable for extracting the text content - unless the Web authors intentionally work to obfuscate/disable this functionality, i.e. using images to display text content.
>PDF is not a good way to archive web pages, either manually or programmatically.
I disagree, entirely, with your conclusion - you haven't made a strong argument. 20,000+ fully-searchable, indexable, accessible-in-offline PDF files vs. your opinion so far. I don't see any of the issues you've stated are insurmountable - in fact, I find the reality to be completely the opposite to your stated opinion. Please expand on this if you have the energy.
Probably the shortest version of the point I'm trying to make is that every current browser does a much better job of providing you this than printing to PDF. If you rely on this as a personal web archiving system, you're going to lose data in the most irritating way - data you thought you collected but actually didn't.
I don't understand your point at all. In what way is a browser going to give me information that is not available to me unless I'm online? PDF's of sites I've visited have all the data I need - the stuff I read that then prompted me to print to PDF. I've searched and I'm yet to find a single PDF in my collection that doesn't have the info that prompted me to save it in the first place. I understand you believe your point is strong - its still not being made in a way that I can relate.
PDF conversion throws away almost all of the structure contained in HTML. Tools like pdf2text then try to reconstruct some of that structure (such as the correct sequence of letters and words) using complex heuristics that don't always work.
They often do that successfully enough, especially if all you want is to grep for words. pdf2text also has a table mode that attempts to reconstruct table structured content. This is far less successful.
So depending on how you want to process your stored data, saving as PDF may or may not preserve sufficient information.
If you were storing pages for the purpose of extracting specific properties of things you are researching (say product information or the tree structure of HN threads), then throwing away all that structure makes it a lot more difficult or even impossible to reconstruct the information you need.
If I were storing pages for unknown future purposes, I wouldn't want to throw away any information I might need, and therefore I would never use PDF as an archival format.
But I understand that you store PDF files for a very specific purpose for which lossy PDF conversion happens to be good enough. So that's fine of course.
The only question I have is where I can find the source URL of the stored pages.
Most of the time though, the formatting isn't an issue. It depends on the site though - some authors produce stuff that doesn't look good as PDF, even if the content is still there. That doesn't bug me much.