It depends on the site - but I haven't found 'lost formatting' to be an issue at...

pvg · on May 18, 2020

Ok, so we seem to agree print-to-pdf loses formatting. I share your interest in and fascination with this (weirdly irksome and edgecasey) problem but just about any modern browser provides better facilities for saving web pages with higher fidelity than 'print to pdf'. Print to pdf is so easy to beat, you'd have to go out of your way to find a way to not-beat it - say, saving just 'page source'.

fit2rule · on May 19, 2020

PDF gives me an off-line readable version of the website, and is pretty compatible with pdftotext as a pipeline tool... I'm not sure that formatting is such a huge issue - I'm yet to find a web page I can't extract some meaningful info from, later on ..

pvg · on May 19, 2020

an off-line readable version of the website

As does 'Save as: Web Archive' in Safari or, if you tell it to store offline, 'Add to Reading List'. In Chrome you can save pages to .mht. All of these are single-keystroke, better ways to locally archive a web page.

fit2rule · on May 19, 2020

Cool. Let me know when I can extract the full text contents from such files using common, built-in tools on your average fresh install of MacOS/Linux.

PDF works just fine. It presents a feasible view of the original data, and allows for data harvesting with ease.

pvg · on May 19, 2020

Let me know when I can extract the full text

You can extract the full text from these (with whatever tools you like) with better fidelity than you can from a pdf, which is a lossy conversion from the same source. This seems to barely merit debating, unless I'm missing something.

PDF works just fine.

I'm sure it works for you and I'm not harbouring any delusions I'm going to talk you out of your decades-established workflow. But for anyone looking for ways to keep track of web pages, thinking about building tools in this space, etc - no, PDF is not a good way to archive web pages, either manually or programmatically.

fit2rule · on May 19, 2020

>PDF is a lossy conversion

I am not finding this to be true. Pretty much every PDF I have has been usable for extracting the text content - unless the Web authors intentionally work to obfuscate/disable this functionality, i.e. using images to display text content.

>PDF is not a good way to archive web pages, either manually or programmatically.

I disagree, entirely, with your conclusion - you haven't made a strong argument. 20,000+ fully-searchable, indexable, accessible-in-offline PDF files vs. your opinion so far. I don't see any of the issues you've stated are insurmountable - in fact, I find the reality to be completely the opposite to your stated opinion. Please expand on this if you have the energy.

pvg · on May 19, 2020

fully-searchable, indexable, accessible-in-offline

Probably the shortest version of the point I'm trying to make is that every current browser does a much better job of providing you this than printing to PDF. If you rely on this as a personal web archiving system, you're going to lose data in the most irritating way - data you thought you collected but actually didn't.

fit2rule · on May 19, 2020

I don't understand your point at all. In what way is a browser going to give me information that is not available to me unless I'm online? PDF's of sites I've visited have all the data I need - the stuff I read that then prompted me to print to PDF. I've searched and I'm yet to find a single PDF in my collection that doesn't have the info that prompted me to save it in the first place. I understand you believe your point is strong - its still not being made in a way that I can relate.

fauigerzigerk · on May 19, 2020

PDF conversion throws away almost all of the structure contained in HTML. Tools like pdf2text then try to reconstruct some of that structure (such as the correct sequence of letters and words) using complex heuristics that don't always work.

They often do that successfully enough, especially if all you want is to grep for words. pdf2text also has a table mode that attempts to reconstruct table structured content. This is far less successful.

So depending on how you want to process your stored data, saving as PDF may or may not preserve sufficient information.

If you were storing pages for the purpose of extracting specific properties of things you are researching (say product information or the tree structure of HN threads), then throwing away all that structure makes it a lot more difficult or even impossible to reconstruct the information you need.

If I were storing pages for unknown future purposes, I wouldn't want to throw away any information I might need, and therefore I would never use PDF as an archival format.

But I understand that you store PDF files for a very specific purpose for which lossy PDF conversion happens to be good enough. So that's fine of course.

The only question I have is where I can find the source URL of the stored pages.

fit2rule · on May 19, 2020

>structure contained in HTML.

As long as I can read the site, I have what I need. Why do I need to read the HTML?

>So depending on how you want to process your stored data, saving as PDF may or may not preserve sufficient information.

As long as I can read it, the PDF is sufficient for my needs.

For everything else, there's wget.

pvg · on May 19, 2020

In what way is a browser going to give me information that is not available to me unless I'm online?

The alternatives I'm talking about are absolutely available offline. I don't understand why you keep arguing against a position I've never taken.