Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

fully-searchable, indexable, accessible-in-offline

Probably the shortest version of the point I'm trying to make is that every current browser does a much better job of providing you this than printing to PDF. If you rely on this as a personal web archiving system, you're going to lose data in the most irritating way - data you thought you collected but actually didn't.



I don't understand your point at all. In what way is a browser going to give me information that is not available to me unless I'm online? PDF's of sites I've visited have all the data I need - the stuff I read that then prompted me to print to PDF. I've searched and I'm yet to find a single PDF in my collection that doesn't have the info that prompted me to save it in the first place. I understand you believe your point is strong - its still not being made in a way that I can relate.


PDF conversion throws away almost all of the structure contained in HTML. Tools like pdf2text then try to reconstruct some of that structure (such as the correct sequence of letters and words) using complex heuristics that don't always work.

They often do that successfully enough, especially if all you want is to grep for words. pdf2text also has a table mode that attempts to reconstruct table structured content. This is far less successful.

So depending on how you want to process your stored data, saving as PDF may or may not preserve sufficient information.

If you were storing pages for the purpose of extracting specific properties of things you are researching (say product information or the tree structure of HN threads), then throwing away all that structure makes it a lot more difficult or even impossible to reconstruct the information you need.

If I were storing pages for unknown future purposes, I wouldn't want to throw away any information I might need, and therefore I would never use PDF as an archival format.

But I understand that you store PDF files for a very specific purpose for which lossy PDF conversion happens to be good enough. So that's fine of course.

The only question I have is where I can find the source URL of the stored pages.


>structure contained in HTML.

As long as I can read the site, I have what I need. Why do I need to read the HTML?

>So depending on how you want to process your stored data, saving as PDF may or may not preserve sufficient information.

As long as I can read it, the PDF is sufficient for my needs.

For everything else, there's wget.


In what way is a browser going to give me information that is not available to me unless I'm online?

The alternatives I'm talking about are absolutely available offline. I don't understand why you keep arguing against a position I've never taken.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: