You need neither. Simply hash both articles, and reference it by hash. Then you ...

doubleunplussed · on Aug 4, 2019

A slight impediment to that is that ArXiv discards PDFs that have not been accessed in a while, and rebuilds them from TeX source if later accessed. The result may not have the same hash - I sometimes even see ArXiv PDFs with today's date in them despite being published a long time ago, because the author used the \today macro. So you would need reproducible builds for the hashes to be valid, or for ArXiv to no longer have the storage concerns thst lead them to this practice. Or you could hash the TeX I suppose.

j-pb · on Aug 4, 2019

Yeah, you should hash the TeX. It's a pity really that PDF has become the dominant publication format, it's just so bad and non machine readable. It's absurd to me that scientific publications haven't switched over to HTML, I mean that format was invented for scientific publication...

cameronbrown · on Aug 4, 2019

I think the consistency of PDF is why it's favoured. It's also completely self contained where HTML has severe issues with long term access.

j-pb · on Aug 4, 2019

Could you elaborate on how you feel that html isn't self contained, and how there are issues with long term access?

captn3m0 · on Aug 4, 2019

References to third party websites that can break. HTML is a living spec, so browsers can decide to break things that work today (as happened with Marquee for eg).

Even if you disallow JS entirely, and stick with just HTML/CSS, it has enough warts to not look and behave consistently over time.

j-pb · on Aug 4, 2019

A link could easily be a URN that identifies the target by its hash, all the protocols for that are already in place e.g. magnet links. PDF doesn't have JS and hyperlinking, so I guess all you'd need would be HTML, even ignoring CSS, which could be tailored to the reader, e.g. Latex style, troff style etc.

Vanilla html with images embedded as data URNs should be pretty darn portable for the forseeable future.

Heres the kicker, it's a text based protocol, and a dead simple one, eben if we should loose all browsers in the big browser war of 2033, its super easy to reverse engineer. PDF not so much.

sjy · on Aug 5, 2019

A subset of HTML + CSS (+ ECMAScript?) could replace PDF for this purpose. However, is there a standard subset and familiar, understandable tools for working with it? In general, using the 'save as' function in a web browser won't produce a document that looks the same 10 years later. Rewriting the source document using a tool like wget can achieve this, but it doesn't always work (eg. what if the content was pulled in asynchronously?), and you need a computer expert to create and explain how the archived format relates to the live content. 'Save as PDF,' despite its technical inferiority, is easy and widely understood.

j-pb · on Aug 5, 2019

So you're telling me that writing vanilla html with images embedded as data links, is more difficult than writing TeX or LaTeX?

kevin_thibedeau · on Aug 5, 2019

PDF does have JS and hyperlinks.

9HZZRfNlpR · on Aug 4, 2019

HTML/CSS is extremely backwards compatible, modern browsers don't have problems of displaying the page differently.

How does pdf solve link rot problem? Pdf is good for print, it's consistant. But fails when display size other than big screen, especially e-ink displays that don't tend to be your standard A4.

sjy · on Aug 5, 2019

PDFs don't solve link rot. But in HTML, it's conventional to rely on links for stylesheets and sometimes even content (images, asynchronous DOM elements), so link rot is a bigger problem.

j-pb · on Aug 5, 2019

Yeah for publishing you don't want content in links, but you can solve that with data URIs that embed images and other data directly into the link [1].

[1]:https://en.m.wikipedia.org/wiki/Data_URI_scheme

dekhn · on Aug 5, 2019

My thesis from 1995 is still online (tex to html): https://compbio.soe.ucsc.edu/papers/thesis_rafael/thesis.htm... however all the image links broke a decade or two ago.

foota · on Aug 4, 2019

Imo tex isn't much more machine readable, depending on what you want to do. Reformatting or lossy conversion to plaintext? Sure. Determining semantics? Good luck.

buboard · on Aug 4, 2019

That sounds like an overkill thing to do nowadays

davrosthedalek · on Aug 4, 2019

The journal version and the arxiv version will never hash to the same value because they are not bit-identical. But you want to link to the peer reviewed version, or one which is semantically identical to the peer reviewed version. So somebody needs to check that the arxiv version is semantically identical to the journal version.

j-pb · on Aug 4, 2019

You should hash the TeX, not the PDF. Alternatively you could have both documents PGP signed by the author with a hash of the original tex, if you want to make sure you get the right "semantically the same but different" version. But tbh that seems to be a slippery slope that I wouldn't want to go on, where do you draw the line for your semantic differences? Imagine you quote something which gets edited out, suddenly it looks like you quote nonsense while it's the original references fault.

davrosthedalek · on Aug 4, 2019

There is no TeX source for the journal version. The point is that you don't want to trust the author to verify that the peer-reviewed+accepted version is the same as the arxiv version, and that it will not be changed. That's why people generally cite the journal version. Because it's immutable.

j-pb · on Aug 4, 2019

Journal versions are simply not immutable because they are referenced by name, not by content. I regularly see a good percentage of dead or wrong DOI, and I've hunted my fair share of papers that were supposedly released in a journal, but that only ever existed in preprint.

Arxiv already accepts latex and compiles it for you, we should expect the same from journals and ask them to publish the hash of the document they received.

davrosthedalek · on Aug 4, 2019

Journal versions are reference by journal name, volume, year, page number, indexing a hard copy version you can find in a library. Seems pretty immutable to me.

The journals I published in all accepted latex. But they convert it to use their layouting software. The last correction steps are typically done only in this version, and the author has to backport them into their tex code. Why should the journal have any interest in making the arxiv version more attractive?

j-pb · on Aug 5, 2019

Even if we ignore reprints, editorial series that rearrange papers (and make a paper citable more than one way), and proceedings (which often don't properly distinguish between papers, but use author + proceeding).

Science simply doesn't operate on journal published papers most of the time. The paper mills run so hot that you regularly cite preprints, that get exchanged between authors directly. It happens regularly that the proof is supposedly in the "full paper" only that the "full paper" was never published.

buboard · on Aug 4, 2019

Why would the author not be trusted? Why do they stand to gain? Arxiv can make the final version immutable too

davrosthedalek · on Aug 4, 2019

Essentially the same reason we need peer review in the first place. Many authors have strong but wrong opinions. But even without malice:Some don't care that the arxiv version is slightly different from the paper.

buboard · on Aug 4, 2019

I dont see why anyone would put different content in the two papers since its so trivial to be ridiculed for that. I dont think arxiv has resources to review if the preprints are the same as the final, and it seems an overkill thing to do .

afiori · on Aug 4, 2019

Also in many cases there is a final round of modifications done by the publisher that you are not free to distribute. For journal paper I was told that sometimes you cannot even publish the corrected version after rebuttal.

buboard · on Aug 4, 2019

it's not the same file - just the same, final text proof. It will be different from the final formatting in the journal.

I dont think authors have incentive to abuse the system. Just upload the final proof of your manuscript to arxiv, click "final version" , and this lets people know that this is the same article as in the journal.

DOIs are ubiquitous and they would serve the purpose of redirecting to the free pdfs rather then the journal site. This can be applied to existing articles retroactively. Plus, many bibliography styles include the DOI which makes the reference easier to use

Symbiote · on Aug 4, 2019

DOIs are controlled by the journal publisher, so I don't see why they would be willing to change their target.

buboard · on Aug 4, 2019

yeah good point. Maybe doi.org could resolve them differently? In any case it s the reference identifier that could connect the two documents