> The goal was to come up with a good regular expression to validate URLs in user input, and not to match any URL that browsers can handle (as per the URL Standard).
WTF? What is "validation" supposed to be good for if it doesn't actually validate what it claims to? Exactly this mentality of making up your own stuff instead of implementing standards is what causes all these interoperability nightmares! If you claim to accept URLs, then accept URLs, all URLs, and reject non-URLs, all non-URLs. There is no reason to do anything else, other than lazyness maybe, and even then you are lying if you claim that you are validating URLs - you are not. If you say you accept a URL, and I paste a URL, your software is broken if it then rejects that URL as invalid.
This does not apply to intentionally selecting only a subset of URLs that are applicable in a given context, of course - if the URL is to be retrieved by an HTTP client, it's perfectly fine to reject non-HTTP URLs, of course, but any kind of "nobody is going to use that anyhow" is not a good reason. In particular, that kind of rejection most certainly is something that should not happen in the parser as that is likely to give inconsistent results as the parser usually works at the wrong level of abstraction.
> The RFC does not reflect reality either (which, ironically, is what you seem to be complaining about).
A spec for a formal language that doesn't contain a grammar? The world is getting crazier every day ...
> That doesn’t mean there aren’t any situations in which I need/want to blacklist some technically valid URL constructs.
Yeah, but blocking IPv4 literals of certain address ranges seems like a stupid idea nonetheless. Good software should accept any input that is meaningful to it and that is not a security problem. And as I said above, such rejection most certainly should not happen in the parser.
> Doesn’t matter – if there’s a discrepancy between what a document says and what implementors do, that document is but a work of fiction.
Yes and no. When there is a de-facto standard that just doesn't happen to match the published standard, yeah, sure. Otherwise, bug compatibility is a terrible idea and should be avoided as much as possible, many security problems have resulted from that.
> This is not a parser.
Well, even worse then. Manually integrating semantics from higher layers into parsing machinery (which it is, never mind the fact that you don't capture any of the syntactic elements within that parsing automaton) is both extremely error prone and gives you terrible maintainability.
edit:
For the fun of it, I just had a look at the "winning entry" (diegoperini). Unsurprisingly, it's broken. It was trivial to find cases that it will reject that you most certainly don't intend to reject. For exactly the reasons pointed out above.
WTF? What is "validation" supposed to be good for if it doesn't actually validate what it claims to? Exactly this mentality of making up your own stuff instead of implementing standards is what causes all these interoperability nightmares! If you claim to accept URLs, then accept URLs, all URLs, and reject non-URLs, all non-URLs. There is no reason to do anything else, other than lazyness maybe, and even then you are lying if you claim that you are validating URLs - you are not. If you say you accept a URL, and I paste a URL, your software is broken if it then rejects that URL as invalid.
This does not apply to intentionally selecting only a subset of URLs that are applicable in a given context, of course - if the URL is to be retrieved by an HTTP client, it's perfectly fine to reject non-HTTP URLs, of course, but any kind of "nobody is going to use that anyhow" is not a good reason. In particular, that kind of rejection most certainly is something that should not happen in the parser as that is likely to give inconsistent results as the parser usually works at the wrong level of abstraction.
> The RFC does not reflect reality either (which, ironically, is what you seem to be complaining about).
Well, or reality does not match the RFC?
> If you’re looking for a spec-compliant solution, the spec to follow is http://url.spec.whatwg.org/.
A spec for a formal language that doesn't contain a grammar? The world is getting crazier every day ...
> That doesn’t mean there aren’t any situations in which I need/want to blacklist some technically valid URL constructs.
Yeah, but blocking IPv4 literals of certain address ranges seems like a stupid idea nonetheless. Good software should accept any input that is meaningful to it and that is not a security problem. And as I said above, such rejection most certainly should not happen in the parser.