It is incorrect to "normalize" // in HTTP URL paths
Posted by pabs3 6 hours ago
Comments
Comment by Bender 3 minutes ago
To generalize by saying "incorrect" is incorrect. The correct answer is that it depends on the requirements in the given implementation. Saying such things will just lead to endless arguing.
Comment by echoangle 2 hours ago
> nginx with merge_slashes
How can it be wrong if it is server-side? If the server wants to treat those paths equally, it can if it wants to.
It would only be wrong if a client does it and requests a different URL than the user entered, right?
Comment by leni536 1 hour ago
It matters where the normalization happens, and server-side behavior is out-of-scope of these identifier RFCs.
Comment by OoooooooO 1 hour ago
> Therefore, collapsing // to / in HTTP URL path segments is not correct normalization. It produces a different, non-equivalent identifier unless the origin explicitly defines those two paths as equivalent.
Comment by cxr 36 minutes ago
Comment by echoangle 12 minutes ago
And at least according to this, the default setting is off so nginx actually is compliant unless you manually make it not be:
https://www.oreilly.com/library/view/nginx-http-server/97817...
Comment by MattJ100 3 hours ago
It gets worse if you are mapping URLs to a filesystem (e.g. for serving files). Even though they look similar, URL paths have different capabilities and rules than filesystems, and different filesystems also vary. This is also an example of that (I don't think most filesystems support empty directory names).
Comment by bryden_cruz 54 minutes ago
Comment by jeroenhd 41 minutes ago
If you're proxying to another server that just assumes relative paths and doesn't do any kind of validation, I guess an extra / might cause reading files outside of the expected area? That'd be an extremely weird and awful setup that I don't think makes any sense in the context of Spring Boot.
Comment by PunchyHamster 3 hours ago
Nothing on web is "correct", deal with it
Comment by dale_glass 2 hours ago
Because maybe you use S3, which treats `foo/bar.txt` and `foo//bar.txt` as entirely separate things. Because to S3, directories don't exist and those are literally the exact names of the keys under which data is stored.
So you have script A concatenate "foo" + "/bar" and script B concatenate "foo/" + "/bar", and suddenly you have a weird problem.
I can't imagine a real use case where you'd think this is desirable.
Comment by Mordisquitos 1 hour ago
Not S3, but here's a literal real use case: the entry for the Iraqw word /ameeni (woman) in Wiktionary.
https://en.wiktionary.org/wiki//ameeni
If for whatever reason your S3 keys contained English words and their translations separated by a slash, you would have a real problem if one of your scripts were to concatenate woman, / and /ameeni as woman/ameeni instead of woman//ameeni in the English/Iraqw case.
Comment by zarzavat 19 minutes ago
Can they not just use a 3 like in Arabic?
Comment by kstrauser 1 hour ago
woman/%2Fameeni
Consider that if the language allowed trailing slashes. What would this path mean if ameeni/ happened to be a valid word? ameeni//ameeni
One of those would get the slash but it’s not clear which.W3C says:
> The slash ("/", ASCII 2F hex) character is reserved for the delimiting of substrings whose relationship is hierarchical.
Comment by realitylabs 35 minutes ago
Comment by secondcoming 2 hours ago
Comment by leni536 1 hour ago
Of course you shouldn't assume that in a client. If you are implementing against an API don't deviate regarding // and trailing / from the API documentation.
Comment by sfeng 3 hours ago
Comment by domenicd 1 hour ago
- URL parsing/normalization; and
- Mapping URLs to resources (e.g. file paths or database entries) to be served from the server, and whether you ever map two distinct URLs to the same resource (either via redirects or just serving the same content).
The former has a good spec these days: https://url.spec.whatwg.org/ tells you precisely how to turn a string (e.g., sent over the network via HTTP requests) into a normalized data structure [1] of (scheme, username, password, host, port, path, query, fragment). The article is correct insofar that the spec's path (which is a list of strings, for HTTP URLs) can contain empty string segments.
But the latter is much more wild-west, and I don't know of any attempt being made to standardize it. There are tons of possible choices you can make here:
- Should `https://example.com/foo//bar` serve the same resource as `https://example.com/foo/bar`? (What the article focuses on.)
- `https://example.com/foo/` vs. `https://example.com/foo`
- `https://example.com/foo/` vs. `https://example.com/FOO`
- `https://example.com/foo` vs. `https://example.com/fo%6f%` vs. `https://example.com/fo%6F%`
- `https://example.com/foo%2Fbar` vs. `https://example.com/foo/bar`
- `https://example.com/foo/` vs. `https://example.com/foo.html`
Note that some things are normalized during parsing, e.g. `/foo\bar` -> `/foo/bar`, and `/foo/baz/../bar` -> `/foo/bar`. But for paths, very few.
Relatedly:
- For hosts, many more things are normalized during parsing. (This makes some sense, for security reasons.)
- For query, very little is normalized during parsing. But unlike for pathname, there is a standardized format and parser, application/x-www-form-urlencoded [2], that can be used to go further and canonicalize from the raw query string into a list of (name, value) string pairs.
Some discussions on the topic of path normalization, especially in terms of mapping the filesystem, in the URL Standard repo:
- https://github.com/whatwg/url/issues/552
- https://github.com/whatwg/url/issues/606
- https://github.com/whatwg/url/issues/565
- https://github.com/whatwg/url/issues/729
-----
[1]: https://url.spec.whatwg.org/#url-representation [2]: https://url.spec.whatwg.org/#application/x-www-form-urlencod...
Comment by mjs01 3 hours ago
Comment by PunchyHamster 3 hours ago
Comment by renewiltord 2 hours ago
Comment by janmarsal 2 hours ago
Comment by leni536 2 hours ago
Comment by tremon 22 minutes ago
Neither has much to do with / normalization, which applies to the path part of a valid uri.
Comment by stanac 2 hours ago
Comment by WesolyKubeczek 3 hours ago
Not doing it is like punishing people for not using Oxford commas, or entering an hour long debate each time someone writes “would of” instead of “would have”. It grinds my gears too, but I have different hills to die on.
Comment by bazoom42 3 hours ago
Comment by PunchyHamster 3 hours ago
Comment by jeroenhd 2 hours ago
Plenty of websites rewrite paths like /a/b/c/d into a backend service call like /?w=a&x=b&y=c&z=d. In that scheme, /a//c/d would rewrite to /?w=a&x=&y=c&z=d, something entirely distinct from /a/c/d working out to /?w=a&x=b&y=c
It's not the application's fault that the people attempting to configure web server URLs don't know how web server URLs work.
Comment by bazoom42 2 hours ago
Comment by Etheryte 3 hours ago
Comment by j16sdiz 3 hours ago
Comment by jeroenhd 2 hours ago
Not that you can include custom normalization rules (like collapsing slashes, tolower()ing the entire path, removing the query part of the URL), but that's not part of the standard. If you're doing anything extra, the risk of breaking stuff is on you.
Comment by LeonTing8090 2 hours ago