Dan's Web Tips:

URLs

URLs (Uniform Resource Locators) are the standardized means of addressing pages in the Web. There are two basic types of URLs: absolute and relative. They each have their place for use in links in your Web sites.

(As an aside, "URL" can be pronounced like "earl" or like "You Are 'Ell". This makes a problem figuring out whether to write "a URL" or "an URL"; which is correct depends on how you expect it to be pronounced. I decided on "a URL" for this document.)

NOTE: These days, it's fashionable among Web purists to say "URI" (Uniform Resource Identifier) instead of "URL". Technically, a URI (presumably pronounced like the name of the psychic known for bending spoons) is any short string leading to a resource that is acceptable for use on the Web, while a URL is a specific kind of URI that identifies a specific protocol for retrieving the resource. URNs (Uniform Resource Name), presumably pronounced like a "Grecian Urn" (What's a Grecian Urn? About 50 drachmae.), are yet another kind of URI that isn't a URL, intended to provide a more stable method of addressing a resource that wouldn't be dependent on specific protocols or network addresses -- several URN schemes are defined now, but browsers are slow to implement them. An Internet draft (no longer online where I linked to it before) proposed a few more additions to this family -- URPs, URTs, and URVs.

YET ANOTHER NOTE: In the above acronyms, the "U" is sometimes construed as standing for "Universal" rather than "Uniform".

Absolute URLs

Definition: Absolute URLs specify the location of a Web page in full, and work identically no matter where in the world you are.

Absolute URLs have the following form:

https://www.dan.info/webtips/images.html#ALT

The first part, separated by a colon (:) from the rest of the URL, is the protocol, usually http for HyperText Transport Protocol, though other protocols such as ftp and gopher are sometimes used. For secure-server sites using an encrypted protocol, https is used as the protocol identifier.

Next comes the hostname (domain name or IP address), preceded by a double slash (//). It seems to be a common misconception that the colon and double slash are an inseparable delimiter terminating the protocol -- for instance, the Mozilla team posted an online document regarding their implementation of irc:// URLs. Actually, the colon is the terminator of the protocol section, and the double slash is used to introduce a hostname or other site identifier (varying somewhat by protocol, with some less-common protocols taking things other than domain names in this section) and is absent in URIs lacking a hostname like mailto: and news: URLs.

After that is the directory path to the Web page you're accessing, with forward slashes (/) separating directory levels (not backslashes (\) like in DOS/Windows systems).

Pedantic Note: Actually, as many purists will tell you, it's not true that the "path" portion of a URL is necessarily a directory path. Servers can be configured to interpret a URL path any way they like, which might not necessarily correspond to any actual subdirectory tree. Sites generated dynamically from databases may use URL paths that have nothing to do with directory structures. However, most Web servers do use URLs corresponding to the file structure, so that's what I'll assume for this document.

Finally, optionally, there is a "fragment identifier" separated by a pound (#) sign from the rest of the URL, indicating that the link is to an anchor within a document (if this is omitted, the link is to the top of the page). (Technically, the fragment identifier isn't actually part of the URL, but an addendum to it, because it isn't sent to the server; it's used by the browser to go to the appropriate part of the retrieved page once it is loaded.)

There are a few special protocols with URLs of differing syntax. mailto: is followed with an e-mail address to create a link allowing users to send mail to that address. news: is followed by the name of a newsgroup (e.g., comp.infosystems.www.authoring.html) to let the user follow the link to see the newsgroup's messages (if the user's browser is configured to access a news server). Both of these URL types do not have slashes (single or double) in them; the syntax looks like mailto:webmaster@webtips.dan.info, not mailto://webmaster@webtips.dan.info/; developers used to the more common http: syntax often put extra slashes in these URLs and cause them to fail. (More information on mailto: URLs is in my page on e-mail.)

Note that you can't leave out the protocol and use www.somewhere.com as a link URL without the http://. This syntax works when you're typing in a URL in most browsers, but in a link within your Web site it will be interpreted as a relative URL to a file named "www.somewhere.com" in the current directory.

Are URLs case sensitive?

Technically, yes. You should always be consistent in your use of upper or lower case in your URLs. Even in cases where the upper and lower case versions go to the same resource, you're imposing an unnecessary burden on browsers that need to retrieve and cache two copies of the same thing if they go to two variants of the same URL.

As far as whether you can vary the case and still get the same resource, this depends. The protocol and hostname are not case sensitive, so you can write https://www.dan.info/ or https://www.dan.info/ and they'll work identically. However, the directory and filenames may be case sensitive depending on what operating system the server is running under (UNIX is case-sensitive, while Windows isn't). Fragment names are case-sensitive. So be careful to match the directory, file, and anchor names in your links to the case of the actual files and anchors.

Can I include spaces in my URLs?

No, the space is not a legal character in URLs. Spaces, and a number of other special characters, must be encoded by using a percent sign (%) followed by a two character hexadecimal number giving the character's position in the ASCII encoding, at least in the case of characters that are part of ASCII (#0-127). Other Unicode characters get more complicated; while in the "old days" you could sometimes find them encoded in URLs using the code values corresponding to their position in ISO-8859-1 or other such 8-bit encodings, at present UTF-8 is the standard, requiring multi-byte encodings for non-ASCII characters, consisting of several consecutive sequences of a percent sign and two hex digits. At any rate, a space is represented as %20.

Some Web servers might have file systems that allow documents with names containing spaces, but if you use files with such names, their URLs will contain %20, which is rather ugly. So it's best to avoid such names and stick to safer characters like letters, numbers, dashes, and underscores. Mac users in particular tend to create directory structures including spaces, producing awkward URLs.

Relative URLs

Definition: Relative URLs are context-sensitive, giving a path with respect to your current location.

There are several types of relative URL.

A URL with no slashes, like "junk.html", references another page in the same directory as the current page. So if you're currently at "http://www.yoursite.com/stuff/one.html" and encounter the relative URL "two.html", this is addressing the page "http://www.yoursite.com/stuff/two.html".
A URL with no leading slashes, but slashes within, references a subdirectory beneath the current one. "subdir/test.html", encountered from the same page as the above example, would reference "http://www.yoursite.com/stuff/subdir/test.html".
A URL with double dots at the start, like "../another.html", references the parent directory of the current one. This URL, accessed from the same page as the above examples, would lead to "http://www.yoursite.com/another.html". Double dots can be repeated, like "../../grandparent.html", to go up additional levels, or combined with subdirectory references like "../sister/" to go to a sibling directory.
A URL with a single dot at the start, like "./stuff.html", references another file in the same directory, just like a URL with no slashes. It's better to use the form of URL without the dot and slash, since there are a few old browsers and indexing robots that don't seem to understand this syntax properly, and end up expanding the URLs into bizarre things like "http://www.yoursite.com/././stuff/../junk/", which work (with most servers), but look weird in your access logs. Double dots produce this effect too, but they're too useful to give up, but the single dot is unnecessary (except in the special case of linking back to the index of the current directory, where "./" is the best URL, as described elsewhere).
A URL with a slash at the start, like "/dir1/dir2/stuff.html", references a page at a path starting from the root of the server. To be more precise, it starts at the root of the domain name you're in. Be careful using this if your site is in a virtual domain on an Internet provider's system. If you have a domain yoursite.com which points at the directory /sites/yours/ within the ISP's domain provider.com, then your page silly/stuff.html can be reached via two different URLs: http://www.yoursite.com/silly/stuff.html and http://www.provider.com/sites/yours/silly/stuff.html. Maybe you had your site up for a long time before getting your own domain so your users are regularly coming in via both addresses. In this case, a URL like "/silly/morestuff.html" can be interpreted as "http://www.yoursite.com/silly/morestuff.html" or "http://www.provider.com/silly/morestuff.html" depending on which domain the user is in. Thus, you should avoid this form of URL if there's any doubt about how the user is accessing your site.
In an uncommon but legal URL form, a URL with a double slash at the start, like "//www.yoursite.com/stuff.html", keeps only the protocol identifier from the current URL and gets the full sitename and path from the new URL. I actually found a use for this form recently, in a piece of HTML code that was being accessed under both the secure https: protocol and the nonsecure http: protocol, and under more than one domain name. I wanted to access a particular graphic in all cases, using a protocol (secure or nonsecure) matching that with which the main page was accessed. Using relative URLs of the forms given above would require the graphic to be placed in all the different domains; and using an absolute URL would force the protocol to be specified. I deftly avoided these problems by using a double-slashed relative URL.
Finally, a URL beginning with a pound sign (#) specifies a link to a fragment identifier (anchor) in the current page.

Which Type of URL Should You Use?

TIP: Use absolute URLs when linking to a different site, and relative URLs when linking within your site.

Within your site, it's best to use relative URLs, because this will allow you to move the entire site to a different location without having to change all the internal links. Avoid the forms of relative URL starting with slashes, as they are relative only to the root of the server and will become incorrect if you move to a different place in the full directory tree. However, the forms without leading slashes will work identically no matter where the site is relocated.

Use absolute URLs when linking to other sites. You may wish to consider even some other pages you created yourself to be "other sites" for this purpose, if they're part of a completely different logical grouping from the current site and there's a chance one set of your pages will be relocated while the other stays put. So, if you have two sites, at http://www.yoursite.com/literature/ and http://www.yoursite.com/music/, and you think you might eventually move the latter to http://www.yourmusic.com/, then any link from the music site to the literature site should use the full URL instead of a relative URL like "../literature/", which would stop working after this site move.

The long and short of it

Whatever sort of URLs you use, I'd prefer that you kept them short, if you can. It's annoying to attempt to put a URL in a plain-text e-mail message and have it wrap to the next line because it's over 80 characters long. People also like to "tweet" URLs on Twitter, with its strict character limit. It's trendy these days to excessively elongate URLs to cram keywords in them for search engines; blog and news sites especially like to do this. So where you might have otherwise had a URL like http://example.net/articles/urls.html you end up with http://example.net/articles/web_development_strategies/how_to_use_overly_lengthy_urls_for_seo.html. I wish you wouldn't do that.

New For 2011: Hash-Bang URLs!

Web developers have constantly come up with new ways to reinvent old wheels, often in manners that break functionality, accessibility, or logical structure. A new instance of this, as of early 2011, is the so-called "hash-bang" URL. These are URLs that contain the sequence "#!", a number sign followed by an exclamation point ("hash" and "bang" in geek jargon). Usually, some important stuff indicating exactly which page in the site is being referenced is contained in the part of the URL following this sequence, like http://www.example.org/#!12345/test_page/, and such URLs are usually introduced after a redesign from an earlier version of the site that used URLs without it, like http://www.example.org/12345/test_page/.

The problem with this is that everything following the number/hash sign is, by the URL specs, a fragment identifier. It is not sent to the server when the URL is requested; rather, it is held by the browser to use after the document is retrieved in order to move to a specific spot in the document. However, the fragment identifier of the current URL is also accessible to client-side scripts such as JavaScript, so they can be put to use by sites using such scripting (e.g., the so-called "AJAX" sites, Asyncronous JavaScript And XML). The result is that instead of the URL, as sent to the server in the initial request, including all necessary path and parameter information to allow the specific desired page to be requested, the URL retrieves only a blank page containing scripts that, in turn, take the fragment identifier and use it to make additional server requests to get the correct data to display. If you have disabled scripting in your browser, you get nothing but a blank page.

Until recently, the very important Googlebot (and other search engine indexers) also got nothing but a blank page, which tended to dissuade developers from using such techniques as being very bad for another of their trendy buzzwords, SEO (Search Engine Optimization). However, Google recently "kludged up" a "standard" whereby their indexer would translate "#!" URLs to http://www.example.org/?_escaped_fragment_=12345/test-page/, which the server can in turn be programmed to respond to with search-engine-friendly content (which might differ from what normal users get, so Google then has to police this for "spamdexing" tricks).

More on this controversial technique is in this article. It's one more in the two-decade-long series of "holy wars" between "purists" complaining that their logical structures are being broken, and "bleeding-edge" developers claiming that the new techniques allow much more exciting and dynamic development. However, some new stuff in HTML 5.0 may make all of this obsolete.

The latest whiz-bang feature related to all of this is the HTML 5.0 feature to let client-side scripts rewrite the URL bar to make the current location reflect changes in the page that were actually made dynamically without a full server-side page load. This is explained in this tutorial. It looks like, at least for browsers that support this feature, you can do a lot to return the Web to the old-fashioned virtue of having pages at definite URLs that can be linked, bookmarked, and seen in the browser bar, while using modern Web 2.0+ snazziness. On the other hand, won't scammers/spammers/phishers have a field day with sites that defraud users with fake URL locations? Can you just stuff "irs.gov" or "citibank.com" into the URL bar when the pages are really coming from "scammersite.ru"? (Apparently you can't; when I tried it under Firefox 10.0, it did nothing and put a security error in the error console. I guess it only lets you set a new address in the same domain. But I bet hackers/crackers/phishers are working hard at probing the limits and looking for loopholes to get around such security restrictions.)