This is one of a series of pages that present the basic principles of using
HTML
to create a user-friendly web site. This page describes the use of tags to produce typographical
variants
such as italic or bold type, and the use of entities
to produce special characters like
accented letters (e.g. é), special symbols (e.g. £) and symbols that cannot be
typed normally because they have special meanings in
HTML
(& < and >).
Principles of friendly web pages
The structure of an
HTML
document
Specifying the typography
The <em> (emphatic) and <strong> (strongly emphatic) tags
The <i> (italic) and <b> (bold) tags
Accented letters and other special characters
Refinements: using additional features
Further refinements: improving accessibility
Trouble-shooting: why doesn’t it come out as expected?
There are two basic approaches to typographical control. It is best to assume the readers of your pages have set their browsers the way they want them, so that if you indicate which words or passages fulfil certain kinds of logical functions you can trust the user’s browser will interpret your indications in a way that satisfies the user (not necessarily you). Unfortunately, however, many web authors take the view that they know better than any user which bits of your page ought to be in italics or bold and so they specify these in the HTML with physical tags. The first approach is far closer to the spirit in which HTML was conceived and it works even if the user is not using a visual browser at all – for example a browser for blind people that will know how to speak differently for emphasis but cannot represent italics in speech. In practice physical tags remain widely used, often for reasons that have not been clearly thought out.
I always try to use logical tags when it is a matter of indicating the logical structure of the page (i.e. which bits need emphasis or fulfil defined functions like indicating someone’s address), but physical tags where there are generally recognized conventions to be followed: e.g. biological convention dictates that names of species such as Escherichia coli should be italicized, and it is common practice in mathematics to represent symbols for matrices in bold.
The commonest logical tags are <em> and <strong>, which will be described shortly (after a brief note about punctuation), but others also exist.
In ordinary typography, i.e. on the printed page, it is conventional that a point of punctuation is in the same style as the text that immediately precedes it. This means, for example, that if a sentence ends with word in italics then the full stop should be in italics as well. You may well wonder why this matters, given that an italic full stop in isolation looks much the same as a roman full stop. However, computer software tends to balance the space around an italic passage rather badly, and some web browsers do it particularly badly. Compare the following cases, in which the first line shows the HTML code and the second shows the way your browser handles it:
roman, <i>italic</i>, <b>bold</b>.
The above code produces this on your browser: roman, italic, bold.
With the following code, however,
roman, <i>italic,</i> <b>bold.</b>
the result is: roman, italic, bold.
The balance will usually be better if the comma is within the range of the italic element, and, depending on your browser, it may be very much better.
The <em> and <strong> tags are
used to add different degrees of emphasis to words or sentences in your page. Most
browsers represent emphatic sections as italic and
strongly emphatic sections as bold. You can tell what
the browser you are using at this moment does by seeing if emphatic sections as
is shown in the same style as italic
and if strongly emphatic sections as
is shown in the same style as
bold.
It is not good practice to combine
these tags, as the results are not
predictable, i.e. <strong><em> and
<em><strong> may be interpreted by
some browsers as equivalent to one another, but they may be also be treated as <em> or
as <strong>, or as just plain wrong and so displayed as normal
text. If you decide to ignore this advice you should still nest the tags correctly, closing
the inner one before the outer: <em><strong> ...
</em></strong> is
definitely an error and may have unexpected results.
The range of a typographic tag like <em> should be wholly contained within the range of a layout tag such as <p> or <h2>. This means that if you want to extend an emphasized passage beyond the end of one paragraph and into the next you should close it at the end of one paragraph and reopen it at the beginning of the next.
These tags allow you specify italic or bold explicitly. They have some recommended uses
that have nothing to do with emphasis. For example, it is common typographic practice
to print words that are in a different language from the rest in italics, and similarly with words
that are being considered just as words, as in the word and is often written as &
, a sentence that would
be difficult to make sense of if just written as the word and is often written as &
. (If
you want to avoid italics you can of course put the word in question in quotation marks:
the word
.)
It is likewise usual
to put algebraic symbols in italics, and something like y = a + bx is much easier to recognize as
a piece of algebra if this convention is followed rather than just writing y = a + bx.
and
is often written as &
There are several other logical tags apart from <em>
and <strong> already
described. Most of these need not concern us for a basic
HTML
file, but there are two others
that are useful even in a first document. These are as follows (listed together with the two
already mentioned):
| Tag pair | Meaning/Use | Usual interpretation |
| <em> ... </em> | Emphatic | italic |
| <strong> ... </strong> | Strongly emphatic | bold |
| <cite> ... </cite> | Quoted text | italic |
| <address> ... </address> | Addresses | italic |
It is obvious – and would be even if this list did not illustrate it – that we can imagine many more logically different kinds of text than there are different typographical styles available to the browser; as a result, therefore, it is inevitable that the same physical style must be used to represent more than one logical style. You may ask what is the point in using <em>, <cite> and <address> tags in an HTML document if they are all going to look the same in the browser. The answer is that even if present-day browsers treat them all alike the browsers of the future may distinguish between them, and it does no harm to include information in the HTML that future browsers may be able to use. Second, when you are revising your HTML – for example because some addresses have changed – it is useful to be able to pick out the <address> tags rapidly without having to wade through a sea of <i> tags.
If a page contains accented letters (e, o, Å, u, etc.) or certain mathematical symbols
like > and <, then they should not be typed just like that in the
HTML
file.
This should work correctly with most characters if your server supplies correct information about
the coding used and if the user’s browser interprets this correctly, but although you
can ensure that the first requirement is satisfied there isn’t much you can do
about the second. The essential point is that it will not work in
general: some characters will be replaced by quite different special characters,
whereas others, most notably <, will be misinterpreted by the browsers as
HTML
codes and may produce quite bizarre results in the browser. To avoid this, you need
to replace such characters by entities
in the
HTML, which consist of the sequence
&code;, where & and ; define where the entity begins and ends,
and code specifies which character to insert.
For most characters alternative
numerical and mnenomic code sequences exist. For example, both ° and
° define the same character, the degree sign °. However, the mnemonic versions
are much easier to remember, and for the common accented letters
they take consistent forms. Thus e is written as é and the other
letters with acute accents are expressed analogously; i is í, and so on.
These codes are case-sensitive: é cannot be written &eAcute;, for
example. All of the accented letters normally available can be listed quite concisely,
together with the most important other entities:
| Examples | Entities | Analogous cases | Used in |
| é É | é É | á í ó ú Á Í Ó Ú | French, Spanish, Portuguese |
| è È | è È | à ì ò ù À Ì Ò Ù | French, Italian, Portuguese |
| ê Ê | ê Ê | â î ô û Â Î Ô Û | French, Portuguese |
| ä Ä | ä Ä | ë ï ö ü ÿ Ë Ï Ö Ü | German, Swedish |
| ç Ç | ç Ç | French, Portuguese | |
| ñ Ñ | ñ Ñ | ã õ Ã Õ | Spanish, Portuguese |
| å Å | å Å | Swedish, Danish, Norwegian | |
| ø Ø | ø Ø | Danish, Norwegian | |
| ß | ß | German | |
| æ Æ | æ Æ | Danish, Norwegian | |
| & (space) | & | General | |
| < > | < > | Mathematics | |
| ° µ | ° µ | Science | |
| £ ¢ ¥ | £ ¢ ¥ | Finance | |
| α β | α β | γ δ ε ζ ... | Mathematics, chemistry |
The character listed as (space) is a non-breaking space, i.e. a fixed-width space at which a line will not be broken. So, for example, if you want to ensure that a quantity like 3 cm does not get broken between the 3 and the cm at the end of a line you can write it as 3 cm. Although most browsers interpret a bare & (with white space following it) as such this is not correct HTML, and & should be used instead. Some HTML authors use an entity " for a quotation mark " if it does not occur within a tag; however, this is not necessary unless you want to include a " within some text that is already enclosed between a pair of quotation marks within a tag. No corresponding problems arise with the semicolon (fortunately!), as this cannot be misinterpreted as the end of an entity unless there has been an opening &.
A specific question that arises with & is that it frequently occurs as part of a
URL, so
you may need to include it in a link. For example, if you enter ampersand
as a search term
in a search engine you are likely to be taken to a page with a
URL something like
http://www.google.com/search?q=ampersand&btnG=Google+Search
. If you need to include
this as a link, you may wonder whether to write this:
<a href="http://www.google.com/search?q=ampersand&btnG=Google+Search">
or this:
<a href="http://www.google.com/search?q=ampersand&btnG=Google+Search">
The answer is that you must write the second in order to have valid HTML. The browser is then responsible for converting & to & before sending it to the remote server as a request.
Another point arises if you want to use curly (or smart
) quotation marks (as in a printed
book) instead of straight ones ". The first point to bear in mind is that straight quotation marks are
normally just as clear for the reader as curly ones, so you gain little by using curly ones. However,
if you think it matters you should use <q> as an opening quotation mark and </q> as a closing
one. A modern browser will convert these to the appropriate curly quotation marks, and an older one
will either ignore them or show both as straight quotation marks. What you should not
do is just put the curly quotation marks in your HTML.
These may look fine on your local system, but they will be converted to garbage characters on some
others. The <q> tags can be nested within one another, so it is legitimate to write something like this:
<q>A quotation can contain <q>another quotation</q> within in it.</q>
On your system this code is displayed like this: A quotation can contain
another quotation
within in it.
You cannot assume that the less common characters exist on all systems and will always be reproduced as you expect: if not, they may be replaced by ? or by the entity itself. The lower-case accented letters used in Western European languages are likely to be safe (as long as your readers are working on a system designed for one of these languages), but the capital letters may not be. Moreover, you cannot safely generalize from the examples above to accented letters in other languages; for example, you cannot assume that ć will put an acute accent on a letter c or that ş will put a cedilla under a letter s.
If you only need to use accented letters occasionally – for example if your page is in English or Dutch but you need to include a few foreign words as in names of people or places, then the simplest thing is just to type in each entity when you come to it. However, if your whole page is in another language or for some other reason you need to include a large number of accented letters, it is rather cumbersome and error-prone to deal with each case separately. An easier method is just to write the page as you normally would in a word-processor – writing e rather than é, etc.; later on, after checking it carefully for errors in this form you can use the find-and-replace function of your text editor to replace all instances of e by é, etc.
Not all current browsers do not interpret the entities for Greek letters (α etc.) correctly, and so they are not as useful as they might be. Some web authors try to get around this by defining little images, but these rarely look right except on exactly the system used by the author, or else they try to specify use of a symbol font that contains the characters wanted, but this also does not work reliably. Incidentally, these Greek entities are not intended for representing text in the Greek language. If you want to prepare a web page in Greek or that contains more than occasional isolated Greek letters you need to refer to a more specialized source of information.
A complete table has
been compiled by Martin Ramsch.