html-spec/HTML.txt at master · dckc/html-spec · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
HTML Working Group                                        T. Berners-Lee
INTERNET-DRAFT                                               D. Connolly
<draft-ietf-html-spec-03.txt>                                    MIT/W3C
Expires November 1, 1995                                   April 1, 1995


                   HyperText Markup Language -- 2.0


Status of this Memo

   This document is an Internet-Draft. Internet-Drafts are working
   documents of the Internet Engineering Task Force (IETF), its areas,
   and its working groups. Note that other groups may also distribute
   working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other
   documents at any time. It is inappropriate to use Internet-Drafts
   as reference material or to cite them other than as "work in
   progress."

   To learn the current status of any Internet-Draft, please check the
   "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow
   Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
   munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or
   ftp.isi.edu (US West Coast).

   Distribution of this document is unlimited. Please send comments to
   the HTML working group (HTML-WG) of the Internet Engineering Task
   Force (IETF) at <html-wg@oclc.org>. Discussions of the group are
   archived at <URL:http://www.acl.lanl.gov/HTML_WG/archives.html>.

Abstract

   The HyperText Markup Language (HTML) is a simple markup language
   used to represent hypertext documents that are portable from one
   platform to another. HTML documents are SGML documents with generic
   semantics that are appropriate for representing information from a
   wide range of applications. HTML markup can represent hypertext
   news, mail, documentation, and hypermedia; menus of options;
   database query results; simple structured documents with graphics;
   and hypertext views of existing bodies of information.

   HTML has been in use by the World Wide Web (WWW) global information
   initiative since 1990. This specification roughly corresponds to
   the capabilities of HTML in common use prior to June 1994. It is
   defined as an application of ISO Standard 8879:1986 Information
   Processing Text and Office Systems; Standard Generalized Markup
   Language (SGML).

   The "text/html; version=2.0" Internet Media Type (RFC 1590) and
   MIME Content Type (RFC 1521) is defined by this specification.

Contents

(@@ I'll regenerate this eventually)


1.  Introduction

1.1  Purpose

   The HyperText Markup Language (HTML) is a simple markup language
   used to create hypertext documents that are portable from one
   platform to another. HTML documents are SGML documents with generic
   semantics that are appropriate for representing information from a
   wide range of applications. HTML has been in use by the World-Wide
   Web (WWW) global information initiative since 1990. This
   specification corresponds to the capabilities of HTML in common use
   prior to June 1994 and referred to as "HTML 2.0".

   This specification defines HTML as an application of ISO Standard
   8879:1986 Information Processing Text and Office Systems; Standard
   Generalized Markup Language (SGML). SGML provides a formal
   definition of the HTML syntax in the form of a Document Type
   Definition (DTD).

   This specification also defines HTML as an Internet Media Type [7]
   and MIME Content Type [4] called "text/html", or
   "text/html; version=2.0". As such, it defines the semantics of the
   HTML syntax and how that syntax should be interpreted by user agents.

1.2  Levels of Conformance

   Version 2.0 of HTML introduces a distinction between levels of
   conformance:

   Level 0
       Indicates the minimum conformance level. When writing Level 0
       documents, authors can be confident that the rendering at
       different sites will reflect their intent.

   Level 1
       Includes Level 0 features plus features such as highlighting and
       images.

   Level 2
       Includes all Level 0 and Level 1 features, plus forms.

1.3  Terminology

   The HTML specification uses these words with precise meanings:

   attribute
       A name/value pair: part of an element which is often used
       to specify a characteristic quality of the element, other than
       type or content.

   character
       An atom of information, for example a letter or a number.
       Graphic characters have associated glyphs, where as control
       characters have associated processing semantics.

   character encoding
       A mapping from sequences of octets to sequences of characters
       from a character repertoire; that is, a sequence of octets and a
       character encoding determines a sequence of characters.

   character number
       A number that determines a character, as per some character set.

   character repertoire
       A finite set of characters. The range of the mapping defined
       by a character set.

   character set
       A mapping of a subset of the integers onto a character
       repertoire. That is, for some set of integers (usually of
       the form {0, 1, 2, ..., N} ), a character set and an integer
       in that set determine a character. Conversely, a character
       and a character set determine the character's number (or,
       in rare cases, a few character numbers).

   conforming HTML user agent
       A user agent that conforms to this specification in its
       treatment of the Internet Media Type "text/html; version=2.0"

   document type definition (DTD)
       A DTD is a collection of declarations (entity, element,
       attribute, link, map, etc.) in SGML syntax that defines the
       components and structures available for a class (type) of
       documents.

   element
       A component of the hierarchical structure defined by the
       document type definition; it is identified in a document
       instance by descriptive markup, usually a start-tag and an end-
       tag.

   entity
       data with an associated notation or interpretation; for
       example, a sequence of octets associated with an Internet Media
       Type.

   message entity
       a head and body. The head is a collection of name/value fields,
       and the body is a sequence of octets. The head defines the
       content type and content transfer encoding of the body.

   SGML document
       A set of entities, including the document entity, which is
       a text entity that conforms to the grammar specified in the SGML
       standard.

   HTML document
       An SGML document conforming to the HTML document type definition.

   HTTP
       The Hypertext Transfer Protocol [3] is the primary application-
       level protocol for the transfer of documents via the World-Wide
       Web.

   (document) instance
       The document itself including the actual content with the actual
       markup. Can be a single document or part of a document instance
       set that follows the DTD.

   markup
       Text added to the data of a document to convey information about
       it. There are four different kinds of markup: descriptive markup
       (tags), references, markup declarations, and processing
       instructions.

   MIME
       The Multipurpose Internet Mail Extensions [4] provide the
       ability to transfer non-textual data, such as graphics, audio
       and fax, via Internet mail.

   minimally conforming HTML user agent
       A user agent that conforms to this specification in its
       treatment of the Internet Media Type "text/html; level=0;
       version=2.0"

   SGML
       Standard Generalized Markup Language [12] (see also [9] and [6])
       is a system for describing document types and markup languages
       to represent them.

   tag
       Markup that delimits an element. A tag includes a name which
       refers to an element declaration in the DTD, and may include
       attributes.

   text entity
       A finite sequence of characters. A text entity typically takes
       the form of a sequence of octets with some associated character
       encoding, transmitted over the network or stored in a file.

   user agent
       A component of a distributed system that presents an interface
       and processes requests on behalf of a user; for example, a www
       browser or a mail user agent.

   URI
       A Universal Resource Identifier [1] is a formatted string that
       serves as an identifier for a resource, typically on the
       Internet. URIs are used in HTML to identify the destination of
       hypertext links, the source of in-line images, and the object of
       form actions. URIs in common use include Uniform Resource
       Locators (URLs) [2] and Relative URLs [5].

   WWW
       The World-Wide Web is a hypertext-based, distributed information
       system created by researchers at CERN in Switzerland. Users may
       create, edit or browse hypertext documents.
       <URL:http://www.w3.org/>

1.4  Imperatives

   may
       The implementation is not obliged to follow this in any way.

   must
       If this is not followed, the implementation does not conform to
       this specification.

   shall
       If this is not followed, the implementation does not conform to
       this specification.

   should
       If this is not followed, though the implementation officially
       conforms to the specification, undesirable results may occur in
       practice.

   typical
       Typical rendering is described for many elements. This is not a
       mandatory part of the specification but is given as guidance for
       designers and to help explain the uses for which the elements
       were intended.

2.  HTML as an Application of SGML

   HTML is an application of ISO Standard 8879:1986 -- Standard
   Generalized Markup Language (SGML) [12]. SGML is a system for
   defining structured document types and markup languages to
   represent instances of those document types. The SGML declaration
   for HTML and the HTML document type definitions (DTDs) are provided
   in Section 12.

   The term "HTML" refers to both the document type defined here and
   the markup language for representing instances of this document
   type.

2.1  SGML Documents

   An HTML document is an SGML document; that is, a set of entities,
   including the document entity, which is a text entity that conforms
   to the grammar specified in the SGML standard. The first production
   of that grammar separates an SGML document into three parts: an
   SGML declaration, a prologue, and an instance.

   For the purposes of this specification, the prologue is a DTD. This
   DTD describes another grammar: the start symbol is given in the
   doctype declaration; the terminals are data characters and tags,
   and the productions are determined by the element declarations. The
   instance must conform to the DTD, that is, it must be in the
   language defined by this grammar.

   The SGML declaration determines the lexicon of the grammar. It
   specifies the document character set, which determines a character
   repertoire that contains all characters used in all text entities
   in the document, and the character numbers associated with those
   characters.

   The SGML declaration also specifies the syntax character set of the
   document, and a few other parameters that bind the abstract syntax
   of SGML to a concrete syntax. This concrete syntax determines how
   each text entity is mapped to a sequence of terminals in the grammar
   of the prologue.

   For example, consider the following document:

    <!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN">
    <title>Parsing Example</title>
    <p>Some text. <em>&#42;wow&#42;</em>

   By application convention, the SGML declaration is the one given in
   section 13.2. Hence the document character set is ISO-8859-1(@@)
   and the markup "&#42;" represents an asterisk character.

   The instance is regarded as the following sequence of terminals:

    TITLE start tag
    data characters: "Parsing Example"
    TITLE end tag
    P start tag
    data characters "Some text. "
    EM start tag
    "*wow*"
    EM end tag

   The start symbol of the DTD grammar is HTML, and the productions
   are given in the public text identified by "-//IETF//DTD HTML
   2.0//EN" (Section 13.3). Hence the terminals above parse as:

   HTML
    |
    \-HEAD,     BODY
       |	 |
       \-TITLE	 \-P
           |	   |
	   |	   \-<P>,"Some text. ",EM
           |                            |
           |                            \-<EM>,"*wow*",</EM>
           \-<TITLE>,"Parsing Example",</TITLE>

2.2  HTML Lexical Syntax

   The syntax character set for all HTML documents is ISO-646 (@@ full
   name). A minimally conforming HTML user agent must support the SGML
   declaration in section 13@@, which specifies ISO Latin 1 (@@full
   name) as the document character set; it may support other SGML
   declarations, in particular, SGML declarations with other document
   character sets.

   A complete discussion of the mapping of a sequence of characters to
   a sequence of tags and data is left to the SGML standard. This
   section is only a summary.

2.2.1 Data Characters

   Any sequence of characters that do not constitute markup (see
   "Delimiter Recognition," section @@@ of the SGML standard) are
   mapped directly to strings of data characters. Some markup also
   maps to data character strings. Numeric character references also
   map to single-character strings, via the document character
   set. Each reference to one of the general entities defined in the
   HTML DTD also maps to a single-character string.

   For example,

       abc&lt;def    => "abc","<","def"
       abc&#60;def   => "abc","<","def"

   Note that the terminating semicolon is only necessary when the
   character following the reference would otherwise be recognized as
   markup:

       abc &lt def     => "abc ","<"," def"
       abc &#60 def    => "abc ","<"," def"

   And note that an ampersand is only recognized as markup when it
   is followed by a letter or number:

       abc & lt def    => "abc & lt def"
       abc & 60 def    => "abc & 60 def"

   A useful technique for translating plain text to HTML is to replace
   each '<', '&', and '>' by an entity reference or numeric character
   reference as follows:

                 ENTITY      NUMERIC
       CHARACTER REFERENCE   CHAR REF     CHARACTER DESCRIPTION
         &       &amp;       &#38;        Ampersand
         <       &lt;        &#60;        Less than
         >       &gt;        &#62;        Greater than


       Note: There are SGML features, CDATA and RCDATA, to allow
       most "<", ">", and "&" characters to be entered without the
       use of entity references. Because these features tend to be
       used and implemented inconsistently, and because they
       require 8bit characters to represent non-ASCII characters,
       they are not used in this version of the HTML DTD.

2.2.1 Tags

   Tags define the start and end of headings, paragraphs, lists,
   character highlighting and links. Most HTML elements are identified
   in a document as a start tag, which gives the element name and
   attributes, followed by the content, followed by the end tag. Start
   tags are delimited by "<" and ">"; end tags are delimited by "</"
   and ">". An example is:

       <H1>This is a Heading</H1>

   Some elements only have a start tag without an end tag. For
   example, to create a line break, you use the <BR> tag.
   Additionally, the end tags of some other elements, such as
   Paragraph (<P>), List Item (<LI>), Definition Term (<DT>), and
   Definition Description (<DD>) elements, may be omitted.

   The content of an element is a sequence of characters and nested
   elements. Some elements, such as anchors, cannot be nested. Anchors
   and character highlighting may be put inside other constructs. See
   the HTML DTD for full details.

       Note: The SGML declaration for HTML specifies SHORTTAG YES,
       which means that there are other valid syntaxes for tags,
       such as NET tags, "<EM/.../"; empty start tags, "<>"; and
       empty end tags, "</>". Until support for these idioms is
       widely deployed, their use is strongly discouraged.

2.2.2 Names

   A name consists of a letter followed by up to 71 letters, digits,
   periods, or hyphens. Element names are not case sensitive, but
   entity names are. For example, <BLOCKQUOTE>, <BlockQuote>, and
   <blockquote> are equivalent, whereas &amp; is different from &AMP;.

   In a start tag, the element name must immediately follow the tag
   open delimiter "<".

2.2.3 Attributes

   In a start tag, white space and attributes are allowed between the
   element name and the closing delimiter. An attribute typically
   consists of an attribute name, an equal sign, and a value (although
   some attributes may be just a value). White space is allowed around
   the equal sign.

   The value of the attribute may be either:

      o A string literal, delimited by single quotes or double quotes
        and not containing any occurrences of the delimiting character.

      o A name token (a sequence of letters, digits, periods, or
        hyphens)

   In this example, A is the element name, HREF is the attribute name,
   and http://host/dir/file.html is the attribute value:

       <A HREF="http://host/dir/file.html">

       Note: Some historical implementations consider any occurrence
       of the ">" character to signal the end of a tag. For
       compatibility with such implementations, when ">" appears in an
       attribute value, it should be represented with a numeric
       character reference, such as in: <IMG SRC="eq1.jpg" alt="a
       &#62; b">

   A useful technique for computing an attribute value literal for a
   given string is to replace each quote and space character by an
   entity reference or numeric character reference as follows:

                 ENTITY      NUMERIC
       CHARACTER REFERENCE   CHAR REF     CHARACTER DESCRIPTION
         TAB                 &#9;         Tab
         LF                  &#10;        Line Feed
         CR                  &#13;        Carriage Return
                             &#32;        Space
         "       &quot;      &#34;        Quotation mark
         &       &amp;       &#38;        Ampersand

   For example:

       <IMG SRC="image.jpg" alt="First &quot;real&quot; example">

       Note: Some historical implementations allow any character
       except space or ">" in a name token. Attributes values must
       be quoted only if they don't satisfy the syntax for a name
       token.

   Note that the SGML declaration in section 13.3 limits the length of
   an attribute value to 1024 characters.

   Attributes with a declared value of NAME, such as ISMAP and
   COMPACT, may be written using a minimized syntax. The markup:

       <UL COMPACT="compact">

   can be written using a minimized syntax:

       <UL COMPACT>

       Note: Some historical implementations only understand the
       minimized syntax.

2.2.5 Comments

   To include comments in an HTML document that will be eliminated in
   the mapping to terminals, surround them with "<!--" and
   "-->". After the comment delimiter, all text up to the next
   occurrence of "-->" is ignored.  Hence comments cannot be
   nested. White space is allowed between the closing "--" and ">",
   but not between the opening "<!" and "--".

   For example:

       <HEAD>
       <TITLE>HTML Guide: Recommended Usage</TITLE>
       <!-- $Id$ -->
       </HEAD>

       Note: Some historical HTML implementations incorrectly consider
       any ">" character to be the termination of a comment.

2.3  Example HTML Document

       <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
       <HTML>
       <!-- Here's a good place to put a comment. -->
       <HEAD>
       <TITLE>Structural Example</TITLE>
       </HEAD><BODY>
       <H1>First Header</H1>
       <P>This is a paragraph in the example HTML file. Keep in mind
       that the title does not appear in the document text, but that
       the header (defined by H1) does.</P>
       <OL>
       <LI>First item in an ordered list.
       <LI>Second item in an ordered list.
         <UL COMPACT>
         <LI> Note that lists can be nested;
         <LI> Whitespace may be used to assist in reading the
              HTML source.
         </UL>
       <LI>Third item in an ordered list.
       </OL>

       <P>This is an additional paragraph. Technically, end tags are
       not required for paragraphs, although they are allowed. You can
       include character highlighting in a paragraph. <EM>This sentence
       of the paragraph is emphasized.</EM> Note that the &lt;/P&gt;
       end tag has been omitted.
       <P>
       <IMG SRC ="triangle.xbm" alt="Warning:">
       Be sure to read these <b>bold instructions</b>.
       </BODY></HTML>

3.  HTML as an Internet Media Type

   An HTML user agent allows users to interact with resources which
   have HTML representations. At a minimum, it must allow users to
   examine and navigate the content of HTML Level 0 documents. Level 1
   HTML user agents must be able preserve all formatting distinctions
   represented in an HTML Level 1 document, and be able to
   simultaneously present resources referred to by IMG elements. (they
   may ignore some formatting distinctions or IMG resources at the
   request of the user). Fully conforming HTML user agents, that is
   Level 2 HTML user agents, must support form entry and submission.

3.1  text/html media type

   This specification defines the Internet  Media Type [7] (formerly
   referred to as the MIME Content Type [4]) called  "text/html". The
   following is to be registered with IANA [8].

       Media Type name:          text

       Media subtype name:       html

       Required parameters:      none

       Optional parameters:      level, version, charset

       Encoding considerations:  any encoding is allowed

       Security considerations:  [Section 14]

   The optional parameters are defined as follows:

   Level
       The level parameter specifies the feature set used in the
       document. The level is an integer number, implying that any
       features of same or lower level may be present in the document.
       Levels 0, 1 and 2 are defined by this specification. Level 2 is
       the default.

   Version
       To help avoid future compatibility problems, the version
       parameter may be used to give the version number of the
       specification to which the document conforms. The version number
       appears at the front of this document and within the public
       identifier of the HTML DTD. This specification defines
       version 2.0.

   Charset
       The charset parameter (as defined in section 7.1.1 of RFC 1521
       [4]) may be given to specify the encoding used to represent the
       HTML document as a sequence of octets. The default value is
       outside the scope of this specification; but for example, the
       default is US-ASCII in the context of MIME mail, and ISO-8859-1
       in the context of HTTP.

3.2  HTML Document Representation

   A message entity with a content type of "text/html" represents an HTML
   document, consisting of a single text entity. The charset parameter
   (whether implicit or explicit) identifies a character encoding. The
   text entity consists of the characters determined by this character
   encoding and the octets of the body of the message entity.

   The SGML declaration of the document is a function of the charset
   parameter. If the charset parameter is US-ASCII or ISO-8859-1, the
   SGML declaration in section 13@@ applies. Other charset parameter
   values are reserved for future use.

   NOTE: A generalized convention for mapping charset parameter values
   to SGML declarations is expected to be specified in a future
   version of this specification.

3.2.1  Conventional Handling of Undeclared Markup Errors

   To facilitate experimentation and interoperability between
   implementations of various versions of HTML, the installed base of
   HTML user agents supports a superset of the HTML 2.0 language by
   reducing it to HTML 2.0: markup in the form of a start tag or end
   tag whose generic identifier is not declared is mapped to nothing
   during tokenization. Undeclared attributes are treated similarly.
   The entire attribute specification of an unknown attribute (i.e.,
   the unknown attribute and its value, if any) should be ignored.  On
   the other hand, references to undeclared entities should be treated
   as data characters.

   For example:

     <div class=chapter><h1>foo</h1><p>...</div>
	  => <H1>,"foo",</H1>,<P>,"..."

     xxx <P ID=z23> yyy
          => "xxx ",<P>," yyy

     Let &alpha; and &beta; be finite sets.
	  => "Let &alpha; and &beta; be finite sets."

   Support for notifying the user of such errors is encouraged.

   Information providers should keep in mind that this convention is
   not binding: unspecified behavior may result, as such markup is
   not conforming to this specification.


3.2.1  Conventional Representation of Newlines

   SGML specifies that a text entity is a sequence of records, each
   beginning with a record start character and ending with a record
   end character (characters numbered 10 and 13 respectively).

   MIME specifies that a body of type text/* is a sequence of lines,
   each terminated by CRLF, that is octets 10, 13.

   In practice, HTML documents are frequently represented and
   transmitted using an end of line convention that depends on the
   conventions of the source of the document; frequently, that
   representation consists of CR only, LF only, or CR LF
   combination. Hence the decoding of the octets will often result in
   a text entity with some missing record start and record end
   characters.

   Since there is no ambiguity, HTML user agents are encouraged to
   infer the missing record start and end characters.

   An HTML user agent should treat end of line in any of its
   variations as a word space in all contexts except
   preformatted text. Within preformatted text, an HTML user agent
   should expect to treat any of the three common representations of
   end-of-line as starting a new line.

4.  Document Structure

   To identify information as an HTML document conforming to this
   specification, each document should start with the prologue:

       <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

       Note: If the body of a text/html body part does not begin
       with a document type declaration, an HTML user agent should
       infer the above document type declaration.

   HTML user agents are required to support the above document type
   declaration, the following document type declarations, and no
   others.

       <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Level 0//EN">
       <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Level 1//EN">
       <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Level 2//EN">

   In particular, they may support other formal public identifiers, or
   document types altogether. They may support an internal declaration
   subset with supplemental entity, element, and other markup
   declarations, or they may not.

   HTML documents may contain an <HTML> tag at the beginning
   (immediately after the prologue) and an </HTML> tag at the end.

   The HTML document element is organized as a head and a body, much
   like a memo or a mail message. Within the head, you can specify the
   title and other information about the document. Within the body,
   you can structure text into paragraphs and lists, as well as
   highlight phrases and create links, using HTML elements.

       Note: Technically, the start and end tags for HTML, Head,
       and Body elements are omissible; however, this is not
       recommended since the head/body structure allows an
       implementation to determine certain properties of a
       document, such as the title, without parsing the entire
       document.

4.1  HTML Document Element

   <HTML> ... </HTML>                                           Level 0

   The HTML element contains the Head and Body elements.

4.2  Head

   <HEAD> ... </HEAD>                                           Level 0

   The head of an HTML document is an unordered collection of
   information about the document. It requires the Title element
   between <HEAD> and </HEAD> tags:

       <HEAD>
       <TITLE>Introduction to HTML</TITLE>
       </HEAD>

4.3  Body

   <BODY> ... </BODY>                                           Level 0

   The Body element identifies the body component of an HTML document.
   Specifically, the body of a document may contain links, text, and
   formatting information within <BODY> and </BODY> tags.

5.  Document Metainformation Elements

5.1  Title

   <TITLE> ... </TITLE>                                         Level 0

   Every HTML document must contain a Title element. The title should
   identify the contents of the document in a global context, and may
   be used in history lists and as a label for the window displaying
   the document. Unlike headings, titles are not rendered in the text
   of a document itself.

   The Title element must occur within the head of the document, and
   must not contain anchors, paragraph tags, or highlighting. Only one
   title is allowed in a document.

       Note: The length of a title is not limited; however, long
       titles may be truncated in some applications. To minimize
       this possibility, titles should be fewer than 64 characters.
       Also keep in mind that a short title, such as Introduction,
       may be meaningless out of context. An example of a
       meaningful title might be "Introduction to HTML Elements."

5.2  Base

   <BASE>                                                       Level 0

   The Base element allows the URL [2] of the document itself to be
   recorded in situations in which the document may be read out of
   context. URLs within the document may be in a "partial" form
   relative to this base address [5].

   The Base element has one attribute, HREF, which identifies the
   absolute base URL.

5.3  Isindex

   <ISINDEX>                                                    Level 0

   The Isindex element tells the user agent that the document is an
   index. This means that the reader may request a keyword search on
   the resource by adding a question mark to the end of the document
   address, followed by a list of keywords separated by plus signs.

   The Isindex element is usually generated by the network server from
   which the document was obtained via a URI. The server must have a
   search engine that supports this feature for the resource. If the
   document URI is unknown to the user agent, <isindex> must be
   ignored.

5.4  Link

   <LINK>                                                       Level 0

   The Link element indicates a relationship between the document and
   some other object. A document may have any number of Link elements.

   The Link element is empty (does not have a closing tag), but takes
   the same attributes as the Anchor element.

   Typical uses are to indicate authorship, related indexes and
   glossaries, older or more recent versions, etc. Links can indicate
   a static tree structure in which the document was authored by
   pointing to a "parent" and "next" and "previous" document, for
   example.

   Servers may also allow links to be added by those who do not have
   the right to alter the body of a document.

5.5  Meta

   <META>                                                       Level 0

   The META element is used within the HEAD element to embed document
   metainformation not defined by other HTML elements. META elements
   can be extracted by servers and/or clients for use in identifying,
   indexing, and cataloging specialized document metainformation.

   Although it is generally preferable to use named elements which
   have well-defined semantics for each type of metainformation (e.g.
   TITLE), the META element is provided for situations where strict
   SGML parsing is necessary and the local DTD is not extensible. HTML
   user agents may use the META element's content if they recognize
   and understand the semantics identified by the NAME or HTTP-EQUIV
   attributes, and may treat the content as metainformation (and not
   render it) even when they do not recognize the name.

   In addition, HTTP servers may wish to read the content of the
   document HEAD to generate header fields corresponding to any
   elements defining a value for the attribute HTTP-EQUIV. Note,
   however, that the method by which the server extracts document
   metainformation is not part of this specification, nor can it be
   assumed by authors that any given server will be capable of
   extracting it. The META element only provides an extensible
   mechanism for identifying and embedding document metainformation --
   how it may be used is up to the individual server implementation
   and the HTML user agent.

   Attributes of the META element:

   HTTP-EQUIV
       This attribute binds the element to an HTTP header field. It
       means that if you know the semantics of the HTTP header field
       named by this attribute, then you can process the contents based
       on a well-defined syntactic mapping, whether or not your DTD
       tells you anything about it. HTTP header field names are not
       case sensitive. If not present, the attribute NAME should be
       used to identify this metainformation and the content should not
       be used within an HTTP response header.

   NAME
       Metainformation name. If the NAME attribute is not present, the
       name can be assumed to be equal to the value of HTTP-EQUIV.

   CONTENT
       The metainformation content to be associated with the given
       name. If multiple META elements are provided with the same name,
       their combined contents--concatenated as a comma-separated list--
       is the value associated with that name.

   Examples

   If the document contains:

       <META HTTP-EQUIV="Expires"
             CONTENT="Tue, 04 Dec 1993 21:29:02 GMT">
       <meta http-equiv="Keywords" CONTENT="Fred, Barney">
       <META HTTP-EQUIV="Reply-to"
             content="fielding@ics.uci.edu (Roy Fielding)">

   then the server (if so configured) may include the following
   headers:

       Expires: Tue, 04 Dec 1993 21:29:02 GMT
       Keywords: Fred, Barney
       Reply-to: fielding@ics.uci.edu (Roy Fielding)

   as part of the HTTP response to a GET or HEAD request for that
   document.

   When the HTTP-EQUIV attribute is not present, the server should not
   generate an HTTP response header for the metainformation; e.g.,

       <META NAME="IndexType" CONTENT="Service">

   would never generate an HTTP response header, but would still allow
   HTML user agents to identify and make use of that metainformation.

   The Meta element should never be used to define information that
   should be associated with an existing HTML element. An example of
   an inappropriate use of the Meta element is:

       <META NAME="Title" CONTENT="The Etymology of Dunsel">

   Do not name an HTTP-EQUIV equal to a response header that should
   normally only be generated by the HTTP server. Example names that
   are inappropriate include "Server", "Date", and "Last-modified" --
   the exact list of inappropriate names is dependent on the
   particular server implementation. We recommend that servers ignore
   any META elements which specify HTTP-equivalents which are equal
   (case-insensitively) to their own reserved response headers.

5.6  Nextid

   <NEXTID>                                                     Level 0

   The Nextid element is a parameter read and generated by text
   editing software to create unique identifiers. This tag takes a
   single attribute which is the next document-wide alpha-numeric
   identifier to be allocated of the form z123:

       <NEXTID N=Z27>

   When modifying a document, existing anchor identifiers should not
   be reused, as these identifiers may be referenced by other
   documents. Human writers of HTML usually use mnemonic alphabetical
   identifiers.

   HTML user agents may ignore the Nextid element. Support for the
   Nextid element does not impact HTML user agents in any way.

6.  Data Characters

    An HTML user agent should present the body of an HTML document as
    a collection of typeset paragraphs and preformatted text. Except
    for the PRE element, each block structuring element is regarded as
    a paragraph by taking the data characters in its content and the
    content of its descendant elements, concatenating them, and
    splitting the result into words, separated by space, tab, or
    record end characters (and perhaps hyphen characters). The
    sequence of words is typeset as a paragraph by breaking it into
    lines.

6.1 The ISO Latin 1 Character Repertoire

   Conforming HTML user agents are required to support the US-ASCII
   [10] or ISO-8859-1 [11] character encodings, and the @@fullname ISO
   Latin 1 document character set.

   The character repertoire shared by these two is known as Latin
   Alphabet No. 1, or simply Latin-1.  Latin-1 includes characters
   from most Western European languages, as well as a number of
   control characters.  Latin-1 also includes a non-breaking space, a
   soft hyphen indicator, 93 graphical characters, 8 unassigned
   characters, and 25 control characters.

   NOTE: Use the non-breaking space and soft hyphen indicator characters is
   discouraged because support for them is not widely deployed.

   In SGML applications, the use of control characters is limited in
   order to maximize the chance of successful interchange over
   heterogeneous networks and operating systems. In HTML, only three
   control characters are allowed: Horizontal Tab (HT, encoded as 9
   decimal in US-ASCII and ISO-8859-1), Carriage Return, and Line Feed.

   The HTML DTD references the Added Latin 1 entity set, to allow
   mnemonic representation of Latin 1 characters using only the widely
   supported ASCII character repertoire. For example:

       Kurt G&ouml;del was a famous logician and mathematician.

   See Section 13.2 for a table of the "Added Latin 1" entities, and
   section 13.3 for a table of the characters of ISO-8859-1.

7.  Data Elements

7.1  Line Break

   <BR>                                                         Level 0

   The Line Break element specifies a line break in a paragraph or
   preformatted text section. A new line should indent the same as
   that of line-wrapped text.

   Example of use:

       <P> Pease porridge hot<BR>
       Pease porridge cold<BR>
       Pease porridge in the pot<BR>
       Nine days old.

7.2  Horizontal Rule

   <HR>                                                         Level 0

   A Horizontal Rule element is a divider between sections of text
   such as a full width horizontal rule or equivalent graphic.

   Example of use:

       <BODY>
       ...
       <HR>
       <ADDRESS>February 8, 1995, CERN</ADDRESS>
       </BODY>

7.3  Image

   <IMG>                                                        Level 0

   The Image element is used to incorporate in-line graphics
   (typically icons or small graphics) into an HTML document. This
   element cannot be used for embedding other HTML text.