-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathHTML.txt
More file actions
executable file
·3233 lines (2459 loc) · 119 KB
/
HTML.txt
File metadata and controls
executable file
·3233 lines (2459 loc) · 119 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
HTML Working Group T. Berners-Lee
INTERNET-DRAFT D. Connolly
<draft-ietf-html-spec-03.txt> MIT/W3C
Expires November 1, 1995 April 1, 1995
HyperText Markup Language -- 2.0
Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its areas,
and its working groups. Note that other groups may also distribute
working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-Drafts
as reference material or to cite them other than as "work in
progress."
To learn the current status of any Internet-Draft, please check the
"1id-abstracts.txt" listing contained in the Internet-Drafts Shadow
Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or
ftp.isi.edu (US West Coast).
Distribution of this document is unlimited. Please send comments to
the HTML working group (HTML-WG) of the Internet Engineering Task
Force (IETF) at <html-wg@oclc.org>. Discussions of the group are
archived at <URL:http://www.acl.lanl.gov/HTML_WG/archives.html>.
Abstract
The HyperText Markup Language (HTML) is a simple markup language
used to represent hypertext documents that are portable from one
platform to another. HTML documents are SGML documents with generic
semantics that are appropriate for representing information from a
wide range of applications. HTML markup can represent hypertext
news, mail, documentation, and hypermedia; menus of options;
database query results; simple structured documents with graphics;
and hypertext views of existing bodies of information.
HTML has been in use by the World Wide Web (WWW) global information
initiative since 1990. This specification roughly corresponds to
the capabilities of HTML in common use prior to June 1994. It is
defined as an application of ISO Standard 8879:1986 Information
Processing Text and Office Systems; Standard Generalized Markup
Language (SGML).
The "text/html; version=2.0" Internet Media Type (RFC 1590) and
MIME Content Type (RFC 1521) is defined by this specification.
Contents
(@@ I'll regenerate this eventually)
1. Introduction
1.1 Purpose
The HyperText Markup Language (HTML) is a simple markup language
used to create hypertext documents that are portable from one
platform to another. HTML documents are SGML documents with generic
semantics that are appropriate for representing information from a
wide range of applications. HTML has been in use by the World-Wide
Web (WWW) global information initiative since 1990. This
specification corresponds to the capabilities of HTML in common use
prior to June 1994 and referred to as "HTML 2.0".
This specification defines HTML as an application of ISO Standard
8879:1986 Information Processing Text and Office Systems; Standard
Generalized Markup Language (SGML). SGML provides a formal
definition of the HTML syntax in the form of a Document Type
Definition (DTD).
This specification also defines HTML as an Internet Media Type [7]
and MIME Content Type [4] called "text/html", or
"text/html; version=2.0". As such, it defines the semantics of the
HTML syntax and how that syntax should be interpreted by user agents.
1.2 Levels of Conformance
Version 2.0 of HTML introduces a distinction between levels of
conformance:
Level 0
Indicates the minimum conformance level. When writing Level 0
documents, authors can be confident that the rendering at
different sites will reflect their intent.
Level 1
Includes Level 0 features plus features such as highlighting and
images.
Level 2
Includes all Level 0 and Level 1 features, plus forms.
1.3 Terminology
The HTML specification uses these words with precise meanings:
attribute
A name/value pair: part of an element which is often used
to specify a characteristic quality of the element, other than
type or content.
character
An atom of information, for example a letter or a number.
Graphic characters have associated glyphs, where as control
characters have associated processing semantics.
character encoding
A mapping from sequences of octets to sequences of characters
from a character repertoire; that is, a sequence of octets and a
character encoding determines a sequence of characters.
character number
A number that determines a character, as per some character set.
character repertoire
A finite set of characters. The range of the mapping defined
by a character set.
character set
A mapping of a subset of the integers onto a character
repertoire. That is, for some set of integers (usually of
the form {0, 1, 2, ..., N} ), a character set and an integer
in that set determine a character. Conversely, a character
and a character set determine the character's number (or,
in rare cases, a few character numbers).
conforming HTML user agent
A user agent that conforms to this specification in its
treatment of the Internet Media Type "text/html; version=2.0"
document type definition (DTD)
A DTD is a collection of declarations (entity, element,
attribute, link, map, etc.) in SGML syntax that defines the
components and structures available for a class (type) of
documents.
element
A component of the hierarchical structure defined by the
document type definition; it is identified in a document
instance by descriptive markup, usually a start-tag and an end-
tag.
entity
data with an associated notation or interpretation; for
example, a sequence of octets associated with an Internet Media
Type.
message entity
a head and body. The head is a collection of name/value fields,
and the body is a sequence of octets. The head defines the
content type and content transfer encoding of the body.
SGML document
A set of entities, including the document entity, which is
a text entity that conforms to the grammar specified in the SGML
standard.
HTML document
An SGML document conforming to the HTML document type definition.
HTTP
The Hypertext Transfer Protocol [3] is the primary application-
level protocol for the transfer of documents via the World-Wide
Web.
(document) instance
The document itself including the actual content with the actual
markup. Can be a single document or part of a document instance
set that follows the DTD.
markup
Text added to the data of a document to convey information about
it. There are four different kinds of markup: descriptive markup
(tags), references, markup declarations, and processing
instructions.
MIME
The Multipurpose Internet Mail Extensions [4] provide the
ability to transfer non-textual data, such as graphics, audio
and fax, via Internet mail.
minimally conforming HTML user agent
A user agent that conforms to this specification in its
treatment of the Internet Media Type "text/html; level=0;
version=2.0"
SGML
Standard Generalized Markup Language [12] (see also [9] and [6])
is a system for describing document types and markup languages
to represent them.
tag
Markup that delimits an element. A tag includes a name which
refers to an element declaration in the DTD, and may include
attributes.
text entity
A finite sequence of characters. A text entity typically takes
the form of a sequence of octets with some associated character
encoding, transmitted over the network or stored in a file.
user agent
A component of a distributed system that presents an interface
and processes requests on behalf of a user; for example, a www
browser or a mail user agent.
URI
A Universal Resource Identifier [1] is a formatted string that
serves as an identifier for a resource, typically on the
Internet. URIs are used in HTML to identify the destination of
hypertext links, the source of in-line images, and the object of
form actions. URIs in common use include Uniform Resource
Locators (URLs) [2] and Relative URLs [5].
WWW
The World-Wide Web is a hypertext-based, distributed information
system created by researchers at CERN in Switzerland. Users may
create, edit or browse hypertext documents.
<URL:http://www.w3.org/>
1.4 Imperatives
may
The implementation is not obliged to follow this in any way.
must
If this is not followed, the implementation does not conform to
this specification.
shall
If this is not followed, the implementation does not conform to
this specification.
should
If this is not followed, though the implementation officially
conforms to the specification, undesirable results may occur in
practice.
typical
Typical rendering is described for many elements. This is not a
mandatory part of the specification but is given as guidance for
designers and to help explain the uses for which the elements
were intended.
2. HTML as an Application of SGML
HTML is an application of ISO Standard 8879:1986 -- Standard
Generalized Markup Language (SGML) [12]. SGML is a system for
defining structured document types and markup languages to
represent instances of those document types. The SGML declaration
for HTML and the HTML document type definitions (DTDs) are provided
in Section 12.
The term "HTML" refers to both the document type defined here and
the markup language for representing instances of this document
type.
2.1 SGML Documents
An HTML document is an SGML document; that is, a set of entities,
including the document entity, which is a text entity that conforms
to the grammar specified in the SGML standard. The first production
of that grammar separates an SGML document into three parts: an
SGML declaration, a prologue, and an instance.
For the purposes of this specification, the prologue is a DTD. This
DTD describes another grammar: the start symbol is given in the
doctype declaration; the terminals are data characters and tags,
and the productions are determined by the element declarations. The
instance must conform to the DTD, that is, it must be in the
language defined by this grammar.
The SGML declaration determines the lexicon of the grammar. It
specifies the document character set, which determines a character
repertoire that contains all characters used in all text entities
in the document, and the character numbers associated with those
characters.
The SGML declaration also specifies the syntax character set of the
document, and a few other parameters that bind the abstract syntax
of SGML to a concrete syntax. This concrete syntax determines how
each text entity is mapped to a sequence of terminals in the grammar
of the prologue.
For example, consider the following document:
<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN">
<title>Parsing Example</title>
<p>Some text. <em>*wow*</em>
By application convention, the SGML declaration is the one given in
section 13.2. Hence the document character set is ISO-8859-1(@@)
and the markup "*" represents an asterisk character.
The instance is regarded as the following sequence of terminals:
TITLE start tag
data characters: "Parsing Example"
TITLE end tag
P start tag
data characters "Some text. "
EM start tag
"*wow*"
EM end tag
The start symbol of the DTD grammar is HTML, and the productions
are given in the public text identified by "-//IETF//DTD HTML
2.0//EN" (Section 13.3). Hence the terminals above parse as:
HTML
|
\-HEAD, BODY
| |
\-TITLE \-P
| |
| \-<P>,"Some text. ",EM
| |
| \-<EM>,"*wow*",</EM>
\-<TITLE>,"Parsing Example",</TITLE>
2.2 HTML Lexical Syntax
The syntax character set for all HTML documents is ISO-646 (@@ full
name). A minimally conforming HTML user agent must support the SGML
declaration in section 13@@, which specifies ISO Latin 1 (@@full
name) as the document character set; it may support other SGML
declarations, in particular, SGML declarations with other document
character sets.
A complete discussion of the mapping of a sequence of characters to
a sequence of tags and data is left to the SGML standard. This
section is only a summary.
2.2.1 Data Characters
Any sequence of characters that do not constitute markup (see
"Delimiter Recognition," section @@@ of the SGML standard) are
mapped directly to strings of data characters. Some markup also
maps to data character strings. Numeric character references also
map to single-character strings, via the document character
set. Each reference to one of the general entities defined in the
HTML DTD also maps to a single-character string.
For example,
abc<def => "abc","<","def"
abc<def => "abc","<","def"
Note that the terminating semicolon is only necessary when the
character following the reference would otherwise be recognized as
markup:
abc < def => "abc ","<"," def"
abc < def => "abc ","<"," def"
And note that an ampersand is only recognized as markup when it
is followed by a letter or number:
abc & lt def => "abc & lt def"
abc & 60 def => "abc & 60 def"
A useful technique for translating plain text to HTML is to replace
each '<', '&', and '>' by an entity reference or numeric character
reference as follows:
ENTITY NUMERIC
CHARACTER REFERENCE CHAR REF CHARACTER DESCRIPTION
& & & Ampersand
< < < Less than
> > > Greater than
Note: There are SGML features, CDATA and RCDATA, to allow
most "<", ">", and "&" characters to be entered without the
use of entity references. Because these features tend to be
used and implemented inconsistently, and because they
require 8bit characters to represent non-ASCII characters,
they are not used in this version of the HTML DTD.
2.2.1 Tags
Tags define the start and end of headings, paragraphs, lists,
character highlighting and links. Most HTML elements are identified
in a document as a start tag, which gives the element name and
attributes, followed by the content, followed by the end tag. Start
tags are delimited by "<" and ">"; end tags are delimited by "</"
and ">". An example is:
<H1>This is a Heading</H1>
Some elements only have a start tag without an end tag. For
example, to create a line break, you use the <BR> tag.
Additionally, the end tags of some other elements, such as
Paragraph (<P>), List Item (<LI>), Definition Term (<DT>), and
Definition Description (<DD>) elements, may be omitted.
The content of an element is a sequence of characters and nested
elements. Some elements, such as anchors, cannot be nested. Anchors
and character highlighting may be put inside other constructs. See
the HTML DTD for full details.
Note: The SGML declaration for HTML specifies SHORTTAG YES,
which means that there are other valid syntaxes for tags,
such as NET tags, "<EM/.../"; empty start tags, "<>"; and
empty end tags, "</>". Until support for these idioms is
widely deployed, their use is strongly discouraged.
2.2.2 Names
A name consists of a letter followed by up to 71 letters, digits,
periods, or hyphens. Element names are not case sensitive, but
entity names are. For example, <BLOCKQUOTE>, <BlockQuote>, and
<blockquote> are equivalent, whereas & is different from &.
In a start tag, the element name must immediately follow the tag
open delimiter "<".
2.2.3 Attributes
In a start tag, white space and attributes are allowed between the
element name and the closing delimiter. An attribute typically
consists of an attribute name, an equal sign, and a value (although
some attributes may be just a value). White space is allowed around
the equal sign.
The value of the attribute may be either:
o A string literal, delimited by single quotes or double quotes
and not containing any occurrences of the delimiting character.
o A name token (a sequence of letters, digits, periods, or
hyphens)
In this example, A is the element name, HREF is the attribute name,
and http://host/dir/file.html is the attribute value:
<A HREF="http://host/dir/file.html">
Note: Some historical implementations consider any occurrence
of the ">" character to signal the end of a tag. For
compatibility with such implementations, when ">" appears in an
attribute value, it should be represented with a numeric
character reference, such as in: <IMG SRC="eq1.jpg" alt="a
> b">
A useful technique for computing an attribute value literal for a
given string is to replace each quote and space character by an
entity reference or numeric character reference as follows:
ENTITY NUMERIC
CHARACTER REFERENCE CHAR REF CHARACTER DESCRIPTION
TAB 	 Tab
LF Line Feed
CR Carriage Return
  Space
" " " Quotation mark
& & & Ampersand
For example:
<IMG SRC="image.jpg" alt="First "real" example">
Note: Some historical implementations allow any character
except space or ">" in a name token. Attributes values must
be quoted only if they don't satisfy the syntax for a name
token.
Note that the SGML declaration in section 13.3 limits the length of
an attribute value to 1024 characters.
Attributes with a declared value of NAME, such as ISMAP and
COMPACT, may be written using a minimized syntax. The markup:
<UL COMPACT="compact">
can be written using a minimized syntax:
<UL COMPACT>
Note: Some historical implementations only understand the
minimized syntax.
2.2.5 Comments
To include comments in an HTML document that will be eliminated in
the mapping to terminals, surround them with "<!--" and
"-->". After the comment delimiter, all text up to the next
occurrence of "-->" is ignored. Hence comments cannot be
nested. White space is allowed between the closing "--" and ">",
but not between the opening "<!" and "--".
For example:
<HEAD>
<TITLE>HTML Guide: Recommended Usage</TITLE>
<!-- $Id$ -->
</HEAD>
Note: Some historical HTML implementations incorrectly consider
any ">" character to be the termination of a comment.
2.3 Example HTML Document
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<HTML>
<!-- Here's a good place to put a comment. -->
<HEAD>
<TITLE>Structural Example</TITLE>
</HEAD><BODY>
<H1>First Header</H1>
<P>This is a paragraph in the example HTML file. Keep in mind
that the title does not appear in the document text, but that
the header (defined by H1) does.</P>
<OL>
<LI>First item in an ordered list.
<LI>Second item in an ordered list.
<UL COMPACT>
<LI> Note that lists can be nested;
<LI> Whitespace may be used to assist in reading the
HTML source.
</UL>
<LI>Third item in an ordered list.
</OL>
<P>This is an additional paragraph. Technically, end tags are
not required for paragraphs, although they are allowed. You can
include character highlighting in a paragraph. <EM>This sentence
of the paragraph is emphasized.</EM> Note that the </P>
end tag has been omitted.
<P>
<IMG SRC ="triangle.xbm" alt="Warning:">
Be sure to read these <b>bold instructions</b>.
</BODY></HTML>
3. HTML as an Internet Media Type
An HTML user agent allows users to interact with resources which
have HTML representations. At a minimum, it must allow users to
examine and navigate the content of HTML Level 0 documents. Level 1
HTML user agents must be able preserve all formatting distinctions
represented in an HTML Level 1 document, and be able to
simultaneously present resources referred to by IMG elements. (they
may ignore some formatting distinctions or IMG resources at the
request of the user). Fully conforming HTML user agents, that is
Level 2 HTML user agents, must support form entry and submission.
3.1 text/html media type
This specification defines the Internet Media Type [7] (formerly
referred to as the MIME Content Type [4]) called "text/html". The
following is to be registered with IANA [8].
Media Type name: text
Media subtype name: html
Required parameters: none
Optional parameters: level, version, charset
Encoding considerations: any encoding is allowed
Security considerations: [Section 14]
The optional parameters are defined as follows:
Level
The level parameter specifies the feature set used in the
document. The level is an integer number, implying that any
features of same or lower level may be present in the document.
Levels 0, 1 and 2 are defined by this specification. Level 2 is
the default.
Version
To help avoid future compatibility problems, the version
parameter may be used to give the version number of the
specification to which the document conforms. The version number
appears at the front of this document and within the public
identifier of the HTML DTD. This specification defines
version 2.0.
Charset
The charset parameter (as defined in section 7.1.1 of RFC 1521
[4]) may be given to specify the encoding used to represent the
HTML document as a sequence of octets. The default value is
outside the scope of this specification; but for example, the
default is US-ASCII in the context of MIME mail, and ISO-8859-1
in the context of HTTP.
3.2 HTML Document Representation
A message entity with a content type of "text/html" represents an HTML
document, consisting of a single text entity. The charset parameter
(whether implicit or explicit) identifies a character encoding. The
text entity consists of the characters determined by this character
encoding and the octets of the body of the message entity.
The SGML declaration of the document is a function of the charset
parameter. If the charset parameter is US-ASCII or ISO-8859-1, the
SGML declaration in section 13@@ applies. Other charset parameter
values are reserved for future use.
NOTE: A generalized convention for mapping charset parameter values
to SGML declarations is expected to be specified in a future
version of this specification.
3.2.1 Conventional Handling of Undeclared Markup Errors
To facilitate experimentation and interoperability between
implementations of various versions of HTML, the installed base of
HTML user agents supports a superset of the HTML 2.0 language by
reducing it to HTML 2.0: markup in the form of a start tag or end
tag whose generic identifier is not declared is mapped to nothing
during tokenization. Undeclared attributes are treated similarly.
The entire attribute specification of an unknown attribute (i.e.,
the unknown attribute and its value, if any) should be ignored. On
the other hand, references to undeclared entities should be treated
as data characters.
For example:
<div class=chapter><h1>foo</h1><p>...</div>
=> <H1>,"foo",</H1>,<P>,"..."
xxx <P ID=z23> yyy
=> "xxx ",<P>," yyy
Let α and β be finite sets.
=> "Let α and β be finite sets."
Support for notifying the user of such errors is encouraged.
Information providers should keep in mind that this convention is
not binding: unspecified behavior may result, as such markup is
not conforming to this specification.
3.2.1 Conventional Representation of Newlines
SGML specifies that a text entity is a sequence of records, each
beginning with a record start character and ending with a record
end character (characters numbered 10 and 13 respectively).
MIME specifies that a body of type text/* is a sequence of lines,
each terminated by CRLF, that is octets 10, 13.
In practice, HTML documents are frequently represented and
transmitted using an end of line convention that depends on the
conventions of the source of the document; frequently, that
representation consists of CR only, LF only, or CR LF
combination. Hence the decoding of the octets will often result in
a text entity with some missing record start and record end
characters.
Since there is no ambiguity, HTML user agents are encouraged to
infer the missing record start and end characters.
An HTML user agent should treat end of line in any of its
variations as a word space in all contexts except
preformatted text. Within preformatted text, an HTML user agent
should expect to treat any of the three common representations of
end-of-line as starting a new line.
4. Document Structure
To identify information as an HTML document conforming to this
specification, each document should start with the prologue:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
Note: If the body of a text/html body part does not begin
with a document type declaration, an HTML user agent should
infer the above document type declaration.
HTML user agents are required to support the above document type
declaration, the following document type declarations, and no
others.
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Level 0//EN">
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Level 1//EN">
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Level 2//EN">
In particular, they may support other formal public identifiers, or
document types altogether. They may support an internal declaration
subset with supplemental entity, element, and other markup
declarations, or they may not.
HTML documents may contain an <HTML> tag at the beginning
(immediately after the prologue) and an </HTML> tag at the end.
The HTML document element is organized as a head and a body, much
like a memo or a mail message. Within the head, you can specify the
title and other information about the document. Within the body,
you can structure text into paragraphs and lists, as well as
highlight phrases and create links, using HTML elements.
Note: Technically, the start and end tags for HTML, Head,
and Body elements are omissible; however, this is not
recommended since the head/body structure allows an
implementation to determine certain properties of a
document, such as the title, without parsing the entire
document.
4.1 HTML Document Element
<HTML> ... </HTML> Level 0
The HTML element contains the Head and Body elements.
4.2 Head
<HEAD> ... </HEAD> Level 0
The head of an HTML document is an unordered collection of
information about the document. It requires the Title element
between <HEAD> and </HEAD> tags:
<HEAD>
<TITLE>Introduction to HTML</TITLE>
</HEAD>
4.3 Body
<BODY> ... </BODY> Level 0
The Body element identifies the body component of an HTML document.
Specifically, the body of a document may contain links, text, and
formatting information within <BODY> and </BODY> tags.
5. Document Metainformation Elements
5.1 Title
<TITLE> ... </TITLE> Level 0
Every HTML document must contain a Title element. The title should
identify the contents of the document in a global context, and may
be used in history lists and as a label for the window displaying
the document. Unlike headings, titles are not rendered in the text
of a document itself.
The Title element must occur within the head of the document, and
must not contain anchors, paragraph tags, or highlighting. Only one
title is allowed in a document.
Note: The length of a title is not limited; however, long
titles may be truncated in some applications. To minimize
this possibility, titles should be fewer than 64 characters.
Also keep in mind that a short title, such as Introduction,
may be meaningless out of context. An example of a
meaningful title might be "Introduction to HTML Elements."
5.2 Base
<BASE> Level 0
The Base element allows the URL [2] of the document itself to be
recorded in situations in which the document may be read out of
context. URLs within the document may be in a "partial" form
relative to this base address [5].
The Base element has one attribute, HREF, which identifies the
absolute base URL.
5.3 Isindex
<ISINDEX> Level 0
The Isindex element tells the user agent that the document is an
index. This means that the reader may request a keyword search on
the resource by adding a question mark to the end of the document
address, followed by a list of keywords separated by plus signs.
The Isindex element is usually generated by the network server from
which the document was obtained via a URI. The server must have a
search engine that supports this feature for the resource. If the
document URI is unknown to the user agent, <isindex> must be
ignored.
5.4 Link
<LINK> Level 0
The Link element indicates a relationship between the document and
some other object. A document may have any number of Link elements.
The Link element is empty (does not have a closing tag), but takes
the same attributes as the Anchor element.
Typical uses are to indicate authorship, related indexes and
glossaries, older or more recent versions, etc. Links can indicate
a static tree structure in which the document was authored by
pointing to a "parent" and "next" and "previous" document, for
example.
Servers may also allow links to be added by those who do not have
the right to alter the body of a document.
5.5 Meta
<META> Level 0
The META element is used within the HEAD element to embed document
metainformation not defined by other HTML elements. META elements
can be extracted by servers and/or clients for use in identifying,
indexing, and cataloging specialized document metainformation.
Although it is generally preferable to use named elements which
have well-defined semantics for each type of metainformation (e.g.
TITLE), the META element is provided for situations where strict
SGML parsing is necessary and the local DTD is not extensible. HTML
user agents may use the META element's content if they recognize
and understand the semantics identified by the NAME or HTTP-EQUIV
attributes, and may treat the content as metainformation (and not
render it) even when they do not recognize the name.
In addition, HTTP servers may wish to read the content of the
document HEAD to generate header fields corresponding to any
elements defining a value for the attribute HTTP-EQUIV. Note,
however, that the method by which the server extracts document
metainformation is not part of this specification, nor can it be
assumed by authors that any given server will be capable of
extracting it. The META element only provides an extensible
mechanism for identifying and embedding document metainformation --
how it may be used is up to the individual server implementation
and the HTML user agent.
Attributes of the META element:
HTTP-EQUIV
This attribute binds the element to an HTTP header field. It
means that if you know the semantics of the HTTP header field
named by this attribute, then you can process the contents based
on a well-defined syntactic mapping, whether or not your DTD
tells you anything about it. HTTP header field names are not
case sensitive. If not present, the attribute NAME should be
used to identify this metainformation and the content should not
be used within an HTTP response header.
NAME
Metainformation name. If the NAME attribute is not present, the
name can be assumed to be equal to the value of HTTP-EQUIV.
CONTENT
The metainformation content to be associated with the given
name. If multiple META elements are provided with the same name,
their combined contents--concatenated as a comma-separated list--
is the value associated with that name.
Examples
If the document contains:
<META HTTP-EQUIV="Expires"
CONTENT="Tue, 04 Dec 1993 21:29:02 GMT">
<meta http-equiv="Keywords" CONTENT="Fred, Barney">
<META HTTP-EQUIV="Reply-to"
content="fielding@ics.uci.edu (Roy Fielding)">
then the server (if so configured) may include the following
headers:
Expires: Tue, 04 Dec 1993 21:29:02 GMT
Keywords: Fred, Barney
Reply-to: fielding@ics.uci.edu (Roy Fielding)
as part of the HTTP response to a GET or HEAD request for that
document.
When the HTTP-EQUIV attribute is not present, the server should not
generate an HTTP response header for the metainformation; e.g.,
<META NAME="IndexType" CONTENT="Service">
would never generate an HTTP response header, but would still allow
HTML user agents to identify and make use of that metainformation.
The Meta element should never be used to define information that
should be associated with an existing HTML element. An example of
an inappropriate use of the Meta element is:
<META NAME="Title" CONTENT="The Etymology of Dunsel">
Do not name an HTTP-EQUIV equal to a response header that should
normally only be generated by the HTTP server. Example names that
are inappropriate include "Server", "Date", and "Last-modified" --
the exact list of inappropriate names is dependent on the
particular server implementation. We recommend that servers ignore
any META elements which specify HTTP-equivalents which are equal
(case-insensitively) to their own reserved response headers.
5.6 Nextid
<NEXTID> Level 0
The Nextid element is a parameter read and generated by text
editing software to create unique identifiers. This tag takes a
single attribute which is the next document-wide alpha-numeric
identifier to be allocated of the form z123:
<NEXTID N=Z27>
When modifying a document, existing anchor identifiers should not
be reused, as these identifiers may be referenced by other
documents. Human writers of HTML usually use mnemonic alphabetical
identifiers.
HTML user agents may ignore the Nextid element. Support for the
Nextid element does not impact HTML user agents in any way.
6. Data Characters
An HTML user agent should present the body of an HTML document as
a collection of typeset paragraphs and preformatted text. Except
for the PRE element, each block structuring element is regarded as
a paragraph by taking the data characters in its content and the
content of its descendant elements, concatenating them, and
splitting the result into words, separated by space, tab, or
record end characters (and perhaps hyphen characters). The
sequence of words is typeset as a paragraph by breaking it into
lines.
6.1 The ISO Latin 1 Character Repertoire
Conforming HTML user agents are required to support the US-ASCII
[10] or ISO-8859-1 [11] character encodings, and the @@fullname ISO
Latin 1 document character set.
The character repertoire shared by these two is known as Latin
Alphabet No. 1, or simply Latin-1. Latin-1 includes characters
from most Western European languages, as well as a number of
control characters. Latin-1 also includes a non-breaking space, a
soft hyphen indicator, 93 graphical characters, 8 unassigned
characters, and 25 control characters.
NOTE: Use the non-breaking space and soft hyphen indicator characters is
discouraged because support for them is not widely deployed.
In SGML applications, the use of control characters is limited in
order to maximize the chance of successful interchange over
heterogeneous networks and operating systems. In HTML, only three
control characters are allowed: Horizontal Tab (HT, encoded as 9
decimal in US-ASCII and ISO-8859-1), Carriage Return, and Line Feed.
The HTML DTD references the Added Latin 1 entity set, to allow
mnemonic representation of Latin 1 characters using only the widely
supported ASCII character repertoire. For example:
Kurt Gödel was a famous logician and mathematician.
See Section 13.2 for a table of the "Added Latin 1" entities, and
section 13.3 for a table of the characters of ISO-8859-1.
7. Data Elements
7.1 Line Break
<BR> Level 0
The Line Break element specifies a line break in a paragraph or
preformatted text section. A new line should indent the same as
that of line-wrapped text.
Example of use:
<P> Pease porridge hot<BR>
Pease porridge cold<BR>
Pease porridge in the pot<BR>
Nine days old.
7.2 Horizontal Rule
<HR> Level 0
A Horizontal Rule element is a divider between sections of text
such as a full width horizontal rule or equivalent graphic.
Example of use:
<BODY>
...
<HR>
<ADDRESS>February 8, 1995, CERN</ADDRESS>
</BODY>
7.3 Image
<IMG> Level 0
The Image element is used to incorporate in-line graphics
(typically icons or small graphics) into an HTML document. This
element cannot be used for embedding other HTML text.