These "discussions," published over the course of 1992, as the second version of HTML (and first documented) was in common use, show the directions that HTML would come to take which would finally mature by HTML 1 in mid-1993. Except for the "HTML (extractions)" file which I only include as an introduction and Table of Contents, I again include the raw text of the file, a pointer to where it currently is on the web (in the W3C historical Archives), and also show the file as it should have displayed. In the case of error files, I have attempted to guess at how the errors were meant to be handled if encountered. In keeping this file compliant with HTML 4.01 Strict I have not included the actual errors in this file.
The WWW system uses marked-up text to represent a hypertext document for transmision over the network. The hypertext mark-up language is an SGML format. WWW parsers should ignore tags which they do not understand, and ignore attributes which they do not understand of tags which they do understand.
The following does not form part of the specifciation.
See also
www-talk from September to October 1991: Re: status. Re: X11 BROWSER for WWW
Date: Tue, 29 Oct 91 10:03:11 GMT+0100 From: timbl (Tim Berners-Lee) Message-Id: <9110290903.AA07413@ nxoc01.cern.ch > To: connolly@pixel.convex.com, www-talk Subject: Re: status. Re: X11 BROWSER for WWW Dan, > I've made some tangible progress on the X11 browser, so I though > I'd let you know. > ... > This code is not in any shape to distribute, or even show anybody. > But it works, and it's pretty speedy. That's enough to encourage me > to polish it off. Sounds like great progress! The TCL sounds interesting -- where did you get it? > [If you wan't my stuff, you'll have to be C++ capable. I can't > think in C any more. :-] Don't worry - we can handle C++, although for the line mode browser we wanted portability into places where C++ could not reach. That's why the common code (in WWW/Implementation) is all in C. Believe me, after writing the NeXT browser in Objective-C it was a wrench to conclude that it would have to be deobjectified. > If you could round up some info on exactly what I can expect to see > in an HTML file, and some idea of how you want it formatted [I have > the HTML doc and the LineMode browser, but if you've got time to > give me a little more info...] I'll be ready to tackle that pretty > soon. You ask for info on exactly what you can expect to find in an HTML file, but you've read the two HTML files about HTML. What is missing from there? Here is some discussion about the tags -- where it's not in http://info.cern.ch/hypertext/WWW/MarkUp/Tags.html I have updated that document now. Most of the tags are just style tags: this goes for the headings H1 to H6, the lists UL and OL with list elements LI, the glossary DL with elements DT and DD. <TITLE> ..<TITLE> is designed to be used for putting in the top banner of a window, or using as the window name. It also is what you would use in a history list. It shouldn't be displayed in the text itself, as usually there is a <H1> heading atteh top of the text anyway. A difference is that thet title is designed to make sense out of context, whereas the heading is within context. For example, a title might be "Formatting Characters for Printf -- C reference manual" whereas the heading may just be "Formatting characters". The base address tag is not used, nor is highlighting HP1 etc. Anchors are used! The REL attribute is NOT used. <ISINDEX> is sent by servers to indicate that they will accept a search given this document name plus keywords. It turns on a search panel when the document is the main window. An even better implementation would have a keyword field at the bottom of the text window if the document is a searchable index. That would make the document more self-contained as an item in the user's eyes, and reduce screen clutter. <NEXTID> can be ignored by browsers, only needed for editors. <XMP> and <LISTING> are used to indicate inserted literal text. To make life easier for those writing documents (and because we don't have entities in the code yet) they are special in that EVERYTHING is litteral text until the closing tag - so one can use XMP for giving examples of HTML for example. (We really need an escaping method - the next parser will have simpl entities like "<." for "<".) Within XMP or LISTING, newlines are significant (and mean "new line"!) <PLAINTEXT> is used to indicate that the rest of the file is in fact just ASCII. It turns off SGML parsing completely. It's a fudge for the moment, until we have the document format negociation. ______________________________________ Structure of documents: In writing a new generic parser, I wondered whether your text object will store the nested structure of a document. At the moment, the document is a linear sequence of styles: you can't have lists within lists, etc. Ideally, it would be able to handle this - although its more difficult for a human writer to handle when formatting the document. I would in fact prefer, instead of <H1>, <H2> etc for headings [those come from the AAP DTD] to have a nestable <SECTION>..</SECTION> element, and a generic <H>..</H> which at any level within the sections would produce the required level of heading. For a browser, it is quite satisfactory to flatten the structure back into a sequence of styles, but for an editor it isn't. Are you going to go for editing capability? Tim PS: Shall I put you on the www-talk list?
In this case, I have included only the raw text of the e-mail itself,
omitting the HTML 4.01 blog threading material at the top and bottom. I
have also corrected the link to point to the actual file intended, though
the file has been updated and does not show what would have been seen if
one followed that link in October of 1991. This file shows that there
was some working draft of the description of HTML in work at that point,
which eventually morphed into the oldest description available now dated
November 13, 1992. One sees here Mr. Berners-Lee even here speculating
about a <SECTION>
tag and even a simple
<H>
tag meant to derive its level from the other tag
instead of from a number as has always been used for HTML headers. Even
more unusual is the mention of the <OL>
tag which had
been eliminated in the previous January, and which he only allowed back
in due to pressure from Dan Connolly in 1993. Perhaps he was concerned
with compatibility with those ancient files that had it, or else as a
friendly nod to Dan (to whom he was writing this) who may have always
resented its removal from HTML. Even more unexpected he even mentions
a REL
attribute (presumably of <A>
)
instead of the TYPE
attribute discussed later, a kind of
anticipation of the fact that TYPE
would one day be renamed
to REL
later on. It is also interesting to note that at
this point <ISINDEX>
was already considered a going
concern at this point (but only as something inserted by a smart server)
while <MENU>
and <DIR>
are not
mentioned. Clearly they had not been invented as yet. See also what
he thinks of <XMP>
, <LISTING>
, and
<PLAINTEXT>
, and his desire even then to "fix"
them with "an escaping method." Finally, he mentions there being
only "two HTML files about HTML," so at this point the larger
suite of files found dated 13 November 1992 were still very much in much
more preliminary forms.
Letter_1 -- /Architecture - Windows Internet Explorer
<TITLE>Letter_1 -- /Architecture</TITLE> <NEXTID 1> <XMP> Date: Thu, 4 Jun 92 00:59:21 +0200 From: jfg@dxcern.cern.ch (Jean Francois Groff) Sender: jfg@dxcern.cern.ch To: barker@www1.cern.ch Subject: forwarded message from Tim Berners-Lee ------- Start of forwarded message ------- Received: by dxmint.cern.ch (dxcern) (5.57/3.14) id AA27986; Wed, 3 Jun 92 16:56:29 +0200 Received: by nxoc01.cern.ch (NeXT-1.0 (From Sendmail 5.52)/NeXT-2.0) id AA08770; Wed, 3 Jun 92 16:55:12 MET DST Message-Id: <9206031455.AA08770@ nxoc01.cern.ch > From: timbl@nxoc01.cern.ch (Tim Berners-Lee) To: connolly@pixel.convex.com Cc: timbl@nxoc01.cern.ch, wei@xcf.berkeley.edu, www-bug@nxoc01.cern.ch Subject: Re: still no DTD, huh? Date: Wed, 3 Jun 92 16:55:12 MET DST Dan, taking your points in order before they pop off the screen. I agree, attribute values ought to be quoted unless they contain only sgml-nice characters. The www browers accept quotes or non-quoted values. It is a bug in the NeXT editor that it exploits this feature. B When we fix the NeXT editor then we will put the quotes in. All other p browsers use the SGML.c parser in the W3 dist which accept quotes. Yes, NEXTID will have to go. NEXTID will be anattibute of the documenmt. We proposed sorry propose 3 dcotypes, HTDOC, HTERR and HTFWD to be described in the DTD. These will be such that any extra tags they define, and structure, will be safeley ignored by old parsers. 3. Minimisation. This is copied from the BOOKMAKER style stuff. Basically, we use <P> as a paragraph separater rather than a paragraph begin or end. It can be regarded as a minimized paragraph element though. Its just that we actually parse it as an empty elemnt with no end tag. That's still valid SGML and you could write it in the DTD that way. <LI> always has an opener and never a closer. The same applies to <DD> and <DT>. Note that we have though made sure that the browser will ignore closers to these, so we could edfine teh DTD with them in and optional. 4. YEs, sections appeal to me too. Especially when making big HTML files out of lots of little ones. The effect of <SECTION> .. </SECTION> would be to demote all headings by one inside the section. I would be inclined then to have simpky a <HEADING> tag which would be equivalent to H0 and map onto H1 within a section, or Hn within n sections. The SGML parser can't generate this stuff, but the editors could derive it from the style information. We would have to introduce <SECTION> early on to get a transistion period. Then in HTML3 we would declare H2 etc obsolete. Pei Wei is maybe working on a DTD too and Carl Barker at CERN is defininbg new features of HTML needed by new features in the protocol (things like <BODY NOTATION=postscript> and suchlike). Some of htis is defined in a few "technical notes" linked to a listof technical notes linked to the W3 project page, if you want to see and comment. (Carl: you could take this message in text form and link it in too) Tim ________ Dan's message: >From connolly@pixel.convex.com Wed Jun 3 04:23:34 1992 Return-Path: <connolly@pixel.convex.com> Received: from dxmint.cern.ch by nxoc01.cern.ch (NeXT-1.0 (From Sendmail 5.52)/NeXT-2.0) id AA05562; Wed, 3 Jun 92 04:23:28 MET DST Received: by dxmint.cern.ch (dxcern) (5.57/3.14) id AA27281; Wed, 3 Jun 92 04:21:34 +0200 Received: from pixel.convex.com by convex.convex.com (5.64/1.35) id AA25114; Tue, 2 Jun 92 21:21:17 -0500 Received: from localhost by pixel.convex.com (5.64/1.28) id AA23193; Tue, 2 Jun 92 21:21:15 -0500 Message-Id: <9206030221.AA23193@pixel.convex.com> To: timbl@nxoc01.cern.ch Subject: still no DTD, huh? Date: Tue, 02 Jun 92 21:21:14 CDT From: Dan Connolly <connolly@pixel.convex.com> Status: R by the way... replying to an address you sent me doesn't work... - ------- Forwarded Message ----- Transcript of session follows ----- >>> RCPT To:<timbl@dxmint.cern.ch> <<< 550 <timbl@dxmint.cern.ch>... Addressee unknown 550 timbl@dxmint.cern.ch... User unknown ----- Unsent message follows ----- Date: Tue, 26 May 92 17:06:43 +0200 From: connolly (Dan Connolly) Message-Id: <9205261506.AA25934@connie.de.convex.com> To: timbl@dxmint.cern.ch Subject: still no DTD, huh? Cc: connolly@convex.com I just browsed the web, hoping to find a DTD for HTML. No such luck. One nifty part of the Chameleon project is an X windows grammar editor for developing context free grammars. It's a little clunky, but in addition to outputting editable Chameleon grammar files, it can write YACC specifications or !SGML DTD's! Finally! a simple DTD editor! Unfortunately, it doesn't support attributes, and I don't think the DTD's it creates have minimization, but it could certainly save a lot of time in creating a DTD! I'll see if I can prototype something when I get back. More later. Dan - ------- End of Forwarded Message Well, I've been attempting to prototype something with Devegram, the Integrated Chameleon Architecture's (ICA's) grammar editor. I messed around a while and had it write out an SGML DTD to play with. Unfortunately Devegram doesn't support many features of an SGML DTD which would be most convenient to describe HTML. So I've abandoned Devegram in favor of a text editor. But it did help with the initial prototype. Now for the REAL problems: HTML in its present form is very difficult to describe in SGML. I'm not experienced enough to say for sure, but I think it's impossible. The problems are mostly small and lexical in nature, but I'd say it's VERY important to make these changes NOW in order to be able to use SGML processing engines in WWW clients in the future. An SGML document consists of 3 parts: the declaration, the prologue, and the instance. The declaration lays the groundwork -- defines the encoding and interpretation of the character set(s), sets processing limits and bounds, and other lexical stuff. Applications generally use the default SGML declaration given in the standard. Each SGML parser has a declaration that declares its feature list and limits. If HTML cannot be described with the default SGML declaration, this will severely limit the usable parsers. (one exception is the NAMELEN limit: many parsers have a value higher than 8) The prologue (sometimes called the DTD, though there may be more than one DOCTYPE in the prologue) gives the structure of the document -- the basic grammar and entities and such. This varies from one application to another, but generally one SGML declaration and prologue is used throughout an application. For example, CALS specifies an SGML declaration and some DTD's. The AAP also has a DTD. The third part is the document instance. This is the part that varies from one document to another within an application domain. I'm trying to use the default SGML declaration and design a DTD such that all HTML files are instances of that DTD. - --- 1--- The first problem I've come accross is that HTML attribute values are not quoted. That is: <A NAME=2 HREF=http://crnvmc.cern.ch./WHO> yields sgmls: SGML error at ../../../WWW/WWW/LineMode/Defaults/default.html, line 8 at ":": Incorrect character in markup; markup terminated I don't know what the exact syntax of an SGML attribute is, but it's not the same as HTML's "everything up to the next space or >" syntax. - --- 2 --- Next, all attributes have names. So I can't figure out a way to parse <NEXTID 10> I could do <NEXTID n=10> - --- 3 --- The biggest problem is the somewhat random use of minimization. I can't seem to make SGML sense of it. More later. I don't have as much time as I thought to explain this. - --- 4 --- I'd also like to be able to add a little more structure than just a "big list of tags and text" to the documents like this: <HTML> <TITLE>foo</TITLE> <SECTION> <H1> header </H1> paragraph associated with above header <SUBSECTION> <H2> header </H2> stuff under H2 </SUBSECTION> </SECTION> </HTML> I can _almost_ get the SGML parser to infer the <SECTION> and </SECTION> tags, but not quite. More later. Dan ------- End of forwarded message ------- </XMP>
In this case, since there is nothing but <TITLE>
and
<XMP>
tags, and the rest all raw text, I have simply shown the raw
text. One sees here some speculation about introducing <SECTION>
and <SUBSECTION>
tags, and mention of adding a N
attribute to <NEXTID>
. It is reassuring to see that they had
the same problems generating a definition in SGML as I have had with the first two
versions of HTML, namely the values of attributes when they contain certain non-SGML
characters (absolute URL's, for example), the mere use of a number as the parameter
of <NEXTID>
, and it also took some doing to solve whether certain
tags (<P>
and <LI>
) would be separator tags or
container tags. At the time they did not seem to think that it would do to make
them container tags since closing tags had not been used, so the challenge was to
make them as separator tags, a problem that was eventually solved, only to go to
making them container tags by HTML 2. But the ability to make them as separator tags
did nevertheless serve as a precedent for such unary tags as <BR>
and <HR>
. The idea of major document subdivisions as illustrated
here with the <SECTION>
and <SUBSECTION>
tags
eventually resurfaces with the introduction of <DIV>
in HTML 3 and
3.2, albeit without the peculiar effects on the <Hn>
header tags
as proposed here. From this it is quite clear that attempts to define HTML in terms
of SGML constructs and lexical syntax was underway at least as early as May of 1992,
and with it a clear intention that SGML might form the academic foundation for HTML.
HTML2 -- /MarkUp - Windows Internet Explorer
Last Modified 10/6/92 by CTBIn order to improve the functionality of the World Wide Web, the HyperText Markup Language must be tidied up, to allow it to be processed by generic SGML engines, and not just the WWW one.
The updated HTML will have a greater structure than the original version, including a header section, separate from the body.
This header section will allow the following tags:
<KEYWORDS>...</KEYWORDS> <TITLE>...</TITLE> <NEXTID ID="NNN"> <ISINDEX>
In the body section, the following tags will be recognised:
<A NAME="XXX" HREF="XXX" TYPE="XXX">...</A> <PLAINTEXT> <LISTING>...</LISTING> <P> <H1>...</H1>, <H2>...</H2>, <H3>...</H3>, <H4>...</H4>, <H5>...<H5>, <H6>...<H6> <ADDRESS>...</ADDRESS> <DL><DT>...<DD>...</DL> <UL><LI>...</UL>
______________________________________________________ CTB
<TITLE>HTML2 -- /MarkUp</TITLE> <NEXTID 1> <ADDRESS>Last Modified 10/6/92 by CTB </ADDRESS> <H1>Updates To HTML</H1>In order to improve the functionality of the World Wide Web, the HyperText Markup Language must be tidied up, to allow it to be processed by generic SGML engines, and not just the WWW one. <P> The updated HTML will have a greater structure than the original version, including a header section, separate from the body.<P> This header section will allow the following tags: <XMP><KEYWORDS>...</KEYWORDS> <TITLE>...</TITLE> <NEXTID ID="NNN"> <ISINDEX> </XMP>In the body section, the following tags will be recognised: <XMP><A NAME="XXX" HREF="XXX" TYPE="XXX">...</A> <PLAINTEXT> <LISTING>...</LISTING> <P> <H1>...</H1>, <H2>...</H2>, <H3>...</H3>, <H4>...</H4>, <H5>...<H5>, <H6>...<H6> <ADDRESS>...</ADDRESS> <DL><DT>...<DD>...</DL> <UL><LI>...</UL> </XMP>______________________________________________________<A NAME=0 HREF=http://info.cern.ch/hypertext/WWW/People.html#11> CTB</A></A>
In the proposal suggested here, the dependence upon SGML is even more explicit as the HTML
tags listed here pretty much comprise what was available as of that time, with a few ideas
not as yet realized. Here one finds the first and only mention of a proposed
<KEYWORDS>
header section tag. This function would eventually surface as
a content to the HTML 2-introduced <META>
tag. It also illustrated two
attributes that did not yet exist, TYPE
for <A>
and
ID
for <NEXTID>
. Even months later, in November, the
TYPE
attribute would still be under discussion as to what sorts of parameter
values might serve, so it obviously was merely a vague proposal at this point. The
ID
attribute would be used in some of Dan Connolly's early drafts of a DTD, but
N
would be its name when finally introduced. <XMP>
is not
listed since it was used for showing the tags and its closing tag would have ruined it, but
plainly it too was intended. Also missing are <MENU>
, <DIR>
,
and the <HPn>
tags. Apparently these only emerged as this phase of HTML
was drawing to a close.
The following is the oldest survivng DTD, prepared by Dan Connolly,
and showing some of the directions being taken. This file was modified in
August, but by November no doubt a later version of the DTD incorporated the
<TYPEWRITER>
tag of which he wrote profusely (and even
illustrated with a few working examples) at that time.
<!-- html.dtd - document type declaration subset for HyperText Markup Language as defined by the World Wide Web project. $Id: html.dtd,v 1.1 92/08/19 18:37:58 connolly Exp $ 15 Jul 92 by connolly@convex.com 6 Aug 92 revision: match HTML.c better 18 Aug 92 revision: FrameMaker integration See also: http://info.cern.ch/hypertext/WWW/MarkUp/Tags.html http://info.cern.ch/hypertext/WWW/MarkUp/HTML2.html --> <!-- Character entities --> <!-- I wonder if we could just use numeric character entities, as long as we're just referencing ASCII characters. That is, write D in stead of < --> <!ENTITY lt "<"> <!ENTITY gt ">"> <!ENTITY amp "&"> <!ENTITY bullet "·" -- @@@ NeXT only --> <!-- parameter entities (DTD macros) --> <!ENTITY % a.a "NAME CDATA #IMPLIED TYPE CDATA #IMPLIED HREF CDATA #IMPLIED"> <!ENTITY % a.list "COMPACT CDATA #IMPLIED"> <!ENTITY % heading "H1|H2|H3|H4|H5|H6" > <!ENTITY % list "UL|OL|DIR|MENU"> <!ENTITY % pass "P|#PCDATA" -- aka pass_character --> <!ENTITY % raw "XMP|LISTING"> <!-- PlainText is more than 8 characters, and changing the NAMELEN capacity involves using an SGML declaration different from the default, which is a hassle. Besides: the semantics of PlainText can't be captured by real SGML anyway. If we were willing to muck with NAMELEN, we could use the <PLAINTEXT> tag to mark the _end_ of the SGML document, and treat the rest of the data in the stream using normal plain text conventions. --> <!-- Document structure --> <!ELEMENT HTML O O ((TITLE? & NEXTID? & ISINDEX?), DOCUMENT)> <!ENTITY % body "%heading|%list|DL|%pass|%raw|address"> <!ELEMENT DOCUMENT O O ((%heading), (%body)+) +(A)> <!-- The DOCUMENT element is necessary to avoid mixed content in the HTML element. Mixed content and optional elements don't mix very well. BUT it introduces minimization into the HTML format. Hmm... --> <!ELEMENT TITLE - - (#PCDATA)> <!ELEMENT ISINDEX - O EMPTY > <!ELEMENT NEXTID - O EMPTY > <!ATTLIST NEXTID ID NUMBER #REQUIRED> <!-- as noted in Tags.html, the conventional <NEXTID 10> is illegal. Use <NEXTID ID=10> to comply with this DTD. --> <!ELEMENT ADDRESS - O (%pass)+> <!ELEMENT (%heading) - - (%pass)+ --Tags.html says titles should fit on one line, but the browser handles paragraph breaks inside headings gracefully. --> <!ELEMENT (%list) - - ((LI|%pass)+)> <!ATTLIST (%list) %a.list> <!ELEMENT DL - - (DT|DD|%pass)+> <!ATTLIST DL %a.list> <!ELEMENT (LI|DT|DD) - O EMPTY> <!ELEMENT A - - (%body)+> <!ATTLIST A %a.a; > <!ELEMENT P - O EMPTY> <!ELEMENT (%raw) - - CDATA> <!-- BUG: tags.html says that you can put anything but </XMP> in the text of an XMP element. SGML says that ETAGO, "</" ends a CDATA section. -->
This is the very oldest surviving HTML DTD. As its comments reveal, older
versions of the DTD were produced by him as early as mid-July. In this DTD
one sees several of the planned improvements to HTML being shown, such as the
TYPE
attribute of <A>
, the ID
attribute of <NEXTID>
, and the character entities for
"<
" and ">
" and even one
(never to be seen again) for a list bullet, "·
"
(shows as ˇ), a clear reference to the ISO-8859-1 standard. Even this early
he is pushing for a reacceptance of <OL>
nearly a year
before its eventual return. Elements for <MENU>
,
<DIR>
, and <ISINDEX>
are also present
as if they had been always there. Note also the mention of the bug about
how a <XMP>
section would end with any end tag, according
to proper SGML, but how HTML User Agents do (and should) only end it with
its closing tag. Notice also how instead of calling the body the
<BODY>
it instead introduces a different tag titled
<DOCUMENT>
.
HyperText Markup: Recommended Usage
These constructs should work even on pretty broken implementations.
Most text elements consist of a start tag, some content, and an end tag. A start tag is an identifier surrouded by angle processing instruction > brackets. An end tag is an open angle bracket, a slash, an identifier, and a close bracket.
An identifier should be a letter followed by up to 7 letters or numbers.
No spaces are allowed between the tag open bracket and the identifier. Space is allowed between the identifier and the close bracket.
Some elements are "empty" and consist of only a start tag.
Paragraphs are separated by the "P" element.
Six levels of headings are supported:
Unordered lists:
Ordered lists:
The address element indicates the author or source of the document.
DWCNormal text is represented in HTML as parsed character data, #PCDATA. The characters '<', '>', and '&' should be represented as "<", ">", and "&" respectively, lest they be interpreted as markup. Lines should not exceed 72 characters. Line breaks have no significance except to separate words.
Sections of literal text are represented in HTML as replaceable character data. Line breaks are significant, and characters are rendered in a fixed-width font to preserve horizontal formatting.
This is literal text. THIS word should line up under THIS word. There should be exactly three blank lines between here and here.
The '&' character should be represented as "&". The character sequence "</" must be represented as "</". The character sequence "]]>" must represented as "]]>".
SGML tags look like <start> and </end>. The marked section close delimiter looks like ]]>. But ]] is just two close square brackets, and > is just a greater-than sign.
The TITLE element names the document. The content of the TITLE element is just character data, CDATA. It should be less than 72 characters, and it should contain no linebreaks, '<', '>', or '&' characters.
The ISINDEX tag appears at most one time, and it precedes all tags but TITLE and NEXTID.
Some elements have associated named attributes. The values of the attributes of an element are specified in its start tag.
Attribute values are represented as RCDATA surrounded by double quotes. The character '"' must be represented as """ in an attribute value literal. The NEXTID tag appears at most one time, after the title and before the text elements.
<!-- test.html $Id$ --> <TITLE>HyperText Markup: Recommended Usage</TITLE> <H1>Recommended HTML Usage</H1> These constructs should work even on pretty broken implementations. <H2>Text Elements</H2> Most text elements consist of a start tag, some content, <!-- comment foo --> and an end tag. A start tag is an identifier surrouded by angle <? processing instruction > brackets. An end tag is an open angle bracket, a slash, an identifier, and a close bracket. <P> An identifier should be a letter followed by up to 7 letters or numbers. <P> No spaces are allowed between the tag open bracket and the identifier. Space is allowed between the identifier and the close bracket. <P> Some elements are "empty" and consist of only a start tag. <P> Paragraphs are separated by the "P" element. <P> Six levels of headings are supported: <P> <H3>Level three heading</H3> <H4>Level four heading</H4> <H5>five</H5> <H6>six</H6> Unordered lists: <P> <UL> <LI> This is the first item of an unordered list. <LI> This is the second item. It's kinda long, and should wrap around on most screens. <P> <LI> This is the third item. It's only one paragraph, but it's got a paragraph tag at the end.<P> <LI> This is the fourth and final item. </UL> Ordered lists: <P> <oL> <LI> This is the first item of an unordered list. <LI> This is the second item. It's kinda long, and should wrap around on most screens. <LI> This is the third item -- you know, the one with the P element. <P> <LI> This is the fourth and final item. </oL> <DL> <DT> term <DD> definition <DT> another term <dd> and its definition </DL> The address element indicates the author or source of the document. <ADDRESS> DWC <P> connolly@convex.com </ADDRESS> <H2>Normal Text: PCDATA</H2> Normal text is represented in HTML as parsed character data, #PCDATA. The characters '<', '>', and '&' should be represented as "&lt;", "&gt;", and "&amp;" respectively, lest they be interpreted as markup. Lines should not exceed 72 characters. Line breaks have no significance except to separate words. <P> <H2>Literal Text: RCDATA</H2> Sections of literal text are represented in HTML as replaceable character data. Line breaks are significant, and characters are rendered in a fixed-width font to preserve horizontal formatting. <P> <XMP> This is literal text. THIS word should line up under THIS word. There should be exactly three blank lines between here and here. </XMP> The '&' character should be represented as "&amp;". The character sequence "</" must be represented as "&lt;/". The character sequence "]]>" must represented as "]]&gt;". <XMP> SGML tags look like <start> and </end>. The marked section close delimiter looks like ]]>. But ]] is just two close square brackets, and > is just a greater-than sign. </XMP> <H2>Document Description Elements</H2> The TITLE element names the document. The content of the TITLE element is just character data, CDATA. It should be less than 72 characters, and it should contain no linebreaks, '<', '>', or '&' characters. <P> The ISINDEX tag appears at most one time, and it precedes all tags but TITLE and NEXTID.<P> <H2>Elements with Attributes</H2> Some elements have associated named attributes. The values of the attributes of an element are specified in its start tag.<P> Attribute values are represented as RCDATA surrounded by double quotes. The character '"' must be represented as "&quot;" in an attribute value literal. The NEXTID tag appears at most one time, after the title and before the text elements.<P>
Dan Connolly hand-wrote this file while Tim Berners-Lee was updating his NeXT
HTML Editor. Notice it does not use his <TYPEWRITER>
tag but
does exercise the character entities. It also appears to be the first use of
<OL>
since the tag was done away with in very early 1991.
The following group of files were prepared by Dan Connolly on November 30, 1992, and not only capture the transition HTML was in at the point, but also show clear evidence of being generated by some other HTML editor than NeXT. But by this point the next version of NeXT was already in use by Tim Berners-Lee, though still there is much here in the way of SGML-friendly enhancements.
Hypertext Markup Language - Windows Internet Explorer
The World Wide Web project involves the processing of structured hypertext documents by diverse systems around the globe. The hypertext documents are represented as marked up text.
The HyperText Markup Language is defined in terms of the ISO 8879:1986, Standard Generalized Markup Language (SGML). The SGML declaration and document type definition specify the syntax and structure of HTML.
This is intended as an introduction to the language and a guide to implementors. It does not comprise an integral part of the HTML specification.
Text and Markup is an introduction to SGML text and markup as it applies to HTML. It should prepare you to read the DTD.
The following sections describe the HyperText Markup language by example. They are organized in order of complexity, both for the human reader and the SGML processing application.
The libHTML software distribution provides the primitive SGML reading functions that you can use to build a conforming implementation.
This software is written in ANSI C (with some accomodataions for K&R compilers). It supports the lexical constructs demonstrated in HTML Extremes.
<TITLE>Hypertext Markup Language</TITLE> <H1>HyperText Markup Language</H1> <H2>A Language for Transmission of Global Hyperdocuments.</H2> <H3>Abstract</H3> The World Wide Web project involves the processing of structured hypertext documents by diverse systems around the globe. The hypertext documents are represented as marked up text.<P> <H2>Specification</H2> The HyperText Markup Language is defined in terms of the ISO 8879:1986, Standard Generalized Markup Language (SGML). The <A NAME=id8 HREF="html.dtd">SGML declaration and document type definition </A> specify the syntax and structure of HTML. <P> <H2>Implementors' Guide</H2> This is intended as an introduction to the language and a guide to implementors. It does not comprise an integral part of the HTML specification. <P> <H3>Introduction</H3> <A HREF="Text.html">Text and Markup</A> is an introduction to SGML text and markup as it applies to HTML. It should prepare you to read <A NAME=id10 HREF="html.dtd" content-type="text/plain">the DTD</A>. <P> <H3>HTML by Example</H3> The following sections describe the HyperText Markup language by example. They are organized in order of complexity, both for the human reader and the SGML processing application. <P> <DL> <DT><A NAME=id2 HREF="recommended.html">Recommended</A> <DD>Examples of how to write HTML that won't stress the processing software. Some things can't be done this way. <DT><A NAME=id3 HREF="complete.html">Complete</A> <DD>Examples of all the constructs necessary to produce HTML documents. <DT><A NAME=id4 HREF="tolerated.html">Tolerated</A> <DD>Examples of illegal constructs that are supported for historical reasons. <DT><A NAME=id6 HREF="deprecated.html">Deprecated</A> <DD>Some quirks; these are legal SGML, but they are likely to break existing implementations (including the sample). <DT><A NAME=id7 HREF="errors.html">Errors</A> <DD>These are just plain broken. Implementors should use these to bullet-proof their code. </DL> <H2>A Partial Implementation</H2> The <A NAME=id11 HREF="libHTML.tar.Z" content-type="application/octet-stream">libHTML software distribution</A> provides the primitive SGML reading functions that you can use to build a conforming implementation.<P> This software is written in ANSI C (with some accomodataions for K&R compilers). It supports the lexical constructs demonstrated in <A NAME=id12 HREF="supported.html">HTML Extremes</A>.
This file has no <NEXTID>
but it does have an attribute that
is nowhere written of, namely a content-type
. When, in HTML 4 and 4.01
the TYPE
attribute resurfaces, it would be used exactly in the same
manner as content-type
was used here. Notice that all
<A>
NAME
attributes here begin with the letters,
id so they are no longer numbers. That is because in SGML if
NAME
is to be truly a name and not merely a number (as NeXT had been
generating during the previous phase of HTML), then a mere number should no longer
be accepted, but even so the next version of NeXT would continue to generate
NAME
values of merely a number as they did before.
HTML Guide: Text and Markup - Windows Internet Explorer
This part of the HTML reference is an explanation of SGML syntax as it applies to HTML. For lexical issues, the purpose is to take the standard and reduce it from the abstract system that is SGML to a concrete language, HTML. For structural issues, the purpose is to give you enough background to read the DTD.
An HTML document is a hierarchy of elements. Each element has a name, some attributes, and some content. Most elements are represented in the document as a start tag, which gives the name and attributes, followed by the content, followed by the end tag. For example:
<HTML> <TITLE> A sample HTML document </TITLE> <H1> An Example of Structure </H1> Here's a typical paragraph. <P> <UL> <LI> Item one has an <A NAME=anchor> anchor </A> <LI> Here's item two. </UL> </HTML>
Some elements (e.g. P, LI) are "empty." They have no content. They show up as just a start tag.
For the rest of the elements, the content is a sequence of data characters and nested elements. The content must match the element's model group from its declaration in the DTD.
Using the example from above, the content of the UL element is the sequence "LI, #PCDATA, A, LI, #PCDATA". This matches the model group from the UL element declaration: "(#PCDATA|LI|A)+".
An HTML document is like a text file, except that some of the characters are interpreted as markup, rather than document content. The following table lists the special character sequences that separate data from markup in an HTML document.
In the DTD, the symbol PCDATA stands for parsed character data, the normal text characters in an HTML document.
The text consists of a stream of lines. The division into lines has no significance apart from indicating a word end.
All of the SGML delimiters listed in the table of delimitersare recognized in PCDATA.
In the DTD, the symbol CDATA stands for character data, the text without markup in an SGML document. Only the end tag open delimiters is recognized in CDATA.
The characters in an SGML document are organized into a heirarchy of elements by the use of tags. Tags are set off from the data characters by angle brackets: '<' and '>'.
The element name immediately follows "<". Names consist of a letter followed by up to 33 letters, digits, periods, or hyphens. Names are not case sensitive.
Following the element name, whitespace and attributes are allowed. An attribute consists of a name, an equal sign, and a value. Spaces are allowed around the equal sign.
The value is either a token or a literal. A token is up to 34 letters, digits, periods, or dashes. Tokens are case sensitive.
A literal is a string surrounded by single quotes or a string surrounded by double quotes. Entity references are processed inside attribute values as inside PCDATA. The length of an attribute value (after entity processing) is limited to 1024 characters.
Each attribute has a type, which puts constraints on the values it can have. For example, the NAME attribute of the A element is an ID. An ID is a name that must be unique among all IDs in the document.
In order to include characters that would otherwise be parsed as markup, you can use entity references refer to some of characters.
An entity reference is an ampersand, followed by a name, followed by a semicolon. No spaces are allowed within an entity reference. For example:
This is how you include a <tag> as data.
Comment declarations can be used include information aimed at persons and tools that read the document in source form. This information will be ignored when the document is processed by an SGML parser.
Comments begin with the character sequence "<!--" and end with "--", which must be followed by '>'. (Technically, whitespace is allowed between the closing "--" and '>'.) They are only allowed in PCDATA.
<TITLE>HTML Guide: Text and Markup</TITLE> <H1>Text and Markup</H1> This part of <A NAME=id3 HREF="MarkUp.html">the HTML reference</A> is an explanation of SGML syntax as it applies to HTML. For lexical issues, the purpose is to take the standard and reduce it from the abstract system that is SGML to a concrete language, HTML. For structural issues, the purpose is to give you enough background to read <A NAME=id1 HREF="html.dtd">the DTD</A>. <P> <H2>Structured Text</H2> An HTML document is a hierarchy of elements. Each element has a name, some attributes, and some content. Most elements are represented in the document as a start tag, which gives the name and attributes, followed by the content, followed by the end tag. For example: <P> <TYPEWRITER> <HTML> <TITLE> A sample HTML document </TITLE> <H1> An Example of Structure </H1> Here's a typical paragraph. <P> <UL> <LI> Item one has an <A NAME=anchor> anchor </A> <LI> Here's item two. </UL> </HTML> </TYPEWRITER> Some elements (e.g. P, LI) are "empty." They have no content. They show up as just a start tag. <P> For the rest of the elements, the content is a sequence of data characters and nested elements. The content must match the element's model group from its declaration in <A NAME=id17 HREF="html.dtd">the DTD</A>.<P> Using the example from above, the content of the UL element is the sequence "LI, #PCDATA, A, LI, #PCDATA". This matches the model group from the UL element declaration: "(#PCDATA|LI|A)+". <H2>Parsing Content Into Data and Markup</H2> An HTML document is like a text file, except that some of the characters are interpreted as markup, rather than document content. The following table lists the special character sequences that separate data from markup in an HTML document. <H3><A NAME=delimiters>SGML delimiters</A></H3> <DL> <DT>CRO<DD>Character Reference Open: "&#", when followed by a letter or a digit, signals a character reference. SGML idioms include things like "&#168;" and "&#SPACE;". It is not used in HTML. <DT>ERO<DD>Entity Reference Open: "&", when followed by a letter, signals an <A NAME=id2 HREF="#Entities">entity reference</A>. <DT><A NAME=ETAGO>ETAGO</A><DD>End Tag Open: "</", when followed by a letter, signals an <A HREF="#Tags">end tag. </A> <DT>MDO<DD>Markup Declaration Open: "<!", when followed by a letter or "--" or "[", signals one of several SGML markup declarations. The only purpose it serves in HTML is to introduce <A NAME=id11 HREF="#Comments">comments</A>. <DT>MSC<DD>Marked Section Close: "]]", when followed by ">" signals the end of a marked section. While marked sections are not used by HTML, this sequence of characters is recognized and reported as an error by conforming SGML parsers. <DT>PIO<DD>Processing Instruction Open: "<?" signals a processing instruction. It is not used in HTML. <DT>STAGO<DD>Start Tag Open: "<", when followed by a letter, signals a <A HREF="#Tags">start tag</A>. </DL> <H3><A NAME=PCDATA>Normal Text: Parsed Character Data</A></H3> In <A NAME=id9 HREF="html.dtd">the DTD</A>, the symbol PCDATA stands for parsed character data, the normal text characters in an HTML document. <P> The text consists of a stream of lines. The division into lines has no significance apart from indicating a word end.<P> All of the SGML delimiters listed in <A NAME=id16 HREF="#delimiters">the table of delimiters</A>are recognized in PCDATA. <P> <H3><A NAME=CDATA>Raw Text: Character Data</A></H3> In <A NAME=id15 HREF="html.dtd">the DTD</A>, the symbol CDATA stands for character data, the text without markup in an SGML document. Only the end tag open <A NAME=id14 HREF="#delimiters">delimiters</A> is recognized in CDATA. <P> <H2><A NAME=Tags>Tags</A></H2> The characters in an SGML document are organized into a heirarchy of elements by the use of tags. Tags are set off from the data characters by angle brackets: '<' and '>'.<P> <H3>Names</H3> The element name immediately follows "<". Names consist of a letter followed by up to 33 letters, digits, periods, or hyphens. Names are not case sensitive.<P> <H3>Attributes</H3> Following the element name, whitespace and attributes are allowed. An attribute consists of a name, an equal sign, and a value. Spaces are allowed around the equal sign.<P> The value is either a token or a literal. A token is up to 34 letters, digits, periods, or dashes. Tokens are case sensitive.<P> A literal is a string surrounded by single quotes or a string surrounded by double quotes. Entity references are processed inside attribute values as inside PCDATA. The length of an attribute value (after entity processing) is limited to 1024 characters.<P> Each attribute has a type, which puts constraints on the values it can have. For example, the NAME attribute of the A element is an ID. An ID is a name that must be unique among all IDs in the document. <H2><A NAME=Entities>Entities</A></H2> In order to include characters that would otherwise be parsed as markup, you can use entity references refer to some of characters.<P> An entity reference is an ampersand, followed by a name, followed by a semicolon. No spaces are allowed within an entity reference. For example:<P> <XMP> This is how you include a &lt;tag&gt; as data. </XMP> <H2><A NAME=Comments>Comments</A></H2> Comment declarations can be used include information aimed at persons and tools that read the document in source form. This information will be ignored when the document is processed by an SGML parser.<P> Comments begin with the character sequence "<!--" and end with "--", which must be followed by '>'. (Technically, whitespace is allowed between the closing "--" and '>'.) They are only allowed in PCDATA.
This file actually used a <TYPEWRITER>
tag, not as a mere
demonstration of the tag, but actually. This tag is clearly the predecessor of
the <PRE>
tag, which is native to the version of HTML being
used by Tim Berners-Lee starting only a few days previous to these HTML directions
discussions posted by Dan Connolly. In this particular instance of the
<TYPEWRITER>
tag, only the opening "<
"
character was replaced with "<
." The closing
">
" characters of the enclosed content were left as is.
This is allowable since it is the sequence of the opening bracket followed by
a letter that signifies a tag. See the explanation within this file of the
"STAGO" SGML delimiter Most browsers are expected to be smart enough
to recognize that a closing ">
" not coming after the
start of a tag is only the character itself and therefore to be displayed as
is. But it is still always good programming practice to replace the closing
">
" with ">
" wherever the
closing bracket in a <PRE>
section is meant to be simply
displayed as is. For this display <TYPEWRITER>
has been
replaced with <PRE>
and the closing brackets modified.
Note here also some of the other SGML contructs such as Markup Declaration
Open ("MDO"), Markup Section Close ("MSC"), and Processing
Instruction Open ("PIO"). Some of the SGML constructions are also
occasionally seen in HTML files, though even at this early period their use
(apart from comment delimiters) is depreciated.
HTML Guide: Recommended Usage - Windows Internet Explorer
This part of the HTML Reference shows recommended usage. These constructs are recommended because
This section contains many suggestions, rules of thumb, and the like. Where the suggestions are not equivalent to the DTD, the words "should," "may," etc. are linked to futher explanation.
An HTML document should start with a TITLE element.
If the document is searchable, an ISINDEX element should come next.
After any TITLE and ISINDEX elements comes the BODY, which should start with an H1 element, followed by other elements character data.
See also: tolerated structural errors, severe structural errors.
The TITLE element should identify the document in a fairly wide context. Its content should fit on one line: it should be less than 72 characters with no linebreaks. It should not contain any '<' characters.
The title may be used to identify the node in a history list, to label the window displaying the node, etc. It is not normally displayed in the text of a document itself. Contrast titles with headings .
The presence of the ISINDEX element indicates the document is searchable.
Within the content of these elements, the characters '<', '>', and '&' signal markup in many cases. They should be written as "<", ">", and "&" respectively, to prevent this.
A span of text can be marked as an anchor. Anchors can be used as the source of a hypertext link:
Choose this to view a neighbor document.
... or as the destination:
See also: tolerated errors in anchors, severe errors in anchors.
Headings are used to break the body into sections and subsections. Several levels of headings are defined:
Text that isn't marked up as some other element forms a paragraph.
Normal paragraphs consist of text consisting of words, sentences, and other stuff. Line breaks have no significance except to separate words. This is still the first paragraph of this section.
This is the second paragraph. Paragraphs are separated by P elements. HTML is relatively flat, and paragraph breaks are not allowed inside lists, headers, anchors, etc.
The address element indicates the author or source of the document:
DWC connolly@convex.comThe TYPEWRITER element is used for characters that have already been formatted for a typewriter-like device. Markup is recognized in this element just as in the normal body paragraphs. But after processing tags and entity references, the data is displayed as on a typewriter, rather than using typesetting conventions.
Line breaks are significant, and characters are rendered in a fixed-width font to preserve horizontal formatting.
For example, a portion of a man page might look like:
NOTES cat is able to correctly access files larger that two giga- bytes in size. SEE ALSO cp(1), ex(1), more(1), pr(1), tail(1)
These elements are used when you want to type the characters into the source document and have them show up in the output just like you typed them.
These elements act much like the TYPEWRITER element, but because markup is not recognized in their content, some character sequences can't be represented (SGML end tags, for example.) On the other hand, you don't have to meticulously mark up all the special characters.
You can draw pictures /\ in example elements / \ see: \__/
This is literal text. THIS word should line up under THIS word. There should be exactly three blank lines between here and here.
These elements are the source of the most errors in HTML implementations. They should be used only for simple examples that don't contiain SGML markup constructs.
<TITLE>HTML Guide: Recommended Usage</TITLE> <H1>Recommended HTML Usage</H1> This part of <A HREF="MarkUp.html">the HTML Reference</A> shows recommended usage. These constructs are recommended because <UL> <LI>They conform to the SGML definition of HTML <LI>They are straightforward to implement <LI>They work on most existing browsers </UL> This section contains many suggestions, rules of thumb, and the like. Where the suggestions are not equivalent to <A NAME=id1 HREF="html.dtd">the DTD</A>, the words "should," "may," etc. are linked to futher explanation. <H2>Structure of an HTML document</H2> An HTML document <A NAME=id2 HREF="complete.html#structure">should</A> start with a <A HREF="#TITLE">TITLE element</A>.<P> If the document is searchable, an <A NAME=id3 HREF="#ISINDEX">ISINDEX</A> element should come next.<P> After any TITLE and ISINDEX elements comes the BODY, which <A NAME=id4 HREF="tolerated.html#id1">should</A> start with an H1 element, followed by other elements character data.<P> See also: <A NAME=id13 HREF="tolerated.html#structure">tolerated structural errors</A>, <A NAME=id14 HREF="errors.html#structure">severe structural errors</A>. <H2>Header Elements</H2> <H3><A NAME=TITLE>TITLE</A></H3> The TITLE element should identify the document in a fairly wide context. Its content should fit on one line: it should be less than 72 characters with no linebreaks. It <A NAME=id5 HREF="complete.html#TITLE">should </A> not contain any '<' characters. <P> The title may be used to identify the node in a history list, to label the window displaying the node, etc. It is not normally displayed in the text of a document itself. Contrast titles with <A HREF="#headings">headings </A>. <H3><A NAME=ISINDEX>ISINDEX</A></H3> The presence of the ISINDEX element indicates the document is searchable.<P> <H2>Body Elements</H2> Within the content of these elements, the characters '<', '>', and '&' signal markup <A NAME=id12 HREF="Text.html#PCDATA">in many cases</A>. They <A NAME=id6 HREF="supported.html#delimiters">should </A> be written as "&lt;", "&gt;", and "&amp;" respectively, to prevent this. <H3>Anchors</H3> A span of text can be marked as an anchor. Anchors can be used as the source of a hypertext link:<P> Choose <A HREF="tolerated.html">this</A> to view a neighbor document.<P> ... or as the destination: <P> <A NAME="Fred">Fred Flinstone</A><P> See also: <A NAME=id15 HREF="tolerated.html#A">tolerated errors in anchors</A>, <A NAME=id16 HREF="errors.html#a">severe errors in anchors</A>. <H3><A NAME=headings>Headings</A></H3> Headings are used to break the body into sections and subsections. Several levels of headings are defined: <P> <H4>Level four headings are for sub-sub-sub headings</H4> <H3>Paragraphs</H3> Text that isn't marked up as some other element forms a paragraph.<P> Normal paragraphs consist of text consisting of words, sentences, and other stuff. Line breaks have no significance except to separate words. This is still the first paragraph of this section. <P> This is the second paragraph. Paragraphs are separated by P elements. HTML is relatively flat, and paragraph breaks are not allowed inside lists, headers, anchors, etc.<P> <H3>Lists</H3> <UL> <LI>This is the first item of an unordered list. <LI>This is the second item. It's kinda long, and should wrap around on most screens. <LI>This is the third item. It's only one paragraph, but it's got a paragraph tag at the end. <LI>This is the fourth and final item. </UL> <!-- @@ link to unordered lists --> <H3>Glossaries</H3> <DL> <DT>term <DD> definition <DT>another term <DD> and its definition, which is long enough that it should wrap around on most screens. </DL> <H3>Address</H3> The address element indicates the author or source of the document:<P> <ADDRESS>DWC connolly@convex.com</ADDRESS> <H3>TYPEWRITER</H3> The TYPEWRITER element is used for characters that have already been formatted for a typewriter-like device. Markup is recognized in this element just as in the normal body paragraphs. But after processing tags and entity references, the data is displayed as on a typewriter, rather than using typesetting conventions. <P> Line breaks are significant, and characters are rendered in a fixed-width font to preserve horizontal formatting.<P> For example, a portion of a man page might look like:<P> <TYPEWRITER> NOTES cat is able to correctly access files larger that two giga- bytes in size. SEE ALSO <A NAME=id7 HREF="man:/1/cp">cp(1)</A>, <A NAME=id8 HREF="man:/1/ex">ex(1)</A>, <A NAME=id9 HREF="man:/1/more">more(1)</A>, <A NAME=id10 HREF="man:/1/pr">pr(1)</A>, <A NAME=id11 HREF="man:/1/tail">tail(1)</A> </TYPEWRITER> <!-- @@ highlighting, character-level elements: bold, italic, etc. --> <H2>Literal Text Elements</H2> <H3>XMP and LISTING</H3> These elements are used when you want to type the characters into the source document and have them show up in the output just like you typed them.<P> These elements act much like the TYPEWRITER element, but because markup is not recognized in their content, some character sequences can't be represented (SGML end tags, for example.) On the other hand, you don't have to meticulously mark up all the special characters. <XMP> You can draw pictures /\ in example elements / \ see: \__/ </XMP> <XMP>This is literal text. THIS word should line up under THIS word. There should be exactly three blank lines between here and here. </XMP> These elements are the source of the most errors in HTML implementations. They should be used only for simple examples that don't contiain SGML markup constructs.
Note here the explanation of <TYPEWRITER>
, illustrating
its new and novel ability to include links within their strictly formatted
text, thus showcasing its superiority to <XMP>
and
<LISTING>
. Though these latter two tags are still listed
as being here among the "recommended" tags, already the use of these
for anything more than small and simple samples is already being discouraged.
After all, one can even show a closing </PRE>
tag within a
<PRE>
section where you cannot have (to show) a closing
</XMP>
tag within a <XMP>
section, or a
closing </LISTING>
tag showing within a
<LISTING>
section. Indeed, in some user agents, anything
that looks like any sort of closing tag inside either a <XMP>
or <LISTING>
section ("ETAGO," End Tag Open) could
terminate the <XMP>
or <LISTING>
section
prematurely.
HTML Guide: A Complete MarkUp Set - Windows Internet Explorer
The recommended usage is incomplete; it only includes those constructs that are easy to implement and explain. This section discusses a few more constructs that allow you to do anything that can legally be done. There are constructs beyond these, but they can all be reduced to constructs shown here.
An HTML document is a header part followed by a BODY element.
The header part consists of the TITLE, ISINDEX, and NEXTID elements which each appear zero or one time in any order. (see ISINDEX test, no title test)
The BODY start and end tags may be omitted. They will be inferred by SGML parsers. "Recommended Usage" is an example of this. This entity is an example of explicitly including the BODY tags.
The PLAINTEXT tag signals the end of the HTML text entity, and the beginning of a non-SGML data entity. (The format of the data is governed by the MIME text/plain content type.)
See Also:
The title can have an '<' character, as long as it's not followed by a '/' and a letter. See the section on SGML delimiters in CDATA.
The normal text content of body elements may include several kinds of markup.
A comment that you shouldn't see: For copyrights, RCS keywords, etc.
processing instruction: lkjsdf If you've _got_ to stick TeX macros or something in there, use this. The sample implementation won't even tell you it's there, though.
Entity references are recognized in normal body elements (anyplace #PCDATA appears in the DTD) and attribute value literals. See the Entities section of "Text and Markup" for more details. The HTML DTD defines the following entities for characters that might otherwise be parsed as markup:
The HTML DTD references the public text "ISO 8879:1986//ENTITIES Added Latin 1//EN" to define entities for latin-1 characters, for example Gödel was a famous mathemetician.
In order to include quotes in the value of the content-type attribute, use """ and "'" entity references: link to SGMLS software distribution with fancy content-type attribute
Section 7.9.3 of the SGML standard states
For the SGML-impared, Ee is Entity End (like EOF); RS is '\n'; RE is '\r'; SEPCHAR is '\t' and SPACE is ' '.
Since to date there are no HTML attributes containing newlines or spaces, that is not much of an issue.
@@But replacement of literals is. For one thing, this creates an interaction between the syntax of URLs and SGML syntax. We could resolve this issue by removing '&' from the URL syntax.
Six levels of headings are defined:
Normal paragraphs consist of text consisting of words, sentences, and other stuff. Line breaks are not significant. This is still the first paragraph of this section.
Here's the second paragraph. It's long. It's only conventional and suggested that lines be less than 72 characters long. It's certainly not specified, defined, or required.
A P tag isn't needed between a paragraph and some other element, like a heading.
These are for things like lists of steps, where the order is significant.
Anything you could put on a typewriter (or an ASCII display device, more precicesly) can be represented in a TYPEWRITER element: Tags: <start> </end> Entity references: < & Tables made from tabs: col 1 col 2 col 3 col 4 1 3 4 2 3 4 1 2 3 4 Plus, you can use hypertext links. Linebreaks _are_ significant. There should be three blank lines from here to here.
The ASCII Horizontal Tab (HT) character should be interpreted as the smallest positive nonzero number of spaces which will leave the number of characters so far on the line as a multiple of 8. Its use is not recommended however.
Comment declaration as data follows: <!-- this would be a comment in PCDATA. It's data in RCDATA. --> Markup declaration as data follows: <!this would be an markup delcaration, which would be an error in PCDATA. It's data in RCDATA.> Start tag follows: <start> tags are fine! & as long as it's not followed by a letter or '#', it's fine! &# is even ok, unless it's followed by a letter or a number.
Tabs in XMP content:
This is literal text with tabs. THESE words should line up under THESE words.
<TITLE>HTML Guide: A Complete MarkUp Set</TITLE> <BODY><H1><A NAME=top>A Complete Set of Constructs</A></H1> The <A NAME=id1 HREF="recommended.html">recommended usage</A> is incomplete; it only includes those constructs that are easy to implement and explain. This section discusses a few more constructs that allow you to do anything that can legally be done. There are <A NAME=id2 HREF="supported.html">constructs beyond these</A>, but they can all be reduced to constructs shown here. <P> <H2><A NAME=structure>Document Structure</A></H2> An HTML document is a header part followed by a BODY element.<P> The header part consists of the TITLE, ISINDEX, and NEXTID elements which each appear zero or one time in any order. (see <A NAME=id3 HREF="structure1.html">ISINDEX test</A>, <A NAME=id5 HREF="structure2.html">no title test</A>) <P> The BODY start and end tags may be omitted. They will be inferred by SGML parsers. <A NAME=id4 HREF="recommended.html">"Recommended Usage"</A> is an example of this. This entity is an example of explicitly including the BODY tags. <P> The PLAINTEXT tag signals the end of the HTML text entity, and the beginning of a non-SGML data entity. (The format of the data is governed by the MIME text/plain content type.)<P> See Also:<UL> <LI><A HREF="structure3.html">plaintext at the beginning of a document</A> <LI><A HREF="structure4.html">plaintext at the beginning of the body</A> <LI><A HREF="structure5.html">plaintext after the body</A> <LI><A HREF="tolerated.html#id1">tolerated errors in structure</A>. <LI><A HREF="errors.html#structure">severe errors in structure</A>. </UL> <H2>Header Elements</H2> <H3><A NAME=TITLE>TITLE</A></H3> The title can have an '<' character, as long as it's not followed by a '/' and a letter. See <A NAME=id10 HREF="Text.html#CDATA">the section on SGML delimiters in CDATA</A>. <H2>Body Elements</H2> The normal text content of body elements may include several kinds of markup.<P> A comment that you shouldn't see: <!-- Your implementation is broken if you see this.--> For copyrights, RCS keywords, etc. <P> processing instruction: <?bold lkjsdf > If you've _got_ to stick TeX macros or something in there, use this. The sample implementation won't even tell you it's there, though.<P> <H3>Entity References</H3> Entity references are recognized in normal body elements (anyplace #PCDATA appears in the DTD) and attribute value literals. See <A NAME=id11 HREF="Text.html#Entities">the Entities section of "Text and Markup"</A> for more details. The HTML DTD defines the following entities for characters that might otherwise be parsed as markup: <P> <H4>HTML Entities</H4> <DL> <DT>Name <DD>Definition <DT>lt<DD>< <DT>gt<DD>> <DT>amp<DD>& <DT>quot"<DD>" <DT>apos<DD>' </DL> <P> <H4>ISO Latin-1 Characters</H4> The HTML DTD references the public text "ISO 8879:1986//ENTITIES Added Latin 1//EN" to define entities for latin-1 characters, for example Gödel was a famous mathemetician. <H2>Anchors</H2> <H3>Order and Apperance of Attributes</H3> <A HREF="#top">name implied</A><P> <A NAME=xyz>HREF implied</A><P> <A HREF="#top" NAME=xyz1>HREF before name</A><P> <H3>Quotes In Attribute Values</H3> In order to include quotes in the value of the content-type attribute, use "&quot;" and "&apos;" entity references: <A NAME=id13 HREF="ftp://ifi.uio.no/pub/SGML/SGMLS/sgmls-0.8.tar" content-type="application/x-tar; name="sgmls-0.8.tar"">link to SGMLS software distribution with fancy content-type attribute</A> <H4>Note: Interpretation of Literals</H4> Section 7.9.3 of the SGML standard states<P> <UL> <LI>An attribute value literal is interpreted as an attribute value by replacing references within it, ignoring Ee and RS, and replacing RE or SEPCHAR with SPACE. </UL> For the SGML-impared, Ee is Entity End (like EOF); RS is '\n'; RE is '\r'; SEPCHAR is '\t' and SPACE is ' '.<P> Since to date there are no HTML attributes containing newlines or spaces, that is not much of an issue.<P> @@But replacement of literals is. For one thing, this creates an interaction between the syntax of URLs and SGML syntax. We could resolve this issue by removing '&' from <A HREF="http://info.cern.ch/hypertext/WWW/Addressing/BNF.html#xalpha">the URL syntax</A> .<P> <H3>Headings</H3> Six levels of headings are defined: <P> <H4>Level four heading</H4> <h4>Another level four heading. It's long. It's only conventional and suggested that lines be less than 72 characters long. It's certainly not specified, defined, or required.</h4> <H5>Level five heading</H5> <H6>Level six heading</H6> <H3>Paragraphs</H3> Normal paragraphs consist of text consisting of words, sentences, and other stuff. Line breaks are not significant. This is still the first paragraph of this section. <P> Here's the second paragraph. It's long. It's only conventional and suggested that lines be less than 72 characters long. It's certainly not specified, defined, or required.<P> A P tag isn't needed between a paragraph and some other element, like a heading. <H3>Ordered lists</H3> These are for things like lists of steps, where the order is significant. <OL> <LI>This is the first item of an unordered list. <LI>This is the second item. It's kinda long, and should wrap around on most screens. <LI>This is the third item. <LI>This is the fourth and final item. </OL> <h3>Case of names is not significant: different cases</H3> <h3>Case of names is not significant: both lower case</h3> <H3>TYPEWRITER</H3> <TYPEWRITER>Anything you could put on a typewriter (or an ASCII display device, more precicesly) can be represented in a TYPEWRITER element: Tags: <start> </end> Entity references: &lt; &amp; Tables made from tabs: col 1 col 2 col 3 col 4 1 3 4 2 3 4 1 2 3 4 Plus, you can use <A NAME=id14 HREF="recommended.html">hypertext links.</A> Linebreaks _are_ significant. There should be three blank lines from here to here.</TYPEWRITER> The ASCII Horizontal Tab (HT) character should be interpreted as the smallest positive nonzero number of spaces which will leave the number of characters so far on the line as a multiple of 8. Its use is not recommended however.<P> <H2>Literal Text Elements</H2> <XMP> Comment declaration as data follows: <!-- this would be a comment in PCDATA. It's data in RCDATA. --> Markup declaration as data follows: <!this would be an markup delcaration, which would be an error in PCDATA. It's data in RCDATA.> Start tag follows: <start> tags are fine! & as long as it's not followed by a letter or '#', it's fine! &# is even ok, unless it's followed by a letter or a number. </XMP> Tabs in XMP content: <XMP> This is literal text with tabs. THESE words should line up under THESE words. </XMP> </BODY>
Dan Connolly here proposes the restoration of <OL>
, not
seen anywhere outside his drafts for a DTD since the very first prototype HTML
(dumped in mid-January 1991). The use of entities, and especially the
ISO-8859-1 character entities is also advocated here, and illustrated with the
"ö
" character (the "ö
" in
"Gödel"), thus at last beginning to take other languages into account.
The more specific advantages of each of <PRE>
and
<XMP>
are each showcased in that one not only has links but
also can (with character entities) show tags, while the other shows everything
in the text, including comments and SGML markup structure text that would
otherwise not show at all. Again we have here an instance of the
content-type
attribute of the <A>
tag, again
being used as TYPE
would be used starting with HTML 4. This file
also contains an instance of a "processing instruction" which is an
SGML construct that can go within an HTML document. It starts with a
"<?name
" where name
is some instruction
(usually formatting) which is not seen by the parser but to be detected and
responded to by the user agent. In this example the name
is
"bold
" so I have treated the remaining contents of the
processing instruction (to, but not including, its closing
">
") as a bolded phrase, something not otherwise
possible in this early version of HTML.
HTML Guide: Tolerated Errors - Windows Internet Explorer
These are illegal according to SGML, but they're so prevalent that they're supported by the sample implementation.
Please stop generating HTML in this style!
The BODY element must start with some element. See: an example document where this rule is broken. Paragraph breaks are not allowed in headers, lists etc. They may be ignored or treated intelligently.
with more than one paragraph
Tags that aren't known to the parser are treated as data by, for example, the MidasWWW-1.0 implementation. They should be ignored. There should be no tags around the word foo: foo.
Note that conforming SGML parsers will treat "&", "<", "</", and "<!" as normal text characters when they are not followed by a letter. HTML producers are discouraged from taking advantage of this feature.
This anchor's name starts with a digit, which is not a name start character.
This anchor's href contains a '#', which is not a name character. It should lead to the NeXT implementation reference below anyway. This anchor's href contains ':' and '/', which are not a name characters. It should lead to the SLAC MidasWWW doc anyway.
The original semantics of the XMP and LISTING elements is not representable in SGML. From Tags used in HTML:
But in section 7.6 of the SGML standard:
The XMP and LISTING elements are deprecated in favor of the TYPEWRITER element.
This example section ends here:
Even though the above ETAGO begins a markup error, this text is in a normal paragraph in conforming implementations.
Just in case the foo close tag above wasn't recognized:
The following systems are known to read and/or write HTML. They all have bugs.
<TITLE>HTML Guide: Tolerated Errors</TITLE> <H1>Tolerating broken HTML writers</H1> These are illegal according to SGML, but they're so prevalent that they're supported by the sample implementation.<P> Please stop generating HTML in this style!<P> <H2><A NAME=id1>Document Structure</A></H2> The BODY element must start with some element. See: <A HREF="error_data_starts_body.html">an example document where this rule is broken</A>. Paragraph breaks are not allowed in headers, lists etc. They may be ignored or treated intelligently. <UL> <LI>a list item<P> with more than one paragraph </UL> <H3>Muti-paragraph<P> heading</H3> <H3>Unknown Tags</H3> Tags that aren't known to the parser are treated as data by, for example, the MidasWWW-1.0 implementation. They should be ignored. There should be no tags around the word foo: <unknown>foo</unknown>. <H2>Body Elements</H2> Note that conforming SGML parsers will treat "&", "<", "</", and "<!" as normal text characters when they are not followed by a letter. HTML producers are discouraged from taking advantage of this feature.<P> <H3><A NAME=a>Anchors</A></H3> <H4>numeric IDs: <A HREF="#NeXT">NeXT</A> and <A HREF="html-mode">html-mode.el</A> </H4> <A NAME=10>This</A> anchor's name starts with a digit, which is not a name start character.<P> <H4>unquoted attribute literals: <A HREF="#NeXT">NeXT</A> and <A HREF="html-mode">html-mode.el</A> </H4> <A HREF=#NeXT>This anchor</A>'s href contains a '#', which is not a name character. It should lead to the NeXT implementation reference below anyway. <A HREF=http://slacvx.slac.stanford.edu:80/midaswww/v10/overview.html>This anchor</A>'s href contains ':' and '/', which are not a name characters. It should lead to the SLAC MidasWWW doc anyway.<P> <H2>Literal Text Elements</H2> <H4>Historical Note</H4> The original semantics of the XMP and LISTING elements is not representable in SGML. From <A HREF="http://info.cern.ch/hypertext/WWW/MarkUp/Tags.html">Tags used in HTML</A>: <P> <UL> <LI>The text may contain any ISO Latin printable characters, including the tag opener, so long as it does not contain the closing tag in full. </UL> But in section 7.6 of the SGML standard:<P> <UL> <LI>The content of an element declared to be character data or replaceable character data is terminated only by an etago delimiter-in-context (which need not open a valid end-tag) ... . </UL> The XMP and LISTING elements are deprecated in favor of the TYPEWRITER element. <H4>Non-standard CDATA parsing: LineMode, MidasWWW, etc.</H4> <XMP> This example section ends here: </foo . Even though the above ETAGO begins a markup error, this text is in a normal paragraph in conforming implementations.<P> <XMP> Just in case the foo close tag above wasn't recognized: </XMP> <H2>Known Implementations</H2> The following systems are known to read and/or write HTML. They all have bugs.<P> <DL> <DT>Linemode Browser 1.3c <DD> <DT>MidasWWW 1.0<DD> The MidasWWW parses HTML into its internal data structures, and then offers the option to extract the data and write it to a file. It doesn't get it right all the time. <DT><A NAME=NeXT>NeXT editor</A> <DD>From timbl@info.cern.ch <DT><A NAME=html-mode>html-mode.el</A> <DD>from marca@@@ <DT>Viola<DD> From Pei Wei @ O'Reilly (@@email address). Any known problems? I hear it's going to use <A NAME=id3 HREF="ftp://ftp.ifi.no/pub/text-processing/sgmls-0.8.tar" TYPE="application/x-tar">SGMLs</A>. <DT>www_and_frame<DD> @@Go get <A NAME=id5 HREF="ftp://info.cern.ch/pub/www/src/www_and_frame-0.3.tar.Z">The latest version</A> -- it should be current with this spec. <DT>perl client<DD> Just heard about it. haven't tried it. I don't think it supports entities. </DL>
Despite the recommendation of using <XMP>
seen in the
above files, here its use is explicitly depreciated in favor of
<PRE>
. The fact that anything that looks like a closing tag
would (perhaps should) end the literal text section is herein illustrated as a
prime reason that <TYPEWRITER>
(<PRE>
)
should be used instead, since this kind of problem cannot occur there. Note
here the one time TYPE
was used instead of content-type
but for the same reason, anticipating HTML 4.
HTML Guide: Obscure Usage - Windows Internet Explorer
These SGML constructs are too messy to support even in the sample implementation. But they are implemented by, for example, the SGMLs parser by James Clark. It is in direct conflict with the SGML standard not to support these, but tough cookies.
newline foo. marked sections ignore:
marked sections cdata: hideous stuff: </HTML id=#foo>
<TITLE>HTML Guide: Obscure Usage</TITLE> <H1>Deprecated Usage</H1> These SGML constructs are too messy to support even in the sample implementation. But they are implemented by, for example, the SGMLs parser by James Clark. It is in direct conflict with the SGML standard not to support these, but tough cookies.<P> newline foo. marked sections ignore: <![IGNORE[ hideous stuff: </HTML id=#foo> ]]><P> marked sections cdata: <![CDATA[ hideous stuff: </HTML id=#foo> ]]><P> <dl<dt>untermiated end tag <DD>The start tag for this DL element is not terminated. By virtue of SHORTTAG YES in the SGML declaration, this is legal. </dl>
At this point however, the only thing "officially" depreciated is the use
of non-HTML SGML constructs (other than <!-- Comments -->
).
HTML Guide: Error Tests - Windows Internet Explorer
These are just plain broken. They're not legal SGML, and I don't know of any implementations that support them.
Here's an anchor with a
paragraph break in it.
Sample anchor ID already in use I think this is a tag, since SHORTTAG YES is in the SGML decl: <>
<TITLE>HTML Guide: Error Tests</TITLE> <H1>Illegal constructs</H1> These are just plain broken. They're not legal SGML, and I don't know of any implementations that support them. <H2><A NAME=structure>Document structure</A></H2> Here's <A NAME=id6 HREF="recommended.html">an anchor with a <P> paragraph break</A> in it. <H3>broken <H4>headers</H3> busted headers</H4> <H2>Body Elements</H2> <H3>Anchors</H3> <A NAME=xyz>Sample anchor</A> <A NAME=xyz HREF="#xyz">ID already in use</A> I think this is a tag, since SHORTTAG YES is in the SGML decl: <> <P> <gggggreeeeeeeeeeeeeeeeaaatbiglongnameofatagthatsreallyjustjunk> </foo junk>
This file is only a demonstration of the degenerate sorts of things one might do if setting out to test the limits of user agents, purely code bulletproofing tools only.
HTML Guide: MarkUp Supported by the Library - Windows Internet Explorer
These are a little tricky, and might break some quick-and-dirty implementations. But they are parsed correctly by implementations based on libHTML.
These constructs are not recommended.
Character reference (not used in HTML): ' ' and È. And character from data: &, and from markup: &.
And-hash from data: &# and from markup &#.
Less-thans as data: < <1 <-)
Less-than-slash as data: </ </1 </-)
greater-than (pretty much always data): > abc> 0>
comment: The sample implementation groks.
comment w/space between -- and >:
marked section close without mdc: ]]. processing instruction: broken impl The sample implementation treats it as a processing instrcution, so you don't see it.
character references and entity references in attribute value literal
<TITLE>HTML Guide: MarkUp Supported by the Library</TITLE> <H1><A NAME=top>HTML Extremes</A></H1> These are a little tricky, and might break some quick-and-dirty implementations. But they are parsed correctly by implementations based on <A NAME=id1 HREF="libHTML.tar.Z" content-type="application/octet-stream">libHTML</A>.<P> These constructs are not recommended.<P> <H2>Document Structure</H2> <H4 >The tags for this element have spaces in them.</H4 > <h4>Another H4 Just in case it missed the close tag with spaces</H4> <H2>Header Elements</H2> <H2>Body Elements</H2> <H3><A NAME=delimiters>Delimiter Recognition</A></H3> Character reference (not used in HTML): '&#SPACE;' and È. And character from data: &, and from markup: &.<P> And-hash from data: &# and from markup &#.<P> Less-thans as data: < <1 <-)<P> Less-than-slash as data: </ </1 </-)<P> greater-than (pretty much always data): > abc> 0><P> comment: <!-- implementation is broken if this shows up--> The sample implementation groks.<P> comment w/space between -- and >: <!-- implementation is broken if this shows up -- ><P> marked section close without mdc: ]]. processing instruction: <?bold broken impl > The sample implementation treats it as a processing instrcution, so you don't see it.<P> <H2>Anchors</H2> <A HREF = "#top" NAME = xyz2>spaces around '='</A><P> <A HREF='spec.html'>single quoted value</A><P> <A HREF="system:cat&#SPACE;>&#SPACE;file">character references and entity references in attribute value literal</A><P> </BODY>
This file is only a demonstration of the degenerate sorts of things one might
do if setting out to test the limits of user agents, purely code bulletproofing
tools only. And for the last time one more instance of content-type
,
this time for the libHTML link. Again, this sample includes yet another example
of the processing instruction, and again it starts with <?bold
,
so again I have bolded its contents.
HTML tests: ISINDEX - Windows Internet Explorer
This test is part of the complete HTML usage reference. It has to be a separate document because it's an example of document structure.
It contains a NEXTID element mostly for grins.
<TITLE>HTML tests: ISINDEX</TITLE> <NEXTID ID=id1> <ISINDEX> <H1>searchable document</H1> This test is part of <A NAME=id1 HREF="complete.html">the complete HTML usage</A> reference. It has to be a separate document because it's an example of document structure.<P> It contains a NEXTID element mostly for grins.<P>
The oldest surviving example of <ISINDEX>
, and regrettably
this is only a non-functional demonstration file. It is not known, even a guess,
how many functional <ISINDEX>
instances actually existed at
this point in time on the entire web. Probably not many, I would expect. Note
here that the attribute to <NEXTID>
is not N
, but
ID
, and its value is a text that also starts with
"id," consistent with the NAME
values seen on
nearly all the <A>
tags in these files by Dan Connolly. This
example of <NEXTID>
is plainly hand-entered, not generated by
any version of the NeXT Editor, and only included, as he puts it, "mostly for
grins." Even so, it is incorrect in that it should be a value of one higher
than the highest NAME
value in the text, clearly an accident since
he illustrates it correctly in another file.
This test is part of the complete HTML usage reference. It has to be a separate document because it's an example of document structure.
<NEXTID ID=id1> <H1>document sans TITLE</H1> This test is part of <A NAME=id1 HREF="complete.html">the complete HTML usage</A> reference. It has to be a separate document because it's an example of document structure.<P>
This file is only a demonstration of the degenerate sorts of things one might
do if setting out to test the limits of user agents, purely code bulletproofing
tools only. Again, the <NEXTID>
value is wrong. It was only
included so as to have something of a "HEADER" section, and the only
other two tags <ISINDEX>
and <TITLE>
were
not possible here since one would have had other effects not wanted here and the
other was specifically what was to be absent here for demonstration purposes.
You'll get an error if you feed this stuff to the SGML parser. It's not sgml. It's text/plain data.
<PLAINTEXT> You'll get an error if you feed this stuff to the SGML parser. It's not sgml. It's text/plain data.
This is yet another example of a degenerate sorts of file one might make only
for setting out to test the limits of user agents, purely code bulletproofing
tools only. This time there is only the <PLAINTEXT>
tag and
then all is raw text after that. One might as well have used a plain ASCII text
file and given it a .txt suffix.
HTML Tests: PLAINTEXT - Windows Internet Explorer
You'll get an error if you feed this stuff to the SGML parser. It's not sgml. It's text/plain data.
<TITLE>HTML Tests: PLAINTEXT</TITLE> <PLAINTEXT> You'll get an error if you feed this stuff to the SGML parser. It's not sgml. It's text/plain data.
And here is a file much like the one surviving <PLAINTEXT>
file remaining, only a <TITLE>
and then the
<PLAINTEXT>
portion, no body content, again an extreme and
degenerate case, for code bulletproofing only.
HTML Tests: PLAINTEXT - Windows Internet Explorer
The SGML part of this document terminates at the PLAINTEXT tag. The rest is text/plain data. Don't feed it to the SGML parser.
You'll get an error if you feed this stuff to the SGML parser. It's not sgml. It's text/plain data.
<TITLE>HTML Tests: PLAINTEXT</TITLE> <H1>Plaintext Tests</H1> The SGML part of this document terminates at the PLAINTEXT tag. The rest is text/plain data. Don't feed it to the SGML parser. <PLAINTEXT> You'll get an error if you feed this stuff to the SGML parser. It's not sgml. It's text/plain data.
This is a demonatration of a nominal <PLAINTEXT>
file,
complete with <TITLE>
for the HEADER portion, a BODY
portion started off with an <H1>
header and continuing
with some standard BODY text, and finally a <PLAINTEXT>
section at the end, to the end of the file.
HyperText Markup: Errors - Windows Internet Explorer
It is an error to start the body of an HTML document with data.
This test is part of the complete HTML usage reference. It has to be a separate document because it's an example of document structure.
<TITLE>HyperText Markup: Errors</TITLE> <NEXTID ID=id2> It is an error to start the body of an HTML document with data. <H1>Error: Data not allowed at this point in BODY element</H1> This test is part of <A NAME=id1 HREF="complete.html">the complete HTML usage</A> reference. It has to be a separate document because it's an example of document structure.<P>
Under the HTML of this era, one of the main cues of the transition from
HEADER information to BODY information was the appearence of the distinctively
BODY tag, <H1>
. Once explicit <HEAD>
(or
<HEADER>
) and <BODY>
tags came into existence
the distinction between HEADER and BODY became much more explicit and with it,
more flexible, and not requiring a header (<H1>
). In this
file, the <NEXTID>
value is one higher than the highest
id value seen in any NAME
in the text, so at last
there is a correct example.
The following two files were prepared by Tim Berners-Lee using his revised NeXT
HTML Editor. Notice the use of quotation marks for most parameters and the attribute
N
of the <NEXTID>
tag, and even more important, the
use of the <HEADER>
tag (soon replaced with <HEAD>
).
Future plans for HTML - Windows Internet Explorer
The HTML language has been in use in the field since 1990, and several suggestions have been made for improvements. See working notes . A new DTD will be the result.
Much of the HTML actually around has been generated by the NeXTStep editor, which has in fact generated bad HTML. This should not confuse the specification. Some bugs in that output include non-matching open and close tags, and a NEXTID tag which is not SGML. Also, attribute values are not quoted even when they contain characters which require them to be quoted in SGML.
A perl script was written by Dan Connolly to clean up bad HTML.
Also, see Dan's HTML spec (draft) which contains a sort of test suite.
Please mail me mentioning this list if you think of features I have missed out.
A wrapper element for all the document-wide information such as title, document-wide links, etc. Advantage: You know when you have got to the end of it, and can open a window with the required attributes. This is easier than checking for a printable character.
Disadvantage: If mandatory, the size of the minimum document is increased.
A "Body" tag might be useful in the same light, for the rest.
A document-wide link, as distinct from a localized anchor. Mainly useful in conjunction with interesting link types such as related-index, related-glossary, parent, author, print-with, copy-with, etc.
An empty element.
Atributes are as for the anchor element.
A tag giving the dates a document was created, modified and expired is going to be essential for caching systems.
The expiry date-time will allow long cache times for documents such as RFCs, and short or zero caching times for varing data.
<DATE CREATED="920630123067" EXPIRES="920706000000">
(Is there an SGML standard for datetimes? Which standard to use standard? HyTime?)
The HPx elements are not implemented. Some bold/italic/fixed width highlighting is useful, with equivalent representations on single font devices. Three possibilities are
The XMP and LISTING elements have proved essential for putting on line text already formatted assuming a fixed-width character set. Many people have asked for a version which, instead of being oblivious to any embedded elements, added elements, ang and anchors withing the text. Line end would have to be mareked as such (with P) so that marked-up a line could be represented on many lines: the markup could make it too long to send as it was, and very inconvenient.
Note that an editor could always save in this element something which was originally loaded as a raw text section: indeed, the raw text is really only a (very useful!) way of importing text which could also go though a filter to make it valid marked up SGML.
Very often one wants to quote a command in fixed width font, but indented as a quotation, say 40 characters wide rather than 80. Perhaps the width required should be a parameter to the fixed width with anchors element. (Smacks of low-level format!)
Perhaps the OL tag ought to go back in, to distinguish the ordered list from the unordered one. Dan Conolly implements it.
There is a list of link types . We should formalize these, and then people actually could implement them. This corresponds to giving values to the TYPE attribute . This attribute cohis attribute coEL for RELATIONSHIP to avoid confusion between the type of link and the type of object to which it points.
A full set of entities for specical charecters should be defined, picked out of a suitable standard table. This should allow for accented characeters and bullets as a minimum. Representation using regular USASCII stand-ins (such as oe for o umlaut) should be allowed where the full character sets are not available. Editors must preserve entities even when the display has defaulted to a stand-in character combination.
The ability to hide information in an SGML document is useful. The COMMENT entity was introduced for this purpose in the line mode browser as an experiment. It should go in as standard in future. If it can contain anything then it can be used for commenting things out.
Tim BL<HEADER> <TITLE>Future plans for HTML</TITLE> <NEXTID N="42"> </HEADER> <BODY> <H1>HTML directions</H1>The <A NAME=2 HREF="MarkUp.html">HTML language</A> has been in use in the field since 1990, and several suggestions have been made for improvements. See <A NAME=1 HREF="../WorkingNotes/Overview.html">working notes</A> . A new <A NAME=37 HREF="SGML.html#5">DTD</A> will be the result. <H2>Bad HTML</H2>Much of the HTML actually around has been generated by the NeXTStep editor, which has in fact generated bad HTML. This should not confuse the specification. Some bugs in that output include non-matching open and close tags, and a NEXTID tag which is not SGML. Also, attribute values are not quoted even when they contain characters which require them to be quoted in SGML.<P> A<A NAME=38 HREF="../Tools/HTMLGeneration/fix-html.pl"> perl script</A> was written by Dan Connolly to clean up bad HTML. <P> Also, see Dan's <A NAME=z41 HREF="Connolly/MarkUp.html">HTML spec</A> (draft) which contains a sort of test suite. <H2>New features</H2>Please mail me mentioning this list if you think of features I have missed out. <H2>Header</H2>A wrapper element for all the document-wide information such as title, document-wide links, etc. Advantage: You know when you have got to the end of it, and can open a window with the required attributes. This is easier than checking for a printable character.<P> Disadvantage: If mandatory, the size of the minimum document is increased.<P> A "Body" tag might be useful in the same light, for the rest. <H2>Link</H2>A document-wide link, as distinct from a localized anchor. Mainly useful in conjunction with interesting link types such as related-index, related-glossary, parent, author, print-with, copy-with, etc.<P> An empty element.<P> Atributes are as for the anchor element. <H2>Dates</H2>A tag giving the dates a document was created, modified and expired is going to be essential for caching systems.<P> The expiry date-time will allow long cache times for documents such as RFCs, and short or zero caching times for varing data.<P> <DATE CREATED="920630123067" EXPIRES="920706000000"><P> (Is there an SGML standard for datetimes? Which standard to use standard? HyTime?) <H2>Highlighting</H2>The HPx elements are not implemented. Some bold/italic/fixed width highlighting is useful, with equivalent representations on single font devices. Three possibilities are <DL> <DT>Numbered HPn tags <DD> These are rather meaningless. In practice, everyone has to remember which is bold and which is italic. <DT>Logical tags. <DD> Dan: "I'd prefer <em>, <tt>, <cite>, ala TeX. Or we could go with the O'Reilly/Hal DocBook tags: <Emphasis>, <OopsChar>, <wordasword>,<CiteBook>,<Subscript>, <Superscript>". A problem is there are never enough of them, so people reuse them on the understanding that they will be bold, etc. <DT>Physical tags: <DD> <Bold>, <italic> etc as in MIME. There would have to be an understanding that equivalent representations could be substituted where bold and italic are not available. </DL> <H2>Base address</H2> <DL> <DT>savedas <DD>Could be a name for the tag to give the address with which the document was saved, so that relative links could be resolved even when a document is found out of context (like mailed). </DL> <H2>Fixed width text with anchors etc</H2>The XMP and LISTING elements have proved essential for putting on line text already formatted assuming a fixed-width character set. Many people have asked for a version which, instead of being oblivious to any embedded elements, added elements, ang and anchors withing the text. Line end would have to be mareked as such (with P) so that marked-up a line could be represented on many lines: the markup could make it too long to send as it was, and very inconvenient.<P> Note that an editor could always save in this element something which was originally loaded as a raw text section: indeed, the raw text is really only a (very useful!) way of importing text which could also go though a filter to make it valid marked up SGML. <H2>Fixed width indented</H2>Very often one wants to quote a command in fixed width font, but indented as a quotation, say 40 characters wide rather than 80. Perhaps the width required should be a parameter to the fixed width with anchors element. (Smacks of low-level format!) <H2>Ordered list</H2>Perhaps the OL tag ought to go back in, to distinguish the ordered list from the unordered one. Dan Conolly implements it. <H2>Link types</H2>There is a <A NAME=39 HREF="../DesignIssues/LinkTypes.html">list of link types</A> . We should formalize these, and then people actually could implement them. This corresponds to giving values to the <A NAME=40 HREF="Tags.html#21">TYPE attribute</A> . This attribute cohis attribute coEL for RELATIONSHIP to avoid confusion between the type of link and the type of object to which it points. <H2>Entities</H2>A full set of entities for specical charecters should be defined, picked out of a suitable standard table. This should allow for accented characeters and bullets as a minimum. Representation using regular USASCII stand-ins (such as oe for o umlaut) should be allowed where the full character sets are not available. Editors must preserve entities even when the display has defaulted to a stand-in character combination. <H2>Comments</H2>The ability to hide information in an SGML document is useful. The COMMENT entity was introduced for this purpose in the line mode browser as an experiment. It should go in as standard in future. If it can contain anything then it can be used for commenting things out. <ADDRESS><A NAME=0 HREF="http://info.cern.ch./hypertext/TBL_Disclaimer.html">Tim BL</A></A> </ADDRESS></BODY>
This file shows many other new directions that HTML is taking. Tim
Berners-Lee here accuses the NeXT Editor of generating bad HTML, which he
is correct about. Those excess trailing </A>
tags
seen all over the place, the lack of quotes around HREF
values
that include characters that trip up SGML parsing, the use of a mere number
as the "attribute" of <NEXTID>
, and so forth.
He then talks about a <HEADER>
tag which this file itself
already has, together with the possibility of a <BODY>
tag, which it also has. He next talks about <LINK>
which would come about with the DTD that Dan Connolly would generate in
January 1993, and proposes a <DATE>
tag, which would in
time appear as a possible set of parameters to a new tag introduced in
HTML known as <META>
. At last the Highlighted phrase
tags begin to recieve some attention as he considers implementing them not
by simply using <HPn>
but actually fleshing it out into
the various text formatting tags such as <EM>
,
<CITE>
, as well as some "physical tags" which
would eventually lead to <B>
and <I>
.
He talks about inculding a Base Address (<BASE>
), a new
kind of <XMP>
tag which would allow some HTML functions
such as links or other embedded elements, which would eventually merge with
Dan's <TYPEWRITER>
tag to become <PRE>
.
He talks about restoring <OL>
and populating the
TYPE
attribute of <A>
with possible values,
and finally entities (leading to ISO-8859-1?) and comments, which were
already coming about straight from the SGML. All in all, this is one of
the most prophetic files in this entire suite!
HyperText Design Issues: Link types - Windows Internet Explorer
See discussion of whether links should be typed .
Descriptive (normal) link types are mainly for the benefit of users and tracing, and graphics representation algorithms. Some link types for example express relationships between the things described by two nodes.
A Is part of B / B includes A
A Made B / B is made by A
A Uses B / B is used by A
A refers to B / B is referred to by A
These have a significance known to the system, and may be treated in special ways. Many of these relate whole nodes, rather than particular anchors within them. (See also multiended links and predicate logic) Suggestions:
The destination is the related index for a search by a user reading this document who asks for an index search function.
A document may have any number of index links, causing several indexes top be searched in a client-defined manner.
The destination of the link is an index which should be used to resiolve glossary queries in the document. (Typically, a double-clik on a word which is not within an anchor).
A document may have any number of glossary links.
The information in the destination node is additional to that in the source node, and may be viewed at the same time. It may be filtered out (as a function of author?).
Annotation is used by one person to write the equivalent of "margin notes" or other criticism on another's document, for example.
Tracing may ignore annotations when generating trees or sequences.
If this link is followed, the node at the end of it is embedded into the display of the source node. This is supported by Guide, but not many other systems. It is used, in effect, by those systems (VAX/notes under Decwindows, Microsoft Word) which allow "Outlining" -- expanding a tree bit by bit.
The browser has a more difficult job to do if this is supported.
This information can be used for protection, and informing authors of interest, for sending mail to authors, etc.
This information can be used for informing readers of changes.
version. This information will probably not be stored as nodes, but be generated from regular diff files. or some other delta method.
<HEADER> <TITLE>HyperText Design Issues: Link types</TITLE> <NEXTID N="5"> </HEADER> <BODY> <H1>Link Types</H1>See <A NAME=3 HREF="Topology.html#4">discussion of whether links should be typed</A> .<P> Descriptive (normal) link types are mainly for the benefit of users and tracing, and graphics representation algorithms. Some link types for example express relationships between the things described by two nodes.<P> A Is part of B / B includes A<P> A Made B / B is made by A<P> A Uses B / B is used by A<P> A refers to B / B is referred to by A <H2>Magic link types</H2>These have a significance known to the system, and may be treated in special ways. Many of these relate whole nodes, rather than particular anchors within them. (See also <A NAME=4 HREF="Topology.html#12">multiended links</A> and predicate logic) Suggestions: <H2>UseIndex</H3>The destination is the related index for a search by a user reading this document who asks for an index search function.<P> A document may have any number of index links, causing several indexes top be searched in a client-defined manner. <H2>UseGlossary</H3>The destination of the link is an index which should be used to resiolve glossary queries in the document. (Typically, a double-clik on a word which is not within an anchor).<P> A document may have any number of glossary links. <H2>Annotation</H3>The information in the destination node is additional to that in the source node, and may be viewed at the same time. It may be filtered out (as a function of author?).<P> Annotation is used by one person to write the equivalent of "margin notes" or other criticism on another's document, for example.<P> <A NAME=2 HREF="TracingLinks.html">Tracing</A> may ignore annotations when generating trees or sequences. <H2>Embedded information</H3>If this link is followed, the node at the end of it is embedded into the display of the source node. This is supported by Guide, but not many other systems. It is used, in effect, by those systems (VAX/notes under Decwindows, Microsoft Word) which allow "Outlining" -- expanding a tree bit by bit.<P> The browser has a more difficult job to do if this is supported. <H2>person described by node A is author of node B</H2>This information can be used for protection, and informing authors of interest, for sending mail to authors, etc. <H2>person described by node A is interested in node B</H2>This information can be used for informing readers of changes. <H2><A NAME=1>Node A is in fact a previous version of node B</A></H2> <H2>Node A is in fact a set of differences between B and its previous</H3>version. This information will probably not be stored as nodes, but be generated from regular diff files. or some other delta method.</BODY>
This file describes the suggested possible values might be entered into the
TYPE
attribute of <A>
as obvious enhancements
to the system. Looking at how vague the descriptions are here, it is obvious
that TYPE
has not been actually used as yet, hence my omission of
TYPE
in my listing of tags and attributes for the first documented
version of HTML. In time these relations would lay an early basis for the
REL
and it's counterpart, REV
, of <A>
and <LINK>
.
HTML Guide: Markup Supported by libHTML
These are a little tricky, and might break some quick-and-dirty implementations. But they are parsed correctly by implementations based on libHTML-930106.tar.Z, available from the WWW code archives. These constructs are not recommended.
Character reference: ' ' and È. And character from data: &, and from markup: &.
And-hash from data: and from markup &#.
Less-thans as data: < <1 <-)
Less-than-slash as data: 1 -)
greater-than (pretty much always data): > abc> 0>
comment: The sample implementation groks.
comment w/space between -- and >: