SGML Constructs

Selected Content

Entities — declaration and references

There are two types of entities: general entities and parameter entities. Both are declared in their respective entity declaration, and they are instantiated by means of an entity reference. There are also character references, which take fixed values (viz. the character which they refer to) and thus need no declaration.

General entities can be referenced anywhere in the document, whereas parameter entities may only be refernced inside the DTD.

Parsing of entities

Entities are normally replaced at each occurrence of their reference, and the replaced text is then parsed again as part of the SGML document. By using specific entity data types, this behavior can be altered. Entities can be stored externally and referred to via an external (public and/or system) identifier.

Entity declaration

General entity:
<!ENTITY name content >
<!ENTITY name PUBLIC "public-id" ["system-id"] >
<!ENTITY name data_type content >
<!ENTITY name brack_type content >
<!ENTITY name PUBLIC "public-id" ["system-id"] extent_type notation >
<!ENTITY name PUBLIC "public-id" ["system-id"] SUBDOC >

If no type is specified (first and second lines), the entity is fully parsed as SGML, i.e. it may contain elements and further entity references. Internal entities may be of the data type data_type; bracketed text entities replace tags and are of type brack_type; and external entities can be of a specific external entity datatype (extent_type), which requires a notation, or may be a subdocument.

Entity data types

Internal data types
(data_type)
Description
CDATA Character Data — pasted and then completely ignored by the parser
SDATA Specific Data — system specific character data (???)
PI Processing Instruction — in reference concrete syntax something like <?content>
Bracketed text
(brack_type)
Description
STARTTAG Pastes the content as an SGML start tag
ENDTAG Pastes the content as an SGML end tag
MS Marked Section — pastes the content as an SGML Marked Section. Example: <!ENTITY draft MS "TEMP [ temporary ">
MD Meta Declaration — in reference concrete syntax something like <!content>
External entity types
(extent_type)
Description
NDATA Non-SGML Data — needs a notation reference
CDATA Character Data — pasted and then completely ignored by the parser, needs a notation reference
SDATA Specific Data — system specific character data, needs a notation reference
SUBDOC SGML Sub-document (if feature enabled)

Marked Sections

SGML may contain marked sections:

<![ keyword [
...
]]>

keyword may be one of the following: (Note: all examples assume the reference concrete syntax.)

CDATA Character Data. The parser goes into delimiter recognition mode until a close tag is found; that is to say, content in a CDATA section remains unparsed.
RCDATA Replacable Character Data. Same as CDATA, only that general entity references (&...;) and character references (&#...;) are parsed and replaced.
IGNORE The section is ignored completely.
INCLUDE The section is included normally.
TEMP Temporary, with whichever consequences.

While CDATA and RCDATA are useful for outputting markup ‘source’, the last three types of section allow conditional markup by means of parameter entities: <![ %status; [ ... ]]>, and <!ENTITY % status 'IGNORE'> etc.

Declared Content

Element types are declared by specifying the element type name and the element’s declared content in the following way:

<!ELEMENT type_name - - declared_content>

Declared content can be one of the following five:

EMPTY Element must not contain anything
ANY Element can contain any declared elements or PCDATA
CDATA Character Data — see the CDATA Marked Section above
RCDATA Replacable Character Data — see the RCDATA Marked Section above
content-model A parantheses delimited list of elements or nested content models; may include the #PCDATA primitive content token

Attributes

Each element can have any number of attributes, and the permissable attributes for each element type are declared in an ATTLIST-declaration:

<!ATTLIST element_type_name
          attrib_1_name value_type status_keyword
          ...
          attrib_n_name value_type status_keyword >

(Note that SGML, unlike XML, allows for element_type_name to be a list of element type names as in (etn_1|etn_2|...|etn_x).) The value type specifies which characters are permissible as the attribute’s value’s data, and it must be one of the following:

Attribute value types

list A bracketed list of possible values, as in ("val1"|"val2"|...)
CDATA Character Data. Any valid SGML characters; however, general and character entities are expanded.
ID A unique identifier for the element.
NOTATION This value type name must be followed by a bracketed list of (elsewhere) declared notations, as in NOTATION (not1|not2), and the permissible attribute values are precisely the notation identifiers from that list.
IDREF A reference to an ID, i.e. a name that is the unique identifier of some other element.
IDREFS Space-separated list of IDREFs.
ENTITY A currently declared (data or subdocument) entity name.
ENTITIES Space-separated list of ENTITY-s.
NAME A valid SGML name. (Must start with an alphabetic character but may contain numbers.)
NAMES Space-separated list of NAMEs.
NMTOKEN A Name Token, which may only contain valid SGML name characters but may start with either alphabetic or numeral characters.
NMTOKENS Space-separated list of NMTOKENs.
NUMBER A number.
NUMBERS Space-separated list of NUMBERs.
NUTOKEN A Number Token, which may only contain valid SGML name characters but must start with a number. Useful for dimensional quantities like 5px.
NUTOKENS Space-separated list of NUTOKENs.

The characters that are permissible for an SGML name are declared in the SGML declaration. Note that NAME and NUMBER, and also NUTOKEN, are rather specific, whereas NMTOKEN is a more general data type.

Attribute status keywords / occurrence indicators

The status keyword that follows the attribute value type in the attribute declaration indicates whether the attribute is optional, required or has a default value. It can be one of the following:

value The default value for that attribute if no other value is specified. This must be a permissible string of characters. It cannot be specified for ID and NOTATION type attributes. It can be empty ("") only for CDATA type attributes.
#FIXED This reserved name is followed by a value, and the attribute always has this value. If the attribute is specified in the document, it must take this value.
#REQUIRED The attribute value must be specified in the document.
#IMPLIED The attribute is optional and can be omitted.
#CURRENT The attribute is required in the first element; in every subsequent element it defaults to the last specified value.
#CONREF Content Reference. Either the attribute is a reference to some other element (e.g. via as an IDREF), or the element contains some sort of cross reference. As an example, consider the declaration
<!ATTLIST figure id ID #REQUIRED >
<!ELEMENT figref - O (#PCDATA) >
<!ATTLIST figref refid IDREF #CONREF >
for which the following two instances of a figref element should be permissable: <figref refid="fig1"> or <figref>(see Figure 1)</figref> .

Notations

When using external data, it is necessary to provide information about the type of data. This is done by means of a notation, which is declared in a notation declaration and used either in attributes or in data entities. The general usage of notations has been described above; here are some details.

The notation declaration looks like this:

<!NOTATION name SYSTEM system-id >
<!NOTATION name PUBLIC public-id [system-id] >

Once declared, a notation may now also receive attributes, which can be set when the notation is used in a data entity. Attribute values may, for obvious reasons, not be of type ENTITY, ENTITIES, ID, IDREF, IDREFS or NOTATION, and they may not use the occurrence keywords #CURRENT or #CONREF. The attributes for a notation are declared like this:

<!ATTLIST #NOTATION notation_name
          attrib_name value_type status_keyword >

Again, notation_name may actually be a bracketed list of notation names, and of course there may be multiple lines of attributes. Here is a usage example:

<!NOTATION oggfile SYSTEM "audioplayer.exe" >
<!NOTATION mp3file SYSTEM "audioplayer.exe" >
<!ATTLIST #NOTATION (oggfile|mp3file)
          bitrate NUTOKEN #REQUIRED
          title   NAME    "Audio File" >

<ENTITY opcred SYSTEM "opening_credits.ogg" NDATA oggfile [bitrate="192kbps" title="Opening Credits"] >
<ENTITY clcred SYSTEM "closing_credits.mp3" NDATA mp3file [bitrate="192kbps" title="Closing Credits"] >
...
&opcred; ... &clcred;