html
Element ¶html :lang=(en) => TEXT
The element contains an HTML document as text (or, in practice, as
CDATA). In some cases, the document starts with <html>
and
ends with </html>
; in others the html
element is
implied. Generally the HTML includes a head
element with a CSS
stylesheet. The HTML body often begins with <BR>
.
The HTML document uses only the following elements:
html
Sometimes, the document is enclosed with
<html>
…</html>
.
br
The HTML body often begins with <BR>
and may contain it as well.
b
i
u
Styling.
font
The attributes face
, color
, and size
are
observed. The value of color
takes one of the forms
#rrggbb
or rgb (r, g,
b)
. The value of size
is a number between 1 and 7,
inclusive.
The CSS in the corpus is simple. To understand it, a parser only
needs to be able to skip white space, <!--
, and -->
, and
parse style only for p
elements. Only the following properties
matter:
color
In the form rrggbb
, e.g. 000000
, with
no leading ‘#’.
font-weight
Either bold
or normal
.
font-style
Either italic
or normal
.
text-decoration
Either underline
or normal
.
font-family
A font name, commonly Monospaced
or SansSerif
.
font-size
Values claim to be in points, e.g. 14pt
, but the values are
actually in “device-independent pixels” (px), at 96/inch.
This element has the following attributes.
lang
¶This always contains en
in the corpus.