Chilkat.HtmlToXml Class Overview
Chilkat.HtmlToXml converts HTML into well-formed XML so it can be
parsed and searched programmatically. It can convert HTML from strings, files,
byte arrays, BinData, or
StringBuilder, and provides options for dropping
selected tags, controlling non-breaking spaces, choosing the XML charset, and
saving the result to a file.
What the Class Is Used For
Use Chilkat.HtmlToXml when an application needs to
normalize HTML into XML for reliable parsing, extraction, or transformation. This
is useful when working with HTML pages, fragments, reports, scraped content, email
bodies, or any markup that should be converted into a predictable XML structure.
Convert HTML to XML
Set Html and call
ToXml, or convert a file directly with
ConvertFile.
Load HTML Many Ways
Set HTML from a string, file, byte array, BinData,
or StringBuilder.
Clean the Output
Drop custom tags, specific tag types, or text formatting tags before XML output.
Control Encoding
Use XmlCharset to choose the charset for generated
XML.
Typical Workflow: Convert HTML in Memory
-
Create an HtmlToXml object.
-
Set the Html property, or load HTML with
SetHtmlFromFile,
SetHtmlBd,
SetHtmlBytes, or
SetHtmlSb.
-
Optionally configure DropCustomTags,
Nbsp, XmlCharset, or
tag-dropping rules.
-
Call ToXml or
Xml to return the XML as a string.
-
Use ToXmlSb when the XML should be appended to a
StringBuilder.
-
Check LastErrorText if conversion fails or behaves
unexpectedly.
Typical Workflow: Convert File to File
-
Create an HtmlToXml object.
-
Optionally configure output behavior such as
DropCustomTags,
Nbsp, XmlCharset, and
tag-dropping options.
-
Call ConvertFile with the input HTML path and
destination XML path.
-
Check the boolean return value.
-
If the method returns false, inspect
LastErrorText.
File-to-file conversion:
ConvertFile converts an HTML file directly to a
well-formed XML file suitable for later parsing and data extraction.
Core Concepts
| Concept |
Meaning |
Important Members |
| HTML Source |
The HTML content to be converted.
|
Html,
SetHtmlFromFile,
SetHtmlBd,
SetHtmlBytes,
SetHtmlSb
|
| XML Output |
The well-formed XML created from the HTML source.
|
ToXml,
Xml,
ToXmlSb,
ConvertFile
|
| Tag Dropping |
Removes selected HTML tags from the output XML.
|
DropCustomTags,
DropTagType,
UndropTagType,
DropTextFormattingTags,
UndropTextFormattingTags
|
| Text Formatting Tags |
Formatting tags such as b,
font, i,
u, and br.
|
DropTextFormattingTags,
UndropTextFormattingTags
|
| Non-Breaking Spaces |
Controls how entities are handled in
XML output.
|
Nbsp |
| Output Charset |
Character encoding of the generated XML.
|
XmlCharset |
Properties
| Property |
Purpose |
Default / Guidance |
| Html |
Holds the HTML to be converted by ToXml.
|
Set this directly for string-based conversion, or use one of the
SetHtml* methods.
|
| DropCustomTags |
Drops non-standard HTML tags from the XML output.
|
Set true when custom or unknown tags should not appear in the converted XML.
|
| Nbsp |
Controls how entities are handled.
|
Default is 0, converting
to a normal space.
|
| XmlCharset |
Specifies the charset of the XML to be created.
|
If empty, the XML is created using the same character encoding as the HTML.
Otherwise, the output is converted to the specified charset, such as
utf-8 or
iso-8859-1.
|
| LastErrorText |
Diagnostic information for the last method or property access.
|
Check after failures or unexpected behavior. Diagnostic information may be
available regardless of success or failure.
|
Non-Breaking Space Handling
| Nbsp Value |
Behavior |
Use When |
| 0 |
Converts to a normal ASCII space.
|
Use for ordinary readable XML/text extraction. This is the default.
|
| 1 |
Converts to
 .
|
Use when the non-breaking-space character should be preserved as a numeric
XML character reference.
|
| 2 |
Drops entities.
|
Use when non-breaking spaces should be removed entirely.
|
| 3 |
Leaves unmodified.
|
Use when the original entity text should remain unchanged.
|
Setting the HTML Source
| Method / Property |
Input |
Purpose |
| Html |
String |
Directly sets the HTML string to be converted.
|
| SetHtmlFromFile |
Filename |
Loads HTML from a file into the Html property.
|
| SetHtmlBd |
BinData |
Sets the Html property from
BinData.
|
| SetHtmlBytes |
Byte array |
Sets the Html property from a byte array.
|
| SetHtmlSb |
StringBuilder |
Sets the Html property from a
StringBuilder.
|
Converting to XML
| Method |
Output |
Use When |
| ToXml |
XML string |
Convert the HTML currently stored in the
Html property and return the XML as a string.
|
| Xml |
XML string |
Same as ToXml. Provided as an alternate method
name.
|
| ToXmlSb |
StringBuilder |
Converts the HTML in Html and appends the XML to
a supplied StringBuilder.
|
| ConvertFile |
XML file |
Converts an HTML file directly to a well-formed XML file.
|
Parsing workflow:
After conversion, the XML output can be loaded into an XML parser so the original
HTML content can be searched and extracted more reliably.
Tag Dropping and Cleanup
| Member |
Effect |
Use When |
| DropCustomTags |
Drops non-standard HTML tags from the output XML.
|
Use when custom elements should be excluded from the generated XML.
|
| DropTagType |
Drops a specific tag type from the output XML.
|
Call once for each tag type to drop.
|
| UndropTagType |
Prevents a specified tag type from being dropped.
|
Use to reverse or override a previous drop rule.
|
| DropTextFormattingTags |
Drops common text formatting tags from the XML output.
|
Use when formatting-only markup should be removed.
|
| UndropTextFormattingTags |
Keeps text formatting tags in the XML output.
|
Important because text formatting tags are dropped by default.
|
Default formatting behavior:
Text formatting tags are dropped by default. Call
UndropTextFormattingTags if formatting tags such as
b, i,
font, or br should remain
in the XML output.
Text Formatting Tags
The following tag types are considered text formatting tags by
DropTextFormattingTags and
UndropTextFormattingTags:
| Category |
Tags |
| Emphasis and font styling |
b,
font,
i,
u,
em,
strong
|
| Size and presentation |
big,
small,
tt,
center
|
| Line and text effects |
br,
s,
strike,
sub,
sup
|
File Helper Methods
| Method |
Purpose |
Guidance |
| ReadFile |
Reads a complete file into a byte array.
|
Use when binary file data is needed.
|
| WriteFile |
Saves a byte array to a file.
|
Use when writing raw bytes.
|
| ReadFileToString |
Reads a text file into a string.
|
The srcCharset argument specifies the input
charset, such as utf-8 or
iso-8859-1.
|
| WriteStringToFile |
Saves a string to a text file.
|
The charset argument specifies the output
encoding.
|
Encoding matters:
Use the correct charset when reading HTML text or writing XML text so characters
are interpreted and saved correctly.
Method Summary by Category
| Category |
Methods / Properties |
Purpose |
| Set HTML input |
Html,
SetHtmlFromFile,
SetHtmlBd,
SetHtmlBytes,
SetHtmlSb
|
Provide the HTML source to be converted.
|
| Convert to XML |
ToXml,
Xml,
ToXmlSb,
ConvertFile
|
Convert HTML to XML as a string, append to
StringBuilder, or write directly to a file.
|
| Control output |
DropCustomTags,
Nbsp,
XmlCharset
|
Configure custom tag handling, non-breaking-space behavior, and XML charset.
|
| Drop or preserve tags |
DropTagType,
UndropTagType,
DropTextFormattingTags,
UndropTextFormattingTags
|
Remove or preserve specific tag types and text formatting tags.
|
| File helpers |
ReadFile,
WriteFile,
ReadFileToString,
WriteStringToFile
|
Read and write byte arrays or encoded text files.
|
| Diagnostics |
LastErrorText |
Read diagnostic information after failed or unexpected operations.
|
Diagnostics and Troubleshooting
| Problem Area |
Member |
What to Check |
| No XML is produced |
Html,
SetHtmlFromFile,
ToXml,
LastErrorText
|
Confirm the HTML source was set before converting, and inspect diagnostic
output if conversion fails.
|
| Formatting tags are missing |
UndropTextFormattingTags |
Text formatting tags are dropped by default. Call this method if they should
be preserved.
|
| Custom tags appear in the XML but should not |
DropCustomTags |
Set this property to true to drop non-standard HTML tags.
|
| Specific tag type should be removed |
DropTagType |
Call once for each tag name that should be dropped.
|
| Non-breaking spaces are not represented as desired |
Nbsp |
Choose the value that converts, preserves, drops, or numeric-encodes
.
|
| Output has incorrect characters |
XmlCharset,
ReadFileToString,
WriteStringToFile
|
Verify the input charset and the desired XML output charset.
|
| Need operation details after failure |
LastErrorText |
Check diagnostic text after failed or unexpected conversion, file read, file
write, or tag handling operations.
|
Common Pitfalls
| Pitfall |
Better Approach |
| Calling ToXml before setting the HTML source. |
Set the Html property or call one of the
SetHtml* methods first.
|
| Expecting formatting tags to be preserved by default. |
Call UndropTextFormattingTags if text
formatting tags should remain in the XML.
|
| Using the wrong charset when reading or writing text files. |
Specify the correct charset in ReadFileToString,
WriteStringToFile, or
XmlCharset.
|
| Assuming custom tags are removed automatically. |
Set DropCustomTags to true when custom tags
should be excluded.
|
| Dropping a tag type and forgetting to restore it for later conversions. |
Use UndropTagType when a previously dropped tag
should be included again.
|
| Ignoring diagnostic information after failed conversion. |
Check LastErrorText for details.
|
Best Practices
| Recommendation |
Reason |
| Choose the simplest input method for the source you already have. |
Use Html for strings,
SetHtmlFromFile for files,
SetHtmlBd for
BinData, and
SetHtmlSb for
StringBuilder.
|
| Use ConvertFile for direct file-to-file conversion. |
It avoids manually loading the HTML and manually saving the XML.
|
| Call UndropTextFormattingTags when visual markup matters. |
Text formatting tags are dropped by default, which is often desirable for
extraction but not always for preservation.
|
| Set Nbsp deliberately for data extraction workflows. |
Non-breaking spaces can affect text comparison and downstream parsing.
|
| Set XmlCharset when the output must use a specific encoding. |
This avoids ambiguity when the generated XML is stored, transmitted, or
parsed by another system.
|
| Use tag-dropping methods to simplify the XML before parsing. |
Removing unwanted tags can make downstream extraction logic simpler and more
stable.
|
| Check LastErrorText after failures. |
It provides useful diagnostic detail for conversion, input loading, file
writing, and tag handling.
|
Summary
Chilkat.HtmlToXml converts HTML into well-formed XML
for programmatic parsing and data extraction. It supports direct string conversion,
file-to-file conversion, input from files, byte arrays,
BinData, and
StringBuilder, configurable output charset,
non-breaking-space handling, custom tag removal, specific tag dropping, and file
helper methods for encoded text and raw bytes.
The most important practical guidance is to set the HTML source before calling
ToXml, remember that text formatting tags are dropped
by default, choose the desired behavior,
set XmlCharset when a specific output encoding is
required, and inspect LastErrorText whenever a
conversion or file operation fails.