Chilkat.HtmlToXml Class Overview

Chilkat.HtmlToXml converts HTML into well-formed XML so it can be parsed and searched programmatically. It can convert HTML from strings, files, byte arrays, BinData, or StringBuilder, and provides options for dropping selected tags, controlling non-breaking spaces, choosing the XML charset, and saving the result to a file.

What the Class Is Used For

Use Chilkat.HtmlToXml when an application needs to normalize HTML into XML for reliable parsing, extraction, or transformation. This is useful when working with HTML pages, fragments, reports, scraped content, email bodies, or any markup that should be converted into a predictable XML structure.

Convert HTML to XML Set Html and call ToXml, or convert a file directly with ConvertFile.
Load HTML Many Ways Set HTML from a string, file, byte array, BinData, or StringBuilder.
Clean the Output Drop custom tags, specific tag types, or text formatting tags before XML output.
Control Encoding Use XmlCharset to choose the charset for generated XML.

Typical Workflow: Convert HTML in Memory

  1. Create an HtmlToXml object.
  2. Set the Html property, or load HTML with SetHtmlFromFile, SetHtmlBd, SetHtmlBytes, or SetHtmlSb.
  3. Optionally configure DropCustomTags, Nbsp, XmlCharset, or tag-dropping rules.
  4. Call ToXml or Xml to return the XML as a string.
  5. Use ToXmlSb when the XML should be appended to a StringBuilder.
  6. Check LastErrorText if conversion fails or behaves unexpectedly.

Typical Workflow: Convert File to File

  1. Create an HtmlToXml object.
  2. Optionally configure output behavior such as DropCustomTags, Nbsp, XmlCharset, and tag-dropping options.
  3. Call ConvertFile with the input HTML path and destination XML path.
  4. Check the boolean return value.
  5. If the method returns false, inspect LastErrorText.
File-to-file conversion: ConvertFile converts an HTML file directly to a well-formed XML file suitable for later parsing and data extraction.

Core Concepts

Concept Meaning Important Members
HTML Source The HTML content to be converted. Html, SetHtmlFromFile, SetHtmlBd, SetHtmlBytes, SetHtmlSb
XML Output The well-formed XML created from the HTML source. ToXml, Xml, ToXmlSb, ConvertFile
Tag Dropping Removes selected HTML tags from the output XML. DropCustomTags, DropTagType, UndropTagType, DropTextFormattingTags, UndropTextFormattingTags
Text Formatting Tags Formatting tags such as b, font, i, u, and br. DropTextFormattingTags, UndropTextFormattingTags
Non-Breaking Spaces Controls how   entities are handled in XML output. Nbsp
Output Charset Character encoding of the generated XML. XmlCharset

Properties

Property Purpose Default / Guidance
Html Holds the HTML to be converted by ToXml. Set this directly for string-based conversion, or use one of the SetHtml* methods.
DropCustomTags Drops non-standard HTML tags from the XML output. Set true when custom or unknown tags should not appear in the converted XML.
Nbsp Controls how   entities are handled. Default is 0, converting   to a normal space.
XmlCharset Specifies the charset of the XML to be created. If empty, the XML is created using the same character encoding as the HTML. Otherwise, the output is converted to the specified charset, such as utf-8 or iso-8859-1.
LastErrorText Diagnostic information for the last method or property access. Check after failures or unexpected behavior. Diagnostic information may be available regardless of success or failure.

Non-Breaking Space Handling

Nbsp Value Behavior Use When
0 Converts   to a normal ASCII space. Use for ordinary readable XML/text extraction. This is the default.
1 Converts   to  . Use when the non-breaking-space character should be preserved as a numeric XML character reference.
2 Drops   entities. Use when non-breaking spaces should be removed entirely.
3 Leaves   unmodified. Use when the original entity text should remain unchanged.

Setting the HTML Source

Method / Property Input Purpose
Html String Directly sets the HTML string to be converted.
SetHtmlFromFile Filename Loads HTML from a file into the Html property.
SetHtmlBd BinData Sets the Html property from BinData.
SetHtmlBytes Byte array Sets the Html property from a byte array.
SetHtmlSb StringBuilder Sets the Html property from a StringBuilder.

Converting to XML

Method Output Use When
ToXml XML string Convert the HTML currently stored in the Html property and return the XML as a string.
Xml XML string Same as ToXml. Provided as an alternate method name.
ToXmlSb StringBuilder Converts the HTML in Html and appends the XML to a supplied StringBuilder.
ConvertFile XML file Converts an HTML file directly to a well-formed XML file.
Parsing workflow: After conversion, the XML output can be loaded into an XML parser so the original HTML content can be searched and extracted more reliably.

Tag Dropping and Cleanup

Member Effect Use When
DropCustomTags Drops non-standard HTML tags from the output XML. Use when custom elements should be excluded from the generated XML.
DropTagType Drops a specific tag type from the output XML. Call once for each tag type to drop.
UndropTagType Prevents a specified tag type from being dropped. Use to reverse or override a previous drop rule.
DropTextFormattingTags Drops common text formatting tags from the XML output. Use when formatting-only markup should be removed.
UndropTextFormattingTags Keeps text formatting tags in the XML output. Important because text formatting tags are dropped by default.
Default formatting behavior: Text formatting tags are dropped by default. Call UndropTextFormattingTags if formatting tags such as b, i, font, or br should remain in the XML output.

Text Formatting Tags

The following tag types are considered text formatting tags by DropTextFormattingTags and UndropTextFormattingTags:

Category Tags
Emphasis and font styling b, font, i, u, em, strong
Size and presentation big, small, tt, center
Line and text effects br, s, strike, sub, sup

File Helper Methods

Method Purpose Guidance
ReadFile Reads a complete file into a byte array. Use when binary file data is needed.
WriteFile Saves a byte array to a file. Use when writing raw bytes.
ReadFileToString Reads a text file into a string. The srcCharset argument specifies the input charset, such as utf-8 or iso-8859-1.
WriteStringToFile Saves a string to a text file. The charset argument specifies the output encoding.
Encoding matters: Use the correct charset when reading HTML text or writing XML text so characters are interpreted and saved correctly.

Method Summary by Category

Category Methods / Properties Purpose
Set HTML input Html, SetHtmlFromFile, SetHtmlBd, SetHtmlBytes, SetHtmlSb Provide the HTML source to be converted.
Convert to XML ToXml, Xml, ToXmlSb, ConvertFile Convert HTML to XML as a string, append to StringBuilder, or write directly to a file.
Control output DropCustomTags, Nbsp, XmlCharset Configure custom tag handling, non-breaking-space behavior, and XML charset.
Drop or preserve tags DropTagType, UndropTagType, DropTextFormattingTags, UndropTextFormattingTags Remove or preserve specific tag types and text formatting tags.
File helpers ReadFile, WriteFile, ReadFileToString, WriteStringToFile Read and write byte arrays or encoded text files.
Diagnostics LastErrorText Read diagnostic information after failed or unexpected operations.

Diagnostics and Troubleshooting

Problem Area Member What to Check
No XML is produced Html, SetHtmlFromFile, ToXml, LastErrorText Confirm the HTML source was set before converting, and inspect diagnostic output if conversion fails.
Formatting tags are missing UndropTextFormattingTags Text formatting tags are dropped by default. Call this method if they should be preserved.
Custom tags appear in the XML but should not DropCustomTags Set this property to true to drop non-standard HTML tags.
Specific tag type should be removed DropTagType Call once for each tag name that should be dropped.
Non-breaking spaces are not represented as desired Nbsp Choose the value that converts, preserves, drops, or numeric-encodes  .
Output has incorrect characters XmlCharset, ReadFileToString, WriteStringToFile Verify the input charset and the desired XML output charset.
Need operation details after failure LastErrorText Check diagnostic text after failed or unexpected conversion, file read, file write, or tag handling operations.

Common Pitfalls

Pitfall Better Approach
Calling ToXml before setting the HTML source. Set the Html property or call one of the SetHtml* methods first.
Expecting formatting tags to be preserved by default. Call UndropTextFormattingTags if text formatting tags should remain in the XML.
Using the wrong charset when reading or writing text files. Specify the correct charset in ReadFileToString, WriteStringToFile, or XmlCharset.
Assuming custom tags are removed automatically. Set DropCustomTags to true when custom tags should be excluded.
Dropping a tag type and forgetting to restore it for later conversions. Use UndropTagType when a previously dropped tag should be included again.
Ignoring diagnostic information after failed conversion. Check LastErrorText for details.

Best Practices

Recommendation Reason
Choose the simplest input method for the source you already have. Use Html for strings, SetHtmlFromFile for files, SetHtmlBd for BinData, and SetHtmlSb for StringBuilder.
Use ConvertFile for direct file-to-file conversion. It avoids manually loading the HTML and manually saving the XML.
Call UndropTextFormattingTags when visual markup matters. Text formatting tags are dropped by default, which is often desirable for extraction but not always for preservation.
Set Nbsp deliberately for data extraction workflows. Non-breaking spaces can affect text comparison and downstream parsing.
Set XmlCharset when the output must use a specific encoding. This avoids ambiguity when the generated XML is stored, transmitted, or parsed by another system.
Use tag-dropping methods to simplify the XML before parsing. Removing unwanted tags can make downstream extraction logic simpler and more stable.
Check LastErrorText after failures. It provides useful diagnostic detail for conversion, input loading, file writing, and tag handling.

Summary

Chilkat.HtmlToXml converts HTML into well-formed XML for programmatic parsing and data extraction. It supports direct string conversion, file-to-file conversion, input from files, byte arrays, BinData, and StringBuilder, configurable output charset, non-breaking-space handling, custom tag removal, specific tag dropping, and file helper methods for encoded text and raw bytes.

The most important practical guidance is to set the HTML source before calling ToXml, remember that text formatting tags are dropped by default, choose the desired   behavior, set XmlCharset when a specific output encoding is required, and inspect LastErrorText whenever a conversion or file operation fails.