HtmlToXml C Library Reference

HtmlToXml

Class for converting HTML to well-formed XML for the purpose of programmatically extracting (scraping) information from any HTML page.

Create/Dispose

HCkHtmlToXml CkHtmlToXml_Create(void);

Creates an instance of the CkHtmlToXml object and returns a handle (i.e. a "void *" pointer). The handle is passed in the 1st argument for the functions listed on this page.

void CkHtmlToXml_Dispose(HCkHtmlToXml handle);

Objects created by calling CkHtmlToXml_Create must be freed by calling this method. A memory leak occurs if a handle is not disposed by calling this function.

C "Properties"

BOOL CkHtmlToXml_getDropCustomTags(HCkHtmlToXml handle);
void CkHtmlToXml_putDropCustomTags(HCkHtmlToXml handle, BOOL newVal);

If set to true, then any non-standard HTML tags will be dropped when converting to XML.

void CkHtmlToXml_getHtml(HCkHtmlToXml handle, HCkString retval);
void CkHtmlToXml_putHtml(HCkHtmlToXml handle, const char *newVal);

The HTML to be converted by the ToXml method. To convert HTML to XML, first set this property to the HTML string and then call ToXml. The ConvertFile method can do file-to-file conversions.

void CkHtmlToXml_getLastErrorHtml(HCkHtmlToXml handle, HCkString retval);

Error information in HTML format for the last method called.

void CkHtmlToXml_getLastErrorText(HCkHtmlToXml handle, HCkString retval);

Error information in plain-text format for the last method called.

void CkHtmlToXml_getLastErrorXml(HCkHtmlToXml handle, HCkString retval);

Error information in XML format for the last method called.

long CkHtmlToXml_getNbsp(HCkHtmlToXml handle);
void CkHtmlToXml_putNbsp(HCkHtmlToXml handle, long newVal);

Determines how to handle   HTML entities. The default value, 0 will cause   entites to be convert to normal space characters (ASCII value 32). If this property is set to 1, then  's will be converted to  . If set to 2, then &nbps;'s are dropped.

BOOL CkHtmlToXml_getUtf8(HCkHtmlToXml handle);
void CkHtmlToXml_putUtf8(HCkHtmlToXml handle, BOOL newVal);

When set to true, all "const char *" arguments are expected to be utf-8 strings. If set to false, the "const char *" arguments are expected to be ANSI strings.

void CkHtmlToXml_getVersion(HCkHtmlToXml handle, HCkString retval);

The version of the component, such as "1.0.0".

void CkHtmlToXml_getXmlCharset(HCkHtmlToXml handle, HCkString retval);
void CkHtmlToXml_putXmlCharset(HCkHtmlToXml handle, const char *newVal);

The charset, such as "utf-8" or "iso-8859-1" of the XML to be created. If XmlCharset is empty, the XML is created in the same character encoding as the HTML. Otherwise the HTML is converted XML and converted to this charset.

C "Methods"

BOOL CkHtmlToXml_ConvertFile(HCkHtmlToXml handle, const char *inHtmlFilename, const char *outXmlFilename);

Converts an HTML file to a well-formed XML file that can be parsed for the purpose of programmatically extracting information.

void CkHtmlToXml_DropTagType(HCkHtmlToXml handle, const char *tagName);

Allows for any specified tag to be dropped from the output XML. To drop more than one tag, call this method once for each tag type to be dropped.

void CkHtmlToXml_DropTextFormattingTags(HCkHtmlToXml handle);

Causes text formatting tags to be dropped from the XML output. Text formatting tags are: b, font, i, u, br, center, em, strong, big, tt, s, small, strike, sub, and sup.

BOOL CkHtmlToXml_IsUnlocked(HCkHtmlToXml handle);

Returns true if the component is already unlocked. Otherwise returns false.

BOOL CkHtmlToXml_ReadFileToString(HCkHtmlToXml handle, const char *filename, const char *srcCharset, HCkString str);

Convenience method for reading a text file into a string. The character encoding of the text file is specified by srcCharset. Valid values, such as "iso-8895-1" or "utf-8" are listed at: List of Charsets.

Returns TRUE for success, FALSE for failure.

BOOL CkHtmlToXml_SaveLastError(HCkHtmlToXml handle, const char *filename);

Saves the last error information to an XML formatted file.

BOOL CkHtmlToXml_SetHtmlFromFile(HCkHtmlToXml handle, const char *filename);

Sets the Html property by loading the HTML from a file.

BOOL CkHtmlToXml_ToXml(HCkHtmlToXml handle, HCkString str);

Converts the HTML in the "Html" property to XML and returns the XML string.

void CkHtmlToXml_UndropTagType(HCkHtmlToXml handle, const char *tagName);

Causes a specified type of tag to NOT be dropped in the output XML.

void CkHtmlToXml_UndropTextFormattingTags(HCkHtmlToXml handle);

Causes text formatting tags to NOT be dropped from the XML output. Text formatting tags are: b, font, i, u, br, center, em, strong, big, tt, s, small, strike, sub, and sup.

Important: Text formatting tags are dropped by default. Call this method to prevent text formatting tags from being dropped.

BOOL CkHtmlToXml_UnlockComponent(HCkHtmlToXml handle, const char *code);

Unlocks the component. An arbitrary unlock code may be passed to automatically begin a 30-day trial.

Returns TRUE for success, FALSE for failure.

BOOL CkHtmlToXml_WriteStringToFile(HCkHtmlToXml handle, const char *str, const char *filename, const char *charset);

Convenience method for saving a string to a file. The character encoding of the output text file is specified by outpuCharset (the string is converted to this charset when writing). Valid values, such as "iso-8895-1" or "utf-8" are listed at: List of Charsets.

const char *CkHtmlToXml_html(HCkHtmlToXml handle);

The HTML to be converted by the ToXml method. To convert HTML to XML, first set this property to the HTML string and then call ToXml. The ConvertFile method can do file-to-file conversions.

const char *CkHtmlToXml_lastErrorHtml(HCkHtmlToXml handle);

Error information in HTML format for the last method called.

const char *CkHtmlToXml_lastErrorText(HCkHtmlToXml handle);

Error information in plain-text format for the last method called.

const char *CkHtmlToXml_lastErrorXml(HCkHtmlToXml handle);

Error information in XML format for the last method called.

const char *CkHtmlToXml_readFileToString(HCkHtmlToXml handle, const char *filename, const char *srcCharset);

Convenience method for reading a text file into a string. The character encoding of the text file is specified by srcCharset. Valid values, such as "iso-8895-1" or "utf-8" are listed at: List of Charsets.

Returns TRUE for success, FALSE for failure.

const char *CkHtmlToXml_toXml(HCkHtmlToXml handle);

Converts the HTML in the "Html" property to XML and returns the XML string.

const char *CkHtmlToXml_version(HCkHtmlToXml handle);

The version of the component, such as "1.0.0".

const char *CkHtmlToXml_xml(HCkHtmlToXml handle);

Converts the HTML in the "Html" property to XML and returns the XML string.

const char *CkHtmlToXml_xmlCharset(HCkHtmlToXml handle);

The charset, such as "utf-8" or "iso-8859-1" of the XML to be created. If XmlCharset is empty, the XML is created in the same character encoding as the HTML. Otherwise the HTML is converted XML and converted to this charset.