HtmlToXml C Library Reference

HtmlToXml

Class for converting HTML to well-formed XML for the purpose of programmatically extracting (scraping) information from any HTML page.

Create/Dispose

HCkHtmlToXml CkHtmlToXml_Create(void);

Creates an instance of the CkHtmlToXml object and returns a handle (i.e. a "void *" pointer). The handle is passed in the 1st argument for the functions listed on this page.

void CkHtmlToXml_Dispose(HCkHtmlToXml handle);

Objects created by calling CkHtmlToXml_Create must be freed by calling this method. A memory leak occurs if a handle is not disposed by calling this function.

C "Properties"

BOOL CkHtmlToXml_getDropCustomTags(HCkHtmlToXml cHandle);
void CkHtmlToXml_putDropCustomTags(HCkHtmlToXml cHandle, BOOL newVal);

If set to true, then any non-standard HTML tags will be dropped when converting to XML.

void CkHtmlToXml_getHtml(HCkHtmlToXml cHandle, HCkString retval);
void CkHtmlToXml_putHtml(HCkHtmlToXml cHandle, const char *newVal);

The HTML to be converted by the ToXml method. To convert HTML to XML, first set this property to the HTML string and then call ToXml. The ConvertFile method can do file-to-file conversions.

void CkHtmlToXml_getLastErrorHtml(HCkHtmlToXml cHandle, HCkString retval);

Error information in HTML format for the last method called.

void CkHtmlToXml_getLastErrorText(HCkHtmlToXml cHandle, HCkString retval);

Error information in plain-text format for the last method called.

void CkHtmlToXml_getLastErrorXml(HCkHtmlToXml cHandle, HCkString retval);

Error information in XML format for the last method called.

long CkHtmlToXml_getNbsp(HCkHtmlToXml cHandle);
void CkHtmlToXml_putNbsp(HCkHtmlToXml cHandle, long newVal);

Determines how to handle   HTML entities. The default value, 0 will cause   entites to be convert to normal space characters (ASCII value 32). If this property is set to 1, then  's will be converted to &#160. If set to 2, then &nbps;'s are dropped. If set to 3, then &nbsp's are left unmodified.

BOOL CkHtmlToXml_getUtf8(HCkHtmlToXml cHandle);
void CkHtmlToXml_putUtf8(HCkHtmlToXml cHandle, BOOL newVal);

When set to true, all "const char *" arguments are expected to be utf-8 strings. If set to false, the "const char *" arguments are expected to be ANSI strings.

void CkHtmlToXml_getVersion(HCkHtmlToXml cHandle, HCkString retval);

The version of the component, such as "1.0.0".

void CkHtmlToXml_getXmlCharset(HCkHtmlToXml cHandle, HCkString retval);
void CkHtmlToXml_putXmlCharset(HCkHtmlToXml cHandle, const char *newVal);

The charset, such as "utf-8" or "iso-8859-1" of the XML to be created. If XmlCharset is empty, the XML is created in the same character encoding as the HTML. Otherwise the HTML is converted XML and converted to this charset.

C "Methods"

BOOL CkHtmlToXml_ConvertFile(HCkHtmlToXml cHandle, const char *inHtmlFilename, const char *outXmlFilename);

Converts an HTML file to a well-formed XML file that can be parsed for the purpose of programmatically extracting information.

void CkHtmlToXml_DropTagType(HCkHtmlToXml cHandle, const char *tagName);

Allows for any specified tag to be dropped from the output XML. To drop more than one tag, call this method once for each tag type to be dropped.

void CkHtmlToXml_DropTextFormattingTags(HCkHtmlToXml cHandle);

Causes text formatting tags to be dropped from the XML output. Text formatting tags are: b, font, i, u, br, center, em, strong, big, tt, s, small, strike, sub, and sup.

BOOL CkHtmlToXml_IsUnlocked(HCkHtmlToXml cHandle);

Returns true if the component is already unlocked. Otherwise returns false.

BOOL CkHtmlToXml_ReadFileToString(HCkHtmlToXml cHandle, const char *filename, const char *srcCharset, HCkString outStr);

Convenience method for reading a text file into a string. The character encoding of the text file is specified by srcCharset. Valid values, such as "iso-8895-1" or "utf-8" are listed at: List of Charsets.

Returns TRUE for success, FALSE for failure.

BOOL CkHtmlToXml_SaveLastError(HCkHtmlToXml cHandle, const char *filename);

Saves the last error information to an XML formatted file.

void CkHtmlToXml_SetHtmlBytes(HCkHtmlToXml cHandle, HCkByteData inData);

Sets the Html property from a byte array.

BOOL CkHtmlToXml_SetHtmlFromFile(HCkHtmlToXml cHandle, const char *filename);

Sets the Html property by loading the HTML from a file.

BOOL CkHtmlToXml_ToXml(HCkHtmlToXml cHandle, HCkString outStr);

Converts the HTML in the "Html" property to XML and returns the XML string.

void CkHtmlToXml_UndropTagType(HCkHtmlToXml cHandle, const char *tagName);

Causes a specified type of tag to NOT be dropped in the output XML.

void CkHtmlToXml_UndropTextFormattingTags(HCkHtmlToXml cHandle);

Causes text formatting tags to NOT be dropped from the XML output. Text formatting tags are: b, font, i, u, br, center, em, strong, big, tt, s, small, strike, sub, and sup.

Important: Text formatting tags are dropped by default. Call this method to prevent text formatting tags from being dropped.

BOOL CkHtmlToXml_UnlockComponent(HCkHtmlToXml cHandle, const char *code);

Unlocks the component. An arbitrary unlock code may be passed to automatically begin a 30-day trial.

Returns TRUE for success, FALSE for failure.

BOOL CkHtmlToXml_WriteStringToFile(HCkHtmlToXml cHandle, const char *str, const char *filename, const char *charset);

Convenience method for saving a string to a file. The character encoding of the output text file is specified by outpuCharset (the string is converted to this charset when writing). Valid values, such as "iso-8895-1" or "utf-8" are listed at: List of Charsets.

BOOL CkHtmlToXml_Xml(HCkHtmlToXml cHandle, HCkString outStr);

To be documented soon...

const char *CkHtmlToXml_html(HCkHtmlToXml cHandle);

The HTML to be converted by the ToXml method. To convert HTML to XML, first set this property to the HTML string and then call ToXml. The ConvertFile method can do file-to-file conversions.

Returns a null on failure

const char *CkHtmlToXml_lastErrorHtml(HCkHtmlToXml cHandle);

Error information in HTML format for the last method called.

Returns a null on failure

const char *CkHtmlToXml_lastErrorText(HCkHtmlToXml cHandle);

Error information in plain-text format for the last method called.

Returns a null on failure

const char *CkHtmlToXml_lastErrorXml(HCkHtmlToXml cHandle);

Error information in XML format for the last method called.

Returns a null on failure

const char *CkHtmlToXml_readFileToString(HCkHtmlToXml cHandle, const char *filename, const char *srcCharset);

Convenience method for reading a text file into a string. The character encoding of the text file is specified by srcCharset. Valid values, such as "iso-8895-1" or "utf-8" are listed at: List of Charsets.

Returns a null on failure

const char *CkHtmlToXml_toXml(HCkHtmlToXml cHandle);

Converts the HTML in the "Html" property to XML and returns the XML string.

Returns a null on failure

const char *CkHtmlToXml_version(HCkHtmlToXml cHandle);

The version of the component, such as "1.0.0".

Returns a null on failure

const char *CkHtmlToXml_xml(HCkHtmlToXml cHandle);

Converts the HTML in the "Html" property to XML and returns the XML string.

Returns a null on failure

const char *CkHtmlToXml_xmlCharset(HCkHtmlToXml cHandle);

The charset, such as "utf-8" or "iso-8859-1" of the XML to be created. If XmlCharset is empty, the XML is created in the same character encoding as the HTML. Otherwise the HTML is converted XML and converted to this charset.

Returns a null on failure