CkHtmlToXml Ruby Programming
Reference Documentation

CkHtmlToXml

Class for converting HTML to well-formed XML for the purpose of programmatically extracting (scraping) information from any HTML page.

Properties

# Returns a boolean value
get_DropCustomTags( )

# v is a boolean (input)
put_DropCustomTags( v )

If set to true, then any non-standard HTML tags will be dropped when converting to XML.

# str is a CkString object (output)
get_Html( str )

# html is a string (input)
put_Html( html )

The HTML to be converted by the ToXml method. To convert HTML to XML, first set this property to the HTML string and then call ToXml. The ConvertFile method can do file-to-file conversions.

# str is a CkString object (output)
LastErrorHtml( str )

Error information in HTML format for the last method called.

# str is a CkString object (output)
LastErrorText( str )

Error information in plain-text format for the last method called.

# str is a CkString object (output)
LastErrorXml( str )

Error information in XML format for the last method called.

# Returns an integer value
get_Nbsp( )

# v is an integer (input)
put_Nbsp( v )

Determines how to handle   HTML entities. The default value, 0 will cause   entites to be convert to normal space characters (ASCII value 32). If this property is set to 1, then  's will be converted to  . If set to 2, then &nbps;'s are dropped.

# Returns a boolean value
get_Utf8( )

# b is a boolean (input)
put_Utf8( b )

When set to true, all "const char *" arguments are expected to be utf-8 strings. If set to false, the "const char *" arguments are expected to be ANSI strings.

# str is a CkString object (output)
get_Version( str )

The version of the component, such as "1.0.0".

# str is a CkString object (output)
get_XmlCharset( str )

# html is a string (input)
put_XmlCharset( html )

The charset, such as "utf-8" or "iso-8859-1" of the XML to be created. If XmlCharset is empty, the XML is created in the same character encoding as the HTML. Otherwise the HTML is converted XML and converted to this charset.

Methods

# inHtmlFilename is a string (input)
# outXmlFilename is a string (input)
# Returns a boolean value
ConvertFile( inHtmlFilename, outXmlFilename )

Converts an HTML file to a well-formed XML file that can be parsed for the purpose of programmatically extracting information.

# tagName is a string (input)
DropTagType( tagName )

Allows for any specified tag to be dropped from the output XML.

DropTextFormattingTags( )

Causes text formatting tags to be dropped from the XML output. Text formatting tags are: b, font, i, u, br, center, em, strong, big, tt, s, small, strike, sub, and sup.

# Returns a boolean value
IsUnlocked( )

Returns true if the component is already unlocked. Otherwise returns false.

# filename is a string (input)
# srcCharset is a string (input)
# str is a CkString object (output)
# Returns a boolean value
ReadFileToString( filename, srcCharset, str )

Convenience method for reading a text file into a string. The character encoding of the text file is specified by srcCharset. Valid values, such as "iso-8895-1" or "utf-8" are listed at: List of Charsets.

Returns true for success, false for failure.

# filename is a string (input)
# Returns a boolean value
SaveLastError( filename )

Saves the last error information to an XML formatted file.

# filename is a string (input)
# Returns a boolean value
SetHtmlFromFile( filename )

Sets the Html property by loading the HTML from a file.

# str is a CkString object (output)
ToXml( str )

Converts the HTML in the "Html" property to XML and returns the XML string.

Returns true for success, false for failure.

# tagName is a string (input)
UndropTagType( tagName )

Causes a specified type of tag to NOT be dropped in the output XML.

UndropTextFormattingTags( )

Causes text formatting tags to NOT be dropped from the XML output. Text formatting tags are: b, font, i, u, br, center, em, strong, big, tt, s, small, strike, sub, and sup.

Important: Text formatting tags are dropped by default. Call this method to prevent text formatting tags from being dropped.

# code is a string (input)
# Returns a boolean value
UnlockComponent( code )

Unlocks the component. An arbitrary unlock code can be passed to automatically begin a 30-day trial.

Returns true for success, false for failure.

# str is a string (input)
# filename is a string (input)
# charset is a string (input)
# Returns a boolean value
WriteStringToFile( str, filename, charset )

Convenience method for saving a string to a file.

# Returns a string
html( )

The HTML to be converted by the ToXml method. To convert HTML to XML, first set this property to the HTML string and then call ToXml. The ConvertFile method can do file-to-file conversions.

# Returns a string
lastErrorHtml( )

Error information in HTML format for the last method called.

# Returns a string
lastErrorText( )

Error information in plain-text format for the last method called.

# Returns a string
lastErrorXml( )

Error information in XML format for the last method called.

# filename is a string (input)
# srcCharset is a string (input)
# Returns a string
readFileToString( filename, srcCharset )

Convenience method for reading a text file into a string. The character encoding of the text file is specified by srcCharset. Valid values, such as "iso-8895-1" or "utf-8" are listed at: List of Charsets.

Returns true for success, false for failure.

# Returns a string
toXml( )

Converts the HTML in the "Html" property to XML and returns the XML string.

# Returns a string
version( )

The version of the component, such as "1.0.0".

# Returns a string
xml( )

Converts the HTML in the "Html" property to XML and returns the XML string.

# Returns a string
xmlCharset( )

The charset, such as "utf-8" or "iso-8859-1" of the XML to be created. If XmlCharset is empty, the XML is created in the same character encoding as the HTML. Otherwise the HTML is converted XML and converted to this charset.