CkHtmlToXml Python Programming Reference Documentation
CkHtmlToXml
Class for converting HTML to well-formed XML for the purpose of programmatically extracting (scraping) information from any HTML page.
Object Creation
obj = chilkat.CkHtmlToXml()
Properties
# Returns a boolean value get_DropCustomTags( )
# v is a boolean (input) put_DropCustomTags( v )
If set to true, then any non-standard HTML tags will be dropped when converting to XML.
# str is a CkString object (output) get_Html( str )
# html is a string (input) put_Html( html )
The HTML to be converted by the ToXml method. To convert HTML to XML, first set this property to the HTML string and then call ToXml. The ConvertFile method can do file-to-file conversions.
# str is a CkString object (output) LastErrorHtml( str )
Error information in HTML format for the last method called.
# str is a CkString object (output) LastErrorText( str )
Error information in plain-text format for the last method called.
# str is a CkString object (output) LastErrorXml( str )
Error information in XML format for the last method called.
# Returns an integer value get_Nbsp( )
# v is an integer (input) put_Nbsp( v )
Determines how to handle HTML entities. The default value, 0 will cause entites to be convert to normal space characters (ASCII value 32). If this property is set to 1, then 's will be converted to . If set to 2, then &nbps;'s are dropped.
# Returns a boolean value get_Utf8( )
# b is a boolean (input) put_Utf8( b )
When set to true, all "const char *" arguments are expected to be utf-8 strings. If set to false, the "const char *" arguments are expected to be ANSI strings.
# str is a CkString object (output) get_Version( str )
The version of the component, such as "1.0.0".
# str is a CkString object (output) get_XmlCharset( str )
# html is a string (input) put_XmlCharset( html )
The charset, such as "utf-8" or "iso-8859-1" of the XML to be created. If XmlCharset is empty, the XML is created in the same character encoding as the HTML. Otherwise the HTML is converted XML and converted to this charset.
Methods
# inHtmlFilename is a string (input) # outXmlFilename is a string (input) # Returns a boolean value ConvertFile( inHtmlFilename, outXmlFilename )
Converts an HTML file to a well-formed XML file that can be parsed for the purpose of programmatically extracting information.
# tagName is a string (input) DropTagType( tagName )
Allows for any specified tag to be dropped from the output XML. To drop more than one tag, call this method once for each tag type to be dropped.
DropTextFormattingTags( )
Causes text formatting tags to be dropped from the XML output. Text formatting tags are: b, font, i, u, br, center, em, strong, big, tt, s, small, strike, sub, and sup.
# Returns a boolean value IsUnlocked( )
Returns True if the component is already unlocked. Otherwise returns False.
# filename is a string (input) # srcCharset is a string (input) # str is a CkString object (output) # Returns a boolean value ReadFileToString( filename, srcCharset, str )
Convenience method for reading a text file into a string. The character encoding of the text file is specified by srcCharset. Valid values, such as "iso-8895-1" or "utf-8" are listed at: List of Charsets. Returns True for success, False for failure.
# filename is a string (input) # Returns a boolean value SaveLastError( filename )
Saves the last error information to an XML formatted file.
# filename is a string (input) # Returns a boolean value SetHtmlFromFile( filename )
Sets the Html property by loading the HTML from a file.
# str is a CkString object (output) ToXml( str )
Converts the HTML in the "Html" property to XML and returns the XML string. Returns True for success, False for failure.
# tagName is a string (input) UndropTagType( tagName )
Causes a specified type of tag to NOT be dropped in the output XML.
UndropTextFormattingTags( )
Causes text formatting tags to NOT be dropped from the XML output. Text formatting tags are: b, font, i, u, br, center, em, strong, big, tt, s, small, strike, sub, and sup.
Important: Text formatting tags are dropped by default. Call this method to prevent text formatting tags from being dropped.
# code is a string (input) # Returns a boolean value UnlockComponent( code )
Unlocks the component. An arbitrary unlock code may be passed to automatically begin a 30-day trial. Returns True for success, False for failure.
# stringToWrite is a string (input) # filename is a string (input) # outpuCharset is a string (input) # Returns a boolean value WriteStringToFile( stringToWrite, filename, outpuCharset )
Convenience method for saving a string to a file. The character encoding of the output text file is specified by outpuCharset (the string is converted to this charset when writing). Valid values, such as "iso-8895-1" or "utf-8" are listed at: List of Charsets.
# Returns a string html( )
The HTML to be converted by the ToXml method. To convert HTML to XML, first set this property to the HTML string and then call ToXml. The ConvertFile method can do file-to-file conversions.
# Returns a string lastErrorHtml( )
Error information in HTML format for the last method called.
# Returns a string lastErrorText( )
Error information in plain-text format for the last method called.
# Returns a string lastErrorXml( )
Error information in XML format for the last method called.
# filename is a string (input) # srcCharset is a string (input) # Returns a string readFileToString( filename, srcCharset )
Convenience method for reading a text file into a string. The character encoding of the text file is specified by srcCharset. Valid values, such as "iso-8895-1" or "utf-8" are listed at: List of Charsets. Returns True for success, False for failure.
# Returns a string toXml( )
Converts the HTML in the "Html" property to XML and returns the XML string.
# Returns a string version( )
The version of the component, such as "1.0.0".
# Returns a string xml( )
Converts the HTML in the "Html" property to XML and returns the XML string.
# Returns a string xmlCharset( )
The charset, such as "utf-8" or "iso-8859-1" of the XML to be created. If XmlCharset is empty, the XML is created in the same character encoding as the HTML. Otherwise the HTML is converted XML and converted to this charset.
|