org.openprivacy.reptile
Class RSSContentSerializer

java.lang.Object
  |
  +--org.openprivacy.reptile.RSSContentSerializer

public class RSSContentSerializer
extends java.lang.Object

Handles serializing HTML content to RSS 1.0

Author:
Kevin A. Burton

Field Summary
static java.lang.String COMPONENT_VERSION
           
static boolean DEBUG
          When true we enable debug mode which prints out information about processing.
static boolean INCLUDE_ACCEPTABLE_WITHIN_FIRSTLEVEL
          If true, acceptable elements are also considered first level elements.
static boolean INCLUDE_FORMS
           
static int MAX_DESCRIPTION_LENGTH
           
static int MAX_TITLE_WIDTH
           
static int MIN_CDATA_LENGTH
          Variable used to detect if a piece of text is valid CDATA.
static int MIN_DESCRIPTION_LENGTH
          Keep adding stripped PCDATA to the description until it is at least this width.
static int MIN_JUNK_DATA_PERCENTAGE
          Minimum amount of junk data in PCDATA after which we consider the whole thing junk.
static int MIN_TITLE_WIDTH
           
static int MODE_ANCHOR
          Mode for matching A name sections.
static int MODE_FLEXIBLE
          Flexible parse mode.
static int MODE_MINIMAL
          Minimal parse mode.
static java.lang.String USER_AGENT_STRING
           
 
Constructor Summary
RSSContentSerializer()
          Create a new RSSContentSerializer instance.
 
Method Summary
 java.lang.String cleanseEntities(java.lang.String data)
           
 java.lang.String cleanseHTML(java.lang.String html)
          Cleanse pcdata of junk.
 java.lang.String cleansePCDATA(java.lang.String pcdata)
          Clean up PCDATA so it is better for RSS - remove fonts - remove
 java.lang.String cleanseTitle(java.lang.String title)
          Cleanse a title so that it can be represented correctly..
 java.lang.String delete(java.lang.String begin_regexp, java.lang.String end_regexp, java.lang.String pcdata)
          Delete the region between the two regexps and return the two strings.
 java.lang.String expand(java.lang.String link)
          Expand a link relavant to the current site.
 java.lang.String getBase()
          Get the base of this URL.
 java.lang.String getContent()
          Return all the content for this item.
 int getContentStrippedLength()
          Get the length of all the stripped content.
 java.lang.String getDescription()
          Get the value of description.
 java.lang.String getHTML()
          Get the value of html.
 boolean getInitialized()
           
 int getMinRepassContentLength()
          Get the minimum amount of content we need befoe another repass The minimum amount of content we need to do a second pass with td, br elements.
 int getMode()
          Get the mode we are operating in.
 org.openprivacy.reptile.RSSContentSerializer.PCDATASection[] getPCDATASections()
          Get all PCDATA entries that were found.
 java.lang.String getResource()
          Get the value of resource.
 java.lang.String getResourceAsString()
           
 java.lang.String getRSS()
          Get the resource as an RSS stream with mod_content
 java.lang.String getSite()
          Get the site for this resource.
 java.lang.String getTitle()
          Get the value of title.
 java.lang.String getTitle(java.lang.String description)
          Attempt to pull out the title from the given description
 void init()
          Initialize this if it hasn't been done.
 boolean isAcceptablePCDATA(org.openprivacy.reptile.RSSContentSerializer.PCDATASection section)
          Return true if this is an acceptable PCDATASection.
 boolean isHolderElement(java.lang.String local_name)
          Return true if the given local_name is a holder than can format HTML across a paragraph.
 boolean isJunkContent(java.lang.String content)
          Return true if this is junk content.
static void main(java.lang.String[] args)
          Handle operations from the command line.
 void parse()
          Parse this channel.
 java.lang.String relativize(java.lang.String content)
          Used to fix relative links in HTML content so that everything is expanded.
 void setDescription(java.lang.String description)
          Set the value of description.
 void setHTML(java.lang.String html)
          Set the value of html.
 void setInitialized(boolean initialized)
           
 void setModeMinimal()
          Set minimal mode and all options.
 void setResource(java.lang.String resource)
          Set the value of resource.
 void setTitle(java.lang.String title)
          Set the value of title.
 java.lang.String strip(java.lang.String content)
          Strip all elements from the given content.
 java.lang.String truncate(java.lang.String value, int length)
          Truncate the given value so that
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

COMPONENT_VERSION

public static final java.lang.String COMPONENT_VERSION
See Also:
Constant Field Values

USER_AGENT_STRING

public static final java.lang.String USER_AGENT_STRING
See Also:
Constant Field Values

MIN_TITLE_WIDTH

public static final int MIN_TITLE_WIDTH
See Also:
Constant Field Values

MAX_TITLE_WIDTH

public static final int MAX_TITLE_WIDTH
See Also:
Constant Field Values

MIN_JUNK_DATA_PERCENTAGE

public static final int MIN_JUNK_DATA_PERCENTAGE
Minimum amount of junk data in PCDATA after which we consider the whole thing junk.

See Also:
Constant Field Values

INCLUDE_ACCEPTABLE_WITHIN_FIRSTLEVEL

public static final boolean INCLUDE_ACCEPTABLE_WITHIN_FIRSTLEVEL
If true, acceptable elements are also considered first level elements.

See Also:
Constant Field Values

MODE_MINIMAL

public static final int MODE_MINIMAL
Minimal parse mode. Less chance of failure and false positive.

See Also:
Constant Field Values

MODE_FLEXIBLE

public static final int MODE_FLEXIBLE
Flexible parse mode. More chance of failure and false positive but works with sites that syndicate content with
and tags. Should only be used if MODE_MINIMAL doesn't work.

See Also:
Constant Field Values

MODE_ANCHOR

public static final int MODE_ANCHOR
Mode for matching A name sections.

See Also:
Constant Field Values

MIN_DESCRIPTION_LENGTH

public static final int MIN_DESCRIPTION_LENGTH
Keep adding stripped PCDATA to the description until it is at least this width.

See Also:
Constant Field Values

MAX_DESCRIPTION_LENGTH

public static final int MAX_DESCRIPTION_LENGTH
See Also:
Constant Field Values

MIN_CDATA_LENGTH

public static final int MIN_CDATA_LENGTH
Variable used to detect if a piece of text is valid CDATA. If a piece of text is < MIN_CDATA_LENGTH we do not consider it acceptable. HISTORY: - 20 - seems to work on most places but fails in some. - 50 - required for sites that syndicate links to other articles prior to the real content. The only problem with this larger value is it skips small valid links like: "

This is the title

" but I believe this is acceptable

See Also:
Constant Field Values

DEBUG

public static final boolean DEBUG
When true we enable debug mode which prints out information about processing. This code is somewhat complex and difficult to understand so this is necessary on URLs that are broken. Keeping this a static final variable should allow the compiler to remove this code.

See Also:
Constant Field Values

INCLUDE_FORMS

public static final boolean INCLUDE_FORMS
See Also:
Constant Field Values
Constructor Detail

RSSContentSerializer

public RSSContentSerializer()
Create a new RSSContentSerializer instance.

Method Detail

getHTML

public java.lang.String getHTML()
Get the value of html.


setHTML

public void setHTML(java.lang.String html)
Set the value of html.


getResourceAsString

public java.lang.String getResourceAsString()
                                     throws java.lang.Exception
java.lang.Exception

getTitle

public java.lang.String getTitle()
Get the value of title.


setTitle

public void setTitle(java.lang.String title)
Set the value of title.


getDescription

public java.lang.String getDescription()
Get the value of description.


setDescription

public void setDescription(java.lang.String description)
Set the value of description.


getResource

public java.lang.String getResource()
Get the value of resource.


setResource

public void setResource(java.lang.String resource)
Set the value of resource.


init

public void init()
          throws java.lang.Exception
Initialize this if it hasn't been done. All initialization does is fetch the HTML for this serializer.

java.lang.Exception

parse

public void parse()
           throws java.lang.Exception
Parse this channel. This should be called before any other methods that return any data.

java.lang.Exception

getRSS

public java.lang.String getRSS()
                        throws java.lang.Exception
Get the resource as an RSS stream with mod_content

java.lang.Exception

strip

public java.lang.String strip(java.lang.String content)
                       throws java.lang.Exception
Strip all elements from the given content. If we strip everything out and there is no content (only markup) we return null. We also strip out   and replace them with " " and then trim the entire string. We also normalize the entire string. Duplicate spaces are replaced with a single space. Duplicate \n chars are replcated with a single \n

java.lang.Exception

relativize

public java.lang.String relativize(java.lang.String content)
                            throws java.lang.Exception
Used to fix relative links in HTML content so that everything is expanded.

java.lang.Exception

main

public static void main(java.lang.String[] args)
Handle operations from the command line.


getPCDATASections

public org.openprivacy.reptile.RSSContentSerializer.PCDATASection[] getPCDATASections()
Get all PCDATA entries that were found.


getContent

public java.lang.String getContent()
Return all the content for this item.


getBase

public java.lang.String getBase()
Get the base of this URL. For example if we are given: http://www.foo.com/directory/index.html we will return http://www.foo.com/directory


getSite

public java.lang.String getSite()
Get the site for this resource. For example: http://www.foo.com/directory/index.html we will return http://www.foo.com


expand

public java.lang.String expand(java.lang.String link)
                        throws java.lang.Exception
Expand a link relavant to the current site. This takes care of links such as /foo.html -> http://site.com/base/foo.html foo.html -> http://site.com/base/foo.html Links should *always* be expanded before they are used. Note that all resource URLs will have correct trailing slashes. If the URL does not end with / then it is a file URL and not a directory.

java.lang.Exception

isJunkContent

public boolean isJunkContent(java.lang.String content)
                      throws java.lang.Exception
Return true if this is junk content. For example if it only contain links. This works very similar to #isAcceptablePCDATA but the main difference is that this is much more picky and tries to avoid false positives at all costs. If isJunkContent does return true on a valid pcdata section we would not include it and this would be bad thing.

java.lang.Exception

getMode

public int getMode()
Get the mode we are operating in.


setModeMinimal

public void setModeMinimal()
Set minimal mode and all options.


cleanseHTML

public java.lang.String cleanseHTML(java.lang.String html)
                             throws java.lang.Exception
Cleanse pcdata of junk. This includes comments, etc.

java.lang.Exception

delete

public java.lang.String delete(java.lang.String begin_regexp,
                               java.lang.String end_regexp,
                               java.lang.String pcdata)
                        throws java.lang.Exception
Delete the region between the two regexps and return the two strings.

java.lang.Exception

cleansePCDATA

public java.lang.String cleansePCDATA(java.lang.String pcdata)
                               throws java.lang.Exception
Clean up PCDATA so it is better for RSS - remove fonts - remove

java.lang.Exception

getContentStrippedLength

public int getContentStrippedLength()
Get the length of all the stripped content.


isAcceptablePCDATA

public boolean isAcceptablePCDATA(org.openprivacy.reptile.RSSContentSerializer.PCDATASection section)
Return true if this is an acceptable PCDATASection. This is done by analyzing the text and figuring out if we can actually use this.


isHolderElement

public boolean isHolderElement(java.lang.String local_name)
Return true if the given local_name is a holder than can format HTML across a paragraph.


setInitialized

public void setInitialized(boolean initialized)

getInitialized

public boolean getInitialized()

truncate

public java.lang.String truncate(java.lang.String value,
                                 int length)
Truncate the given value so that


getTitle

public java.lang.String getTitle(java.lang.String description)
                          throws java.lang.Exception
Attempt to pull out the title from the given description

java.lang.Exception

getMinRepassContentLength

public int getMinRepassContentLength()
                              throws java.lang.Exception
Get the minimum amount of content we need befoe another repass The minimum amount of content we need to do a second pass with td, br elements.

java.lang.Exception

cleanseEntities

public java.lang.String cleanseEntities(java.lang.String data)
                                 throws java.lang.Exception
java.lang.Exception

cleanseTitle

public java.lang.String cleanseTitle(java.lang.String title)
                              throws java.lang.Exception
Cleanse a title so that it can be represented correctly..

java.lang.Exception