Support
Help Save Reptile!
Navigation

Essentials

Installation

Developers

P2P (content distribution)

Search Infrastructure

Services

Proposals

Resources

Overview

The reptile HTCache is a mechanism to cache remote HTML content and provide portability between computer systems. Essentially we need a mechanism to cache remote HTML documents locally in the event of a network outage and potentially transfer this content 'out-of-band' over a P2P network to a remote peer.

Most HTML documents include associated images, javascript, stylesheets, etc and these need to be included if a page is to be displayed correctly.


Design

The HTCache mechansim is designed around the Panther proxy and the Torque database system.

The actual HTML text and associated resources are kept on disk in the Panther root directory. We keep an index of all documents in the database within HTCACHE and HTCACHE_RESOURCES tables. Each document is registered in the HTCACHE table and any resource requirements are kept in HTCACHE_RESOURCES


Storage

All HTCache integration is in an org.openprivacy.reptile.htcache package. Every time a new article is found we cache it with the HTCache mechanism.

Essentially this just uses regexp across the help file and looks for patterns:


<img src="hello.png">    

        

If any URLs are in the cache and are NOT relative we need to rewrite these so that they are relative. If they used the full path we wouldn't be able to


Complex pages may break

Any complex page that makes extensive use of javascript may break. There really is no simple way to figure out how the javascript would rewrite a page, and what resources it needs, without writing a compiler.


XSLT integration

When complete we will provide a Xalan extension for determining if content is in the cache and if it is put a 'Cached' link next to the URL.

This is going to need isCached and getCachedLocation methods.


Garbage collection

Reptiles nodes that are around for a long time may need to garbage collect content. Most content > 90 days will be useless.

In some situations you may want to keep content around with a high reputation. We need to have garbage collection policies which are based on reputation. If the reputation is good keep the content in the cache longer. If the reputation is bad, remove it sooner.


Task system updates

We will probably need the concept of "run once" tasks. Specifically, once we update a content with the HTCache we don't need to update it again.



Copyright © 2001-2003, OpenPrivacy.org