Detecting an Image in 10 Bytes
Posted on 2003-11-23T01:17:15:00-08
Say you're given a resource URL which you believe to be a an image but you're
not quite sure. It would be a waste of bandwidth to download the entire URL
contents if you knew for a *fact* that you only needed N bytes to detect if it's
an image or not.
It turns out you can detect all major image types with only 4 bytes.
Bitmaps (.bmp files)
First three bytes are 0x00 0x00 and 0x01
JPEG and some PNG files(.jpg and .png)
Bytes 7,8,9 and 10 are "JFIF" (US ASCII)
GIF (.gif files)
First four bytes are "GIF8"
PNG (.png files)
First three bytes are "PNG"
Detecting Width and Height
Lazyweb request:
How many bytes does it take to determine the width and height of various image
formats? This seems like it should be header information but it would be nice
to be able to find this information out without having to fetch the entire file.
Read More
Print Article
FeedParser - an RSS Parser API for Java
Posted on 2003-11-10T19:44:24:00-08
I've been working on a new RSS API named FeedParser which will drive the backend
of NewsMonster 2.0. I've been thinking about this for a long time now
(Jetspeed, Reptile, NewsMonster, etc) and I think this API is *very* flexible
and provides a great deal of power to developers.
We are trying to Open Source the API and contribute the code to Apache and
could use some feedback. I've published the Javadoc here . If you want to
participate in the discussion to Open Source this component I've just posted to
general@jakarta.apache.org.
Design
The main API is very similar to JAXP, TRaX, SAX, and is designed to be very
flexible. Having been a veteran of the RSS wars, member of the RSS 1.0 working
group, and Atom developer, I think this takes into consideration all major
issues with RSS/Atom feed formats and integration with the Java language.
The core API allows a developer to do whatever they want with RSS feeds. Parse
all RSS and Atom versions, support all modules, even serialize RSS feeds and
correctly handing encoding issues.
Functionality
Supports all RSS versions including Atom. This is more difficult than it
sounds. All version of RSS differ in some way or another and with Atom
supporting a different content model the complexity of building an RSS
application is just going to grow. This is accomplished using a generic
FeedParserListener which provides all functionality available on all RSS
versions.
Support FeedParserListener 'plugins' for all RSS modules within all RSS
specifications. Plugins are interfaces which provide support for RSS modules
via different "dispatch methods". For example the FeedParser supports the RSS
1.0 mod_content extension with a ModContentFeedParserListener and methods
such as onContentEncoded provide callbacks.
Only the lowest level of RSS support is required to be supported (channel and
item metadata) within applications. Additional modules and functionality are
provided on an implementation specific basis with additional interfaces. For
example if you want to write a "foo" module and then provide support within
FeedParser you build a FooParserListener, update the parser, then implement
FooParserListener in your code. If you have any other code which doesn't care
about your "foo" module you don't have to deal with this added complexity.
Low memory consumption due to SAX-like API model. Since we don't build an
in-memory object structure GC time and memory consumption are reduced. This
also allows us to easily support our plugin model.
Unified support for Dates using Java date support. RSS has ISO-8601 and RFC
822 date strings and we support a unified java.util.Date abstraction.
RSS serialization support. Serializers for all RSS versions (1.0, 2.0, Atom,
etc) with the same code. Serialization is performed as the opposite of parsing.
Your export code creates a new OutputFeedParserListener (which is just a
FeedParserListener) and calls methods on the object directly which then
serializes RSS to an OutputStream (much nicer than having to use JDOM or another
API directly). This has the added advantage of not having to deal with any
unicode or other encoding issues. It should just work right out of the box.
Direct driver -> listener integration without RSS parsing. This feature
allows your RSS integration code to work with custom backends that produce RSS
style output. For example if you have an RSS backend that expects RSS (as an
XML encoding) it doesn't make sense for developers to serialize to XML and then
parse on the same box just for code reuse. FeedParser supports the ability to
call listener methods directly from an RSS producer which allows you to save a
significant performance cost when working with RSS.q
Ability to use different low level XML parsing mechanisms (DOM, SAX, XPATH,
regexp, etc) without having to rewrite any existing code. This has been very
important in the past and we've seen a lot of discussion of this in the RSS
community. For example some people want to use regexp based parsers since they
are more lenient on what they accept (an example would be slightly invalid XML
which would choke a regular parser). (The current parser uses XPATH.)
Alternatives
The Informa packages (LGPL) is a DOM-style API. This is a large issue due
to lack of RSS module support. Since it's DOM based all new modules will
require modification of the existing DOM bindings (which might break existing
applications). Also this doesn't allow for separation of code for each new
module.
It's also very RSS specific which needs to change for Atom support.
Status
Right now all versions of RSS are supported. The only FeedParser implementation
is a bit slow due to the fact that it uses XPATH . Also, the API is still in
flux and will change before we release.
My current plans are to add Atom support, support more RSS versions, spend more
time on the API and push out an initial 0.5 release in a month.
Read More
Print Article
The Worst Thing to The Semantic Web
Posted on 2003-11-14T15:29:28:00-08
Dionidium notes :
Clay Shirky's recent bashing of the Semantic Web has generated a lot of
comments, several from defenders of this planned utopia. Kevin A. Burton, one
such supporter, writes the following as a postscript to a rant on Clay's piece
For the record I would be the first to criticize the SW on a number of issues.
My primary gripe being that Shirky failed to identify any of them.
I was somewhat forced into a devils advocate position since Shirky's argument
against the SW was so poor.
Does the SW offer some real innovation? Hell yeah.. Does it have problems?
definitely! (but I think we can make forward progress)
Using TinyURL, a URL-shortening service primarily useful for creating manageable
pointers to replace long URLs in e-mail messages, to reduce the visibility of
Clay's article ignores the not-so-startling truth: weblog indexes help us find
popular articles, not popular ideas. Daypop, for example, is worthwhile because
it shows us not only what the Web likes, but often, what it dislikes.
Kevin's disservice is not only to the articles he non-links, but also himself,
since he's rendering his own site unreachable via services like Technorati,
which link back to comments on popular articles.
It's unavoidable: articles important enough for comments are important enough
for links. Hiding your link doesn't make a site any less popular; it only makes
it harder for the rest of us to know it.
Actually the truth is that a LOT of services (Technorati, Google, etc) already
internalize URLs with 302 redirection (if they don't they probably should) so my
little hack probably wouldn't work.
That said I like the use of Zero Cost Hyperlinks a lot more.
Stay tuned...
Read More
Print Article
Rich SciFi Syndication (RSS)
Posted on 2003-11-10T19:52:50:00-08
SCI FI Wire has added an RSS (Rich Site Summary or Really Simple Syndication)
feed for readers who want another way to access the news and interviews on the
site. An RSS newsreader will let you read all your favorite online journals or
Weblogs (aka blogs) in one place, instead of having to go to each one
individually.
Cool!
http://scifi.com/scifiwire/art-main.html?2003-11/10/11.00.sfc
Read More
Print Article
Semantic Web of Lies
Posted on 2003-11-09T09:12:16:00-08
http://tinyurl.com/uakb
The W3C's Semantic Web project has been described in many ways over the last few
years: an extension of the current web in which information is given
well-defined meaning, a place where machines can analyze all the data on the
Web, even a Web in which machine reasoning will be ubiquitous and devastatingly
powerful. The problem with descriptions this general, however, is that they
don't answer the obvious question: What is the Semantic Web good for?
The simple answer is this: The Semantic Web is a machine for creating
syllogisms. A syllogism is a form of logic, first described by Aristotle, where
"...certain things being stated, something other than what is stated follows of
necessity from their being so." [Organon]
Shirky makes a few major mistakes with this critique of the semantic web which
for the most part make this paper irrelevant.
First is the concept that RDF requires the use of the syllogism for reasoning.
While some researchers are in fact using RDF assertions to build reasoning
agents there is no requirement for a reasoning backend when deploying RDF
based systems. In fact, one of the amazing features of RDF is that *any*
backend can be used which provides researchers with a great deal of flexibility.
Practical and deployed agent systems have to this point failed because the
'database' backend wasn't there to make them practical. A number of compelling
papers have been written that have designed agents but lack of real-world data
has to this point made them irrelevant.
If you want to solve this problem you have enter the field of Information
Retrieval (IR). At this point you are now working within three areas of
research. P2P and distributed systems design, agent systems/AI/reasoning, *and*
IR.
The semantic web allows developers to focus on one area without having to resort
to IR techniques.
Shirky also makes the reputation mistake with one of his syllogies:
Consider the following statements:
- The creator of shirky.com lives in Brooklyn
- People who live in Brooklyn speak with a Brooklyn accent
You could conclude from this pair of assertions that the creator of shirky.com
pronounces it "shoiky.com." This, unlike assertions about my physical location,
is false. It would be easy to shrug this error off as Garbage In, Garbage Out,
but it isn't so simple. The creator of shirky.com does live in Brooklyn, and
some people who live in Brooklyn do speak with a Brooklyn accent, just not all
of them (us).
OK. How about this statement (which is NOT true):
Clay Shirky is the creator of peerfear.org
Which we could then conclude that
"The creator of shirky.com pronounces it as 'shoiky.com.' and has also
created 'poirfear.org'"
Which... last time I checked isn't true.
This is a major problem. How can this be solved?
Reputation systems to the rescue. Any semantic web reasoning system which
doesn't include a reputation system is useless. If the assertion "People who
live in Brooklyn speak with a Brooklyn accent" can't have a reduced priority
within the system we would be victim to a great deal of hostile peers (spam).
Another point is that in the above scenario the use of 'accent' should be known
to not imply totality. Humans are smart enough to realize that generalities
about a population don't apply to the *entire* population. There is no reason
an RDF reasoning system should have the same fault.
Here Rothenberg follows the script to a tee, labeling RSS autodiscovery
'simplistic' without entertaining the idea that simplicity may be a
requirement of rapid and broad diffusion. The real lesson of RSS autodiscovery
is that developers can create valuable meta-data without needing any of the
trappings of the Semantic Web. Were the whole effort to be shelved tomorrow,
successes like RSS autodiscovery would not be affected in the slightest.
Which is the main advantage of RSS/Atom. Easy and simple metadata. Heck, even
RDF isn't that hard. The real advantage of RDF is the RDF model. Critics
often think *way* too much about the RDF syntax and reasoning model (both of
which can be changed) without realizing that the RDF magic is in the RDF model.
Here's another random question. What technology does Clay Shirky like? He
doesn't like metadata, doesn't like RDF, doesn't like micropayments, etc.
PS. I don't think Shirky is telling a "Web of Lies" here (as in the title). I
just thought it would be a cool name for this blog entry.
PPS. I used tinyurl.com for the URL to shirky's article so it wouldn't be
included in pagerank, daypop, etc. I'm going to start doing this to articles I
find suboptimal. Consider it a negative cert (or lack of approval).
Read More
Print Article
RSS Is Not The Solution To Spam
Posted on 2003-09-02T10:44:30:00-07
With scam artists, spammers and virus writers all using the e-mail inbox as the
main target, it has become a daily nightmare for legitimate online publishers
and marketers to cope with mail filters, blacklists and irate subscribers.
Enter RSS (define), the XML syndication format that allows publishers to shuttle
content to news aggregators, avoiding the e-mail chaos altogether.
"E-mail is dead, period," declares Chris Pirillo, the Internet entrepreneur who
distributes about 400,000 e-mail newsletters weekly. "I don't care what kind of
legislation goes through, people aren't signing up for newsletters
anymore. People are assuming that every e-mail publisher is a spammer."
http://www.internetnews.com/dev-news/article.php/3070851
RSS is not the solution to the spam problem. The solution to the spam problem
is a distributed trust metric. The major problem here is that this would
require a lot of overhaul to the existing email infrastructure.
I personally think that what will surface is the use of whitelists and RSS for
newsletters like what Chris Pirillo wants to do. Email will evolve into a a
role for only personal one-to-one communication with RSS handling the newsletter
space.
The whitelist will eventually be replace with a distributed trust metric...
Read More
Print Article
Atom Dinner in SF
Posted on 2003-08-01T00:40:27:00-07
Tonight I went to the Atom dinner in San Franciso - man that was fun!
The whole gang was there!
I asked Mark Pilgrim if the RSS validator supported robots.txt. He said "bite
me" (typical buddhist response I guess). Then I pointed out that I was just
joking and he seemed amused... all in good fun!
I was shocked by how many people flew out at the last minute. Greg Reinacker
was there and so was Joe Gregorio (both from the east coast). I guess Atom
is important!
Joi posted some pictures - really nice guy btw. When I first read his blog
I figured he was just another VC but then I noticed that he was hacking python
IRC bots - most VCs can't even spell IRC.
Read More
Print Article
External New Topics
Posted on 2003-07-29T22:18:26:00-07
I have been giving a lot of thought recently about using blogs to replace email.
We are not that far off if you think about it. All we really need is a deployed
CommentAPI and an RDF/RSS export and discovery of comments around
permalinks.
The one remaining issue that I can see is the ability to start a new topic
within a blog. For example what if I wanted to ask Steve a question about
blogger but keep it public? There really isn't a decent way to do this.
Referrer URIs
One mechanism is just to link to the user's blog and hope they pay attention to
their referrers. One problem is that if a visitor to the page doesn't click on
the link then it won't show up in the users referrer logs.
This can be solved by doing a HEAD of the URL and then specifying a referrer.
I'm not sure how this would be handled by most log analysis software. HEAD
minimized bandwidth byt log analysis software might not report this.
Search Cosmos
Systems like Technorati and Feedster support this type of mechanism but are
non-ideal since they are centralized (too much control in the hands of one
company).
Read More
Print Article
Broken Comments
Posted on 2003-07-26T18:14:05:00-07
It appears that the comments system has been broken on this site for a while and
no one complained. Brad was nice enough to notice this and let me know. The
problem was that a comment would fail to post but silently succeed. Very
annoying.
Read More
Print Article
Badges on Weblogs and Sidebar Syndication
Posted on 2003-07-25T18:31:43:00-07
Neil brings up something I have been thinking about lately :
If you look at a number of weblogs, you'll notice that many have a wide variety
of badges (or buttons), indicating various things they support. Such badges
include RSS, FOAF, Valid HTML, weblog tools, Creative Commons links, and so
on. It's like being a boy scout. Got some RSS? Get an RSS badge. Passed the CSS
validator? Get a Valid CSS badge. Then, people proudly display all their badges
on their web site.
He then goes on to point out that iwe should be syndicating these badges in a
"file" instead of presented on the users website.
I think he is onto something but doesn't take it to its inevitable conclusion.
Badges and sidebars (bookmarks, blogrolls, etc) should be syndicated. This
could allow aggregators such as NewsMonster to integrate the badges into the
presentation and provide enhanced options to the user.
For example Joi Ito's blog has sidebar options to subscribe via email, to
view photo albums, monthly archives, etc. It would be cool to expose this
functionality to aggregators.
Read More
Print Article
Can your read this?
Posted on 2003-07-24T15:44:17:00-07
Someone on Java Blogs obviously had a problem with their RSS generation (I won't
name names). They kept increasing the size of displayed fonts until single
characters were actually larger than my browser window.
Ouch...
Read More
Print Article
URI Scheme as User Intervace and Non-Protocol URI Schemes
Posted on 2003-07-24T12:37:42:00-07
The other day in Aggregator Subscription Mechanisms I talked about using a
feed:// or aggregator:// URI scheme to trigger manual subscription for RSS/Atom
feeds.
The last few days I have given it some thought and reviewed IETF and W3C
documentation on the subject and I believe that this URI scheme would be RFC
2718 compliant.
There is prior art on the subject with the view-source, mailto, javascript, and
aim URI schemes.
The view-source scheme is probably the closest cousin for a feed URI scheme. In
both cases we have a URI to an HTTP resource contained within a parent URI which
has an associated action. In the case of the view-source scheme we are telling
our browser to load up the source code for the given HTTP URL in a source code
viewer of some sort. In the case of the feed scheme we would tell it to load
the HTTP URL in our aggregator.
Section 2.3 (Demonstrated Utility) is probably the most apropos section:
URL schemes should have demonstrated utility. New URL schemes are expensive
things to support. Often they require special code in browsers, proxies,
and/or servers. Having a lot of ways to say the same thing needless
complicates these programs without adding value to the Internet.
In the case of our feed scheme this is clearly demonstrated. There are
thousands of "XML" links on websites right now. People clearly want to
expose their feed URLs.
Another import note - HTML supports 'type' attributes for anchors .
This would allow you to specify the media type within your HTML.
<a href="http://www.peerfear.org/rss/index.rss" type="application/rss+xml"/>
There are some drawbacks here. If you had the ability to discover the media
type within the URL we wouldn't have to rely on content producers to setup their
webservers correctly.
Another downside is that I doubt this convention would take off.
Read More
Print Article
Distributed Checking is a Bad Thing
Posted on 2003-07-19T13:43:15:00-07
Shrook now has a feature called "Distributed Checking" which allows each
version of Shrook to update a feed within a shorter interval (sometimes 5
minutes). This allows each Shrook instance to stay up to date without pounding
the Internet connection of the feed host.
This is certainly an honorable goal. Delivering news to users quicker is
something we all want.
To oversimplify: A central server maintains a database of when each channel
was last updated. To keep it up to date, every so often, the server chooses a
computer to check for new items and report back. The frequency of this varies
from every 5 minutes for popular channels, to every half hour for channels
with only one online subscriber, and it tries to use a different computer each
time. At the other end, each copy of Shrook checks in with the server every 5
minutes, and if any of its channels are out of date, it reloads them.
In short this is a bad idea and I would like to explain why. (Note that since
Shrook doesn't provide source (like NewsMonster) I had to conduct this analysis
from the FAQ )
First off, this isn't distributed checking, this is centralized checking. The
fact that Shook is using peers to perform the update check is irrelevant. The
server could easily perform the update check and in fact this would be an
optimization. Bandwidth for clients would be reduced and issues like NAT and
proxy caches (which might return stale content) would not affect the update
check.
The major downside is that for popular feeds this will literally become a
Distributed Denial of Service attack once a significant number of peers has been
deployed with distributed checking. Imagine you run a blog like Boing Boing and
you have thousands of RSS readers hitting your site every hour. Now imagine you
update your feed and within a five minute interval thousands of machines
decide to download the same file? This would be a very bad thing. Your link
would be saturated with RSS feed downloads
In the past we have had load balancing of feed download due to the natural
update interval distribution due to the randomness with with aggregators are
started and synchronized. With distributed checking this load balancing is
destroyed.
This might not be the case. The Shrook distributed checking FAQ says "... each
copy of Shrook checks in with the server every 5 minutes, and if any of its
channels are out of date, it reloads them."
It doesn't specify where or how it reloads them. I highly doubt that it
fetches the content from the centralized server as this would probably yield a
hefty bandwidth invoice for the author.
Shrook also doesn't specify if it support gzip content encoding. If it
doesn't support this feature (and any RSS reader worth a grain of salt needs
to specify this on the webpage which Shrook does not) then this is a highly
irresponsible change.
Not only would this essentially yield a "DDoS attack" of popular feeds but it
would be downloading unnecessary content which could have been compressed.
Read More
Print Article
Reptile receives $1000 grant from LinuxFund
Posted on Wed Jul 17 2002 09:52 PM
"Projects funded for the cycle ending June 1 were: OpenBIOS, Reptile and
AnonNet. Thanks to all the LinuxFund.org applicants and we look forward to
seeing more ideas from from everyone."
This is really great! LinuxFund has given the Reptile project a $1000
development grant.
I am very excited. Right away I am going to use this money to upgrade the
memory on my development machine to 768M (I have been wanting to do this for a
while). I have plans for the rest of the money but still want to think about
the best way to allocate the funds.
Anyway... I took a picture of the check .
Read More
Print Article
Ideas - Remote RSS Access
Posted on Tue Jul 16 2002 02:33 PM
Remote RSS Access from Reptile...
This is the ability to give Reptile an RSS channel:
http://kerneltrap.org/module.php?mod=node&op=feed
And pull out the RSS from this channel.
There are a number of reasons one would like to fetch this content from Reptile
and not the original site.
We can incorporate the history of the RSS feed so that we are not limited to
the items currently within the RSS document. We can use Reptile's database to
pull out 10, 15, 20, etc items.
We can include additional Reptile stored metainfo such as the date an item
was discovered.
Reptile can attempt to automatically discover (and included withint the served
RSS) <description> elements for the RSS content even if it is not within the RSS
field.
The only problem is that I didn't want to use an 'ugly' URL to serve the
content. I also didn't want to use POST because one can't bookmark these URL
types.
The solution is just to append the URL onto the end of a Servlet and use
PATH_INFO (via getPathInfo) to pull out the full URL.
So instead of
http://kerneltrap.org/module.php?mod=node&op=feed
One would request:
http://reptile.peerfear.org/reptile/servlet/reptile/http/kerneltrap.org/module.php?mod=node&op=feed
Which I think is much more elegant than the alternative:
http://reptile.peerfear.org/reptile/servlet/reptile?URL=ENCODED_GARBAGE
Read More
Print Article
New Reptile website
Posted on Tue Jul 09 2002 11:47 AM
I just reworked the Reptile website so that all the news items I enter on
http://www.peerfear.org are mirrored to http://reptile.openprivacy.org. This
should make it MUCH easier to maintain news between the two sites. (I love
RSS!)
Read More
Print Article
IP banned on Slashdot
Posted on Tue Jul 09 2002 09:05 AM
It looks like I deployed Reptile on peerfear with a 5 minute retry time. It was
hitting slashdot.org every 5 minutes and the IP banned me (67.112.30.210).
This is really stupid. Slashdot kills a server every 5th time they post a link
and they get mad because I request a 5k file every 5 minutes. Get a grip guys!
Anyway... this is partly my fault. I need to make sure that Reptile stable is
deployed with a 60 minute feed refresh time.
Read More
Print Article
New startup infrastructure for Reptile
Posted on Sat Jul 06 2002 12:04 AM
It seems that all our portability issue with Reptile are related to JVM startup.
Issues such as JAVA_HOME, CATALINA_HOME, bootstrap.jar (and etc) settings, valid
configuration files, correct parameters passed to Tomcat, etc all lead to
problems running on different Operating Systems.
At one point in time Reptile was started via Ant. This was very portable but
was kind of buggy (long story) and required about 3 additional threads and 10M
more of system memory.
We migrated away from this system due to the fact that we wanted Reptile to run
on thinner client machines (around 64M or so). The solution was to reuse the
catalina.sh|bat scripts provided by Tomcat.
This still leaves us with directory and jar issues.
I have not decided to refactor the startup mechanism one more time (hopefully
the last) to yield a higher level of portability.
Reptile will now be started with org.openprivacy.reptile.Startup and
shutdown with org.openprivacy.reptile.Shutdown classes with the
reptile-startup.sh (and etc) scripts staying the same.
I think this should provide us with an almost zero configuration and setup
mechanism for Reptile. Just download the distribution, uncompress, and click on
reptile-startup.sh|bat.
SUPPLEMENTAL: // created on Sat Jul 06 2002 03:33 PM
Another good point to this is that we can run services without having them run
within Tomcat. We can also load services without running within the Tomcat
classloader (so they are persistent).
Read More
Print Article
Reptile running on peerfear (early access)
Posted on Tue Jul 02 2002 02:43 PM
I just setup Reptile running under peerfear . This should be running for
a while on port 8050 and when it is stable I will probably try to set it up
running under reptile.peerfear.org.
I still have a few things I need to fix before a 0.6.0 release:
Get JXTA support fully working. I am really close to getting this done but I
having some problems with discovery.
Export the SOAP service via a Servlet so that developers can use Reptile as a
web service.
Point Reptile to mysql to that we have better performance and reliability (we
are still using hypersonic)
Improve the performance of RSS feed updates
Improve the XSLT so that it is more of a search style website instead of a P2P
native client UI.
Read More
Print Article
Reptile web service with HTML/SOAP/RSS interfaces.
Posted on Mon Jun 24 2002 03:46 PM
I an pondering setting Reptile up as a permanent service running under peerfear
. It will have a an HTML interface for running searches with a web browser,
SOAP interface for running queries from native code, and RSS interfaces for
running new queries and getting the latest news.
Most of the code base for this is already complete. I just need to deploy it on
the new machine and setup mysql as the database.
Any thought to what the name should be? I was thinking about putting it under
reptile.peerfear.org but this doesn't seem like a good idea as I would like to
put the main website there at some future date. Maybe aggregator.peerfear.org?
Maybe some other name?
SUPPLEMENTAL: // created on Mon Jun 24 2002 04:33 PM
Maybe it should just be home.peerfear.org or reptile-home.peerfear.org?
Read More
Print Article
|