The V6 Web Engine
Release 1
Bernard Lang
François Rouaix
INRIA Rocquencourt
June 1996
Table of Contents
Overview
V6 is to the Web what pipes are in Unix systems: a compositional device
to combine document processing. To be easily
integrated in the Web architecture, V6 is available as a personal proxy.
Relying on a common skeleton architecture and Web related libraries, V6
can be easily configured to support various sets of filters while remaining
portable and browser independent. The filters may act on the requests
emitted by the browser (or other web client) or on the document returned by
a server, or both.
In the current release, the available filters include
-
flexible caching
- request redirection
- HTML filtering (based on NoShit)
- global history
- on-the-fly full text indexing
V6 can be used to support many other navigation aids and Web-related
tools in a uniform, browser independent way. In addition, V6 can also be
used as a traditional http server: this is particularly useful to serve
private files without needing access to the site-wide http server, or to
interface to local, private applications (mail, ...) through the CGI
interface.
Chapter 1 Motivations and rationale
The design of V6 is explained in the position paper presented at the
workshop Programming the Web - in search for APIs, who took place
during the 5th World Wide Web conference (may 1996). The paper is available
on-line at
http://pauillac.inria.fr/~lang/Papers/v6/
Chapter 2 Implementation design
2.1 Overview
Given the organising principles given in the previous chapter, we postulated
the following requirements for an implementation of a web engine:
-
client independency:
- the engine should not depend on vendor specific
browser extensions, beit HTML or HTTP.
- modularity:
- the engine should be highly modular, so that servers,
service components, and filters can be separately implemented and added to an
engine (possibly already running).
- dynamic configuration:
- the engine components should contain as less
as possible hard-coded information (e.g. compile-time decisions). Components
should be configurable through generic HTML/HTTP communication with the
client: in particular, this means that there should not be any GUI-based
component configuration except through a Web browser.
Even if a component has persistent file-based configuration, this
information should be editable by a client browser.
- performance:
- the performance of the engine is not critical. The only
requirement is to avoid as much as possible complete locking of the engine
while responding to some request.
- language independent components:
- the engine should support, whenever
feasible, components written in arbitrary languages.
Our implementation choices were the following:
-
concurrency:
- the engine exists as a unique process, using threads.
- dynamic linking:
- the various components are dynamically loaded into
the engine, who provides only the skeleton architecture
- (almost) nothing hard-coded:
- the engine by itself is ``empty'', in
the sense that it implements only the library functions.
The V6 engine is written in Objective Caml, which provides concurrency through the
threads library, as well as safe dynamic linking. Some components may
be given as arbitrary binaries, either with CGI-type interface, or with
classical stdio (pipe combinable Unix programs). However, as of version
1, some components still have to be written in Objective Caml, mainly for
performance reasons.
2.2 Architecture of the engine
The design of V6 essentially follows the figure in
chapter 1. The only important difference is the notion of
scheduler and the associated job queue: instead of having a
thread created for each incoming request, dealing everything from request
reading and filtering to serving, there are two separate worlds in
V6. The first
world contains servers, receiving request from clients, and responding to
them from an abstract feed object; the second world contains
services, which produces feeds. The way the feeds are created depends
on the nature of the request, beit a proxy request or a request for the
local services of V6. The scheduler and the jobs act an an
isolator between the different components of the proxy, and should simplify
extensions of V6 to support other protocols, notably the future HTTP-NG.
When a server component receive a request from a client (usually a Web
browser), it obtains a job structure from the scheduler, and queues
this job. The job contains the future of the document. The server waits
for this future (a feed) to be available, and then responds to the
client by sending the data obtained from the feed.
Filters
V6 offers a general mechanism for writing and combining filters
that act either on an incoming request, or on responses and document bodies,
or on both. The filters can be written either as Caml filters,
or as external programs (for filters acting on document bodies).
Several examples of filters are given in the current V6 distribution,
and are described in chapter 3: -
proxy authentication
- cache
- HTML filtering
- GIF deinterlacing
- global history
- on-the-fly full-text indexing of HTML and plain text documents.
A server component has an associated set of filters. Each incoming request
is passed through the set of request filters. Each of these filter
rewrites the HTTP request message, and returns an optional return
filter. The return filter is itself decomposed : the filter first
rewrites the HTTP response message, and returns an optional filter to
be applied to the document body.
This decomposition, although apparently complex, has several advantages:
-
a request filter can either simply rewrite the request, without
associating a return filter. Then, it can decide, by looking at the
request, if a return filter is required (e.g. based on the HTTP
method, the URI of the request, etc...).
- the return filter can decide to insert a filter on the document
body by examining the response (e.g. HTTP return code, content type, ...).
The combination of filters is completely transparent to each filter. Filters
acting on document bodies obey the traditionnal pipe model of Un*x
operating systems: in the case of external program filters, they receive the
document body on the standard input, and must return the filtered body on
the standard output. In the case of Caml filters, the filter function is
run on a separate thread, and receives as argument an stdio like
record of functions (one for reading and one for writing).
2.2.2 Proxy components
Since the main role of V6 is to act as a proxy, there are
naturally proxy components. A proxy component handles requests that
are not adressed to local services (such as files or CGIs). Currently,
V6 only implements HTTP 1.0 poxying (that is, handling requests for URIs
of scheme http:, and proxy proxying (transmitting requests of other
protocol schemes, such as ftp:, gopher: to a further proxy).
2.2.3 Service components
A service component handles URIs that are adressed to the V6
engine itself. Service components are registered for some path prefix.
When an incoming request is identified as local, V6 chooses the
component who registered the longest path prefix matching the request, and
calls it with the request.
In this architecture, one can have all requests starting with foo
served by one component, except /foo/bar which is served by another
component.
There are several levels of utilization of V6: the first level is the
installation and configuration of the standard V6 distribution; it is
described in this chapter. The second level is the conception and
implementation of new components; it is described in
chapter 4. The third level is the addition of more library
functions into the V6 core, for use by new components; it is not yet
documented.
3.1 Installation and configuration
Assuming that V6 has been installed on your system (or check the INSTALL file in the distribution), you have to install some V6
configuration files in your $HOME directory. Check the USER
INSTALLATION section in the INSTALL file.
Then, you have to choose the components that will be loaded by your own
instance of V6, and configure each of them.
Here are specific choices that depend on the kind of system you are going to
run V6 on:
-
multi-user machine, connected to the outside world:
- in this
configuration, the main points are the choice of
the server ports (select a port that nobody else will use, say your uid),
and protection. Since the machine is routed, you must take care that all
accesses to V6 are subject to user identification : be sure that pauth.cmo is in modules.conf, and that all filter sets defined in filters.conf contain "Proxy authentication". Then check pauth.conf, and edit it to add a user/password pair (use the utility v6pass to get the encrypted password).
- workstation dedicated to single user, inside a firewall:
- in this
configuration, V6 can be used only as a primary proxy. Do not
include http_proxy.cmo in modules.conf. Instead, use proxies.cmo, and edit proxy.conf so that it contains the proper
host names and port numbers of the regular proxy that your employer is
bound to have installed.
3.1.1 Loaded modules (modules.conf)
The modules.conf file contains the list of modules that will be loaded
by V6 during startup. These module should reside in $HOME/.v6/modules. The list of available modules is described in the next
section.
The order in which components are specified in modules.conf is the
order of loading. It is irrelevant except for the servers.cmo
module who should always be placed last.
3.1.2 Filter sets (filters.conf)
The configuration of filters in V6 is decomposed in two steps.
The first step is the definition of filter sets, in the file filters.conf. Each filter set is given by a name and a list of regular
expressions. The set is computed by selecting all registered filters whose
name match the given regular expressions.
The order in which the regexp are given is also the order in which the
filters will be applied to an incoming request (and, consequently, the
reverse of response filters.
The second step is the configuration of each server component (currently
only the http server) to use some filter set.
3.2 Components Library
HTTP 1.0 server (hserver.cmo)
-
Waits for HTTP connections on a given port, forward requests (possibly
filtered) to the scheduler, and answers to client.
- Distinguishes requests meant for the engine itself from other requests.
- Meant to be used as your standard/default proxy in your web browser.
The configuration file (servers.conf) allows the definition of one or
several HTTP ports on which the server is active. Each port is defined by the
hostname (in case your machine has several names on several networks), the
port number, and the name of the filter set applied to each server (see
below for filter set definitions).
It may be useful to have several ports with different filters, in the case
where some filters are specific to given browsers or tools (such as
on-the-fly conversion of documents, caching, etc...).
Be sure to check your browser configuration so that it points to the ports
you specified !
3.2.2 Proxy components
HTTP proxy (http_proxy.cmo)
-
Acts as a classical HTTP proxy, forwarding HTTP requests to their
proper server.
No configuration required. Will work only if your machine has default
routing.
Forwarding proxy (proxies.cmo)
-
Forwards request of a given protocol to a further proxy (talks to this
proxy in HTTP 1.0).
For each protocol scheme that should be forwarded, specify the scheme (ftp, http, etc...), the fully qualified host name of the proxy host,
and the port number.
Since V6 does not support proxying for anything else than http at
this time, you will probably need this component for the other common
protocol schemes. Remember, the main interest of V6 is to have all
documents retrieved from the Web go through a series of filter. Thus, even
if V6 only act as a forwarding proxy for certain protocols, it may be
useful however to configure your browser so that all requests go
through V6.
3.2.3 Service Components
File system mapping (fs.cmo)
-
As classical HTTP servers, maps an URL to some document file/directory
on the file system.
- supports directories
- does not support directory to index.html remapping, access
control, or other goodies (yet).
In contrast to most other http servers, this components allows several
different file system hierarchies to be served under different path
prefixes, without having to make symbolic links all over the place.
The configuration is quite simple: for each hierarchy, specify the local URL
prefix to be used on the server and the corresponding absolute path on the
filesystem. For example,
doc/foobar /net/software/foobar/documentation
will tell V6 to respond to a request for /doc/foobar/some/path
with the file at /net/software/foobar/documentation/some/path.
CGI frontend (cgi_ctl.cmo)
CGI support is actually in the core V6 engine. This module only provides
an interface to access this feature and to read the configuration file.
-
As classical HTTP servers, maps an URL to some executable
file/directory on the file system.
- Supports (I think) the CGI 1.1 specification.
The configuration files (cgi.conf) allows for specifying either
directories of CGI programs, or single CGI programs. The first form
dir bin ~/bin/cgi
maps requests for /bin/foo to the CGI program ~/bin/cgi/foo
.
However, /bin/ga/bu is mapped to ~/bin/cgi/ga
, and not to
~/bin/cgi/ga/bu
even if it exists.
The second form
file search ~/bin/ffwsearch
maps requests for /search to the CGI program ~/bin/ffwsearch
.
Proxy authentication (pauth.cmo)
-
Proposed HTTP/1.1 proxy authentication.
- Defines a filter named
"Proxy authentication"
By inserting this filter, all accesses to the proxy are required to be
authentified (header Proxy-Authorization). The accepted users are
defined in the configuration file pauth.conf. User names are arbitrary
tokens (no space). Passwords are encoded with MD5. To get the encoding of
a password, use the v6pass program included in the distribution.
Redirection (redirect.cmo)
-
redirect requests according to the request URI.
- defines filters named
"Redirect"
The configuration file (redirect.conf) contains redirection
specifications as regular expressions (from the Str library in the Objective Caml distribution). Each redirection is composed of a
regular expression matching the request URI, and a substitution expression
for producing the redirected URI.
The example
Redirect "http://v6\(:80\)?\(.*\)" "\2"
makes v6 a virtual host name for the local services of the V6
engine, so that you can add things like http://v6/services in your
bookmarks independantly of where the engine is actually running.
HTML Filtering (a.k.a NoShit) (noshit.cmo)
GIF Deinterlacing (deinterlace.cmo)
Global History (history.cmo)
-
stores in a DBM database all URLs that where requested (with GET)
through V6, with the time of last visit
- does not (yet) provide easy interface for working on the global history
- is used by the indexing filter (see below) to avoid duplicates.
- defines a filter named
"Global history (NDBM)"
Cache (cache*.cmo)
-
Combination of filters and service components for caching documents.
- The cache is defined by the equality relations on requests. Only GET
requests are compared (equality is false for all other methods). For GET
requests, the normalized URIs are compared (although URI normalisation
doesn't handle hex encodings yet).
- A request with header Pragma: no-cache bypasses the cache.
- A request with header If-modified-since: bypasses the cache.
- The cache stores only responses with a return code matching the user
specified list of cacheable codes.
- There is no limit on the cache size, no automatic cache flushing
(except requested documents for which the cached version is expired).
- Extensive user interface for searching and manipulating the cache
- defines a filter named
"Cache find"
The configuration file cache.conf contains directives for specifying
-
codes
- : the list of HTTP codes that we want to cache.
A good default is positive answers (200) and permanent redirections (301)
codes 200 301
- privacy
- : the private directive says that your cache is supposed
to be strictly private. Even documents that were protected by HTTP
Authentication will be cached.
- nocache
- : the nocache directive specifies regular expressions of
URLs that should not be cached (e.g. local documents, local services,
servers on the same site, responses from search engines, ...)
- cookie
- : an arbitrary string to protect cache modifications from being
triggered my malicious third-party pages.
We hope to provide reasonable expiration support in the next releases.
The documentation for the cache interface is on-line (/cache)
Indexing Memory (indexing.cmo)
This filter is an experiment in using NLP tools such as incremental
full-text indexing to provide navigation aid. As an alternative to bookmarks, we suggest that all incoming HTML or text documents should be
indexed on the fly by some full-text indexer. Then, finding a place on the
Web where the user has been in the past, and that was talking about some
subject is just a matter of interrogating the data base with some query
containing keywords.
The requirements for this experiment were:
-
incremental indexing (because we index documents on the fly)
- no need to keep the documents (we just want to keep a trace)
- low ratio index/document size.
- CGI access for interrogation
A first Web scan revealed surprisingly few freely available software meeting
these requirements, and we chose FFW for the
experiment. Unfortunately, we had to make some patches to the official
distribution, and the FFW license forbids re-distribution of modified
version.
Thus, the V6 distribution contains only our diffs, and the user has to
get the original FFW distribution from
http://www.nta.no/produkter/ffw/ffw.html
then has to apply the patch and compile the software (C++ required !).
The indexing component requires ffwindex and ffwmerge to be in
the PATH. It builds the database $HOME/.v6/ffwdb/cache (make
sure that the directory $HOME/.v6/ffwdb/ exists).
Then configure and install the CGI script ffwsearch in
some of your binary directories. To offer index querying, check cgi.conf, and add something like
file search ~/bin/ffwsearch
so that you can issue queries with http://v6/search/cache
The filter defined by this component is named "Indexing Memory"
3.3 Running V6
V6 is started by executing the v6 command. The allowed options are
-
-engines <n>
- specifies the number of engines (threads) for
processing requests (default is 10). The relation between observable speed
and number of engines is not immediate.
- -modules <file>
- specifies an alternate set of modules to be
loaded (default is
modules.conf
.
- -dir <directory>
- specifies an alternate directory for all
V6 configuration files. Also affects components who compute the path
names of their persitent storage files from the V6 root directory.
- -debug
- makes V6 quite verbose.
- foo.cmo
- specifies additionnal modules to be loaded during
startup.
V6 logs its transactions on (buffered) standard output. Debugging
messages go to standard error.
To check that V6 started properly, try the URL http://v6/v6
(assuming you kept the redirection rule in the.
From there, you can access to builtin services (see below 3.7).
3.4 Configuring your browser
In your browser, go the preference control panel for the network (or
proxies). Set the proper host name and port number for each protocol for
which you want V6 as a filterin proxy.
3.5 Dynamic configuration of V6
In this release, V6 cannot be easily re-configured while running. For
the moment, one has to kill V6, edit the configuration file, and restart
the engine.
3.6 Killing V6
V6 can be killed safely at (almost) any time (using SIGINT or SIGTERM). Most components will checkpoint properly when the program exits.
3.7 Builtin services
3.7.1 Services /services
The list of registered V6 service components can be accessed with the services URL. For each component, a short description is given, and when
available, a pointer to the documentation (and further configuration).
3.7.2 Filters /filters
The list of registerd V6 filters is available at /filters. A
simple dynamic configuration form is also available, but beware that the
configuration changes are valid only for the current session (they are not
saved to the disk).
3.7.3 Engines /engines
Then engines interface provides simple ps and kill
interface to V6 engines. Working engines may be killed if necessary
(e.g. when stuck on outbound connections).
3.8 Troubleshooting
-
V6 will run even if the servers component failed to initialize
properly (unassignable port number, ...). Make sure the port number is
allowed ( > 1024 , not used by other software).
- to understand what's going on, run V6 with -debug
Chapter 4 Programmer's Manual
NOT YET AVAILABLE
This document was translated from LATEX by
HEVEA.