Moabdar - a newsgroup archiver with a web interface
At the time of this writing, Moabdar (version 0.0.1) is very new, and I will
be very surprised if it doesn't have critical bugs. But please use it and report
them, otherwise they can't possibly be fixed. See BUGS for more
information.
Moabdar is a Perl program that archives newsgroups and provides a web interface
to the archive.
Google Groups (http://groups.google.com/) provides an awesome archive of Usenet
and some other small newsgroup hierarchies. However, some local newsgroups may
not be archived by Google, so Moabdar can be used to create an archive for
these. I created Moabdar in order to archive some newsgroups (around twenty) we
keep at National Technical University of Athens, which are accessible only from
Greece.
-
Reads a news spool directory (that can be created with inn, leafnode, or
slrnpull) and archives the messages of specific newsgroups.
-
Stores messages in a directory structure similar to that of the news spool
directory, except that all messages are tarred and gzipped by month to achieve
better compression.
-
Provides a web interface to browse messages by newsgroup, month and day, and to
search messages by date, subject and sender.
-
Contains a system to automatically hide parts or all of a message if there are
insufficient copyright permissions.
See BUGS for Moabdar problems and for things it does not currently do.
You need:
-
Perl 5 or later. Perl can be found in http://www.perl.com/. Many Unixlike
systems (such as Linux) come with Perl pre-installed. For Microsoft Windows,
the easiest to install is probably ActivePerl, http://www.activeperl.com/.
-
The Perl modules CGI, CGI::Carp, File::Spec, File::Temp, Exporter, File::Spec,
File::Path, and HTML::Template. Most of these are standard, that is, they are
automatically installed with Perl; you are probably missing only CGI, CGI::Carp
and HTML::Template. The easiest way to install them is probably by running the
command
perl -MCPAN -e shell
If this is the first time you are running it, you will have to answer some
configuration questions. Then, you can install a module by typing the command
install modulename
at the CPAN prompt. For example,
install HTML::Template
-
A local news spool. Moabdar does not support NNTP; you must fetch messages
locally using a program such as inn, leafnode, or slrnpull. I use slrnpull. A
problem with slrnpull is that it eliminates cross-posts.
-
GNU tar. Most Linux systems have GNU tar installed. If you have another version
of tar, you might want to modify Moabdar so that it can work with the other
version. Send me the patch!
-
gzip
-
A web server, such as Apache, capable of running scripts.
-
Unpack the tar.gz into the directory you wish to install Moabdar, such as
/usr/local/moabdar.
-
The moabdarize, moabdar-web.pl, and index.cgi files contain
a line near the top (it is one of the first fifteen lines) that reads
my $BASE_DIR='/usr/local/moabdar';
If you have installed Moabdar in a different directory than
/usr/local/moabdar, modify this line as needed.
-
Create a copy of or link to the index.cgi file and the icons directory in
places accessible through the web. You can either put both in a new directory
and configure your web server to execute index.cgi, or put icons wherever
you like and index.cgi in the web server's cgi directory (in which case you
should change its name).
-
Configure Moabdar by modifying the moabdar.rc file. You will find
instructions inside the file; the complete reference is in CONFIGURATION DIRECTIVES FOR moabdar.rc.
-
Create the archive directory and set its ownership and permissions. A good idea
is for it to be owned by the same user your web server runs as (usually
nobody or www-data or www) and be accessible by the owner only.
After configuration, just run moabdarize to perform the archiving operation.
Make sure that you run moabdarize with permission to write to the archive
directory and to moabdar.rc. See BUGS for a problem with updating
moabdar.rc. You probably want to have cron
run moabdarize at regular
intervals.
moabdar.rc consists of empty lines or lines beginning with #, which are
ignored, and lines of the form KEY=value
. The keys are case sensitive, so
specify them in all caps.
The various keys are:
- ICONPATH
-
The URL of the Moabdar icons directory; may be a full path without hostname,
such as /foo/icons, or a path relative to the directory where index.cgi
has been placed, such as icons, or a full URL.
- CACHE
-
The directory where search cache files are stored.
- CACHE_EXPIRY
-
Whenever the archive is accessed from the web, files in the cache directory
older than CACHE_EXPIRY seconds are deleted.
- ACTIVE_FILE
-
The full pathname of the INN type active file. moabdarize reads that file in
order to determine which messages in the news spool directory are new since the
previous time it was run, and archives these messages. In theory, Moabdar can
work without an active file, in which case it determines the new messages by
looking into the news spool directory; however, Moabdar has never been debugged
that way, so it is very unlikely it will work. If you are willing to debug and
fix it, leave ACTIVE_FILE blank.
- NEWS_SPOOL
-
The news spool directory.
- ARCHIVE_DIRECTORY
-
The archive directory.
- TRIPLE_HEADER_INDEX_FILE
-
- MESSAGE_ID_INDEX_FILE
-
- TAR_FILE
-
These are filenames for files Moabdar creates in various subdirectories of the
archive directory. I don't see any reason why these might have to bee changed,
so leave the defaults.
- ARCHIVE_NAME
-
The name used in the web pages as the title of the archive.
- DEFAULT_CHARSET
-
The character set specified in the Content-Type http response header.
- CHECK_COPYRIGHT
-
- AUTHFILE
-
For an explanation of these, see COPYRIGHT OF NEWSGROUP MESSAGES.
- NEWSGROUP
-
- HIGH_MESSAGE
-
- MESSAGE_COUNT
-
For each newsgroup you want to archive, set NEWSGROUP to each name, HIGH_MESSAGE
to 0 and MESSAGE_COUNT to 0; specify as many NEWSGROUP, HIGH_MESSAGE and
MESSAGE_COUNT triplets as there are newsgroups to be archived. moabdarize
will subsequently update moabdar.rc each time it is run in order to keep
track of the archived messages in HIGH_MESSAGE and MESSAGE_COUNT. HIGH_MESSAGE
is the spool system id of the last message archived, and MESSAGE_COUNT is the
largest Moabdar id assigned so far.
Copyright is the right to grant or deny permission to reproduce a work. The
author of a newsgroup message holds the copyright; that is, the author can grant
or deny permission to reproduce the message. By default, permission is denied;
you are not allowed to reproduce the message unless the author explicitly gives
you permission to do so. When you post to Usenet, you obviously grant implicit
permission for your message to propagate in the Usenet servers, but you don't
grant any other permission. I don't know how Google Groups has got around this
problem, but you must be aware of this issue. See the applicable copyright laws
for details.
Moabdar contains some features to publish only those messages for which
explicit permission has been given. If the CHECK_COPYRIGHT parameter is 0, then
these features are off. To turn them on, set CHECK_COPYRIGHT to 1 and AUTHFILE
to a file where the names of the people who have given permission will be
stored. If you do this, Moabdar will hide parts of the message when displaying
it on the web page. Specifically, Moabdar will:
-
Hide the entire message body if the message contains a
X-no-archive
header.
-
Hide the entire message body unless a line exists in AUTHFILE that matches a
part of the
From:
header. Note that the match must be exact and
case-sensitive. For example, if a line Foo Bar
exists in AUTHFILE, it will
match a From: Foo Bar <foo@bar.com
>, but not From: Foo S. Bar
<foo@bar.com
>. Thus, if a person appears with many variations of a name, they
must all be included in AUTHFILE.
-
Hide quoted text unless a line exists in AUTHFILE that matches a part of the
line preceeding the quoted text, or the line preceeding the quoted text is blank
and the previous quoted text in the same message has not been hidden.
-
Hide all quoted text inside quoted text.
Obviously, an administrator must be responsible for manually maintaining
AUTHFILE.
Note that it may be illegal even to maintain the archive, even if it is
unpublished. Check with the applicable copyright laws.
Numerous. Here's a few that come to mind:
-
Can't search all archived newsgroups at the same time. You can only search
messages in one newsgroup.
-
Doesn't display threads.
-
Messages with
X-no-archive
should be ommitted from archive rather than have
their bodies hidden when viewing.
-
If you search for messages, then leave the browser open past the cache expiry
time, then try to access some of the search results, the script will die, giving
a message like ``Internal server error''. It should explain, instead, that the
cache has expired.
-
Each message is logged in three sets of index files; this is probably useless
waste of disk space; maybe one set of index files is enough.
-
It is not possible to search all groups at the same time.
-
The full pathname of the configuration file is hardwired in the code. Thus, it
is not possible to keep distinct archives with only one copy of the program.
-
If moabdarize archives the messages, but fails to update moabdar.rc (e.g.
due to insufficient permissions), then its state will be inconsistent. Run
moabdarize -r
after correcting the problem to rebuild indexes.
-
It is not possible to turn warnings off. When slrnpull is used as the news
system, it eliminates cross-posts. When moabdarize cannot find the eliminated
files, it issues warnings. When moabdarize is run from cron, annoying
e-mail may be appearing due to these warnings.
Please see http://moabdar.sourceforge.net/ for a more complete list of bugs.
Also report bugs there. Fix them if you can!
Moabdar was written by Antonios Christofides, <A.Christofides@itia.ntua.gr>,
for archiving of the ntua.* newsgroups kept at news.ntua.gr. It is a direct
offspring of Usenet-Web 1.0.2, a newsgroup archiver created by Benjamin
``Snowhare'' Franz. In fact, it is a rewrite of Usenet-Web, with a few extra
features added (namely the copyright restrictions, the storing of the messages
in tar files, and the ability to search in all years at the same time). The
icons used in the web pages are those of Usenet-web.
- 13 August 2002
-
Moabdar 0.0.1 released.
There are different licenses for the software (including the templates) and for
the icons.
Copyright (C) 2002 Antonios Christofides
Moabdar is free software; you can redistribute it and/or modify it under the
terms of either of the following:
- a)
-
the GNU General Public License version 2, as published by the Free Software
Foundation. You should have received a copy of the GNU General Public License
with this program, in the file GPL.
- b)
-
the Artistic License. You should have received a copy of the Artistic License
along with this program, in the file ARTISTIC.
Copyright (C) 1994-1995 Benjamin Franz
The icons are those found in Usenet-Web, by Benjamin Franz, and I guess that
the following notice, found in some of the files of the Usenet-Web distribution,
applies to them:
The Usenet-Web programs are copyrighted 1994 by Benjamin Franz
(snowhare@netimages.com) and may be freely distributed and
modified so long as no fees are charged.