Andrew Web - Publishing Subsystem Requirements
by Dan McCarriar
1. Introduction
This document describes the requirements for the publishing subsystem of the Andrew Web System. The publishing subsystem is one part of an overall redevelopment of the Andrew Web System, a collection of web services made available to the Carnegie Mellon University community.
The current publishing mechanism varies between the two production web servers. Publishers using the main web site at http://www.cmu.edu/ must transfer their content to a specified AFS directory, and use a telnet-based interface to release their content to a "test" environment, which is accessible at http://www.cmu.edu:8001/. After the content is released to the test environment, the publisher must use the telnet interface again to move the content to the "production" environment. Releases to production are not immediate - the content is copied from AFS to the local disk of the web server via an automated process that runs at 2:12am daily.
On http://www.andrew.cmu.edu/, which hosts content for courses, organizations, and individual users, the publishing model is somewhat different. Publishers still transfer their content to an AFS directory, but then visit a web site to release their content. There is no "test" environment; pages are visible on the production server immediately after they're released. The release mechanism is a simple form, and there are both authenticated and unauthenticated options available.
Outside of the publishing mechanism, the current system does not provide any web-based tools for link validation or managing access to collections. The quotas available for individual collections are relatively small. There is no mechanism for content expiration - for example, an individual user's web pages could remain accessible on the production server long after that user has left the university.
2. Vision Statement
One fundamental need of all web publishers is a way to manage content on a web server. To do this effectively, publishers need not only a way to transfer content to a web server, but also a robust set of tools to support different models of publishing. Individuals and small workgroups may only need space in a file system that is automatically published by a web server. Larger workgroups often require tools to maintain different versions of content and support concurrent authoring, as well as a means to preview content on a web server before deploying it to the public. All publishers need to have an interface to publishing tools such as link validation, and a means to grant individual users access to manage content.
From service provider point of view, it is desirable to separate the "production" web servers where content is viewed by the public from the "development" web servers where content authors perform their development and testing. The web servers where content is published should be easily recoverable, and the content on those servers should be able to automatically "expire" on a specified date or when an individual publisher's account expires.
The publishing subsystem is designed to provide Andrew Web System customers a consistent way to manage content on the Andrew Web System's production web servers, www.cmu.edu and www.andrew.cmu.edu. For server administrators, the publishing subsystem will provide a mechanism for easy administration of web server content, configuration management, and a means for rapid recovery of each production web server in the event of a machine failure.
3. System Overview
The Publishing subsystem consists of the following components:
- One or more "staging" web servers for uploading and previewing content.
- Two "production" web servers for publicly viewable content. The two servers are www.cmu.edu for 'official' university and department web pages, and www.andrew.cmu.edu for pages related to courses, student organizations, and individual users.
- Web-based tools for web publishers, which facilitate management of content on the staging and production servers, and management of user access to web collections.
- Web-based tools for site administrators, which allow for the same content and access management as the web publisher tools, but on a site-wide basis. The administrative tools also allow for the archival and deletion of collections.
An important concept of the publishing subsystem is that of a web "collection". A collection is simply space in a file system to which one or more web publishers have access to place HTML pages, images, and other web-related content. Each collection maps to a particular URL on a production web server. For example, an individual user's collection might be accessible at:
http://www.andrew.cmu.edu/user/joe
A collection for the Basket Weaving department, which may have many people that participate in publishing the department's web site, could map to:
http://www.cmu.edu/basketweaving
4. Use-cases and User-scenarios
The publishing subsystem is intended for the following classes of users:
- Site administrators use the publishing subsystem to manage all content on the production web servers. Common tasks include deleting and archiving content and managing user access to collections.
- Web publishers use the publishing subsystem to place their content on production web servers, move content between the staging server and the production server, and manage access to their collections.
Publishing Content (Simple)
Wally Webmaster has a very simple web site for his department. He creates and tests his pages on a local computer, and uploads them via FTP to a specified location on a file server. After he uploads his pages, they are immediately available at his department's URL: http://www.cmu.edu/wallysdepartment.
Publishing Content (With Staging Server)
Wanda Webmaster is part of a large workgroup that maintains the university's front page. The front page has to be perfect, as it gives visitors from outside the university their first impression of the university. Wanda uploads her files via kFTP to a specified location on a file server. Her pages are then immediately available on the staging server, where she can have her workgroup proofread the pages and test all of the links before the pages are released to the public. After her workgroup has looked at the new pages, Wanda goes to the publishing page on the staging server, where she authenticates by entering her user name and password. She is presented with a list of the web collections she is permitted to manage, and chooses the "homepage" collection. She notices from the release logs that a colleague has also uploaded some new pages to the homepage collection since she did her upload. Wanda checks with her colleague to ensure that the pages have been proofread, and then clicks the "publish" button to move the new pages in the "homepage" collection from the staging server to the production server, where they will be viewable by the public.
Requesting a New Collection
Paul Professor is teaching a class during the spring semester. Before the semester begins, he goes to the web publishing home page to request that a new web collection be created for his course. At the site, he enters various information about the course, including the name and course number. He also enters the user IDs of his teaching assistants to give them access to manage web pages in the new collection. Finally, he enters a date a few weeks after the end of the semester on which the collection will expire. Rather than have the collection be deleted on that date, he chooses the "archive" option, which will cause his course content to be moved to a private URL where it will be accessible only to the maintainers of the collection.
Managing User Access to Collections
Michelle Manager has just hired a new webmaster for her department. On the webmaster's first day of work, Michelle visits the web publishing home page and enters her user name and password. She only has access to maintain one collection, so she is immediately directed to the management page for her department's collection. She clicks the "Add User" button and enters the user ID of her new webmaster. The new webmaster then has access to manage and release content in the department's collection.
Link Checking
Rhonda Researcher has just hired a new graduate student to maintain her web site. Before turning the site over to the student, she visits the web publishing home page to run the link validation tool on her collection. The tool provides her with a list of three broken links, which she fixes. She then configures the tool to run automatically once per week, and email the student if it finds any incorrect links. To cover periods of time such as the summer break where the student might be out of town, she also configures the link validation tool to notify her via email if any links remain broken for longer than two weeks.
Site-Wide Administration
Andy Administrator is responsible for maintaining content on the two production web servers. He regularly goes to an administrative publishing page to manage collections. Today when he logs in to the page, he is notified that the accounts all of the users with access to publish the "ultimatesoccer" collection, which belongs to a student organization, have been disabled because the students have recently graduated. Since nobody can now access this collection, he researches the organization and finds out that there are no remaining members. Andy decides to remove the collection from the production web servers. Andy then adds a user to the access list for another collection because of a request he received via email.
5. Related Links
6.1 Requirements: General Publishing
- 100.1 Each collection must have an associated file system space that is accessible via FTP or kFTP.
100.10 The file system space allocated for collections must provide access to some sort of version control system such as RCS or CVS.
100.15 The file system space allocated for collections must, by default, be accessible only to the administrators of the collection and any required services (e.g. the publishing software). Specifically, file system space for web collections must NOT be publicly accessible by default, though web publishers must have the ability to make their space publicly accessible if they choose to do so.
100.20 The file system space allocated for collections must allow quotas of up to 200MB.
100.30 Each collection must map to one URL on one of the production web servers. In certain circumstances, a collection must be mapped to more than one URL on one or more of the production web servers. In this case, when a collection is released, it must be published to each associated URL automatically.
100.40 The publishing system must support hosting of domain names other than those of the production servers (e.g. www.bio.cmu.edu).
100.50 Web publishers must be able to choose whether their collections are managed via either the simple or advanced publishing models, described below.
100.55 There must be a simply migration path between the simple and advanced publishing models, involving little or no intervention from the server administrators.
100.60 The file system space allocated for collections should allow much larger quotas (up to 2 GB) for special projects, subject to reasonable usage guidelines and strict content expiration requirements.
6.2 Requirements: Simple Publishing Model
- 200.1 After a web publisher transfers content to the designated file system space for his collection, the content must be immediately viewable at the collection's URL on the appropriate production web server. Unlike the advanced publishing model described below, there is no staging server available for testing content under the simple model. Changes made to content in a publisher's specified file space will be "live" immediately after they are made.
200.10 Web publishers utilizing the simple publishing model should have access to all other content management tools (e.g. logging, link management, etc.)
200.20 Web publishers utilizing the simple publishing model must have access to a standardized, non-configurable report of activity for that collection.
6.3 Requirements: Advanced Publishing Model
- 300.1 After a web publisher transfers content to the designated file system space for her collection, the content must be immediately viewable at the collection's specified test URL on the staging server.
300.10 When a web publisher chooses to move her content from the staging server to the production server, the process to move that content to the production server must begin immediately. The actual time it will take for the content to appear on the production server will vary based on the size of the collection.
300.20 While content is actively being moved from the staging server to the production server, users viewing the affected collection at the same time must never see a mix of old content and new content.
300.30 Web publishers should have the ability to make their test URL on the staging server accessible only to the administrators of their collection.
6.4 Requirements: Content Management
- 400.1 Content belonging to a single individual (e.g. user pages) must be deleted automatically when that individual's account is permanently deactivated or suspended. In the case of a deactivated account, the content need not be re-published if the account is restored at some point in the future. In the case of a suspended account, the content should be republished when the account is reinstated.
400.5 Individual publishers must be able to specify a forwarding URL for their web pages before their account is deleted. If a forwarding URL is specified, a redirection page should be automatically generated for the user's pages, and be left in place for the same period of time as mail forwarding is currently offered.
400.10 The content management system must allow web publishers to request new collections via a web interface.
400.20 The content management system must maintain a release log, listing the date and time of each action on a collection (e.g. releases to production, modification of access rights). Each log entry should record the date and time, the action performed, the userid of the actor, whether the action was successful, and an appropriate error message if the action was not successful.
400.30 The content management system must provide web publishers with a tool to perform link validation.
400.40 The link validation tool must have the ability to verify both local links (e.g. relative or absolute links to content on the same server as the collection being validated) and external links (e.g. absolute links to content that is not on the same server as the collection being validated).
400.45 The link validation tool should have the ability to make experimental submissions to HTML forms within a collection, to verify that any CGI or other service called by the form is not returning an error.
400.50 The link validation tool must have the ability to notify multiple web publishers via email if it detects inactive links, and provide an escalation path for notification if inactive links persist for a specified period of time.
400.51 Web publishers must be able to specify the interval at which the link validation tool runs (e.g. daily, weekly, monthly).
400.55 The email escalation path should be single-tiered (e.g. email error reports to the collection webmaster, and then escalate to someone else if they aren't corrected within a week).
400.56 The email addresses for notification by the link validation tool must be limited to the list of email addresses specified in the access list for a collection.
400.60 Web publishers should have the ability to specify an "expiration date" for each collection they manage.
400.70 Web publishers should have the ability to specify whether a collection is deleted or archived (see below) upon expiration.
400.80 The content management system should allow for a content "archival" process, where upon expiration a collection is moved to a new, private URL where it is still available to the collection publishers, but not the general public. Archived content should also be removed from any search index.
400.90 The publishing subsystem should notify content publishers via email at certain intervals before their content is scheduled to expire. All notifications of pending content expiration should also be copied to an electronic bulletin board specified by the server administrator.
400.100 The content deletion and archival process must be automatic, requiring no manual processing by the server administrator.
400.110 When content is archived or deleted, a redirection or notification page should be automatically posted for a specified period of time.
6.5 Requirements: Content Management Tool User Interface
- 500.1 Web publishers must be able to go to a web page, authenticate, and be presented with a list of collections that they are permitted to publish or manage.
500.5 The main content management page must list the collection name and the URL to which the collection will be published.
500.10 For publishers utilizing the advanced publishing model, the content management tool must present options to move content from the staging server to the production server, restore content from the production server to the staging server, and delete all content from the staging server.
500.20 For all publishers, the content management tool must provide an interface for maintaining collection access lists and configuring the link validation tool.
500.25 The content management tool must provide an interface for publishers to restrict the availability of their content using the currently supported content-restriction method. Specifically, for our current web services, the content tool should provide an interface for publishers to automatically create the appropriate .htaccess file to restrict their content via KWeb.
500.30 The content management tool should display the last ten actions in the release log for the collection being managed.
500.35 The content management tool should provide a mechanism for viewing the entire release log for the collection.
500.40 The main URL of the staging server should host publishing instructions and the content management tools.
6.6 Requirements: Server Configuration, Reliability, and Monitoring
- 600.1 The production web servers and the staging server must be unavailable for no more than 15 minutes each week.
.
600.10 The production web servers and the staging server must be monitored every five minutes for an answer on port 80, and administrators must be notified immediately if any server does not answer on port 80 two consecutive times.
600.20 To facilitate rapid recovery in the event of a machine or disk failure, content that has been published to the production web servers must be mirrored to a second location.
600.25 All server configuration and support files must be mirrored to a second location to facilitate rapid machine recovery. The list of files to be mirrored should include all configuration files for web-related services, package configuration files, SSL certificates, password files, and srvtab files.
600.30 New versions of web-related software packages must not be automatically installed on the production web servers or the staging server by package or depot.
600.40 The staging server must have a robots.txt file instructing web spiders not to index any content on that server.
600.50 The production web servers and the staging server should be unavailable for no more than 15 minutes each month
600.60 During periods of scheduled and unscheduled downtime, an application should answer the hostname of the down server to notify users of the scheduled downtime.
7. Revision History
Document Revision # |
Action Taken, Notes |
When? |
By Whom? |
0.1 |
Creation |
02/23/2001 |
Dan McCarriar |
0.2 |
Added motivation paragraph, tweaked and rearranged requirements. |
02/28/2001 |
Dan McCarriar |
1.0 |
Modified section numbering format; added, removed, and changed requirements based on 3/23/2001 meeting. |
04/04/2001 |
Dan McCarriar |
1.1 |
Added requirement 100.15. |
04/10/2001 |
Dan McCarriar |
dlm@cmu.edu