Apache HTTP Server
Overview of the Apache EBCDIC Port
As of Version 1.3, the Apache HTTP Server
includes a port to (non-ASCII) mainframe machines which use
the EBCDIC character set as their native codeset.
(Initially, that support covered only the Fujitsu-Siemens family of
mainframes running the
BS2000/OSD
operating system, a mainframe OS which features a
SVR4-derived POSIX subsystem. Later, the two IBM mainframe operating
systems TPF and OS/390 were added).
The EBCDIC related directives
EBCDICConvert,
EBCDICConvertByType, and
EBCDICKludge
are available
only if the platform's character set is EBCDIC
(This is currently only the case on Fujitsu-Siemens'
BS2000/OSD and IBM's OS/390 and TPF operating systems). EBCDIC
stands for Extended Binary-Coded-Decimal Interchange Code
and is the codeset used on mainframe machines, in contrast to
ASCII which is ubiquitous on almost all micro computers today.
ASCII (or its extension latin1) is the basis for the HTTP
transfer protocol, therefore all EBCDIC-based platforms need a
way to configure the code set conversion rules required between
the EBCDIC based mainframe host and the HTTP socket protocol.
On an EBCDIC based system, HTML files and other text files are
usually saved encoded in the native EBCDIC code set, while image
files and other binary data are stored with identical encoding as
on ASCII based machines. When the Apache server accesses documents,
it must therefore make a distinction between text files (to be
converted to/from ASCII, depending on the transfer direction)
and binary files (to be delivered unconverted).
Such a distinction can be made based on the assigned MIME type, or
based on the file extension (i.e., files sharing a common file
suffix).
By default, the configuration is symmetric for input and output
(i.e., when a PUT request is executed for a document which was
returned by a previous GET request, then the resulting uploaded
copy should be identical to the original file). However, the
conversion directives allow for specifying different conversions
for input and output.
The directives EBCDICConvert and
EBCDICConvertByType are used to
assign the conversion setting (On or Off) based on file
extensions or MIME types. Each configuration setting can be defined
for input only (e.g., PUT method), output only (e.g., GET method),
or both input and output. By default, the conversion setting is
applied for input and output.
Note that after modifying the conversion settings for a group of
files, it is not sufficient to restart the server. The reason for
this is the fact that a cached copy of a document (in a browser or
proxy cache) will not get revalidated by contents, but only by
date. Since the modification time of the document did not change,
browsers will assume they can reuse the cached copy.
To recover from this situation, you must either clear all cached
copies (browser and proxy cache!), or update the modification time
of the documents (using the touch
command on the server).
Note also that server-parsed documents (CGI scripts, .shtml files,
and other interpreted files like PHP scripts etc.) are not subject to
any input conversion and must therefore be stored in EBCDIC form
on the server side.
In absense of any
EBCDICConvertByType directive,
and if no matching EBCDICConvert was
found, Apache falls back to an internal heuristic which assumes
that all documents with MIME types starting with
"text/", "message/" or
"multipart/" as well as the MIME type
"application/x-www-form-urlencoded" are text documents
stored in EBCDIC, whereas all other documents are binary files.
In order to provide backward compatibility with older versions of
apache, the EBCDICKludge directive
allows for a less powerful mechanism to control the conversion of
documents to and from EBCDIC.
Note:
The EBCDICKludge directive is deprecated, since its functionality
is superseded by the more powerful
EBCDICConvert and
EBCDICConvertByType
directives.
The directives are applied in the following order:
- First, the configured EBCDICConvert
directives in the current context are evaluated in
configuration file order. As soon as a matching file extension
is found, the search stops and the configured conversion is
applied.
EBCDICConvert settings inherited from parent directories are
tested after the more specific (deeper) directory levels.
- If the EBCDICKludge is in effect,
the next step tests for a MIME type of the format
type/x-ascii-subtype. If the
document has such a type, then the
"x-ascii-" substring is removed and the
conversion set to Off.
- In the next step, the configured
EBCDICConvertByType
directives are evaluated in configuration file order. If
the document has a matching MIME type, the search stops and
the configured conversion is applied.
EBCDICConvertByType settings inherited from parent
directories are tested after the more specific (deeper)
directory levels.
If no EBCDICConvertByType
directive at all exists in the current context, the server
falls back to the simple heuristics which assume that MIME
types starting with "text/", "message/" or "multipart/" (plus
the special type "application/x-www-form-urlencoded" used in
simple POST requests) imply a conversion, while all the rest
is delivered unconverted (i.e., binary).
Since all Apache input and output is based upon the BUFF data type
and its methods, the easiest solution was to add the actual
conversion to the BUFF handling routines. The conversion must be
settable at any time, so BUFF flags were added which define
whether a BUFF object has currently enabled conversion or not.
Two such flags exist: one for data read from the client
(ASCII to EBCDIC conversion) and one for data returned to the
client (EBCDIC to ASCII conversion).
During sending of the header, Apache determines (based on the
returned MIME type for the request) whether conversion should be used
or the document returned unconverted. It uses this decision to
initialize the BUFF flag when the response output begins.
Modules should therefore determine the MIME type for the
current request before initiating the response by calling
ap_send_http_headers().
The BUFF flag is modified at
several points in the HTTP protocol:
- set (In and Out) before a request is
received (because the request and the request header
lines are always in ASCII format)
- set/unset (for Input data) when the request body is
received - depending on the content type of the request body
(because the request body may contain ASCII text or a binary file)
- set (for returned Output) before a response
header is sent (because the response header lines are always
in ASCII format)
- set/unset (for returned Output) when the
response body is sent - depending on the content type of the
response body (because the response body may contain text or
a binary file)
Additional transparent transitions may occur for extracting/inserting
the HTTP/1.1 chunking information from/into the input/output body data
stream, and for generating multipart headers for range
requests. (See RFC2616 and src/main/http_protocol.c for details.)
-
The relevant changes in the source are #ifdef'ed into two
categories:
#ifdef CHARSET_EBCDIC
- Code which is needed for any EBCDIC based machine. This
includes character translations, differences in
contiguity of the two character sets, flags which
indicate which part of the HTTP protocol has to be
converted and which part doesn't etc.
#ifdef _OSD_POSIX | TPF | OS390
- Code which is needed for the Fujitsu-Siemens BS2000/OSD | IBM TPF |
IBM OS390 mainframe platforms only. This deals with include file
differences and socket and fork implementation topics which are
only required on the respective platform.
-
The possibility to translate between ASCII and EBCDIC at the
socket level (on BS2000 POSIX, there is a socket option which
supports this) was intentionally not chosen, because
the byte stream at the HTTP protocol level consists of a
mixture of protocol related strings and non-protocol related
raw file data. HTTP protocol strings are always encoded in
ASCII (the GET request, any Header: lines, the chunking
information etc.) whereas the file transfer parts (i.e., GIF
images, CGI output etc.) should usually be just "passed through"
by the server. This separation between "protocol string" and
"raw data" is reflected in the server code by functions like
bgets() or rvputs() for strings, and functions like bwrite()
for binary data. A global translation of everything would
therefore be inadequate.
(In the case of text files of course, provisions must be made so
that EBCDIC documents are always served in ASCII)
This port therefore features a built-in protocol level conversion
for the server-internal strings (which the compiler translated to
EBCDIC strings) and thus for all server-generated documents.
-
By examining the call hierarchy for the BUFF management
routines, I added an "ebcdic/ascii conversion layer" which
would be crossed on every puts/write/get/gets, and
conversion flags which allowed enabling/disabling the
conversions on-the-fly. Usually, a document crosses this
layer twice from its origin source (a file or CGI output) to
its destination (the requesting client): file ->
Apache, and Apache -> client.
The server can now read the header
lines of a CGI-script output in EBCDIC format, and then find
out that the remainder of the script's output is in ASCII
(like in the case of the output of a WWW Counter program: the
document body contains a GIF image). All header processing is
done in the native EBCDIC format; the server then determines,
based on the type of document being served, whether the
document body (except for the chunking information, of
course) is in ASCII already or must be converted from EBCDIC.
-
By default, Apache assumes that documents with the MIME types
"text/*", "message/*", "multipart/*" and "application/x-www-form-urlencoded"
are text documents and are stored as EBCDIC files, whereas all
other files are binary files (and stored in a byte-identical
encoding as on an ASCII machine).
These defaults can be overridden
on a by-MIME-type and/or
by-file-extension basis, using the
directives
EBCDICConvertByType {On|Off}[={In|Out|InOut}] mimetype [...]
EBCDICConvert {On|Off}[={In|Out|InOut}] fileext [...]
where the mimetype argument may contain wildcards.
-
Before adding the flexible conversion, non-text documents were
always served "binary" without conversion.
This seemed to be the most sensible choice for, .e.g., GIF/ZIP/AU
file types (It of course requires the user to copy them to the
mainframe host using the "rcp -b" binary switch), but proved to be
inadequate for MIME types like model/vrml,
application/postscript and application/x-javascript.
-
Server parsed files are always assumed to be in native (i.e.,
EBCDIC) format as used on the machine (because they do not cross the
conversion layer when being read), and are converted after processing.
-
For CGI output, the CGI script determines whether a conversion is
needed or not: by setting the appropriate Content-Type, text files
can be converted, or GIF output can be passed through unmodified
(depending on the conversion configured in the script's context).
Binary Files
When exchanging binary files between the mainframe host and a
Unix machine or Windows PC, be sure to use the ftp "binary"
(TYPE I) command, or use the
rcp -b command from the mainframe host
(the -b switch is not supported in unix rcp's).
Text Documents
The default assumption of the server is that Text Files
(i.e., all files whose Content-Type: starts with
text/) are stored in the native character
set of the host, EBCDIC.
Server Side Included Documents
SSI documents must currently be stored in EBCDIC only. No
provision is made to convert them from ASCII before processing.
The same holds for other interpreted languages, like
mod_perl or mod_php.
Apache HTTP Server