Anonymous (nobody@replay.com)
Sat, 29 Aug 1998 08:44:42 +0200
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Steganography via Arithmetic Compression
One method of steganography is to hide the embedded message by means of
changing random elements of the cover message. For example, the cover
message may have various words where synonyms can be substituted.
Everywhere the word "couch" appears it could be substituted with
"sofa" and vice versa. Each substitutable word would encode one bit
of the embedded message. The sender and receiver would share the same
list of synonyms and so the receiver could identify all the places
where the sender would have made a substitution, and recover one bit
of the embedded message.
Another example would be an interactive protocol, like an online poker
game. The dealer must randomly select cards from a deck. Each of
these random choices could be guided by data from the message to be
embedded. A random choice of a card from a deck of 52 could encode
almost six bits of the embedded message.
For cases like the first example, where there are only two equally
likely alternatives each time data from the embedded message is to be
inserted, the process is simple. One bit of the embedded message can
be used to make the decision. In other cases, where there are more
than two alternatives or where they may not be equally likely, it is a
more complicated matter to embed the data such that it can be later
extracted.
Arithmetic compression provides a good algorithm for this process.
This report discusses the use of arithmetic compression to embed data
by means of random alterations to the cover traffic.
The analysis makes certain assumptions:
- The data to be embedded is a completely random bit string. Both 0
and 1 bits should be equally probable, and there should be no
detectable structure to the data. This can be achieved by
encrypting the message, or in some cases by using a good
compression algorithm.
- The sender and receiver of the embedded message have the same model
for where random substitutions will be made, and for the
probabilities which will be used. For the synonym example, both
sides must agree about which synonyms will be used, and both must
agree about the probabilities with which they will be chosen. Some
words might have three choices, for example, with probability 1/3
for each choice.
- The cover message is faithfully transmitted from sender to receiver,
without alteration.
Arithmetic compression is a very efficient compression method. A good
description of arithmetic compression is in Ian H. Witten, Radford
M. Neal, and John G. Cleary, Arithmetic Coding for Data Compression,
Communications of the ACM, v30, n6, p520, June, 1987.
Briefly, the data stream to be compressed must consist of a string of
symbols from some alphabet. The compressor must have a model, which is
a way of calculating the probability that each possible symbol will
occur in each context. The model may be fixed, as with letter
frequencies in English, or it may be variable, taking context into
account, such as other letters which preceded the current one.
As one way of organizing the process, the compression "engine" may be
called once for each symbol of the input string. Along with the
symbol, the compressor is provided with a probability table giving the
probabilities of all symbols which might have appeared at that
location. Using this information the compressor gradually outputs the
compressed representation of the input stream.
A feature of arithmetic compression is that it allows symbols to be
represented with fractional bits. A given symbol does not necessarily
compress to an integer number of bits. One compressed bit may
represent two or more symbols. This allows arithmetic compression to
be extremely efficient.
Furthermore, if the model is accurate (that is, the input symbols are
actually distributed according to the model's predicted frequencies),
then the output of the compressor will be "perfect". It will encode
all of the entropy of the input message in the most compressed form
possible. This means that the output will be fully random: there will
be no structure in the output and 1's and 0's will be equally probable.
Another relevant property is that arithmetic compression is a lossless
compression scheme. Corresponding to the compression engine is a
decompression engine. This takes the compressed bit stream as input
along with model information, and produces decompressed output which
exactly matches the input.
The decompression engine may be structured so that it is called
repeatedly with the model information for the probability distribution
of the next possible symbol. It will use the bits in the compressed
data stream along with this model to select the symbol which had been
seen at that point by the compression engine. Each call returns one
more symbol of the original message.
This process of arithmetic compression and decompression can be applied
to solve the problem of embedding data in cover traffic, but in a
rather surprising way. Consider the problem first from the point of
view of the steganographic extraction engine, which must recover the
embedded message from the cover data.
It sees a sequence of input elements where random choices were made.
For each one, it knows which element was chosen, along with the
probability distribution which was used to make that choice. This is
exactly the information needed by an arithmetic compression engine. If
the steganography extractor feeds these random choices along with the
probability distribution into an arithmetic compression engine, it will
output a fully compressed, perfectly random data buffer which encodes
all the information of the random choices.
Now consider what would happen if we could take that random data buffer
and give it to the steganographic embedder as a message to be embedded.
As the embedder sees each element of the cover traffic where a random
selection must be made, it knows the probability distribution to use.
This is the information needed by an arithmetic decompression engine to
select an output symbol. The embedder can use the arithmetic
decompression engine, driven by the random data, to select output
symbols with the probability distribution needed for each selection.
Assuming that the sender and receiver share the same model of the
probabilities for each random choice, the data which is input to the
arithmetic compression engine will match the data which is output from
the arithmetic decompression engine. In effect, the model specifies a
one to one mapping between the compressed data and the uncompressed
data.
In the steganography case, we know that the random elements in the
cover traffic are the same at both ends (that is, the receiver sees the
same cover message produced by the sender). If both sides share the
same probability model, the one to one mapping implies that the
compressed data will also be the same at both ends. Therefore the
extractor will faithfully produce the same message which was given to
the embedder to hide.
The steganography case is the reverse of the usual way of using
arithmetic compression. Normally the uncompressed data is given to the
compressor, which produces a compressed representation. This is then
given to the decompressor, which restores the original data. In the
steganography case, data is first given to the decompressor which
treats it as compressed data, to create an uncompressed data stream.
This uncompressed stream is then given to the compressor, which
restores the original compressed data. The same one to one mapping is
used in each case, but the order of operations is reversed.
The result is that embedding a message is a matter of using arithmetic
decompression on the embedded data, providing at each choice point a
probability distribution for the alternatives. Extracting the original
message is done by using arithmetic compression, providing it the
symbol which appears at each choice point along with the same
probability distribution used by the embedder. The properties of
arithmetic compression will insure that the choices at each choice
point are made following the specified probability distribution, and
that the extracted message will match the one originally embedded.
In effect, the arithmetic decompressor acts as a data-driven random
number generator. It produces random selections from a probability
distribution, but it does so in such a way that it faithfully encodes
the data from the compressed input stream. As long as the compressed
input is fully random, the output selections will be selected with the
correct probabilities.
One other aspect of arithmetic compression is relevant, which is that
the data does not carry an indication of its end point. The
decompressor can continue to run and output more data upon request.
But this is actually a desirable feature of a steganographic system.
The embedded data should not reveal its own length; that information
should be available only once the data has been decrypted by the
recipient. The embedded data should be padded with random bits which
only the intended recipient will know to ignore. This way the
steganalyst sees only random data when he applies the extraction
algorithm, which is exactly what he would see if the cover stream were
produced by a random number generator with no steganographic data being
embedded.
-----BEGIN PGP SIGNATURE-----
Version: PGP for Personal Privacy 5.0
iQA/AwUBNeehVZERADbySURoEQKwigCgn5ZbKtrkx6eVrpUe3hhO1MzZ4TYAnjy7
/fZ75qfstfsV4sg4Ht0524yB
=/zpZ
-----END PGP SIGNATURE-----
The following archive was created by hippie-mail 7.98617-22 on Sat Apr 10 1999 - 01:11:02