Return to lecture notes index
August 26, 2008 (Lecture 9)

Error Detection and Correction

Data in transmission is rarely perfectly safe. Transmitting data subjects it to interference. Occasionally the data gets corrupted along the way. Corrupted data is very hard for application software to manage -- it can't reasonably be expected to check every single piece of data for errors before using it. So, we take care of this within the network software.

The basic idea is that we break the data into pieces, known as frames. To each of these frames, we add some additional information -- Error Detection Codes (EDCs) or Error Correction Codes (ECCs). These extra bits will enable us to detect and or correct most failures.

In either case, sending these extra bits wastes some network time, so we like to keep them to a minimum, thereby spending most of our time sending the actual data. And, ECCs are larger than ECCs, so we use them sparingly. In fact, we generally only use ECCs in storage systems, e.g. disk drives, or in particularly error-prone but high-bandwidth media, e.g. satellites. Otherwise, we use only EDCs, because we can often resend the data on the rare occasions that it becomes corrupted in transmission.

In summary, by adding ECCs to data, we gain the ability to correct some errors. By adding EDCs, we gain the ability to detect some errors -- but we are unable to correct them, so our recourse is to discard the frame.

Visualizing Error Correction and Error Detection

Imagine that we want to send a message over the network. Our message is either a) RUN or b) HIDE. We could very efficiently encode these messages as follows:

But, the problem with this encoding is that if a single bit gets flipped, our message is undetactably corrupted. By flipping just one bit, one valid codeword is morphed into another valid codeword.

Now, let's consider another encoding, as follows:

The 2-bit encoding requires twice as many bits to send the messages -- but offers more protection from corruption. If we, again, assume a single-bit error, we still garble our transmission. Either "00" or "01" can become "01" or "10". But, neither can, in the case of a single bit error, be transformed into the other. So, in this case, we can detect, but not correct the error. it would take a two bit error for the error to become an undectable Byzantine error.

Now, let's consider a 3-bit encoding, which takes three-times as much storage as our original encoding and 50% more storage than our encoding which allowed for the detection of single-bit errors:

Given the encoding above, if we encounter a single bt error, "000" can become "100", "010", or "001", each of which is 2-bits from our other valid codeword, "111". Similarly, should a transmission of "111" encounter a single-bit error, it will become "011", "101", or "110" -- still closer to "111" than "000".

The upshot is this. Since single-bit errors are much more common than multiple bit errors, we correct a defective codeword by assuming that it is intended to be whatever valid codeword happens to be closest to it as measured in these bit-flips. If what we have doesn't match any codeword, but is equally close to more than one codeword, it is a dectable, but uncorrectable error.

Hamming Distance

In the examples above, I measured the "distance" between to codewords in terms of the number of bits that one would need to flip to convert one codeword into the other. This distance is known as the Hamming Distance.

The Hamming Distance between two codes (not individual codewords) is the minimum Hamming distance between any pair of codewords within the code. This is important, because this pair is the code's weak link with respect to error detection and correction.

This is a useful measure, because as long as the number of bits in error is less than the Hamming Distance, the error will be detected -- the codeword will be invalid. Similarly, if the Hamming distance between the codewords is more than double the number of bits in the error, the defective codeword will be closer to the correct one than to any other.

How Many Check Bits?

Let's consider some code that includes dense codewords. This would, for example, be the case if we just took raw data and chopped it into pieces for transmission or if we enumerated each of several messages. Given this situation, we can add extra bits to increase the Hamming distance between our codewords. But, how many bits do we need to add? And, how do we encode them? Hamming answered both of these questions.

Let's take the first question first. If we have a dense code with m message bits, we will need to add some r check bits to the message to put distance between the code words. So, the total number of bits in the codeword is n, such that n = m + r

If we do this, each codeword will have n illegal codewords within 1 bit. (Flip each bit). To be able to correct an error, we need 1 more bit than this, (n + 1) bits to make sure that 1-bit errors will fall closer to one codeword than any other.

We can express this relationship as below:

Hamming's Code

Hamming offered an encoding that illustrated the lower limit described above. This code allows the correction of 1-bit errors (but not the detection of 2-bit erros). The encoding works as follows:

Hamming Code: Example Encoding

Let's consider encoding the message "1100001". Notice that we'll use the power-of-two bits for the extra check bits.

  _    _     1   _    1    0    0   _    0   0   1
  1    2     3   4    5    6    7   8    9   10  11
  

So, the message that we send of the network is as follows:

  1    0     1    1     1     0    0   1    0   0   1
  1    2     3    4     5     6    7   8    9   10  11
  

Hamming's Code: Correcting a 1-Bit error

Given the code above, what happens if we encounter a 1-bit error? Consider the following:

  Intended Message:  1    0    1    1    1    0    0   1    0   0   1
  Corrupted Message: 1    0    1    1    1    0    1   1    0   0   1
  

Visually, we can see that bit-7 is in error. But, how can we determine this algorithmically? We recalculate the parity associated with each of our groups: Bit-1's group, Bit-2's group, Bit-4's group, and Bit-8's group:

So, let's consider the example above:

It is worth noting that Hamming codes can be precomputed and placed into tables, rather than being computed on the fly.

Cyclic Redundancy Check (CRC)

Perhaps the most popular form of error detection in netowrks is the Cyclic Redundancy Check (CRC). The mathematics behind CRCs is well beyond the scope of the course. We'll leave it for 21-something. But, let's take a quick look at the mechanics.

Smart people develop a generator polynomial. These are well-known and standardized. For example:

These generators are then represented in binary, representing the presence or absence of each power-of-x term as a 1 or 0, respectively, as follows:

In order to obtain the extra bits to add to our codeword, we divide our message by the generator. The remained is the checksum, which we add to our message. This means that our checksum will have one bit fewer than our generator --= note that, for example, the 16-bit generator has 17 terms (16-0). Upon receiving the message, we repeat the computation. If the checksum matches, we assume the transmission to be correct, otherwise, we assume it to be incorrect. Common CRCs can detect a mix of single and multi-bit errors.

The only detail worth noting is that the division isn't traditional decimal long division. It is division modulus-2. This means that both additon and subtraction degenerate to XOR -- the carry-bit goes away.

Let's take a quick look at an example:

                        1 1 0 0 0 0 1 0 1 0 
                ________________________________ 
   1 0 0 1 1    |    1 1 0 1 0 1 1 0 1 1 0 0 0 0 
                     1 0 0 1 1
                     ---------
                       1 0 0 1 1
                       1 0 0 1 1 
                       ---------
                                 1 0 1 1 0
                                 1 0 0 1 1
                                 ---------
                                     1 0 1 0 0
                                     1 0 0 1 1 
                                     ---------
                                         1 1 1 0   <------This is the checksum

  

A Second Look At Framing

At the beginning of today's conversation, I mentioned that we carved messages into small chunks known as frames. Now that we've discussed error correction, we can discuss framing a bit further.

Framing is essential for the management of netowrk communication. We need to cut data into pieces in order that we can label it with checksums for error correction and detection. We also need managable pieces should be need to discard and possibly resend a chunk.

So, what should be the frame size? Well, this depends on the data rate and the error rate. The higher the error rate, the small the frame. The reason is simple -- if we have a large frame and have to throw something away, we'll end up throwing out more. If we have an error rate of one bit per million and our frame size is a ten-million bits, we'll virtually never get to send the frame -- each attempt will have, on average, 10 errors. But, if the frame size is 1,000 bits, almost all of the frames will go through without error.

So, why not have very small frames, need it or not? Well, frames typically contain non-data. For example, they usually have special "framing bits". These bits make it possible to resynchrnoize with the beginning of the frame if bits are lost or injected into the stream, such as by a bad buffer in a network switch. Frames might also contain other non-data, such as a frame-number for retransmission, or the protocol number for the higher-level protcol, etc. As a result, smaller frames usually result in a greater percentage of the netowork time spend sending metadata (non-data overhead), ratehr than the payload (the user data). More frames means more frame headers, etc.

By the way, in class, someone asked what happens if the framing pattern occurs within the payload. The answer is that it usually needs to be escaped.

The Point-To-Point (PPP) Protocol Example

Just to show you guys a real-world example, below is the structure of the frame used by the Point-To-Point (PPP) protocol:

  Flag         Address        Control    Protocol    Payload    Checksum      Flag
  8-bits       8-bits         8-bits     8/16-bits   varibale   16/32-bits    8-bits
  01111110     11111111       00000011                                        01111110
  

Notice the bits pattern used to delimit the beggining and ending of the chunk of data, e.g., frame the data.

In the case of PPP, the address bits are almost always all 1s, as it is almost never used, except as a point-to-pont protocol with only two stations, not as a multi-station broadcast protocol. The control bits are usually as shown, but can also be used to number frames and a reliable protocol. As it turns out, the sender and receiver can agree, during the initial negotiation, to leave both of these often-unused fields out of the frames, reducing the non-data overhead.