Sector 0

June 22, 2008

Huffman encoding in F#

Filed under: .NET,F#,Functional Programming,Programming — Tags: , , , — frank @ 13:24

For a while I have been toying with F# – a strict, functional programming language – primarily to get to know the language. I have written a few utility programs and I am working on the problems defined on the Project Euler site. These mathematical problems are perfect for being solved in a functional programming language and it is a great way to work with a new language as well as getting a brush-up on your mathematics.

I have made an implementation of the Huffman compression algorithm in F# to showcase a few of the important things that make functional programming so great to some classes of problems.

Huffman tree generated from the text 'this is an example of a huffman tree'. Source:wikipedia.org

Huffman tree generated from the text ‘this is an example of a huffman tree’. Source:wikipedia.org

I will not go into detail about the Huffman algorithm itself. The following quote is from wikipedia.org about Huffman coding.

Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix-free code (sometimes called “prefix codes”) (that is, the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol) that expresses the most common characters using shorter strings of bits than are used for less common source symbols. Huffman was able to design the most efficient compression method of this type: no other mapping of individual source symbols to unique strings of bits will produce a smaller average output size when the actual symbol frequencies agree with those used to create the code. A method was later found to do this in linear time if input probabilities (also known as weights) are sorted.

The code

The following code is an implementation of the Huffman algorithm in F#. The program does not encode a string into a compressed bit array, but will use the Huffman algorithm to count how many bits there will be in a compressed string. It is easy to see what needs to be altered in order for the algorithm to actually output a compressed array.

Furthermore the code cannot decode a compressed array. This is because I am just showcasing a few topics. It is easy to write a method that uses the Huffman tree to decode an array of bits into the original input string.

Finally the representation of the Huffman codes is not considered when calculating the size of the compressed array. In an application that actually writes the compressed data to a file or a stream, the huffman codes needs to be stored too in order for the decoder to properly decode the data.

#light
module FsHuffmanEncoder
 
// Define a node type using a discriminated union
type Node =
    | InternalNode of int * Node * Node // an internal node with a weight and two subnodes
    | LeafNode of int * byte // a leaf with a weight and a symbol
 
/// Get the weight of a node (how many occurencies are in the input file)
let weight node =
    match node with
    | InternalNode(w,_,_) -> w
    | LeafNode(w,_) -> w
 
/// Creates the initial list of leaf nodes
let createNodes inputValues =
    let getCounts (leafNodes:(int*byte)[]) =
        inputValues |> Array.iter
            (fun b ->   let (w,v) = leafNodes.[(int)b]
                        leafNodes.[(int)b] <- (w+1,v))
        leafNodes
    [|for b in 0uy..255uy -> (0,b)|] |> getCounts
    |> List.of_array
    |> List.map LeafNode
 
/// Create a Huffman tree using the initial list of leaf nodes as basis
let rec createHuffmanTree nodes =
    match nodes |> List.sort (fun node1 node2 -> compare (weight node1) (weight node2)) with
    | first :: second :: rest ->
        let newNode = InternalNode((weight first)+(weight second),first,second)
        createHuffmanTree (newNode::rest)
    | [node] -> node
    | [] -> failwith "Cannot create a huffman tree without input nodes"
 
/// Get a map of (symbol, length-of-huffman-code) pairs used when calculating the theoretical
/// length of the output.
let getHuffmanCodes topNode =
    let rec assignBitPattern node =
        match node with
        | LeafNode(_,v) -> [(v,[])]
        | InternalNode(_,leftNode,rightNode) ->
            let lCodes = assignBitPattern leftNode |> List.map (fun (v,c) -> (v,false::c))
            let rCodes = assignBitPattern rightNode |> List.map (fun (v,c) -> (v,true::c))
            List.append lCodes rCodes
    assignBitPattern topNode |> List.map (fun (v,(c:bool list)) -> (v,List.length c))
    |> Map.of_list
 
/// Calculates the theoretical size of the compressed data (without representing the
/// Huffman tree) in bits, and prints the result to stdout
let encode (input:byte[]) =
    let mapByte huffmanCodes b =
        match Map.tryfind b huffmanCodes with
        | Some huffmanCodeLength -> huffmanCodeLength
        | None -> failwith "Unknown input byte - invalid huffman tree"
    let huffmanCodes = createNodes input |> createHuffmanTree |> getHuffmanCodes
    let size = [|0|] // this is an ugly hack, but it is just for show and tell
    Array.iter (fun b -> size.[0] <- size.[0] + mapByte huffmanCodes b) input
    printfn "Original size      : %d bits" (input.Length*8)
    printfn "Compressed size    : %d bits" size.[0]

The main entry point of the application:

#light
module Main
open FsHuffmanEncoder
open System.IO
 
let fileName = @"c:\uncompressed_text.txt"
encode (File.ReadAllBytes(fileName))

Important lessons in this example

Type definitions with Discriminated union

One of the first things you notice in the code above is the definition of the type Node. This consists of either a LeafNode or an InternalNode type, and these are just names for the tuples that they represent. These named tuples are called discriminators, and can be used in pattern matching like so:

let weight node =
	match node with
	| InternalNode(w,_,_) -> w
	| LeafNode(w,_) -> w

Pattern matching is an important construct in any functional programming language, and is used all over.

Pipelining

The |> “forward pipe” operator is a very important operator in F#, and you should familiarize yourself with it. There is nothing special about it, and a description of it could be “function application in reverse”. Using |> has the advantage of making your code much easier to read and to understand the flow.

The following two lines of code are exactly the same. The first uses the normal way of calling functions, the second line uses a pipe to do the same.

List.iter (fun x -> x*2) myList
myList |> List.iter (fun x -> x*2)

It should be clear that when using big compound statements with a lot of inline function calls in function parameters the pipe operator helps make the code more readable.

Mutable state

Immutable state is one of the core properties of functional programming languages. Immutable state helps programmers write more succint programs as well as more robust programs; if the state is not shared there are no race conditions, no invalid reads/writes of memory.

But some times mutable state can enable you to write some piece of code in a more efficient manner. In the Huffman example I use a mutable array to store the count of each input symbol. This could of course be done using immutable state, but the code would be more complex and the performance more poor.

The lesson here is that although you should strive to use immutable data structures as much as possible, using mutable state is okay – and even preferable – in some situations. And when interfacing with the .NET libraries, or code written in an imperative .NET language, you automatically use shared state.

Integration with other .NET languages

Using existing .NET libraries

Using .NET libraries from F# code is just as easy as from any other language. In the Huffman example I use the class System.IO.File to read the bytes from a file on disk. At the top of the source file the statement open System.IO functions in the same manner as a using statement in C#.

The statement encode (File.ReadAllBytes(fileName)) – which could also be written File.ReadAllBytes(fileName) |> encode using the pipe operator – simply calls the method ReadAllBytes defined in the class File.

Using the encoder from C#

Using types defined in F# in another programming language is just as easy as the other way around. The above example has a module statement. This automatically resolves as a type in C#, and all methods are public static methods. Using the encoder from C# could then look like this:

static void Main(string[] args)
{
	string fileName = @"c:\uncompressed_text.txt";
	FsHuffmanEncoder.encode(File.ReadAllBytes(fileName));
}

Defining types with instance members is just as easy in F# as in any other language.

Recursion

Recursion is another important concept in functional programming. Iterating over a list is something you do a lot in programming, and defining a recursive function to do just that is more efficient than using some imperative construction (which is quite legal in F#).

Tail recursion

But recursion is a double edged sword. If you have programmed recursive methods in an imperative language you have probably also run into one or more stack overflow exceptions.

Each time a method is called the return value is pushed on the execution stack so the runtime knows where to return to when the current method call terminates. If you are not careful in terminating your recursive methods in time an unlimited number of values are pushed onto the execution stack. In reality the stack overflows and the program terminates with a nasty exception. This is also something you must be aware of when defining recursive functions in F#, especially since recursion is a key programming technique in functional programming and is thus used much more intensely.

Fortunately most funtional programming languages – including F# – implements what is known as “tail recursion”. Tail recursion enables you to write recursive functions that will never result in a stack overflow – you are able to write eternally recursive functions. This is done by ensuring that the last expression in a function definition is the recursive call.

The following function will result in a stack overflow exception

let rec badRecursiveFunction n =
	match n with
	| _ when n<9999999999 -> badRecursiveFunction (n+1)
	| _ -> printfn "%d" n //some big number

This function however is tail recursive and will terminate gracefully without putting a stain on the execution stack

let rec tailRecursiveFunction n =
	match n with
	| 9999999999 -> printfn "%d" n //some big number
	| _ -> tailRecursiveFunction (n+1)
All that ends well is tail recursive?

Be careful, though. Making sure that the last expression written in the code is the recursive method is not enough. You have to make sure that it is the last expression evaluated on the stack machine in the runtime.

Consider the following example. The function “sum” takes a list of numbers as input and sums them.

#light
 
let sum numbers =
	| [] -> 0
	| head::tail -> head + sum tail

At first glance this looks tail recursive, but that is not the case. If you apply [3;5] the expression is evaluated like this:

sum [3;5]	= 3 + ( sum [5] )
		= 3 + ( 5 + sum [] )
		= 3 + ( 5 + ( 0 ) )

In order to properly evaluate this expression the values are pushed onto the evaluation stack (this is effectively the result of evaluating the expression ‘sum’) and then the operator ‘+’ is executed. This way ‘sum’ is not the last expression to execute even though it looks like it when reading the code.

So you need to know how the stack machine in the runtime functions in order to write reliable, high performance code. Knowing the language is never enough.

Reference implementation i C#

A somewhat equivalent implementation of the above algorithm written in C# can be seen below.

using System;
using System.Collections.Generic;
using System.Text;
 
namespace CsHuffmanLib
{
    /// 
    /// C# version of the Huffman algorithm
    /// 
    public class CsHuffmanEncoder
    {
        private abstract class Node
        {
            public int frequency = 0;
        }
 
        private class LeafNode : Node
        {
            public byte symbol;
            public bool[] bitPattern;
 
            public LeafNode(byte symbol, int frequency)
            {
                this.frequency = frequency;
                this.symbol = symbol;
            }
        }
 
        private class InternalNode : Node
        {
            public Node leftNode;
            public Node rightNode;
 
            public InternalNode(Node leftNode, Node rightNode)
            {
                this.leftNode = leftNode;
                this.rightNode = rightNode;
                frequency = leftNode.frequency + rightNode.frequency;
            }
        }
 
        public int Encode(byte[] input)
        {
            List leafNodes = GetLeafNodes(input);
            Node topNode = ConstructTree(new List(leafNodes));
            AssignBitPatterns(topNode, new List());
 
            int bitCount = 0;
            foreach (byte b in input)
            {
                //find the matching symbol and store the bit count
                foreach (Node node in leafNodes)
                {
                    LeafNode leafNode = node as LeafNode;
                    if (leafNode.symbol == b)
                    {
                        bitCount += leafNode.bitPattern.Length;
                        break;
                    }
                }
            }
 
            //make sure the number of bits is dividable by 8 before converting to bytes
            return bitCount / 8 + ((bitCount % 8) != 0 ? 1 : 0);
        }
 
        private List GetLeafNodes(byte[] input)
        {
            //initialize the list that is used to count symbol occurences
            List initialLeafList = new List();
            for (int i = 0; i < 256; i++)
                initialLeafList.Add(new LeafNode((byte)i, 0));
 
            //iterate through all inputs to get frequencies
            foreach (byte b in input)
                initialLeafList[(int)b].frequency++;
 
            //remove any symbols that have a frequency of 0
            List finalLeafList = new List();
            foreach (LeafNode leaf in initialLeafList)
                if (leaf.frequency > 0)
                    finalLeafList.Add(leaf);
            return finalLeafList;
        }
 
        private Node ConstructTree(List nodes)
        {
            Node node1, node2;
            while (nodes.Count > 1)
            {
                //find the two smallest nodes (by frequency)
                node1 = FindSmallestNodeByFrequency(nodes);
                node2 = FindSmallestNodeByFrequency(nodes);
                nodes.Add(new InternalNode(node1, node2));
            }
            return nodes[0];
        }
 
        private Node FindSmallestNodeByFrequency(List nodes)
        {
            Node node = nodes[0];
            foreach (Node n in nodes)
                if (n.frequency < node.frequency)
                    node = n;
            nodes.Remove(node);
            return node;
        }
 
        private void AssignBitPatterns(Node node, List bitsSoFar)
        {
            LeafNode leafNode = node as LeafNode;
            if (leafNode != null)
                leafNode.bitPattern = bitsSoFar.ToArray();
            else
            {
                InternalNode internalNode = node as InternalNode;
                bitsSoFar.Add(false); //left node
                AssignBitPatterns(internalNode.leftNode, bitsSoFar);
                bitsSoFar.RemoveAt(bitsSoFar.Count - 1);
                bitsSoFar.Add(true); //right node
                AssignBitPatterns(internalNode.rightNode, bitsSoFar);
                bitsSoFar.RemoveAt(bitsSoFar.Count - 1);
            }
        }
    }
}

2 Comments »

  1. Error 1 Using the generic type ‘System.Collections.Generic.List’ requires ’1′ type arguments C:\Documents and Settings\sys8d\My Documents\Visual Studio 2008\Projects\Huffamn\Huffamn\Program.cs 67 17 Huffamn

    Comment by Yuosef — December 22, 2008 @ 9:13

  2. Im portuguese and I need to implement a huffman decode for jpg compression

    Comment by antonio — June 30, 2010 @ 19:50

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress