Sunday 22 July 2012

More fun with Intel TBB and Boost.Proto (and Boost.Fusion)

In my last post, we developed an EDSL for specifying TBB Flow Graph structure with the help of Boost.Proto. Instead of writing a series of tbb::flow::make_edge calls, we can use a little language to help us be more declarative. In the process, however, we lost some of the runtime efficiency. Instead of just invoking make_edge() a number of times, our solution uses loops and temporary data structures (e.g. std::set) which introduce runtime overhead. In this post, we will look at how to modify what we did last time to create a meta-program that will enable us to remove the unnecessary runtime overhead.

Boost.Fusion

Fusion library sits somewhere between Boost.MPL and STL. Like MPL, it provides heterogeneous data structures but unlike MPL it is able to operate at runtime. It also is a great tool for certain meta-programming tasks. For example, suppose we wanted to generate code that would invoke foo() multiple times, each with a different value (and maybe type). With Fusion, it is as easy as putting those values into a fusion::vector and then calling fusion::for_each() with the vector and a polymorphic functor:
void foo(int);
void foo(double);

struct call_foo {
    template <typename T>
    void operator()(T x) const {
        foo(x);
    }
};

int main() {
    boost::fusion::vector<int, double, int> v(1, 2.3, 4);
    boost::fusion::for_each(v, call_foo());
    return 0;
}

Once the program is compiled with optimizations enabled, all that remains are just foo() invocations (I ran the output generated via -s flag through c++filt to demangle the names):
main:                          # @main
    .cfi_startproc
# BB#0:
    pushq %rax
.Ltmp1:
    .cfi_def_cfa_offset 16
    movl $1, %edi
    callq foo(int)
    movsd .LCPI0_0(%rip), %xmm0
    callq foo(double)
    movl $4, %edi
    callq foo(int)
    xorl %eax, %eax
    popq %rdx
    ret

Since what we are trying to do is generate a bunch of make_edge() calls, this seems like exactly what we need.

Putting it all together

The basic idea is very simple. Replace the use of std::set with boost::fusion::vector when keeping track of senders and receivers in a compound node. We begin by re-defining compound_node to bundle up two fusion vectors. I replaced the use of std::tuple with a struct to make the code more readable and also added a utility function to make the construction easier:
template <typename Senders, typename Receivers>
struct compound_node {
    typedef Senders senders_t;
    typedef Receivers receivers_t;

    senders_t senders;
    receivers_t receivers;
};

template <typename S, typename R>
compound_node<S, R> mk_compound_node(S s, R r) {
    return compound_node<S, R>{s, r};
}

Now it's time to modify the make_compound_node transform that is invoked by Proto everytime it meets a terminal:
struct make_compound_node : proto::callable {
    typedef compound_node<
        fusion::vector<flow::sender<flow::continue_msg> *>,
        fusion::vector<flow::receiver<flow::continue_msg> *>
    > result_type;

    result_type operator()(flow::sender<flow::continue_msg>& s, flow::receiver<flow::continue_msg>& r) {
        return mk_compound_node(
            fusion::vector<flow::sender<flow::continue_msg> *>(&s),
            fusion::vector<flow::receiver<flow::continue_msg> *>(&r)
        );
    }
};

Next up is the join transform that is applied to process the + operator. Recall that all it has to do is create a new compound node by joining the senders and receivers of the left and right hand sides. To join two Fusion vectors, we use fusion::join function. Unfortunately, it returns not a new vector but a view which contains references to the original sequences it was asked to join. Because we pass everything by value and have tons of temporaries, this is a recipe for disaster. So we force Fusion to create a new vector by passing the result of join to fusion::as_vector. The messy part turns out to be the return type. Remember, Proto transforms have to comply with (TR1/C++11) result_of protocol. While previously we got away by using the simple method of typedef'ing a result_type, this time our return type depends on the types of arguments. We have to define a meta-function by specializing the result<Sig> struct. Luckily, Fusion library also supports a uniform way of computing result types of its functions (albeit slightly different from the standard approach). All of its functions have a meta-function with the same name in the fusion::result_of namespace.
struct join : proto::callable {
    template <typename Sig>
    struct result;

    template <typename This, typename LeftNode, typename RightNode>
    struct result<This(LeftNode, RightNode)> {
        typedef typename LeftNode::senders_t left_senders_t;
        typedef typename LeftNode::receivers_t left_receivers_t;
        typedef typename RightNode::senders_t right_senders_t;
        typedef typename RightNode::receivers_t right_receivers_t;

        typedef compound_node<
            typename fusion::result_of::as_vector<
                typename fusion::result_of::join<left_senders_t const, right_senders_t const>::type
            >::type,
            typename fusion::result_of::as_vector<
                typename fusion::result_of::join<left_receivers_t const, right_receivers_t const>::type
            >::type
        > type;
    };

    template <typename LeftNode, typename RightNode>
    typename result<join(LeftNode, RightNode)>::type operator()(LeftNode left, RightNode right) const {
        return mk_compound_node(
                fusion::as_vector(fusion::join(left.senders, right.senders)),
                fusion::as_vector(fusion::join(left.receivers, right.receivers))
            );
    };
};

Finally we get to the meat of the problem -- implementing splice transform where actual make_edge calls are made. Just like in the join transform above, we need to adhere to result_of protocol using struct specialization. We also need to execute the double for loop: the outer loop iterates over the left-hand's senders and the inner one over the right-hand receivers. But of course we'll use fusion::for_each() twice to do the looping. We are lucky that the fusion::vectors contain homogeneous types -- they are all either flow::sender<continue_msg> or flow::receiver<continue_msg>. This makes it possible to use C++11 lambdas and keep everything in one place.
struct splice : proto::callable {
    template <typename Sig>
    struct result;

    template <typename This, typename LeftNode, typename RightNode>
    struct result<This(LeftNode, RightNode)> {
        typedef compound_node<
            typename RightNode::senders_t,
            typename LeftNode::receivers_t
        > type;
    };

    template <typename LeftNode, typename RightNode>
    typename result<splice(LeftNode, RightNode)>::type operator()(LeftNode left, RightNode right) const {
        fusion::for_each(left.senders, [&](flow::sender<flow::continue_msg>* s) {
            fusion::for_each(right.receivers, [=](flow::receiver<flow::continue_msg>* r) {
                flow::make_edge(*s, *r);
            });
        });

        return mk_compound_node(right.senders, left.receivers);
    }
};

Note that no changes were necessary to the grammar nor, of course, main(). To verify that we reached our goal, let's check out the disassembly. Since flow::make_edge is inline function, I temporarily declared a make_edge() in the global namespace with the matching signature and used that instead. And voilĂ :
    leaq    560(%rsp), %rsi
    leaq    344(%rsp), %rdi
.LEHB53:
    call    make_edge(tbb::flow::interface6::sender<tbb::flow::interface6::continue_msg>&, tbb::flow::interface6::receiver<tbb::flow::interface6::continue_msg>&)
    leaq    816(%rsp), %rsi
    leaq    600(%rsp), %rdi
    call    make_edge(tbb::flow::interface6::sender<tbb::flow::interface6::continue_msg>&, tbb::flow::interface6::receiver<tbb::flow::interface6::continue_msg>&)
    leaq    1072(%rsp), %rsi
    leaq    88(%rsp), %rdi
    call    make_edge(tbb::flow::interface6::sender<tbb::flow::interface6::continue_msg>&, tbb::flow::interface6::receiver<tbb::flow::interface6::continue_msg>&)
    leaq    560(%rsp), %rsi
    leaq    88(%rsp), %rdi
    call    make_edge(tbb::flow::interface6::sender<tbb::flow::interface6::continue_msg>&, tbb::flow::interface6::receiver<tbb::flow::interface6::continue_msg>&)
    leaq    48(%rsp), %rsi
    leaq    1336(%rsp), %rdi
    call    make_edge(tbb::flow::interface6::sender<tbb::flow::interface6::continue_msg>&, tbb::flow::interface6::receiver<tbb::flow::interface6::continue_msg>&)
    leaq    304(%rsp), %rsi
    leaq    1336(%rsp), %rdi
    call    make_edge(tbb::flow::interface6::sender<tbb::flow::interface6::continue_msg>&, tbb::flow::interface6::receiver<tbb::flow::interface6::continue_msg>&)

Saturday 14 July 2012

Fun with Intel TBB and Boost.Proto

Intel Threading Building Blocks

As the hardware engineers cram more and more cores into the processors, software engineers are left to ponder about how to best exploit these new capabilities. Intel offers an open source C++ library they call Threading Building Blocks (TBB). Akin to competing solutions (OpenMP, GCD, PPL), the library allows the developer to break up the computation into bite size pieces called tasks. A task also has a successor (or parent) task which makes it possible to express a relationship describing who should run before whom. The library runtime then takes care of scheduling the tasks onto the physical threads. It is, however, error prone and inconvenient to program using the tasks. Thankfully, much of the programming can be expressed via higher level abstractions (especially those involving fork-join parallelism). For example, parallel_for function allows for using multiple threads to execute the iterations of the loop. Another example is parallel_pipeline function which allows for expressing pipes-and-filters kind of parallelism.

TBB also has a facility called a Flow Graph. Designed for those situations when it is best to express the concurrency as a directed graph of dependencies between message emitting nodes, Flow Graph stands almost like a separate library within the library. Let's take a look at an example from TBB's online reference. In this example, every node will print its name after it receives a completion message from the node that it depends on. Here's the dependency graph:

And the corresponding code:

#include <cstdio>
#include "tbb/flow_graph.h"

using namespace tbb::flow;

struct body {
    std::string my_name;
    body( const char *name ) : my_name(name) {}
    void operator()( continue_msg ) const {
        std::printf("%s\n", my_name.c_str());
    }
};

int main() {
    graph g;

    broadcast_node< continue_msg > start;
    continue_node<continue_msg> a( g, body("A"));
    continue_node<continue_msg> b( g, body("B"));
    continue_node<continue_msg> c( g, body("C"));
    continue_node<continue_msg> d( g, body("D"));
    continue_node<continue_msg> e( g, body("E"));

    make_edge( start, a );
    make_edge( start, b );
    make_edge( a, c );
    make_edge( b, c );
    make_edge( c, d );
    make_edge( a, e );

    for (int i = 0; i < 3; ++i ) {
        start.try_put( continue_msg() );
        g.wait_for_all();
    }

    return 0;
}

The trouble with the code above is that the graph definition is not declarative. Looking at the series of make_edge calls, it is very hard to visualize the graph. I wanted to see if the situation could be improved...

Boost.Proto

I first encountered Boost.Proto at the Extraordinary C++ conference in 2007. Eric Niebler, the author of the library, gave a presentation to an awestruck audience where he took C++ metaprogramming to a level not seen before. In 2010, Eric wrote a series (1, 2, 3, 4, 5, 6, 7, 8) of articles on the subject that were published on C++Next blog. Through this wonderful series, I was finally able to start wrapping my head around Boost.Proto. The library is actually designed for library writers, more specifically those that are interested in developing an embeddable DSL inside C++ programs. I am not going try to give a tutorial of Boost.Proto in this post. If you have not yet done so, I highly recommend reading the aforementioned articles. What I hope to do is document my experience of using Boost.Proto for this exercise.

Intel TBB meets Boost.Proto

To make the TBB's flow graph construction more declarative, let's create an EDSL for this task. Ideally the syntax would be as close to the picture of a graph as possible but that turned out to be difficult. Let's start with a simple case of few nodes in a row:


In the EDSL we will write this as a > b > c. Next, let's look at graphs where a node has more than one edge coming out of it or going into it:


 



We will write the left graph as a > (b + c) and the right one as (a + b) > c. Since when I see graphs like this, I tell to myself "a goes to b and c" and "a and b go to c", this syntax seems reasonable. And since in C++ the addition has higher precedence than greater operator, the parenthesis can be dropped: a > b + c and a + b > c. Let's use this syntax to express the graph in the original example: start > (a > e + c) + (b > c > d).

The Grammar

As I started working on defining a grammar for this mini language, I first defined it on paper using EBNF notation much like I would do if I was going to use yacc:
expr ::= expr ">" group | group
group ::= group "+" node | "(" expr ")" | node
node ::= broadcast_node | continue_node

After a little fumbling however, I realized  that Proto grammars can be much simpler. Since Proto expressions are C++ expressions, one does not have to worry about introducing parenthesis into the grammar nor trying to build operator precedence and associativity into it. Thus the grammar actually becomes:

expr ::= expr ">" expr
       | expr "+" expr
       | broadcast_node
       | continue_node

A grammar like this would usually be ambiguous as it would be unclear how parse a > b > c, for example. But since C++ dictates the operator rules for us, there is no such ambiguity with Proto grammars! Written in Proto (transforms replaced with ellipses for now), it becomes:

namespace flow = tbb::flow;
namespace proto = boost::proto;
typedef flow::broadcast_node<flow::continue_msg> bcast_node;
typedef flow::continue_node<flow::continue_msg> cont_node;
typedef proto::literal<bcast_node> bcast;
typedef proto::literal<cont_node> cont;

struct grammar
    : proto::or_<
        proto::when<proto::plus<grammar, grammar>, ...>,
        proto::when<proto::greater<grammar, grammar>, ...>,
        proto::when<bcast, ...>,
        proto::when<cont, ...>
    >
{};

The Transforms

With the grammar in place, it's time to make it actually do stuff. We need to put in semantic actions, or transforms as Proto calls it, next to each grammar rule. In TBB, each node is both a sender and receiver of messages (and is derived from sender<T> and receiver<T>). As we combine the nodes together using the two operators, we produce a compound node which has some of the nodes inside of it acting as its senders and some as receivers. For example, suppose we have an expression (a > b) > (c > d > e) and we have independently build up two compound nodes, (a > b) and (c > d > e). We now join the two using the > operator. The > operator splices together the left-hand's senders with right-hand's receivers. In the diagrams below, the box  represents the compound node, the receivers are colored blue, the senders are colored red and nodes acting as both receivers and senders are colored purple. For the example above, the following diagram illustrates the before and after the splice. Note that while 'a' is sending messages to 'b', it is actually the receiver of the compound node.




The + operator joins the two compound nodes by combining the left-hand's receivers with right-hand's receivers and left-hand's senders with right-hand's senders. Again, suppose for the expression (a > b) + (c + d) we are joining (a > b) with (c + d). The following diagram shows the before and after:



Finally, let's go back to the > operator for the case when there are multiple senders being spliced to multiple receivers. In that case we splice each of the lefthand's senders to each of the right-hand's receivers. Thus, executing > operator on expressions (a + b) and (c + d + e) results in the following graph:


We now have all the pieces to start looking at code. Here are some typedefs:

typedef std::set<flow::sender<flow::continue_msg>*> sender_set;
typedef std::set<flow::receiver<flow::continue_msg>*> receiver_set;
typedef std::tuple<sender_set, receiver_set> compound_node;

Next up is the easiest of the transforms. When we encounter the terminals, broadcast_node or continue_node, we make a compound_node that only has that node inside of it:

struct make_compound_node : proto::callable
    typedef compound_node result_type;

    compound_node operator()(flow::sender<flow::continue_msg>& s, flow::receiver<flow::continue_msg>& r) {
        return compound_node({&s}, {&r});
    }
};

struct grammar
    : proto::or_<
        proto::when<proto::plus<grammar, grammar>, ...>,
        proto::when<proto::greater<grammar, grammar>, ...>,
        proto::when<bcast, make_compound_node(proto::_value, proto::_value)>,
        proto::when<cont, make_compound_node(proto::_value, proto::_value)>
>
{};

Next, we take care of + operator. Remember, we just need to join the two senders' sets into one and do the same for the receivers' sets:

struct join : proto::callable {
    typedef compound_node result_type;
    compound_node operator()(compound_node left, compound_node right) const {
        sender_set& senders = std::get<0>(left);
        senders.insert(std::get<0>(right).begin(), std::get<0>(right).end());

        receiver_set& receivers = std::get<1>(left);
        receivers.insert(std::get<1>(right).begin(), std::get<1>(right).end());

        return compound_node(std::move(senders), std::move(receivers));
    };
};

struct grammar
    : proto::or_<
        proto::when<proto::plus<grammar, grammar>, join(grammar(proto::_left), grammar(proto::_right))>,
        proto::when<proto::greater<grammar, grammar>, ...>,
        proto::when<bcast, make_compound_node(proto::_value, proto::_value)>,
        proto::when<cont, make_compound_node(proto::_value, proto::_value)>
>
{};

And finally the > operator:

struct splice : proto::callable {
    typedef compound_node result_type;

    compound_node operator()(compound_node left, compound_node right) const {
        for( auto s: std::get<0>(left) )
            for( auto r: std::get<1>(right) )
                flow::make_edge(*s, *r);

        sender_set& senders = std::get<0>(right);
        receiver_set& receivers = std::get<1>(left);

        return compound_node(std::move(senders), std::move(receivers));
    }
};

struct grammar
    : proto::or_<
        proto::when<proto::plus<grammar, grammar>, join(grammar(proto::_left), grammar(proto::_right))>,
        proto::when<proto::greater<grammar, grammar>, splice(grammar(proto::_left), grammar(proto::_right))>,
        proto::when<bcast, make_compound_node(proto::_value, proto::_value)>,
        proto::when<cont, make_compound_node(proto::_value, proto::_value)>
>
{};

The only thing that is left is main() demonstrating the usage:

int main() {

    flow::graph g;

    bcast start(g);
    cont a = cont_node( g, body("A"));
    cont b = cont_node( g, body("B"));
    cont c = cont_node( g, body("C"));
    cont d = cont_node( g, body("D"));
    cont e = cont_node( g, body("E"));

    auto expr =
        start > (a > e + c)
              + (b > c > d);

    grammar()(expr);

    proto::value(start).try_put( flow::continue_msg() );
    g.wait_for_all();

    return 0;
}

A big thanks goes to Neil Groves for reviewing this article.