Skip to content
haberman edited this page Jul 6, 2011 · 30 revisions


μpb (or more commonly, “upb”) is an implementation of the Protocol Buffers serialization format released by Google in mid-2008. The Greek letter mu (μ) is the SI prefix for “micro”, which reflects the goal of keeping upb as small as possible while providing a great deal of flexibility and functionality.

upb is written in ~5000 sloc of C, and compiles to ~30kb of object code on x86.

Why Protocol Buffers?

Protocol Buffers are an excellent platform for data processing and interchange. They take all the best parts of JSON, XML, and UNIX and leave out all of the bad parts. They can be as easy to use as JSON, as versatile as UNIX pipes, as explicit as XML-Schema, while being more efficient in both CPU and memory. They can be human-readable when you want (and can even use JSON as their on-the-wire format), binary and extremely compact when you need. And all of this can be implemented in a very small library (<50kb of object code).

Using Protocol Buffers, you can define your schema using a simple syntax:

message Person {
  required uint32 age = 1;
  required uint32 birthday = 2;
  enum Gender {
    MALE = 0,
    FEMALE = 1
  }
  optional Gender gender = 3;
}

You can then create structures of this type in any language that has protocol buffer support. These structures can be very efficiently serialized and deserialized into either binary or text formats.

  • vs JSON: the protocol buffer schema is explicit, as opposed to JSON where messages can have arbitrary keys that map to any kind of value. Protocol Buffers and JSON can interoperate nicely; you can serialize a Protocol Buffer structure to JSON and parse from JSON. Protocol Buffers (properly implemented) are more efficient in-memory, since they are stored as offset-based structures instead of hash tables. The Protocol Buffers text format is roughly comparable to JSON. The Protocol Buffers binary serialization format is significantly smaller than JSON, but is not human readable.
  • vs XML: Protocol Buffers have a data model that more cleanly maps to programming languages (no special “DOMAPI is required). Protocol Buffers are significantly more efficient both on the wire and in memory. Protocol Buffers integrate data types and a schema at the lowest level, instead of layering it on top of XML using technologies like XML Schema. Protocol Buffers are significantly less complex than the XML stack.

Why another Protocol Buffers implementation?

The Google implementation of Protocol Buffers is open source, released under a liberal license (BSD). Other people have written implementations also, such as protobuf-c. Why did I write a completely new implementation from scratch? Why should anybody use my implementation?

High performance without code generation

Most protobuf implementations focus on code generation as their primary means of achieving speed. “Code generation” in this context means using a compiler to translate a .proto file to C or C++ code that is specific to those .proto types. A C or C++ compiler is then used to output machine code that can parse, serialize, or manipulate those types.

Code generation can achieve high speeds, but also has a high cost:

The generated code can be large
descriptor.proto, which can be represented as a 3.5kb protobuf, compiles to >150kb of machine code on x86. If you have a binary that processes lots of message types, this code can really add up.
You have to link in any message types you want to parse
This means you have to decide ahead of time what messages you might possibly want to process, and you pay the size and compile time hit for all of them. Whenever they change, you have to recompile.
There is an extra step in your edit/compile/run cycle
Or worse, if you didn’t have an edit/compile/run cycle before (like with interpreted languages), you do now.
The generated code is inflexible
Generated code achieves it speed by compiling for one very specific configuration. In other words, it takes all your decisions about how you want to parse and fixes them at compile time. This means that the generated code is only good for one very specific purpose. Want to change the set of fields you care about? Recompile. Want to reference the input strings instead of copying them? Recompile. Want to do callback-based parsing instead of parsing into the stock data structures? Recompile.

upb was designed with the belief that protobuf parsing without code generation could achieve speeds comparable to code generation. If this can be achieved, we can avoid the drawbacks of code generation. Programs need only compile the upb core (<50k object code), and all .proto files can be loaded at runtime as they are needed.

Current benchmarks indicate that upb is never slower than 70% the speed of the official release of protobufs (with the official release doing code generation and upb dynamically loading .proto types). In some tests, the speed difference is even less.

Even if upb can’t achieve 100% the speed of code generation in an apples-to-apples comparison, upb can come out ahead by offering the flexibility to perform optimizations that are not easy or practical with current code generation approaches. The most significant examples are:

Skipping fields/submesages you don’t need.
The protobuf format makes it possible to skip submessages very efficiently. If you are only reading a small portion of a large, nested protobuf, you can get the fields you need in orders of magnitude less time than it would take to parse the whole thing.
Lazy parsing of submessages (not implemented yet).
A slightly different take on the previous point, it is possible to parse submessages only if/when they are accessed. This can achieve the same speeds as the previous without requiring you to statically analyze the set of fields you need. The downside is that parse errors surface later and unsynchronized reads are no longer thread-safe.
Referencing input string data instead of copying.
If the input contains strings, it is possible to reference them from the input string instead of paying for malloc() and memcpy(). This might be desirable in some cases but not others — a non-code-generation approach lets you decide at runtime.
Callback/Event-based parsing
Event-based parsing (like SAX in XML) can be much more efficient than parsing into a data structure.

Support for Dynamic Languages

The dynamic nature of upb is especially useful in the context of dynamic or interpreted languages. upb is specifically designed to be an ideal target for dynamic language extensions.

Protocol Buffers has an enormous potential to be useful to users of dynamic languages. It provides a format that languages can use to exchange data in a very efficient way. It provides the efficiency benefits of using built-in serialization formats like Python’s “Pickle”, Perl’s “Storable,” and Ruby’s “Marshal”, or JavaScript’s “JSON”, but with a more explicit schema and greater interoperability across languages.

Despite this promise, Protocol Buffers haven’t seen much adoption in dynamic languages because the existing implementations aren’t very efficient. upb was designed from the outset to be an ideal implementation for supporting very fast Protocol Buffers implementations for dynamic languages. This is much of the reason upb is focused on making the runtime dynamic and configurable (ie. no code generation), so that .proto types are easy to load at runtime and flexible in the ways you can process them.

Another important feature is developing memory-management interfaces that can integrate with the memory managers of dynamic languages. This is no easy task, because each language runtime does memory management differently. Some use reference counting, some use garbage collection, some use a combination, and the interfaces for interacting with the memory managers are different for every runtime. A key goal of upb was to design a memory management scheme that could gracefully integrate with all of these.