-
Notifications
You must be signed in to change notification settings - Fork 262
Home
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.
—Antoine de Saint-Exupery
μpb (or more commonly, “upb”) is an implementation of the Protocol Buffers serialization format released by Google in mid-2008. The Greek letter mu (μ) is the SI prefix for “micro”, which reflects the goal of keeping upb as small as possible while providing a great deal of flexibility and functionality.
upb is written in 2300 sloc of C, and compiles to just under 30kb of object code on x86.
The Google implementation of Protocol Buffers is open source, released under a liberal license (BSD). Other people have written implementations also, such as protobuf-c. Why did I write a completely new implementation from scratch? Why should anybody use my implementation?
I will give two main reasons, besides the goal of minimalism (which has either already won you over or failed to pique your interest):
upb is designed for maximum flexibility. What this means is that it gives you as a programmer more choices about how you want to store and process your data. Specifically:
- upb is fully streaming-capable.
- This means that your serialized data doesn’t have to be in one big contiguous buffer to start parsing it. If your buffer is scattered across chunks of memory or if you are streaming data off of a disk or network, upb lets you parse as much data as you currently have in your buffer. When you have more data, you can resume parsing.
- upb’s lowest-level parser is event-driven, like SAX.
-
SAX-based parsers are a great fit for some applications. You might want to parse the Protocol Buffer data into your own custom data structure instead of the stock message classes. Or your application might be capable of processing the data in a streaming fashion, in which case you can avoid the malloc/free/memcpy overhead of saving the data into a tree structure.
- upb’s memory management policies are adaptable
- Memory management can make or break performance.
malloc()
,free()
, andmemcpy()
are expensive when overused, especially taking into account the cache effects. Deep in upb’s design is a recognition of this fact, and interfaces that let you optimize for intelligent memory management. For example, upb is capable of making strings reference the original protobuf data (rather than copying), and upb’s memory management interface lets you reuse submessages instead of destroying and reallocating them.
upb is designed to be a toolbox of paradigms for manipulating protocol buffer data. upb is built in layers, and any of the layers are available for clients to use as they see fit.
In addition, there are (or will be) several different code generation strategies, for compiled languages that wish to use generated code.
Protocol Buffers has an enormous potential to be useful to users of dynamic languages. It provides a format that languages can use to exchange data in a very efficient way. It provides the efficiency benefits of using built-in serialization formats like Python’s “Pickle”, Perl’s “Storable,” and Ruby’s “Marshal”, or JavaScript’s “JSON”, but with a more explicit schema and greater interoperability across languages.
Despite this promise, Protocol Buffers haven’t seen much adoption in dynamic languages because the existing implementations aren’t very efficient. upb was designed from the outset to be an ideal implementation for supporting very fast Protocol Buffers implementations for dynamic languages.
One key part of this strategy was designing the table-driven parsing code-path — the method of operation that doesn’t require you to generate and compile C or C++ for each message — as fast as possible. It is inconvenient for users of dynamic languages to have a compile step in their development cycle.
Another important feature is developing memory-management interfaces that can integrate with the memory managers of dynamic languages. This is no easy task, because each language runtime does memory management differently. Some use reference counting, some use garbage collection, some use a combination, and the interfaces for interacting with the memory managers are different for every runtime. A key goal of upb was to design a memory management scheme that could gracefully integrate with all of these.