I’m currently working on a project that relies on Bokeh, and I have some questions about the serialization format used in the library. Specifically:
Is there any detailed documentation available for the serialization format? It would be immensely helpful for understanding its structure and how it evolves over time.
Does the serialization format itself adhere to semantic versioning guidelines? For instance, can I expect breaking changes to the serialization format to only occur with major version releases? Or is it possible for the format to change in ways that break compatibility between minor or patch releases as well?
No, currently the serialization protocol is defined by the implementation. I started this issue in hopes to get it documented.
We don’t officially support semantic versioning and the protocol itself isn’t versioned, but we also make sure that it stays backwards compatible between minor releases and any major changes happen in major releases. If there are new features added, then they are made optional on both sides of the protocol.
For some background, when Bokeh started in ~2012 it was really only aimed at Python developers, so at the time, all of the BokehJS side of things was considered purely private implementation details. These days, we’d love to improve the standalone BokehJS story, but of course volunteer OSS moves however fast it can move. (i.e. we’d love help from any motivated new contributors.)
Anyway, I’m happy to try and provide any practical information or guidance or answer to specific queries (all of which can hopefully feed into @mateusz’s issue).
For now a quick brain dump with few random comments:
The protocols are not explicitly versioned but Python Bokeh and BokehJS versions are currently locked together, and the Python API encodes its version in the generated JSON in the “document wrapper”:
This top-level structure has not changed in as long as I can remember.
BokehJS will emit a warning to the console in the event of a mismatch between the version in the serialized document and itself.
The root are the top-level Bokeh models together with all the things they might reference, that comprise the document. Each object is a pretty much a wrapper around one dict of attributes. The wrapper part has a type, name, and id in addition to the attributes:
This wrapper structure has also not changed in very many years. As @mateusz stated most changes are to the properties of existing models or adding/removing models.
The attributes dict maps attribute names to property-specific values for models:
Things with id like "source" above are actually references to other models. That is how, e.g. two plots may share one actual range object. The other properties here are “dataspecs” which means things that can either have a single scalar as a value, or a reference to a vector of data in a column data source [1]
Other properties like array have more involved structure. This is how a numpy array shows up:
Note that things are different for arrays in the Bokeh server case—instead of being base64-encoded, the data is sent as separate binary payloads over websocket for performance. The websocket layer does actually have some documentation:
The project strictly keeps the “defaults” for properties exactly aligned on the Python vs JS side of things. The Python API only serialized values that have changes (i.e. that a user has explicitly set)
If you are familar with OpenGL / WebGL think of something thaat can either be a “uniform” or a “vector”. ↩︎