Ndarrays are sent as base64 encoded?

I’m not sure if I’m overlooking something but it seems like that ndarrays are encoded as a buffer just to be later encoded as base64. Is that expected?

In the PayloadEncoder ( bokeh/src/bokeh/core/json_encoder.py at dd25b9157afe5e04de35d0f3bfa80715f4a2422b · bokeh/bokeh · GitHub ) the document object will contain the ndarray as a Buffer object, however as self._buffer is just an empty list (not sure why?) obj.to_base64() will get called on the encoded ndarrays.

Maybe I miss something but to me it seem like that buffers are not set correctly in the Serializer.

def default(self, obj: Any) -> Any:
    if isinstance(obj, Buffer):
        if obj.id in self._buffers: # TODO: and len(obj.data) > self._threshold:
            return obj.ref
        else:
            return obj.to_base64()
    else:
        return super().default(obj)

Full code example (data size can be adjusted as needed).

import numpy as np

from bokeh.io import curdoc, show
from bokeh.layouts import row
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure

n_splits = 100
n_split_samples = 20_000

cds_standard = ColumnDataSource(data={"x":[], "y":[]})
cds_multi = ColumnDataSource(data={"x":[], "y":[]})
cds_scatter = ColumnDataSource(data={"x":[], "x2": [], "y":[]})

for i in range(n_splits):
    x_data = np.arange(n_split_samples)
    y_data = np.sin(x_data/100) + x_data/500*np.random.randn()

    cds_standard.stream({"x": x_data, "y": y_data})
    cds_multi.stream({"x": [np.append(x_data, x_data[0])], "y": [np.append(y_data, y_data[0])]})
    cds_scatter.stream({"x": x_data, "x2": -x_data, "y": y_data})


p1 = figure(title="Single line", output_backend="webgl")
p2 = figure(title="Multi line", output_backend="webgl")
p3 = figure(title="Scatter", output_backend="webgl")

p1.line(x="x", y="y", source=cds_standard)
p2.multi_line(xs="x", ys="y", source=cds_multi)
p3.scatter(x="x", y="y", source=cds_scatter, size=1)

curdoc().add_root(row(p1, p2, p3))

It’s been a very long time since I have dug into these details, so I’ll have to try to find some time to do some actual investigation. However, my offhand recollection is that only arrays that are CDS columns get sent directly over the wire using the binary protocol, without encoding. That’s the majority of cases where arrays are need serialization and transport, and that’s even moreso for cases where “large” arrays might be used. Other places that can accept arrays, e.g. as an argument to stream, are “oddballs” that are not currently covered by the binary protocol.

And that’s because the intention with the stream and patch functions was to afford a way to do small incremental updates, so they were not considered a priority to fit into the binary protocol. I.e. the difference between sending 10 new values into a 500k array is completely insignificant to the alternative of sending the whole 500k array, regardless of whether the 10 values are encoded or not.

Again, I’ll have to actually dive into things to confirm my recollection…

Looks like the initial document pull doesn’t use binary serialization at the moment, which is a huge regression. I reported it in Bokeh protocol doesn't use binary encoding in `pull-doc-reply` message · Issue #14724 · bokeh/bokeh · GitHub. Further document updates do use binary serialization.

Actually I’m not sure this hasn’t always been an issue with pull-doc specifically. At least, I have a vague memory of that being something that needed fixing/improving. I thought there was an issue, but I guess it slipped through the cracks? Thanks @Maxxner for helping us re-visit this.

Given that streaming before the initial render is a no-op (you can add data to the data source directly at that stage), then what you are observing is a bug in the initial document pull (pull-doc message), which is now fixed in Use binary protocol in `pull-doc-reply` message by mattpap · Pull Request #14725 · bokeh/bokeh · GitHub . If streaming happens after the initial render, then it will go through ColumnsStreamed event as a part of patch-doc message, which supports binary transfer well.

the document object will contain the ndarray as a Buffer object, however as self._buffer is just an empty list (not sure why?) obj.to_base64() will get called on the encoded ndarrays.

self._buffers should be filled with buffers that are expected be binary serialized. What buffers can be serialized depends on the buffer (e.g. ndarrays of int64, strings, etc. are serialized as nested lists) and the method of embedding (currently only server with websockets supports binary transport). In your particular case is just a bug or, as it appears to be, a missing feature.

ndarrays of int64, strings, etc. are serialized as nested lists

Just elaborating for @Maxxner it’s specifically only numpy arrays that have corresponding Javascript typed arrays that can be sent with the binary protocol. So e.g. there is no int64 typed array in Javascript, which is why int64 numpy arrays fall back to JSON lists.