Preventing stream from updating server side source

Marculius · May 9, 2020, 7:42pm

I am creating a dashboard for displaying live data which i am collecting from multiple sensors.
The data is collected in a dictionary via an update function running in a separate thread, which is started by on_server_loaded() in app_hooks.py, similar to the spectrogram example app.
I am using this dictionary as a ColumnDataSource for displaying the data in the Bokeh app.
I am currently updating the data in the Bokeh graphs via source.data, but this sends all the data again which i would like to avoid.
source.stream() allows me to only send new_data to the web page but it also updates the source (ColumnDataSource) which is the dictionary in which the new data from the sensors is collected by the separate thread.
This causes doubling of every point in the dictionary, every point is first added by the separate thread and then again by the stream method.
Is there a way to prevent the stream method from adding the new_data to the source dictionary.
I know I could crate a new empty dictionary and use it as a ColumnDataSource, but this would double up the points server side, more if multiple pages are open.
As i am hoping to run this on a raspberry pi 3B+ (1GB RAM) i would like to minimize server side memory usage.
Is there anyway to only send new data to the web page without updating the server side ColumnDataSource?

What i am currently using:

from .bkh_lib.data_holder import dev_data_global    # dictionary with the sensor data
source = ColumnDataSource(data=dev_data_global['plot_data'])

def update_bkh():
    # function used to periodically update bkh web page data if there are new points
    if len(dev_data_global['plot_data']['x']) != data_length:
        source.data = dev_data_global['plot_data']
        data_length = len(dev_data_global['plot_data']['x'])

doc.add_periodic_callback(update_bkh, update_time)

Stream version:

from .bkh_lib.data_holder import dev_data_global    # dictionary with the sensor data
source = ColumnDataSource(data=dev_data_global['plot_data'])
def update_bkh():
    # function used to periodically update bkh web page data if there are new points
    if len(dev_data_global['plot_data']['x']) != data_length:
        new_data = {
            'x' = dev_data_global['plot_data']['x'][data_length:]
            'y' = dev_data_global['plot_data']['y'][data_length:]
        }
        source.stream(new_data)
        data_length = len(dev_data_global['plot_data']['x'])

doc.add_periodic_callback(update_bkh, update_time)

Bryan · May 9, 2020, 8:03pm

There is not. The primary defining feature of the Bokeh server is that it is a tool to keep a set of objects in sync across the Python<–>JS runtime boundary. If that’s not what you want, then Bokeh (or at least the Bokeh server) may not be the appropriate tool for your use case. You might look in to Bokeh solutions without the Bokeh server, e.g. using ServerSentDataSource or AjaxDataSource from standalone Bokeh output.

p-himik · May 10, 2020, 7:00am

I see two options:

Stop writing into the dictionary that you used to create the source. Just make the separate thread to call source.stream directly (IIRC it would have to be done via doc.add_periodic_callback as well)
Stop sharing the dictionary - before passing the data into the ColumnDataSource constructor, just wrap it in copy.deepcopy(...)

Bryan · May 11, 2020, 6:03am

source.stream updates source.data too. The only intervention here would be so send websocket protocol messages manually.

p-himik · May 11, 2020, 9:42am

To me the problems seems to be that Marculius:

Updates source.data via an indirect reference
Calls source.stream with the same data, which results in duplicated data
Wants to avoid duplication of data anywhere

Yes, the question was about preventing source.data changes from being sent over the wire. But the described problem can be solved differently. That is, if I understood it correctly.

Marculius · May 11, 2020, 11:43am

I update the underlying dictionary designated as source.data via
source = ColumnDataSource(data=dev_data_global['plot_data'])
in the above example.
Periodically a separate thread adds points to dev_data_global['plot_data'].

While I a could use source.stream in that separate thread to update the data,
my understanding is that I would need to run it for every opened document (web page).
This would again lead to point duplication in the source.data.
I could be wrong here, I’m not sure how source.stream keeps track of the documents.

Copy.deepcopy just results in a new dictionary for every document (web page).
This duplicates all the data for every document, which i want try to avoid.
Though i still need to test how much of a deal breaker this is.
4 devices with an order of 100000 points each, 2 floats for each point, 8 bytes per float we get an order of 6.4 MB of memory usage which is not so terrible.
This could be just a case of premature optimization.

On the other hand, i wanted to avoid sending all the data at every update because i was experiencing two things.
High CPU usage o the client (web page) side and “backlogging” (in lack of proper term, i’m not very familiar with this), once the data got large (~5000 pts).
By backlogging i mean the new data being added was lagging behind the data sent.
To debug i added a text display to the webpage to keep track of update ticks and the web page would display would be 100+ ticks behind if i let it run long enough.
The web page continued to update after the server was shutdown.
I’m guessing this just too much data being sent for the available internet bandwidth.
And updates were resolved as they arrived.
High CPU usage turned out to mostly be due to auto range changing the axis at each update.
Setting it to manual helped a LOT.

I was hopping the source.stream problem could be solved “easly” by somehow disabling it from updating source.stream, but i understand how that wasn’t it’s intended usage.

For now i will try having a separate dictionary for each document (data duplication) as the other option is proving to be network and potentially CPU intensive and reevaluate when i get to testing it on the rpy.
If needed flask + AjaxDataSource might be the way to proceed.

p-himik · May 11, 2020, 12:06pm

That is correct. Different documents are different in every regard. They cannot share anything, at least not right now.

I agree.

And that’s exactly why you should not use anything but the stream method since it sends only the new data. Yes, you will have to call it once per each document.

Great! Let us know if you see any issues with that approach.