Pre-loading data in Bokeh Server?

julioasotodv · January 17, 2020, 10:50pm

Hi,

I am thinking about developing a simple app using Bokeh Server. This app needs to load a large csv file in order to make some aggregates and plots.

However, for demonstration purposes, imagine I have got the following simple code that emulates loading realatively large data (in this case creating a random pandas DataFrame) in a main.py file:

import numpy as np
import pandas as pd

from bokeh.plotting import Figure, curdoc
from bokeh.models import ColumnDataSource

# Creating a large dataframe:
random_df = pd.DataFrame({"a": np.random.randn(70_000_000),
                          "b": np.random.randn(70_000_000)
                          })

# Some random Bokeh Figure:
figure = Figure(title="Some chart", 
                width=500, 
                height=300)
figure.scatter(x=random_df["a"][:10], 
               y=random_df["b"][:10], 
               size=10, 
               fill_color="blue")

curdoc().add_root(figure)

If I want to serve this with bokeh serve main.py, at first, everything works as expected:

At first, no memory usage increase, since nothing is going on apart from the bokeh server initializing.
When I open the web browser and go to bokeh’s server IP and port, the random_df is created (which increases memory usage obviously) and the figure is displayed in my browser. Everything looks good.
However, for every other tab I open in the browser pointing to the same bokeh server, the memory usage keeps on increasing and increasing. I guess it makes sense, since the random_df is being created and held in memory as many times as we have got clients (at first I would have thought that any incoming session would overwrite the variable, keeping the memory usage reasonable, but it doesn’t look like it is the case).
I start to close tabs (therefore killing websocket connections), but memory usage does not decrease.

In order to make my bokeh application truly scalable to more than 4 or 5 clients, I believe the way to go is to pre-load the data (in this case would be random_df creation) just once per bokeh server process (in my case 1, as num-procs=1), and then creating the Figure and the rest of the logic on a per-session basis.

However, I can’t find any way of doing this. I thought about using lifecycle hooks (in particular on_server_loaded), but I can’t think of a way of returning the created random_df back to the main.py bokeh application logic.

Am I missing out on something? Is there a preferred way of pre-loading data in bokeh server on a per-process basis, instead on per-session (client)?

Thank you!

Bryan · January 17, 2020, 11:46pm

Hi @julioasotodv since you seem to be fairly experienced I am going to first just point you at a relevant example with minimal commentary, but please come back with any questions.

Please see the spectogram example:

Specifically, look at how at the server_lifecycle.py and audio.py. The lifecycle imports the audio module and starts a thread that updates a module-level variable in audio. Then, when any server session is created and the app code is run, it imports audio. A Bokeh server is aynchronous in a single process, so every session sees the same audio module due to the way Python caches module imports. This is one way you can set something up once beforehand that all sessions can access. This simple example only uses the data in audio in a read-only manner from the app code, so no kind of explicit synchronization is added, but obviously you may want to consider mutexes/locks as appropriate to your use-case.

julioasotodv · January 19, 2020, 10:45pm

Hi @Bryan!

Thank you so much for the tip. It actually works

However, I decided to modify your example: avoiding creating an explicit thread in order to ‘force’ data loading before anything else; this way the Tornado server is actually blocked (not accepting incoming connections) until data is loaded, mostly because in my example it would not make sense to have the server running without the data loaded first.

Do you think it is worth writing an entry in Bokeh’s User Guide to talk about this topic? I believe it could be a pretty common use case (trying to visualize large datasets with the smallest memory footprint possible), and I did not even think about looking at the audio example (I basically looked all the others in order to find a tip…)

If you give your approval, I could come up with some text to put perhaps somewhere in Running a Bokeh server — Bokeh 2.4.2 Documentation

Thank you so much again the more I use Bokeh (+ the Holoviz ecosystem), the more I like it. In fact, I am thinking about providing further examples for other users (combining low level Bokeh + HvPlot + Panel to create business-grade dashboards).

Bryan · January 20, 2020, 3:16am

@julioasotodv Thanks for the kind words, glad to hear things are working! FYI in that example the thread is in order to have continuously updating data. If you don’t have that need I agree it’s not necessary in general. Would certainly be happy to have any new docs or examples added!

julioasotodv · January 20, 2020, 11:56pm

@Bryan cool! I will try to come up with an explanation and a PR in Github to add it to the docs.