Comes from this Github issue
Hi,
I am thinking about developing a simple app using Bokeh Server. This app needs to load a large csv file in order to make some aggregates and plots.
However, for demonstration purposes, imagine I have got the following simple code that emulates loading realatively large data (in this case creating a random pandas DataFrame) in a main.py
file:
import numpy as np
import pandas as pd
from bokeh.plotting import Figure, curdoc
from bokeh.models import ColumnDataSource
# Creating a large dataframe:
random_df = pd.DataFrame({"a": np.random.randn(70_000_000),
"b": np.random.randn(70_000_000)
})
# Some random Bokeh Figure:
figure = Figure(title="Some chart",
width=500,
height=300)
figure.scatter(x=random_df["a"][:10],
y=random_df["b"][:10],
size=10,
fill_color="blue")
curdoc().add_root(figure)
If I want to serve this with bokeh serve main.py
, at first, everything works as expected:
- At first, no memory usage increase, since nothing is going on apart from the bokeh server initializing.
- When I open the web browser and go to bokeh’s server IP and port, the
random_df
is created (which increases memory usage obviously) and the figure is displayed in my browser. Everything looks good. - However, for every other tab I open in the browser pointing to the same bokeh server, the memory usage keeps on increasing and increasing. I guess it makes sense, since the
random_df
is being created and held in memory as many times as we have got clients (at first I would have thought that any incoming session would overwrite the variable, keeping the memory usage reasonable, but it doesn’t look like it is the case). - I start to close tabs (therefore killing websocket connections), but memory usage does not decrease.
In order to make my bokeh application truly scalable to more than 4 or 5 clients, I believe the way to go is to pre-load the data (in this case would be random_df
creation) just once per bokeh server process (in my case 1, as num-procs=1
), and then creating the Figure and the rest of the logic on a per-session basis.
However, I can’t find any way of doing this. I thought about using lifecycle hooks (in particular on_server_loaded
), but I can’t think of a way of returning the created random_df
back to the main.py bokeh application logic.
Am I missing out on something? Is there a preferred way of pre-loading data in bokeh server on a per-process basis, instead on per-session (client)?
Thank you!