Scatter markers time series plot interacting too slow

Hi, I am making a bokeh server app that has several plots whose x_range is the same for all. Most of the plots are line time series however a few are scatter time series (date vs direction). The scatter plots are very slow responsive (xbox_zoom, xpan, etc), when the data points are many, hence all the other plots respond slowly too as their x_ranges are the same.

Can you please guide/suggest me on what can be an efficient way to plot scatter when data points are too many (e.g. 100,000) so that the responsiveness of the plots does not get slow?

Thank you :slight_smile:

@Omi t’s possible there is some usage improvements that may help, or it could be that you are up against library limits and need to look at things like holoviz+datashader. It’s not really possible to speculate without actual code, i.e .a complete Minimal Reproducible Example.

Hi @Bryan, thank you for your reply. I will appreciate it if you can guide me if some usage improvements are needed or if it is the library limits. Please refer to the MRE tested on the bokeh server. I am facing exactly the same behavior with my bokeh server app that I face with this.

You will see that when you will:

  1. Use the xpan tool, the response of the plots in adjusting the xaxis/xrange will be too slow, and a similar behavior when using xbox_zoom or other tools

  2. Drag the highlighted-box/selected-range (green box) of the rangetool below each main plot, the response of the plots will be too slow.

Thank you so much.

Ps: Ignore the dummy data part in the following MRE.

import pandas as pd
import numpy as np
import random
from bokeh.plotting import figure
from bokeh.models import Button, ColumnDataSource, RangeTool
from bokeh.layouts import layout
from bokeh.io import curdoc

# =============================================================================
# Dummy data
# =============================================================================
direction = []
for i in range(int(105120/144)):
    x = random.randint(0, 360)
    j = [random.randint(1,2) for i in range(144)]
    for k in j:
        xj = x+k
        direction.append(xj)
        
speed = []
for i in range(int(105120/144)):
    x = random.randint(1, 25)
    j = [random.randint(1,2) for i in range(144)]
    for k in j:
        xj = x+k
        speed.append(xj)
        
ts = [pd.to_datetime('01-01-2022')+pd.Timedelta(x,
                                                "T") for x in np.arange(0,
                                                                        1051200,
                                                                        10)]
    
my_dict = {'Time' : ts,
           'speed1' : speed,
           'speed2' : [i+np.round(random.uniform(1,2),2) for i in speed],
           'direction1' : direction,
           'direction2' : [i+random.randint(1,3) for i in direction]}

my_df = pd.DataFrame.from_dict(my_dict)
my_df.set_index('Time', inplace=True, drop=True)
# =============================================================================

# =============================================================================
# Function to make plot silimar to www.docs.bokeh.org/en/2.4.3/docs/gallery/range_tool.html
# =============================================================================
def plot(df, xrange):
    source = ColumnDataSource(data=df)
    
    tools = ['xpan', 'xwheel_zoom', 'xbox_zoom', 'xzoom_in',
              'yzoom_out', 'undo', 'reset']
    
    p = figure(width=768, height=360, x_axis_type='datetime',
                tools=tools, toolbar_location="right", x_range=xrange)
    
    for i,c in zip(df.columns, ['red', 'black']):
        
        # if the columns are direction then scatter/dot plot else line plot
        if i in ['direction1', 'direction2']:
            p.dot(x='Time', y=i, width=10, size=10, source=source,
                  legend_label=i, color=c)
            p.title.text = "Directions Plot"
        else:
            p.line(x='Time', y=i, width=2, source=source, legend_label=i,
                   color=c)
            p.title.text = "Speeds Plot"
            
    p.legend.click_policy="hide"
    
    # Range tool plot below the main plot
    range_tool = RangeTool(x_range=p.x_range)
    range_tool.overlay.fill_color = "green"
    range_tool.overlay.fill_alpha = 0.35
    
    s = figure(height=120, width=768, x_axis_type="datetime",
               y_axis_type=None, tools='', toolbar_location=None)
    
    for i,c in zip(df.columns, ['red', 'black']):
        # if the columns are direction then scatter/dot plot else line plot
        if i in ['direction1', 'direction2']:
            s.dot(x='Time', y=i, width=10, size=10, source=source, color=c)
        else:
            s.line(x='Time', y=i, width=2, source=source, color=c)
            
    s.add_tools(range_tool)
    s.toolbar.active_multi = range_tool
    
    return p,s
# =============================================================================

# x_range to show last six weeks of data in plot and in rangetool plot
xrange = (my_df.index.max()-pd.Timedelta(6, "W"), my_df.index.max())

# Speed plot with rangetool plot
p1,s1 = plot(my_df[['speed1', 'speed2']], xrange=xrange)
# Direction plot with range tool plot and having xrange of speed plot (p1)
p2,s2 = plot(my_df[['direction1', 'direction2']], xrange=p1.x_range)

# Button to show plots 
button1 = Button(label="Show plots", button_type="success")
# Call back function to show plots on button click
def bc1():
    lt.children = [p1, s1, p2, s2]
# Button callback
button1.on_click(bc1)

lt = layout(children=[button1])
curdoc().add_root(lt)

@Omi I would expect a single plot with 100k points to perform OK, especially if webgl can be applied, though it is getting ot be on the high end of things. Maybe four separate plots starts to push things too far. I did turn on webgl, and also simplified to a non-server app by changing the end of the script:

from bokeh.io import show

show(layout(children=[p1, s1, p2, s2]))

I wanted to make sure the overhead was actually drawing, and not e.g some unexpected large network traffic. I do think I would expect things to perform a little better, especially with webgl turned on. I don’t have any immediate suggestions or alternatives though, so all I can advise is to open a GitHub development discussion about this with the MRE and relevant version details.

With this amount of data the only option is to use webgl backend, which is simple enough (figure(..., output_backend="webgl")). This brings performance from unusable to quite usable, with a lot of room for improvement.

I did some quick profiling and with the canvas backend most of the time is spent painting, which is perfectly expected as the canvas is slow (especially for this kind of applications). With webgl backend I can see multiple places in bokehjs where excessive time is spent, which should optimize nicely in the future. Also the webgl backend itself isn’t fully optimized and I think we will be able to shuffle more computations to the GPU in the future.

@Omi / @Bryan

I have a very similar setup and I’m working with around 6 billion points on 45 graphs… I have had some luck altering the ColumnDataSource when zoom levels change. Essentially, the idea is to push up the data that exists only in the current view. Here is a working example of the idea. I list some caveats at the end.

import numpy as np
from bokeh.events import *
from bokeh.io import curdoc
from bokeh.layouts import column
from bokeh.models import ColumnDataSource, Range1d
from bokeh.models import RangeSlider
from bokeh.plotting import figure


def event_callback(event):
    # Here you figure out what data should be displayed in the browser and alter the source.
    indices = np.where((x >= event.x0) & (x <= event.x1))
    source.data = {'x': x[indices], 'y': y[indices]}


def update_value(attr, old, current):
    # The slider value has changed, so the range has changed, b/c of the js_link() calls.  
    # However, for some reason, the RangesUpdate callback doesn't happen.  So we "spoof" the event.
    class Event:
        x0 = current[0]
        x1 = current[1]

    event_callback(Event())


# Generate some data
x = np.arange(1000000)
y = np.tile([0, 1], int(len(x) / 2))

# Figure out your initial display range
# With the range, you want to constrain the max zoom level to prevent zooming out so far you get too much data
source = ColumnDataSource(dict(x=x[:150000], y=y[:150000]))
fig = figure(width=800, height=200, x_range=Range1d(0, 150000, bounds=(0, 1000000), max_interval=1500000, name="foo"))
fig.step('x', 'y', source=source)

# you will want some kind of slider to move around the data.  RangeTool or RangeSlider may be what you want.
slider = RangeSlider(start=0, end=1000000, value=(0, 15), width=790)
slider.js_link('value', fig.x_range, 'start', attr_selector=0)
slider.js_link('value', fig.x_range, 'end', attr_selector=1)

#  The slider will adjust the zoom window but this doesn't seem to fire RangesUpdate.  This is the workaround.
slider.on_change("value", update_value)
# RangesUpdate is the event that will let you know the viewing window has changed
fig.on_event(RangesUpdate, event_callback)

curdoc().add_root(column([fig, slider]))

Caveats:

  1. You have to constrain the zoom level. If you let the user zoom out too far, you will be back where you started with too much data to plot.
  2. You will find in some cases there is a visible drawing effect when you are panning or zooming. You can try some things like pushing data past the visible zoom level or caching some region of the data.
  3. You may find that you need to add 1 point further on both ends of the plot to make the glyph render “end to end”. In this example with step, there is no line rendering b/c I’m not adding the values just outside the range.
  4. If you zoom in far enough that 0 or 1 points are in the visible range, you may see no data plotted. Again, you should catch this case and add in the neighboring data on either end.

Honestly if you are zooming around in this much data, draw effects and constrained views don’t seem like a serious issue.

A further idea you could use instead of RangeSlider is to do like a datashader rendering of your full data set and use the RangeTool on that ImageRGB plot to allow the user to have a kind of thumbnail view of the entire dataset.