Why does the speed of updating one column in a CDS increases with the number of total columns?

Crysers · March 30, 2021, 7:20am

Hello,
I started using Bokeh a few months ago and I´m new to this forum. I stated this question on StackOverflow about the different possibilities to update one column in a CDS and the time it takes to update.

Now I investigated a little bit further and to me it seems, that the time it takes to update one specific column in a CDS actually not only depends on the number of rows of that CDS (which to me would sound very logical), but it also depends on the number of total columns in that CDS. This is something I would not have expected.

The following is a MRE that I made to show this behavior.

#Using Bokeh==2.3.0
import pandas as pd
from bokeh.plotting import figure, curdoc, show
from bokeh.models import ColumnDataSource
from bokeh.layouts import layout
import random
import time
import numpy as np

N_test = 3

times_list_cols = []

def update_data(n_cols, n_rows):
    # Prepare a CDS with some random value:
    df = pd.DataFrame({"Value1": [int(random.random()*10) for i in range(n_rows)]})

    # Add n_cols additional columns with random values 
    for i in range(n_cols):
        df[f"col_{i}"] = df[f"Value1"]/i

    # Create actual CDS
    source = ColumnDataSource(df)
    
    # Update one column and make N_test time measurements
    new_column = source.data["Value1"]+1
    time_needed=[]
    for test_n in range(N_test):
        t0=time.time()
        source.data['Value1'] = new_column
        time_needed.append(time.time() - t0)
    
    return np.mean(time_needed)

# Check impact of columns and show results
times=[]
N_rows=5000
for n_cols in range(0,200):
    time_needed = update_data(n_cols=n_cols, n_rows=N_rows)
    times.append(time_needed)
    print(f"Time needed for {n_cols} cols and {N_rows} rows: {time_needed}")

plot_rows = figure(plot_width=800, plot_height=250)
plot_rows.line(x=np.arange(0,200), y=times)
plot_rows.yaxis.axis_label = 'Time in s'
plot_rows.xaxis.axis_label = 'Number of Columns in CDS'
plot_rows.title.text = "Increasing the number of COLUMNS in a CDS"

# Check impact of rows and show results
times=[]
N_cols=100
for n_rows in range(0,20000,100):
    time_needed = update_data(n_cols=N_cols, n_rows=n_rows)
    times.append(time_needed)
    print(f"Time needed for {N_cols} cols and {n_rows} rows : {time_needed}")

plot_cols = figure(plot_width=800, plot_height=250)
plot_cols.line(x=np.arange(0,20000,100), y=times)
plot_cols.yaxis.axis_label = 'Time in s'
plot_cols.xaxis.axis_label = 'Number of Rows in CDS'
plot_cols.title.text = "Increasing the number of ROWS in a CDS"

show(layout([[plot_rows],[plot_cols]]))

This results in the following (I know, there are better ways to log the time, but I guess my point of linear increasement becomse clear):

Now my Question is, am I missing something, or is this the correect and intended behavior?
Is there a way to have a fast access/edit regardless of the number of columns?

Thanks for your support and discussion.

Bryan · March 30, 2021, 4:20pm

@Crysers As I mentioned on SO, this is not expected. The expectation is that the operation of updating a subset of columns is special-cased to only send the subset that actually changed. But perhaps there has been a regression at some point. Please file a bug report GitHub Issue with the test case.