"Boxplot example" performance

Hello,

While reading boxplot example the question arose: why some whiskers and vbar are rendered slightly asymmetrically and boldly.

It turned out that the figure displays a lot of whisker and vbar (overlapped), because the source is the original dataframe df with “duplicated” quantiles (q1, q2 and q3), upper and lower.

If you make qs the data source for whisker and vbar, including the calculation of quantiles in qs, then the figure becomes prettier:

My questions:

  1. Is the original example is a generally accepted (idiomatic) way to create a boxplot using bokeh? Or is the “patched” version is really more optimal (the subject for the issue)?

  2. Do I understand correctly that inside the bokeh “magic” the memory consumption on whisker and vbar for the original example is higher because of some proxy-objects (wrappers) for each row in the source? Or does bokeh create raster image only, so no additional memory is consumed on the wrapper?

  3. Related to question 2. I view the data sources for the displayed data using

    for i, r in enumerate(p.renderers):
        print(i, r.name)
        print(r.data_source.data)
    

    This is two vbar and one scatter (outliers).

    But I did not understand how to view the data source for whisker, which are annotations.


Original example

import pandas as pd

from bokeh.models import ColumnDataSource, Whisker
from bokeh.plotting import figure, show
from bokeh.sampledata.autompg2 import autompg2
from bokeh.transform import factor_cmap

df = autompg2[["class", "hwy"]].rename(columns={"class": "kind"})

kinds = df.kind.unique()

# compute quantiles
qs = df.groupby("kind").hwy.quantile([0.25, 0.5, 0.75])
qs = qs.unstack().reset_index()
qs.columns = ["kind", "q1", "q2", "q3"]
df = pd.merge(df, qs, on="kind", how="left")

# compute IQR outlier bounds
iqr = df.q3 - df.q1
df["upper"] = df.q3 + 1.5*iqr
df["lower"] = df.q1 - 1.5*iqr

source = ColumnDataSource(df)

p = figure(x_range=kinds, tools="", toolbar_location=None,
           title="Highway MPG distribution by vehicle class",
           background_fill_color="#eaefef", y_axis_label="MPG")

# outlier range
whisker = Whisker(base="kind", upper="upper", lower="lower", source=source)
whisker.upper_head.size = whisker.lower_head.size = 20
p.add_layout(whisker)

# quantile boxes
cmap = factor_cmap("kind", "TolRainbow7", kinds)
p.vbar("kind", 0.7, "q2", "q3", source=source, color=cmap, line_color="black")
p.vbar("kind", 0.7, "q1", "q2", source=source, color=cmap, line_color="black")

# outliers
outliers = df[~df.hwy.between(df.lower, df.upper)]
p.scatter("kind", "hwy", source=outliers, size=6, color="black", alpha=0.3)

p.xgrid.grid_line_color = None
p.axis.major_label_text_font_size="14px"
p.axis.axis_label_text_font_size="12px"

show(p)

“Patched” example

import pandas as pd

from bokeh.models import ColumnDataSource, Whisker
from bokeh.plotting import figure, show
from bokeh.sampledata.autompg2 import autompg2
from bokeh.transform import factor_cmap

df = autompg2[["class", "hwy"]].rename(columns={"class": "kind"})

kinds = df.kind.unique()

# compute quantiles
qs = df.groupby("kind").hwy.quantile([0.25, 0.5, 0.75])
qs = qs.unstack().reset_index()
qs.columns = ["kind", "q1", "q2", "q3"]
# Patch 1
#df = pd.merge(df, qs, on="kind", how="left")

# compute IQR outlier bounds
# Patch 2
#iqr = df.q3 - df.q1
#df["upper"] = df.q3 + 1.5*iqr
#df["lower"] = df.q1 - 1.5*iqr
iqr = qs.q3 - qs.q1
qs["upper"] = qs.q3 + 1.5*iqr
qs["lower"] = qs.q1 - 1.5*iqr
df = pd.merge(df, qs, on="kind", how="left")

# Patch 3
#source = ColumnDataSource(df)
source = ColumnDataSource(qs)

p = figure(x_range=kinds, tools="", toolbar_location=None,
           title="Highway MPG distribution by vehicle class",
           background_fill_color="#eaefef", y_axis_label="MPG")

# outlier range
whisker = Whisker(base="kind", upper="upper", lower="lower", source=source)
whisker.upper_head.size = whisker.lower_head.size = 20
p.add_layout(whisker)

# quantile boxes
cmap = factor_cmap("kind", "TolRainbow7", kinds)
p.vbar("kind", 0.7, "q2", "q3", source=source, color=cmap, line_color="black")
p.vbar("kind", 0.7, "q1", "q2", source=source, color=cmap, line_color="black")

# outliers
outliers = df[~df.hwy.between(df.lower, df.upper)]
p.scatter("kind", "hwy", source=outliers, size=6, color="black", alpha=0.3)

p.xgrid.grid_line_color = None
p.axis.major_label_text_font_size="14px"
p.axis.axis_label_text_font_size="12px"

show(p)

Bokeh mostly provides building-block components, rather than schematized charts of any kind. That’s why there is a boxplot example rather than a boxplot function. There’s potentially lots of ways to render box plots [1], and generally the only reason to prefer one over another is that it satisfies your needs better in some way. I don’t recall why this example is the way it is, I updated it in a hurry before a release deadline, so it’s probably just not as good as it could be. PRs to improve examples are always welcome.

Do I understand correctly that inside the bokeh “magic” the memory consumption on whisker and vbar for the original example is higher because of some proxy-objects (wrappers) for each row in the source? Or does bokeh create raster image only, so no additional memory is consumed on the wrapper?

Do you mean Python process memory usage? or browser memory usage? Just to clarify up front: the “Python side” of Bokeh does almost nothing except generate a blob of JSON to be consumed by the BokehJS JavaScript runtime in the browser. All of the rendering, eventing, everything, only actually happens in JavaScript.

As for rendering specifically, BokehJS supports HTML raster canvas primarily, but also has backends for WebGL and SVG output in the browser (those backends may offer some benefit, e.g. WebGL may be more performant for larger data sets, but may also be lacking some features that the primary HTML canvas implementation affords).

There are currently no row-based APIs or abstractions in Bokeh. As suggested by the model name, a ColumnDataSource is a columnar data store. And the columns inside are all references to whatever you used to construct the CDS (i.e. Python lists, NumPy arrays, Pandas or Arrow series), not copies of them. There is some small overhead for a CDS, and for Bokeh model representations in general, but it is typically minimal. So more or less, Python memory usage is due to whatever data you yourself load.

OTOH if you create two CDS from the same Pandas DataFrame, they are serialized separately and deserialized into distinct objects in BokehJS. So that would result in double the memory usage in the browser (although the JSON is created in the Python process so it will be reflected there too, until is is garbage collected).

But I did not understand how to view the data source for whisker, which are annotations

Only some annotations (DataAnnotations) are configured with a data source. For those, the data source is in the .source property. There is some historical baggage and inconsistencies here, e.g. with .data_source on glyph renderers, that are slowly being worked on.

In the future, more or less, there will only be glyphs. “Annotation” will become an action you can perform using whatever glyphs are handy for your specific use case, rather than a distinct class of objects with a separate type of “Annotation”. That unification has started but is still a ways down the road.


  1. For instance you can see how different old versions of this example were before Whisker existed. You could still create a box plot that way if you wanted, though. ↩︎

1 Like

@Bryan I am very grateful for the detailed explanation.

The forum is really “question-driven documentation” :slight_smile:

@Bryan What about CDSView object? Does it work on the BokehJS side only with ColumnDataSource without doubling the memory usage but requiring extra calculations in the browser? Or does it double the memory when serialized from the source too?

CDSView only computes/stores row indices into the columns, it does not store any data itself.

1 Like