Nested x-axis categoricals from pandas dataframe without groupby aggregation

Hi, I’m new to Bokeh and the forum.

I am trying to build SPC charts in Bokeh. (I saw a good example in the gallery, but no code available?) My actually use case is (obv) more complicated than this, with typical dataframes have 10k+ obs (rows) and ~70 tracked variables (cols).

Want to have:

Nested x-axis categoricals: does this require a ColumnDataSource? Does it require aggregation? Can I plot these directly from a pandas df? Is there a limit to how many? The error messages appear to inidicate a limit of three; why? Tableau, AFAICT, has no limit. Is there a speed hit? Plotting ~8 datapoints takes 1 full second on a brand-new laptop.

Tooltips: does this require a ColumnDataSource? Whem plotting from a dataframe, I can get tooltips, but the values are “???”, and each data point has ~6 associated “???”'s with it.

Here’s a toy problem based on the docs. Is this the canonical way to do this? Can I declare a CDS without a df.groupby(), and then nest the x-axis categoricals? (Most categorical plots I try end up empty, which is and will be a separate question.)

from bokeh.io import output_file, show
from bokeh.models import ColumnDataSource, FactorRange
from bokeh.plotting import figure

reset_output()
output_notebook()
fruits = ['Apples', 'Pears']
years = ['2015', '2016']

data = {'fruits' : fruits,
        '2015'   : [2, 1],
        '2016'   : [5, 3]}

fruit_df = pd.DataFrame(data).set_index("fruits")
display(fruit_df) # not tidy

tidy_df = fruit_df.reset_index().melt(id_vars=["fruits"], var_name="year")
tidy_df = tidy_df.rename(columns={"fruits":"fruit"})
display(tidy_df) # tidy

# make pandas group
group_cols = ["fruit", "year"]
group = tidy_df.groupby(group_cols)

# declare these variables because later Bokeh arguments are string-based
x_string = "_".join(group_cols)
y_col = "value"

# make CDS of group
source = ColumnDataSource(group)

# make figure
p = figure(plot_height=350, 
           x_range=group, 
           title="Fruit by Year",
           toolbar_location=None, 
           tools="")

# add glyphs? renderers? dunno?
p.circle(x=x_string, 
         y=y_col + "_mean", # why do we need to calculate the mean?
         width=5, 
         source=source,
        )

# show figure
show(p)

Output:

TIA, have spent a couple days on this so would really appreciate getting unstuck.

Hi GoMrPickles! Welcome to the forum!

Generally yes, the ColumnDataSource is going to be your driver for Bokeh models and tools. And there is a limit of 3 on nested categorical axes.

If I’m understanding your question correctly about setting up nested categoricals without a groupby, then this example may be helpful (although based on the fruits in your code, I suspect that you may have seen it already).

If not, it may be more useful to post something closer to your actual code which ends up empty, or your hovertools which aren’t working, and we’d be happy to find and address whatever issues are seen there.

Which gallery plot were you hoping to find the code for? All of the examples should have code in the examples subdirectory of the Bokeh repo, so it’s probably there.

1 Like

Hi, thanks for the response.

I did indeed see the fruits example, and that’s what I based my toy problem on. However, it does not use a dataframe, and as the rest of my workflow (SQL queries to multiple databases, df merges, calculations, formatting, etc.) is already in pandas, I’d love to stick with it. Otherwise, I suppose I can dump things into a dictionary, then make a CDF from that… but it seems like that defeats the purpose of pandas. I saw bokeh_catplot but haven’t tried it yet. That is the main problem I’m currently dealing with.

The example I saw was in the showcase:

I will post separate questions w.r.t. the other questions I asked, but it does take a while to find toy problems that replicate the issues I get with my actual workflow.

It doesn’t matter - you can create the required ColumnDataSource either way, just pass your data frame directly into its constructor.

You can use Pandas just as before, but you will have to convert the resulting data frame to a ColumnDataSource right before plotting it. You do not need a Python dictionary specifically to work with Bokeh data sources, as I mentioned above.

1 Like

Hi, thanks for responding. I’m afraid I don’t follow.

Here is the error I get if I use a dataframe without a groupby.

from bokeh.io import output_file, show
from bokeh.models import ColumnDataSource, FactorRange
from bokeh.plotting import figure

reset_output()
output_notebook()

# make list of tuples to use as factors
factors = tidy_df.set_index(["fruit", "year"]).index.tolist()
print(factors)

# attempt to plot without using pd.groupby()
source = ColumnDataSource(tidy_df)

# make figure
p = figure(plot_height=400,
           plot_width=400,
           x_range=FactorRange(*factors), 
           title="Fruit by Year",
)

# add glyphs? renderers? dunno?
p.circle(x=factors, 
         y="value",
         width=5, 
         source=source,
        )

# show figure
show(p)

Output:

 [('Apples', '2015'), ('Pears', '2015'), ('Apples', '2016'), ('Pears', '2016')]`
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    <ipython-input-554-125b597928c2> in <module>
         24          y="value",
         25          width=5,
    ---> 26          source=source,
         27         )
         28 

~\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\bokeh\plotting\_decorators.py in wrapped(self, *args, **kwargs)
     52             for arg, param in zip(args, sigparams[1:]):
     53                 kwargs[param.name] = arg
---> 54             return create_renderer(glyphclass, self, **kwargs)
     55 
     56         wrapped.__signature__ = Signature(parameters=sigparams)

~\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\bokeh\plotting\_renderer.py in create_renderer(glyphclass, plot, **kwargs)
     92     incompatible_literal_spec_values += _process_sequence_literals(glyphclass, glyph_visuals, source, is_user_source)
     93     if incompatible_literal_spec_values:
---> 94         raise RuntimeError(_GLYPH_SOURCE_MSG % nice_join(incompatible_literal_spec_values, conjuction="and"))
     95 
     96     # handle the nonselection glyph, we always set one

RuntimeError: 

Expected x to reference fields in the supplied data source.

When a 'source' argument is passed to a glyph method, values that are sequences
(like lists or arrays) must come from references to data columns in the source.

For instance, as an example:

    source = ColumnDataSource(data=dict(x=a_list, y=an_array))

    p.circle(x='x', y='y', source=source, ...) # pass column names and a source

Alternatively, *all* data sequences may be provided as literals as long as a
source is *not* provided:

    p.circle(x=a_list, y=an_array, ...)  # pass actual sequences and no source
import pandas as pd
from bokeh.io import show
from bokeh.models import ColumnDataSource, FactorRange
from bokeh.plotting import figure

data = {'fruit': ['Apples', 'Pears'],
        '2015': [2, 1],
        '2016': [5, 3]}

tidy_df = (pd.DataFrame(data)
           .melt(id_vars=["fruit"], var_name="year")
           .assign(fruit_year=lambda df: list(zip(df['fruit'], df['year'])))
           .set_index('fruit_year'))

p = figure(x_range=FactorRange(factors=sorted(tidy_df.index.unique())),
           tooltips=[('Fruit', '@fruit'),
                     ('Year', '@year'),
                     ('Value', '@value')])
cds = ColumnDataSource(tidy_df)
p.circle(x='fruit_year', y='value', source=cds)

show(p)
1 Like

Regarding the exception, @GoMrPickles can you suggest any improvements? I tried to make it as descriptive and immediately actionable as I knew how. As it states, if you pass any data via a source argument and column name, you must pass all data that way. You cannot “mix and match” column names and lists/arrays in the same glyph call.

1 Like

Hi @Bryan, unfortunately I do not understand Bokeh well enough to understand your question.

React has (or had, at least) a nice was of handling that. Each exception that it throws itself also comes along with a link that directs to its common errors knowledge base, with a longer description, some examples, some reasoning.

That’s not bad idea. I don’t know that it’s necessary for every exception but there are definitely some (like this one)that need more explanation and remediation than is reasonable to include direcrly in an exception text.

@GoMrPickles For reference, I specfically am wondering how this error message might be improved:

RuntimeError: 

Expected x to reference fields in the supplied data source.

When a 'source' argument is passed to a glyph method, values that are sequences
(like lists or arrays) must come from references to data columns in the source.

For instance, as an example:

    source = ColumnDataSource(data=dict(x=a_list, y=an_array))
   
    p.circle(x='x', y='y', source=source, ...) # pass column names and a source

Alternatively, *all* data sequences may be provided as literals as long as a
source is *not* provided:

    p.circle(x=a_list, y=an_array, ...)  # pass actual sequences and no source 

The line in your code that causes this error (viewable in the traceback) is:

p.circle(x=factors,   # this is a real concrete list/array
         y="value",   # this is a string column name -- can't have both
         width=5, 
         source=source,
)
1 Like

BTW why can’t we have both? Is it such a bad idea?

Was about to also ask how we can avoid fruit + year + fruit_year duplication, but realized that one can use an expression. Maybe worth adding to the main Handling Categorical Data page as an example.

We used to allow both. But allowing both means that you have to mutate the source argument to add the concrete literal that got passed in. This caused a lot of support activity when people did not understand why columns in their CDS were getting modified out from under them, or when column names were different than their expectations. It was demonstrably too confusing, and it was an easy decision remove this as an option in that context. Easier and better to be able to state: “we will not modify your CDS, period”.

Hi @p-himik, thank you for this example. It is very helpful. I was not familiar with all of the syntax but I think I understand it. I find the manner in which the index (factors) are created and named to be confusing.

Is it required to sort the FactorRange? In the toy problem, I did not see a difference in output with or without sorting.

No, but the order of the factors specifies the order of the labels on the axis: https://docs.bokeh.org/en/latest/docs/user_guide/categorical.html#sorted

1 Like

Why would we need to modify the data source? Just pass the array along. Seems like only VectorSpec.array would need a change, apart from that check.

Because a single CDS is the only mechanism to pass data to glyphs. Even if you pass list/array literals, we just create a CDS with default column names behind the scenes for you. Glyph rendering is already horrendously complex— intersecting selection, non-selection, LoD, and muted rendering, hit testing, and tooltip generation, auto-ranging bounds computation… the one thing that keeps that sane is that one glyph gets data always and only from one single CDS.

And even were we to consider allowing this complexity, then ill-defined situtations open up elsehwere. What happens when a users passes x=data, y="x" and wants a hover tooltip for "@x" ? You could argue either way, which means 100% that someone will get surprised.

1 Like

I think that it’s pretty tough. I was confused on a couple of levels: first, using a ColumnDataSource or a pandas df. Second, whether it was better to use arrays or a CDS (I still don’t know). Third, bugs in my code (I assume) were causing plots to take minutes to plot; converting a DF with ~140k cells took 20 minutes. As I am new to Bokeh, I have no idea if this is normal or not.

In general, for new users, having multiple ways of doing something (with no recommendation on which way is recommended) can be somewhere between confusing and hopeless. Throw in deadlines and it doesn’t get better.

Regarding your specific question, I suggest some edits a la:

RuntimeError: 

Expected x to reference fields in the supplied data source.

When a 'source' argument is passed to a glyph method, values that are sequences
(like lists or arrays) must come from references to data columns in the source.

For instance, as an example:

    source = ColumnDataSource(data=dict(x=a_list, y=an_array))
   
    p.circle(x='x', y='y', source=source, ...) # 'x' and 'y' must be valid column names in source

Alternatively, *all* data sequences may be provided as literals as long as a
source is *not* provided:

    p.circle(x=a_list, y=an_array, ...)  # pass actual sequences and no source 

I was also confused by mixing pandas and non-pandas sources; in my experience, as long as some variable references some value, it’s worked out. My experience is lacking, I see. :slight_smile:

Using arrays will just create an implicit ColumnDataSource for you, so neither of the approaches is bad.

And on the contrary - having just one way of doing something that’s not utterly trivial is usually detrimental for people that already know their way around.

A life of new users is made simpler by examples, of which there are plenty. Many tens or even a few hundreds in the Bokeh source code and documentation, thousands if you count StackOverflow and Discourse.

Not related to Bokeh in any way, but for me it doesn’t seem like a good strategy to start using something entirely new when you have a deadline.
Of course, there are situations when you don’t know anything that would help you solve some task, and there’s not enough resources to get a hired help. In that case, that’s what platforms like this one are for. Of course, there can be a delay, sometimes measured in weeks, but that’s just the law of life described in RFC1925, section 2, item 7a. :slight_smile:

Thanks! Looks nice.
One thing though - I would not put it into an error message by itself. It’s hard to format, there’s no syntax highlighting, it’s very clumsy to link to such a description from elsewhere. As I mentioned above, having a link to such a description at some online knowledge base would solve these issues.

Oh, I agree - but I also think it’s important to document the newbie experience, as it’s often forgotten once something has been mastered.

Regarding my choice of Bokeh, it wasn’t my first choice. It was something I wanted to explore “eventually.” I am redoing an existing manufacturing workflow that leans heavily on JMP, Excel, and cut-and-paste. I tried Tableau; for whatever reason, it’s unusable slow with our data sources, and cross-database joins would not even complete dummy data extracts. Tableau can also only co-plot two database sources; I needed at least three. It also can’t write data easily, outside of CSV exports, which returns to the cut-and-paste workflow I’m trying to solve.

I switched to MatPlotLib, then Seaborn, but those couldn’t do factor plots. There’s a hack on StackOverflow for doing factor plots in MPL, but I also wanted Tableau-style filters and hover widgets, which I think Bokeh can do. I also didn’t want to learn so heavily on a custom MPL layout function. Widgets might be possible in MPL, but… I decided to bite hard into the Bokeh sandwich. So, after three or four other dead ends, here I am learning Bokeh! Which I have wanted to learn anyway.

1 Like

Absolutely, and we’re improving bit by bit. :slight_smile: Hence all the questions about how we can improve the error message.

WRT the plotting itself, so far Bokeh is the fastest library that I’ve found, especially when it comes to quite populated plots and interactions. Have been “selling” it to friends ever since.

I don’t have any experience with Tableau, so can’t really confirm anything. But “hover widgets” sounds like something we don’t have. If it’s something like “show an input field or a button when you hover over a data point” then it should be possible to implement even in user space, but it’s not trivial and requires some knowledge of web development in general.

In any case, glad you decided to learn Bokeh!