Glyphs with nonexistent column names (non-continuous ranges!)

loremaster.cerberus · April 20, 2022, 8:42pm

My Goal:

Create a plot with a dictionary’s keys on the y axis, and it’s values (which are lists specifying a range of numbers) that appear as either a series of discrete or continuous glyphs on the x-axis - the measures should be vertically separated by some distance (like you’d see in an hbar). The twist: The ranges are not continuous.

My Problem:

Suppose I have a dictionary:

data = {'a': [0, 1, 2, 3, nan, nan, 6, ..... 99999, 10000],
        'b': [0, 1, nan, 3, nan, 5, 6, ..... 99999, nan]}

I added nan values into each list to ensure that value was the same length (10000 in this example), because this is a required property for the ColumnDataSource.

I thought I might need to specify the names of the keys and values of data for glyph rendering and axis specification:

taxa = list(data.keys())
values = list(data.values())
source = ColumnDataSource(data=data)
p = figure(y_range=taxa, height=250, x_range=(x_min, x_max), title="Coverage by Taxa",
                toolbar_location=None, sizing_mode="stretch_width")

Then, I tried a few different methods of glyph rendering:

p.circle(x='values', y=jitter('taxa', width=0.6, range=p.y_range), source=source, alpha=0.3)

p.hbar_stack(values, y='taxa', height=0.9, color=GnBu3, source=source)

p.line(x='values', y='taxa', source=source, line_width=2)

It may very well be that I am just using one of the above incorrectly, but in examples I’ve found I’ve never seen these non-continuous ranges plotted before via Bokeh.

In every case, I’m not getting anything plotted, and I’m getting the error:
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name.

The figure is getting created though - looks something like this:

In my debugger, I looked inside of source and discovered that the column names do correspond to the keys of the dictionary (what I need for the y-range). But I have no idea about the numerical values - could it be the data attribute?

I want the final graph to look something like this:

Which glyph renderer should I use for this? And what can I specify for ‘x’, exactly?

Thanks.

gmerritt123 · April 21, 2022, 1:26pm

Hey @loremaster.cerberus , you are on the right track and I totally see where you’re getting tripped up. Bokeh can accept (at least) 3 different things for x and y arguments when making line glyphs/renderers → you can feed them arrays, singular values, or strings, and you can mix them. So for example,

p.line(x=[0,1,2,3],y=[5,5,5,5]
p.line(x=[0,1,2,3], y= 5)

will both plot a horizontal line from 0-3 at y=5.

Now where it gets interesting, is if you feed it a string, bokeh expects that string to point to a field name in the ColumnDataSource it’s running off of.

So,

banana_src = ColumnDataSource(data={'bananas':[5,5,5,5]})
p.line(x=[0,1,2,3],y='bananas',source=banana_src)

will plot the same horizontal line as above.

However, you are feeding the line a string arg for y, trying to tell it to plot at that string’s categorical value on the axis. But that’s not how bokeh’s gonna interpret that string. So how do we work around that? Well the trick is simply that when you assign a categorical range to an axis, it’s basically just a mask being applied to a numerical range → your categorical axis ticks just get plotted at 0.5,1.5,2.5 etc. (I think this was to make things like vbar/hbar convenience functions work nicely etc). So we can leverage that knowledge → specifying the numeric singular values for the y arg we know to correspond to those categorical tick labels. See working example below, and absorb the comments :

# -*- coding: utf-8 -*-
"""
Created on Thu Apr 21 08:40:49 2022

@author: harol
"""

import numpy as np
from bokeh.plotting import figure, show
from bokeh.models import Line, ColumnDataSource

n=100

data = {'a':np.arange(0,n,dtype=object)
        ,'b':np.arange(0,n,dtype=object)}

#add random nans between n/3 and n/5
for k in data.keys():
    num_nans = np.random.randint(int(n/5),int(n/3))
    nan_inds = [np.random.randint(0,n-1) for x in range(num_nans)]
    data[k][nan_inds] = np.array(['nan' for n in range(num_nans)])
    
#plotting time
f = figure(y_range=list(data.keys())) #y_range now categorical
colors = ['blue','red']
#now the "trick"
#want to draw a line for each key in data
for i,k in enumerate(data.keys()):
    #bokeh can accept three things for x and y args when you draw a line:
        #1: an array of values
        #2: a singular numeric value
        #3: a string value that is to point to field in a ColumnDataSource (requires a source arg added as well)
    #We want to point to data[k] to return an array of x values
    #And we want to pass a single scalar value to the y field that bokeh will take and apply to along all of the x array
    #the tricky part is that we've set our y_range to categorical fields.. 
    #but we can't pass a string to the y arg because bokeh will think that string is pointing to a field in a columndatasource
    #but the trick is that you can still assign numerical values to the y field
    #when categorical ranges are spec'd, as we have, the axis ticks/labels get placed at 0.5,1.5,2.5 etc.
    #so all we have to do is leverage that and pass i+0.5 to the y arg as our singular scalar value
    f.line(x=data[k],y=i+0.5
           ,legend_label=k,line_color=colors[i]) #colors and labels just for you to see
show(f)

loremaster.cerberus · April 21, 2022, 3:45pm

Wow @gmerritt123 - this totally worked for me.
However, I’d like to add some dynamic scaling options for my plots - and I’d need to use a ColumnDataSource for that! I’m not sure whether I’d need to work in the enumerate method into the source itself, or if I could specify the x and y values beforehand, make a dict out of them, and then call ColumnDataSource normally.

I’m imagining something like this:

colors = itertools.cycle(inferno(len(taxa)))
for i, k in enumerate(data.keys()):
            node_dict = {'x_values': data[k],
                         'y_values': np.array[i] (or something different here?)}
            source = ColumnDataSource(data=node_dict)
            p.line(source=source, legend_label=k, color=next(colors), line_width=3)

The trick would be filling out the list for y_values, right?

In the example you provided, we are providing Bokeh with an array and a float; and as far as I understand, this is forbidden if attempted by the user. I think the crux of the problem is exactly how Bokeh translated your solution into a ColumnDataSource object behind the scenes - which must be happening, correct?.

gmerritt123 · April 21, 2022, 6:59pm

I think the crux of the problem is exactly how Bokeh translated your solution into a ColumnDataSource object behind the scenes - which must be happening, correct?.

Bingo. The high level api, plotting things by going:

p.line(x=somearray,y=somearray)

etc. is actually doing several things all at once for you, including manufacturing a “behind the scenes” CDS for you as you correctly surmised. You actually do have access to that manufactured CDS though, because that higher level api call actually returns the renderer object too (and adds it to the figure obviously). Try:

renderer = p.line(x=[0,1],y=[0,1])
print(dir(renderer))

You’ll that renderer object has a boatload of properties/methods, including datasource. If you drop “renderer.data_source” into the console, you’ll see the CDS that bokeh made for you. “renderer.data_source.data” will return the CDS in dictionary form. Us “creating them beforehand” gives you way more creative control for customization etc though, as you can a) assign more than just the main args (i.e. more than just x+y) and b) in creating a variable pointing to the CDS, it suddenly gets a lot easier to pass into/manipulate via callbacks, especially CustomJS ones.

What I like to do is organize all my stuff into dictionaries. This helps me structure what I’m doing better and makes for real ease of use passing back and forth into callbacks.

Something like this:

bk_dict = {}
for i,k in enumerate(data.keys()):
    bk_dict[k] = {}
    bk_dict[k]['src'] = ColumnDataSource(data= {'x':data[k],'y' : [i+0.5 for x in range(len(data[k]))]})
    bk_dict[k]['rend'] = f.line(x='x',y='y'
                                ,legend_label=k,line_color=colors[i]
                                ,source=bk_dict[k]['src']) #now pointing to that source
show(f)

Idea of this is that bk_dict will look like this:

{'a':{'src':ColumnDataSource(...) #the CDS driving 'a' 
       ,'renderer': LineRenderer(...) #the renderer for 'a'
,'b':{... etc.}

With this, I can pass stuff into CustomJS as args, and even on the python side it helps because it becomes a lot easier to access the things you want to access/manipulate. Now what I outlined above is a bit redundant as

bk_dict['a']['rend'].data_source

would return the exact same cds object as

bk_dict['a']['src']

but I have found that I’m manipulating CDS’s so much, and often times I have multiple renderers running off the same CDS (see docs about linking data, it’s amazing), that it pays to be redundant

loremaster.cerberus · April 21, 2022, 7:00pm

What a coincidence! I found a solution via list comprehension just now.

        for i, k in enumerate(data.keys()):
            x = list(data[k])
            y = [i + 0.5 for x in x]
            node_dict = {'x': x,
                        'y': y}
            source = ColumnDataSource(data=node_dict)
            p.line(source=source, legend_label=k, color=next(colors), line_width=3)

Thank you for your detailed replies! I’m familiar with integrating CustomJS with controls on the html side to get truly dynamic Bokeh plots (not just data filtering - but recreation of a graph with adjusted axes post-filtration), so your help in creating a source opens the door to that stuff.

It’s much appreciated.