bokeh.charts.HeatMap with 38k rows: pretty slow

Ian_Stokes_Rees · November 1, 2016, 5:23am

I’m trying to use bokeh.charts.HeatMap
on 38k rows of data. That doesn’t sound like too big a task to
me, but it takes several minutes for this to complete. Is there
a better way to do the following 3 steps (broken up to show
timings):

from pandas import options, read_csv
from bokeh.charts import HeatMap, bins
fuel = read_csv('../data/fueleconomy/vehicles.csv')
fuel.dropna(subset='highway08 displ'.split(), inplace=True, how='any')

    This takes 445ms and

vehicles.csv
can be found at https://www.fueleconomy.gov/feg/download.shtml
in Zipped CSV File. Followed by the HeatMap
creation step (which takes all the time):

fuel_hp = HeatMap(fuel, x=bins('highway08'), y=bins('displ'), legend=None)

    This requires 4

minutes and 24 seconds, after which the plot is displayed with show(fuel_hp)
in 82ms:

    My sense was that

the actual time to compute a heatmap isn’t that great, so I
tried using seaborn.jointplot
to generate a static (matplotlib) hexbin map from exactly the
same data:

import seaborn as sns
sns.jointplot(fuel.highway08, fuel.displ, kind="hex", color="#4CB391")

    This required only

672 ms to create a static version of effectively the same
information:

    Are there any ways

to adjust the Bokeh code to run in, at most, 20 seconds? Fewer
bins? As a last resort I can decimate the data by striding to
sample it, e.g.

fuel_hp = HeatMap(fuel[::20], x=bins('highway08'), y=bins('displ'), legend=None)

    which takes 15

seconds, but if I can avoid that it would be great.

TIA. Ian

Rutger_Kassies · November 1, 2016, 10:39am

Hey,

My guess would be that Bokeh probably sends a lot of the original data along for interactivity purposes. If you dont mind loosing that, just like with a static Matplotlib image, i;m sure you can get an immense speedup by using Numpy’s histogram2d to generate the heatmap yourself, and then use Bokeh’s p.image to display the image. If this approach is something you can use, but get stuck in the process, i could create an example.

Regards,
Rutger

···

On Tuesday, November 1, 2016 at 6:23:44 AM UTC+1, Ian Stokes-Rees wrote:

I’m trying to use bokeh.charts.HeatMap
on 38k rows of data. That doesn’t sound like too big a task to
me, but it takes several minutes for this to complete. Is there
a better way to do the following 3 steps (broken up to show
timings):

from pandas import options, read_csv
from bokeh.charts import HeatMap, bins
fuel = read_csv('../data/fueleconomy/vehicles.csv')
fuel.dropna(subset='highway08 displ'.split(), inplace=True, how='any')
    This takes 445ms and
vehicles.csv
can be found at https://www.fueleconomy.gov/feg/download.shtml
in Zipped CSV File. Followed by the HeatMap
creation step (which takes all the time):

fuel_hp = HeatMap(fuel, x=bins('highway08'), y=bins('displ'), legend=None)
    This requires 4
minutes and 24 seconds, after which the plot is displayed with show(fuel_hp)
in 82ms:

1200×1200 40.6 KB

    My sense was that
the actual time to compute a heatmap isn’t that great, so I
tried using seaborn.jointplot
to generate a static (matplotlib) hexbin map from exactly the
same data:

import seaborn as sns
sns.jointplot(fuel.highway08, fuel.displ, kind="hex", color="#4CB391")
    This required only
672 ms to create a static version of effectively the same
information:

1196×1186 88.5 KB

    Are there any ways
to adjust the Bokeh code to run in, at most, 20 seconds? Fewer
bins? As a last resort I can decimate the data by striding to
sample it, e.g.

fuel_hp = HeatMap(fuel[::20], x=bins('highway08'), y=bins('displ'), legend=None)
    which takes 15
seconds, but if I can avoid that it would be great.

TIA. Ian

Bryan · November 1, 2016, 1:41pm

Can you run a profile and provide the results? The reality is that bokeh.charts currently lacks a maintainer. But if a profile showed a clear path to some immediate improvement, we could try to make a quick fix.

Thanks,

Bryan

···

On Nov 1, 2016, at 12:23 AM, Ian Stokes Rees <[email protected]> wrote:

I’m trying to use bokeh.charts.HeatMap on 38k rows of data. That doesn’t sound like too big a task to me, but it takes several minutes for this to complete. Is there a better way to do the following 3 steps (broken up to show timings):

from pandas import
options, read_csv

from bokeh.charts import
HeatMap, bins

fuel = read_csv(
'../data/fueleconomy/vehicles.csv'
)
fuel.dropna(subset=
'highway08 displ'.split(), inplace=True, how='any'
)

This takes 445ms and vehicles.csv can be found at Download Fuel Economy Data in Zipped CSV File. Followed by the HeatMap creation step (which takes all the time):

fuel_hp = HeatMap(fuel, x=bins('highway08'), y=bins('displ'), legend=None
)

This requires 4 minutes and 24 seconds, after which the plot is displayed with show(fuel_hp) in 82ms:

<bokeh_plot.png>

My sense was that the actual time to compute a heatmap isn’t that great, so I tried using seaborn.jointplot to generate a static (matplotlib) hexbin map from exactly the same data:

import seaborn as
sns

sns.jointplot(fuel.highway08, fuel.displ, kind=
"hex", color="#4CB391"
)

This required only 672 ms to create a static version of effectively the same information:

<Screenshot 2016-11-01 01.14.20.png>

Are there any ways to adjust the Bokeh code to run in, at most, 20 seconds? Fewer bins? As a last resort I can decimate the data by striding to sample it, e.g.

fuel_hp = HeatMap(fuel[::20], x=bins('highway08'), y=bins('displ'), legend=None
)

which takes 15 seconds, but if I can avoid that it would be great.

TIA. Ian

--
You received this message because you are subscribed to the Google Groups "Bokeh Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/bokeh/9e002903-e87c-cefd-450e-7016284e52c6%40continuum.io\.
For more options, visit https://groups.google.com/a/continuum.io/d/optout\.