Bokeh not displayed on jupyterlab PySpark kernel

Hi,

I try to use bokeh to work in jupyterLab, pyspark kernel but noting is displayed.

I deployed an AWS EMR with jupyterHub prepackaged by AWS.
I manually installed jupyterLab on it (JupyterLab on JupyterHub). and gew other llibraries

sudo python3 -m pip install ipython
sudo python3 -m pip install jupyterlab jupyterlab-git
sudo python3 -m pip install matplotlib==3.4.3
sudo python3 -m pip install seaborn

I manually installed Hail v0.2.80, that come with bokeh 1.4.0 as a requirement

few other environment versions for info:

  • EMR: emr-6.4.0
  • EC2: Amazon Linux 2 AMI
  • Java: java -version 1.8.0
  • Python: python --version 3.7.10
  • Ganglia: 3.7.2
  • Hadoop: hadoop version 3.2.1
  • JupyterHub: 1.4.1
  • Livy: ?
  • Spark UI: 3.1.2
  • Spark: spark-shell 3.1.2
  • Scala: spark-shell 2.12.10

In jupyterLab, I open a new PySpark kernel.
I am able to initiate spark and load hail
I can load some data
I can display HTML styled spark dataframes using

%%pretty
data.to_spark().show()

I can display matplotlib plot using

import matplotlib.pyplot as plt
# Extract data
p_df = data.to_pandas()
# Compute histogram
plt.hist(p_df['info.AN'], bins=10, range=(4500, 5500))
# Use sparkmagic
%matplot plt

So it seams that diaplying HTML in pyspark is doable.

But I am not able to plot using bokeh

from bokeh.plotting import figure, show, output_notebook
output_notebook()
# prepare some data
x = [1, 2, 3, 4, 5]
y = [6, 7, 2, 4, 5]
# create a new plot with a title and axis labels
p = figure(title="Simple line example", x_axis_label="x", y_axis_label="y")
# add a line renderer with legend and line thickness
p.line(x, y, legend_label="Temp.", line_width=2)
# show the results
show(p)

The code above do not display anything (no loading bokeh…, no error)

I try to install jupyter-bokeh but that upgrade bokeh to v2.4.2 (that is incompatible with Hail) and do not solve the ploting issue…

Please help

Hi @mhebrard There has never been any official work by the Bokeh project team to support PySpark, and I’m not aware that anyone on the core team uses PySpark at all. I am afraid I can’t offer much guidance beyond:

  • the standard jupyter_bokeh extension probably needs to be installed
  • do also check the browser JavaScript console for any relevant error messages
  • reach out to the PySpark project (Hail?) on their support forums or trakcers, since they are the folks with any actual expertise of using Bokeh with PySpark

Also, FYI

So it seams that displying HTML in pyspark is doable.

Basic MPL plotting merely displays static image files, and does not do any HTML (JavaScript) plotting. This is considerably simple and does not require any jupyter extensions, etc in order to execute JS code.

Hi @Bryan . Thanks for the answer.

I get that apparently no one is trying to plot using PySpark. It is a bit frusrating because that is the only way to rely on your cluster and that benefits a lot when working on big distributed data without the need to perform the aggregation, save a file, and switch kernel in the middle of the process.

Strangely Bokeh is working fine in zeppelin. But the format of zeppelin notebook is less convenient that jupyter so I wish to switch back. I am especially after the github support to display jupyter notebook that ease reviews.

See below zeppelin snippet

%pyspark
# Imports
from bokeh.io import show, output_notebook
from bokeh.resources import Resources
import bkzep
output_notebook(notebook_type='zeppelin', resources=Resources(mode='inline'))
# Plot
...

Now I tried to install jupyter_bokeh extention, but that upgraded the version of bokeh automatically. Is there a way to fix the bokeh version while installing the extention ?

@mhebrard I don’t actually know, it’s possible @Philipp_Rudiger or @mateusz can chime in with more information. It’s entirely possible that that Bokeh 1.4 may not be compatible with recent Jupyter-lab. The authors of the tool you are using should really update their codebase, since they are about to be two entire major releases out of date.

As an aside, you could consider using Dask as an alernative to PySpark. It has a much more Pythonic / Pandas-like API, and also has a very sophisticated built-in cluster performance monitoring and analysis dashboard that is actively maintained.