Bokeh embedded session limitations, race condition or bug?

ISSUE Bokeh server application with moderate size column data source (10000 samples of 100 variables) works when accessed directly via standalone server but fails when embedded in a Flask application using pull_session()/server_session() mechanism.

It can be shown by example that the problem manifests based on the size of the data source used in a plot even in cases where only two of the columns are used in the rendering.

ENVIRONMENT Mac OSX Big Sur with Anaconda Python. Python 3.8.5, Bokeh 2.2.3, Flask 1.1.2.
Client browser Google Chrome Version 87.0.4280.141 (Official Build) (x86_64)

ERRORS (Flask terminal log)

% python flask_bokeh.py
 * Serving Flask app "flask_bokeh" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://127.0.0.1:8000/ (Press CTRL+C to quit)
ERROR:tornado.application:Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f966006fd00>>, <Task finished name='Task-12' coro=<ClientConnection._next() done, defined at /Users/***/opt/anaconda3/lib/python3.8/site-packages/bokeh/client/connection.py:307> exception=TypeError("_transition_to_disconnected() missing 1 required positional argument: 'dis_state'")>)
Traceback (most recent call last):
  File "/Users/***/opt/anaconda3/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/Users/***/opt/anaconda3/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/Users/***/opt/anaconda3/lib/python3.8/site-packages/bokeh/client/connection.py", line 316, in _next
    await self._state.run(self)
  File "/Users/***/opt/anaconda3/lib/python3.8/site-packages/bokeh/client/states.py", line 139, in run
    return await connection._transition_to_disconnected()
TypeError: _transition_to_disconnected() missing 1 required positional argument: 'dis_state'
ERROR:tornado.application:Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f969057c280>>, <Task finished name='Task-18' coro=<ClientConnection._next() done, defined at /Users/***/opt/anaconda3/lib/python3.8/site-packages/bokeh/client/connection.py:307> exception=TypeError("_transition_to_disconnected() missing 1 required positional argument: 'dis_state'")>)
Traceback (most recent call last):
  File "/Users/***/opt/anaconda3/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/Users/***/opt/anaconda3/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/Users/***/opt/anaconda3/lib/python3.8/site-packages/bokeh/client/connection.py", line 316, in _next
    await self._state.run(self)
  File "/Users/***/opt/anaconda3/lib/python3.8/site-packages/bokeh/client/states.py", line 139, in run
    return await connection._transition_to_disconnected()
TypeError: _transition_to_disconnected() missing 1 required positional argument: 'dis_state'
ERROR:tornado.application:Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f9690582430>>, <Task finished name='Task-27' coro=<ClientConnection._next() done, defined at /Users/***/opt/anaconda3/lib/python3.8/site-packages/bokeh/client/connection.py:307> exception=TypeError("_transition_to_disconnected() missing 1 required positional argument: 'dis_state'")>)
Traceback (most recent call last):
  File "/Users/***/opt/anaconda3/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/Users/***/opt/anaconda3/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/Users/***/opt/anaconda3/lib/python3.8/site-packages/bokeh/client/connection.py", line 316, in _next
    await self._state.run(self)
  File "/Users/***/opt/anaconda3/lib/python3.8/site-packages/bokeh/client/states.py", line 139, in run
    return await connection._transition_to_disconnected()
TypeError: _transition_to_disconnected() missing 1 required positional argument: 'dis_state'
127.0.0.1 - - [22/Jan/2021 08:31:01] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [22/Jan/2021 08:31:01] "GET / HTTP/1.1" 200 -
ERROR:tornado.application:Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f9660203310>>, <Task finished name='Task-45' coro=<ClientConnection._next() done, defined at /Users/***/opt/anaconda3/lib/python3.8/site-packages/bokeh/client/connection.py:307> exception=TypeError("_transition_to_disconnected() missing 1 required positional argument: 'dis_state'")>)
Traceback (most recent call last):
  File "/Users/***/opt/anaconda3/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/Users/***/opt/anaconda3/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/Users/***/opt/anaconda3/lib/python3.8/site-packages/bokeh/client/connection.py", line 316, in _next
    await self._state.run(self)
  File "/Users/***/opt/anaconda3/lib/python3.8/site-packages/bokeh/client/states.py", line 139, in run
    return await connection._transition_to_disconnected()
TypeError: _transition_to_disconnected() missing 1 required positional argument: 'dis_state'

INFO (client browser JS Console log)

INFO (bokeh server terminal)

% bokeh serve bkapp.py --port=5006 --allow-websocket-origin=localhost:8000 --allow-websocket-origin=localhost:5006 --unused-session-lifetime=60000 --keep-alive=1000
2021-01-22 08:31:18,041 Starting Bokeh server version 2.2.3 (running on Tornado 6.1)
2021-01-22 08:31:18,042 Keep-alive ping configured every 1000 milliseconds
2021-01-22 08:31:18,042 Unused sessions last for 60000 milliseconds
2021-01-22 08:31:18,042 User authentication hooks NOT provided (default user enabled)
2021-01-22 08:31:18,044 Bokeh app running at: http://localhost:5006/bkapp
2021-01-22 08:31:18,044 Starting Bokeh server with process id: 90205
2021-01-22 08:31:21,185 WebSocket connection opened
2021-01-22 08:31:21,594 ServerConnection created
2021-01-22 08:31:21,767 Failed sending message as connection was closed

The problem when running as an embedded session goes away if the number of variables in the data source is reduced with all other pieces are unchanged. The problem does not occur when accessing via the bokeh server directly, e.g. navigating to localhost:5006/bkapp

This is a trivial, minimal example that is uninteresting from an engineering standpoint. However, the motivation is to support a real-world application that performs multivariate time-series visualization and analysis. For brevity and ease of reproducing, I’ve stripped away all of the pieces related to uploading large-ish user data, callbacks, analysis functions and interactive plots.

The minimal reproducible example includes the following parts

bkapp.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
"""
import numpy as np
import pandas as pd

from bokeh.models import ColumnDataSource

from bokeh.plotting import figure
from bokeh.io import curdoc

rows, cols = (10000, 100)
data = pd.DataFrame(data=np.random.randn(rows, cols), columns=['x'+str(i) for i in range(cols)])
source = ColumnDataSource(data=data)

p = figure(x_axis_type='linear', title="Flask-Bokeh Example")
p.scatter('x0', 'x1', source=source)

curdoc().add_root(p)

flask_bokeh.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Flask+bokeh app
"""
from flask import Flask, render_template

from bokeh.client import pull_session
from bokeh.embed import server_session

app = Flask(__name__)

@app.route('/', methods=['GET'])
def index():
    arguments = None
    with pull_session(url='http://localhost:5006/bkapp', arguments=arguments) as _session:
        script = server_session(session_id=_session.id, url='http://localhost:5006/bkapp')
        return render_template("index.html", script=script, template="Flask")

if __name__ == '__main__':
    app.run(port=8000)

index.html

<!doctype html>

<html lang="en">
<head>
  <meta charset="utf-8">
  <title>Embedding a Bokeh Server With {{ framework }}</title>
</head>

<body>
  <div>
    This Bokeh app below served by a Bokeh server that has been embedded
    in another web app framework. For more information see the section
    <a  target="_blank" href="https://docs.bokeh.org/en/latest/docs/user_guide/server.html#embedding-bokeh-server-as-a-library">Embedding Bokeh Server as a Library</a>
    in the User's Guide.
  </div>
  {{ script|safe }}
</body>
</html>

Most likely you are exceeding the default max websocket message size for Tornado, and need to increase it. There is a command line option to the Bokeh server.

Thanks for the reply.

Is that consistent with the observation that the application works properly when accessing the server directly, e.g. by visiting localhost:5006/bkapp, versus going through the pull_session/server_session layer of the Flask app?

UPDATE confirmed that the behavior remains when increasing the max websocket message size to ~2GB, even though my data is much less than that even accounting for significant overhead.

2021-01-22 10:58:22,384 Starting Bokeh server version 2.2.3 (running on Tornado 6.1)
2021-01-22 10:58:22,386 Keep-alive ping configured every 1000 milliseconds
2021-01-22 10:58:22,386 Unused sessions last for 60000 milliseconds
2021-01-22 10:58:22,386 Torndado websocket_max_message_size set to 2000000000 bytes (1907.35 MB)

Do you have any sort of proxy (e.g nginx) in the non-localhost case? The proximate problem is the websocket close with code 1006, unfortunately that specific code is vague and usually does not offer much information. You might try looking at the console in a different browser. You can also bump the python and JS log levels to TRACE, which may help better localize where in the exchange the problem is happening.

Sorry for the confusion.

Everything here is running on a personal computer using localhost without nginx or any other proxying mechanism to eliminate as many variables as possible.

In one terminal I run the bokeh server. I can access the route to the server via a browser without problems, localhost:5006/bkapp.

In a separate terminal I run the minimal flask example, going through that at localhost:8000, the problem manifests when trying to access the aforementioned bokeh server started separately.

I think the cause of the 1006 errors seen in the JavaScript console and the messages in the bokeh server terminal are due to the crash as reported from the pull_session/server_session seen in the Flask app. This superficially feels like there was some sort of race condition closing/ending the session before all data are exchanged.

AFAIK code 1006 is always client-originated. It looks in the flask log like there may be some issue with the transition to disconnected state in the server, but that’s happening after the websocket is already closed (by the browser). Again I’d suggest experimenting with other browsers and log levels in case that yields more information.

I don’t think this has anything to do with pull_session or server_session. If that were the case the browser would not even get to the point of attempting to start a connection in the browser, which it is definitely doing.

Okay. Will do.

For what it’s worth, I don’t always see the messages in the JavaScript console. It seems to depend on the sequence of re-visiting the site in the browser, restarting the apps, etc.

But the crash sequence reported above in the terminal running the flask app with pull_session/server_session is always the same.

At this point all I can really suggest is a GitHub issue.

Results from more controlled tests, i.e. clearing all browser histories, restarting apps, etc.

Identical behavior seen in Google Chrome and Safari browsers.

With Bokeh python and JavaScript console log levels set to trace, I see verbose printouts in the browsers JavaScript console for the working mode, i.e. going through localhost:5006/bkapp.

When going through the route for the Flask app, the JavaScript console shows nothing at all despite the log-level set to trace. The bokeh server terminal window (not the flask app terminal window) shows the following Python logs with the level set to trace.

2021-01-22 11:56:57,203 [pid 92017] 0 clients connected
2021-01-22 11:56:57,203 [pid 92017]   /bkapp has 0 sessions with 0 unused
2021-01-22 11:56:58,200 Running keep alive job
2021-01-22 11:56:59,203 Running keep alive job
2021-01-22 11:57:00,202 Running keep alive job
2021-01-22 11:57:01,202 Running keep alive job
2021-01-22 11:57:01,203 Running session cleanup job
2021-01-22 11:57:02,114 Subprotocol header received
2021-01-22 11:57:02,114 Supplied subprotocol headers: ['bokeh', 'eyJzZXNzaW9uX2lkIjogIjVqaTJRWXZUeGgxUXFnaGhMVk5FREJ1dGxmM0dJaHl6NWw1bnNpM0NacUdQIiwgInNlc3Npb25fZXhwaXJ5IjogMTYxMTMzNDkyMn0']
2021-01-22 11:57:02,115 WebSocket connection opened
2021-01-22 11:57:02,521 Receiver created for Protocol()
2021-01-22 11:57:02,521 ProtocolHandler created for Protocol()
2021-01-22 11:57:02,521 ServerConnection created
2021-01-22 11:57:02,521 Running keep alive job
2021-01-22 11:57:02,523 Sending pull-doc-reply from session '5ji2QYvTxh1QqghhLVNEDButlf3GIhyz5l5nsi3CZqGP'
2021-01-22 11:57:02,649 Failed sending message as connection was closed
2021-01-22 11:57:02,650 WebSocket connection closed: code=None, reason=None
2021-01-22 11:57:03,203 Running keep alive job
2021-01-22 11:57:04,203 Running keep alive job
2021-01-22 11:57:05,202 Running keep alive job
2021-01-22 11:57:06,200 Running keep alive job

Github Issue #10863 created.

Just noting that there is an error with transitioning to disconnected, but that’s not really relevant (the disconnection already happened). This is the diff that fixes that tho:

diff --git a/bokeh/client/states.py b/bokeh/client/states.py
index c63dc06fd..74ebb697b 100644
--- a/bokeh/client/states.py
+++ b/bokeh/client/states.py
@@ -136,7 +136,7 @@ class WAITING_FOR_REPLY:
     async def run(self, connection):
         message = await connection._pop_message()
         if message is None:
-            return await connection._transition_to_disconnected()
+            return await connection._transition_to_disconnected(DISCONNECTED(ErrorReason.NETWORK_ERROR))
         elif 'reqid' in message.header and message.header['reqid'] == self.reqid:
             self._reply = message
             return await connection._transition(CONNECTED_AFTER_ACK())

with that change you just get the intentional disconnect message:

ERROR:flask_bokeh:Exception on / [GET]
Traceback (most recent call last):
File “/Users/bryan/anaconda/envs/dev/lib/python3.7/site-packages/flask/app.py”, line 2447, in wsgi_app
response = self.full_dispatch_request()
File “/Users/bryan/anaconda/envs/dev/lib/python3.7/site-packages/flask/app.py”, line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File “/Users/bryan/anaconda/envs/dev/lib/python3.7/site-packages/flask/app.py”, line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File “/Users/bryan/anaconda/envs/dev/lib/python3.7/site-packages/flask/_compat.py”, line 39, in reraise
raise value
File “/Users/bryan/anaconda/envs/dev/lib/python3.7/site-packages/flask/app.py”, line 1950, in full_dispatch_request
rv = self.dispatch_request()
File “/Users/bryan/anaconda/envs/dev/lib/python3.7/site-packages/flask/app.py”, line 1936, in dispatch_request
return self.view_functionsrule.endpoint
File “flask_bokeh.py”, line 16, in index
with pull_session(url=‘http://localhost:5006/bkapp’, arguments=arguments) as _session:
File “/Users/bryan/work/bokeh/bokeh/client/session.py”, line 120, in pull_session
session.pull()
File “/Users/bryan/work/bokeh/bokeh/client/session.py”, line 387, in pull
self._connection.pull_doc(doc)
File “/Users/bryan/work/bokeh/bokeh/client/connection.py”, line 196, in pull_doc
raise RuntimeError(“Connection to server was lost”)
RuntimeError: Connection to server was lost

@_jm I just had a thought. The option --websocket-max-message-size definitely controls the websocket connection between a bokeh server and a browser, but I wonder if somehow that is not true for a bokeh server and a python process using bokeh.client. I.e I wonder if some Tornado configuration is needed in the flask app calling pull_session as well.

Edit: I realize that would not explain why the “bare” app works, but I do note that (10000, 100) times two float64 columns is ~16mb which it not far off the server default limit. [1] Perhaps there is some other setting that needs to be configured, relevant to bokeh.client usage.


  1. And just FYI if I cast to float32 instead, then thing work ↩︎

1 Like

Thanks! This makes sense and looks like a promising way forward. I will let you know when I have a chance to explore more.

@_jm this is definitely it. Bokeh uses Tornado’s websocket_connect function which has a separate (and lower) max message size:

The following diff makes your example work:

diff --git a/bokeh/client/connection.py b/bokeh/client/connection.py
index f32077b94..ee433525a 100644
--- a/bokeh/client/connection.py
+++ b/bokeh/client/connection.py
@@ -267,7 +267,7 @@ class ClientConnection:
         formatted_url = format_url_query_arguments(self._url, self._arguments)
         request = HTTPRequest(formatted_url)
         try:
-            socket = await websocket_connect(request, subprotocols=["bokeh", self._session.token])
+            socket = await websocket_connect(request, subprotocols=["bokeh", self._session.token], max_messag
             self._socket = WebSocketClientConnectionWrapper(socket)
         except HTTPClientError as e:
             await self._transition_to_disconnected(DISCONNECTED(ErrorReason.HTTP_ERROR, e.code, e.message))

I’ll submit a PR tonight to expose this parameter.

Thanks.

As an aside, I am using Dropzone.js with chunking to reliably transmit large data from a user to the python code backend. So, I have large data handling figured out in that context.

The issue of this discourse topic came about when I wanted to see what the limits were with rendering in the browser. I wasn’t expecting the data flow to the browser to be a bottleneck if I’m not plotting all the data in the source, but I should have thought about it more. This is good to know in general when organizing things.

@_jm FYI https://github.com/bokeh/bokeh/pull/10869

1 Like