Bokeh server getting killed by memory leak, and sometimes also getting Getting an "extra unexpected referrers!" error

I’m trying to deploy a fairly resource-intensive Panel/Holoviews/Datashader project (a map with overlaid images with a lot of regridding and reshading going on) and based on what I found, it gets served by Bokeh, so that’s why I write here, but let me know if I’m wrong at that.

The app runs on an AWS EC2 ubuntu server with python 3.8.5 and with the latest versions for every module. I start the application with this command:
python -m panel serve app.py --allow-websocket-origin={'mydomain:5006','mydomain:8080','mydomain'}
The future is to be able to serve this through a Flask app using server_document, hence the extra ports, but currently it just gets embedded as an iframe using :5006. This is how the app.py exposes the content:

full_content = pn.Template(template_html)
full_content.add_panel('content', combined_panel_and_holoviews_elements)
full_content.servable()

When I start the script and load the page, I can see a process in htop with the same name as the command above and a green thread with the same name both taking up about 33% memory immediately and holding onto that. The CPU of the process ranges from 0% to 100% depending on the current user interaction. The memory consumption however constantly grows. Every time I refresh the browser, it grows by about 3% for a while and then decreases by about 1% but holds onto the total net 2% gain. So subsequent refreshes quickly grow the memory consumption of the system to the available maximum (which is 80.6% in this particular case), then at first everything just freezes, but finally, the script gets killed with a simple “Killed” message.

I have no information on what goes wrong, however, I had some error messages popping up earlier while still in development, but they went away before I had the time to directly address them. These only happened on rare occasions and looked like this:

bokeh.document.document - ERROR - Module <module 'bokeh_app_0471f2090d064ee1b1e2c9fd42c2fcb3' from '/x/y/app.py'> has extra unexpected referrers! This could indicate a serious memory leak. Extra referrers: [<cell at 0x7f15bc66bc50: module object at 0x7f15bc690470>]

Now, of course, I know that without reproducible code it’s hard to guess what goes wrong, but maybe somebody has any pointers.

  • Are there common pitfalls that cause memory leaks and best practices to avoid them?
  • Is there any documentation on what actually the Bokeh server does to be able to understand the underlying logic?
  • Are there any better ways to investigate this issue than commenting out everything and hoping something will make it go away and then at least I would have a culprit?
  • If the memory leak error message from earlier is relevant, how should I decipher what “cell” and “module object” it refers to?

If I run the script with mem- and stats- logging, it prints this on start:

2021-05-21 16:02:12,777 [pid 12306] 0 clients connected
2021-05-21 16:02:12,777 [pid 12306]   /app has 0 sessions with 0 unused
2021-05-21 16:02:12,778 [pid 12306] Memory usage: 100.00 MB (RSS), 291.00 MB (VMS)
2021-05-21 16:02:12,807   uncollected Documents: 0
2021-05-21 16:02:12,832   uncollected Sessions: 0
2021-05-21 16:02:12,869   uncollected Models: 1

But as soon as I actually load the app in a browser, it stops logging to the console.

Also after loading the app in the browser for the first time after starting the process I again get this, but not after refreshes:
bokeh.document.document - ERROR - Module <module 'bokeh_app_fa366694b3804dc79e2879edde1ad888' from '/path/app.py'> has extra unexpected referrers! This could indicate a serious memory leak. Extra referrers: [<cell at 0x7f5e2e205f90: module object at 0x7f5e2de4edd0>]

This is an extremely rare/uncommon message to see. I’m not sure I can recall any users ever reporting seeing it before. It’s probably worth a glance at the actual function that raises this error:

bokeh/document.py at branch-2.4 · bokeh/bokeh · GitHub

This function delete_modules is called whenever a Document is destroyed, e.g because a session is closed and cleaned up. The checks explicitly assert that the modules associated with the Document are only references in expected ways/places, so that they can be expected to be garbage collected. So, this message is saying that is not the case, and something somewhere is holding on to an unexpected reference (which could prevent garbage collection).

Things that could plausibly cause an extra reference:

  • Evidently get_referrers makes fairly loose guarantees. Are you using Python 3.10 or 3.11 preview? There is a very slight chance that Python itself has made some change here (that would only explain seeing a referrers message though, not an actual leak)

  • It’s possible Holoviews/Panel is somehow holding on to the app module in some way cc @Philipp_Rudiger

  • It’s possible Bokeh itself is holding on to a reference it should not (but I am not aware of any changes in this area in quite some time, and have not seen any other issues noted)

  • It’s possible that user code is holding the reference

For the last bullet, offhand the only thing that comes to mind is, say, creating a thread that outlives the session, and the code in the thread is holding on to something it shouldn’t. If your app code is creating threads then you would need to make sure clean up properly. But there are almost certainly other possibilities. As you say, it’s very difficult to speculate without an MRE (this would really take detailed investigation).

1 Like

Also FYI, re: “cell” here is the relevant Python documentation:

Cell Objects — Python 3.9.5 documentation

I am reasonably familiar with some CPython internals, but this is definitely outside my expertise. The only vague grasping connection I might try to make is that perhaps something involving decorators or closures is involved. E.g. maybe if you passed a decorated function or an inner function to a FunctionHandler that is problematic. But again that’s pure speculation, what’s really needed is code to run and reproduce.

1 Like

For what it’s worth as a data point, we run a large collection of public-facing resource-intensive Bokeh/Panel/Holoviews/Datashader projects accessible through examples.pyviz.org, and we have monitoring in place to see how often the containers involved have to be restarted. In the past we have found some apps with growing memory usage that led to such restarts, but we don’t currently see any issues like that. I don’t remember the precise fixes @Philipp_Rudiger had to do, but I believe they were scattered around, some in the app code, some in older versions of Panel, etc. Not sure that helps for debugging this case…

1 Like