Server-side caching of excessive (provisional) data

ThomasB · February 20, 2020, 10:12am

According to the documentation I found that bokeh provides a sophisticated session-management between client and server already.
The core seem to be the server.session class or the document object which holds all the information for the browser-side GUI.
However, I was not able to figure out how or where I can store data that are not (yet) supposed to be exchanged with the browser, but still belong to a session.

The situation is, that we want to upload rather large data-sets to the server once at the start of a session.
These data should be cached completely on the server but will only be plotted and analysed in smaller parts or fractions of the whole set.
I could do that by setting up my own session handling / caching in parallel to bokeh’s.
But it would be much more convenient to have an accessible component of server.session, where the application can server-side-cache user data whithout sending those data in full to the browser each time the page is reloaded or the GUI manipulated.

Is there a place to store the data in the bokeh session on server-side?

Thanks for the support!

p-himik · February 22, 2020, 2:39pm

we want to upload rather large data-sets to the server once at the start of a session

If you mean a Bokeh session here then it’s pretty simple, albeit not ideal. Right now there’s a 1-to-1 correspondence between a session and a document, and the document is never recreated - it’s always the same object, unless you create a new session.
In your application, when you initialize the document, you can just attach any field to it that’s not taken by Bokeh itself. Such field won’t be synchronized with clients and will always be available where you can access the document.

ThomasB · February 22, 2020, 6:39pm

That sounds exactly like the thing I hoped for. I was afraid to break other things if I amend document.
Thanks for clarification!
I will try it.

rmitchel · February 22, 2020, 11:53pm

Hey, do you mind clarifying something a little more for me? By initialize the document you are referring to in the actual app where for instance you do curdoc().addroot()? Also, in order to set up data on the document and always retrieve it you would have to keep track of the session id to make sure you get the correct document/session, right? Lastly (sorry haha), how do you access this field from the document - I believe in the question Thomas wanted to only store the data and then later (in a separate bokeh app) graph a small subset of that data, so you wouldn’t necessarily want to attach a plot to the main document… I think?

I’m trying to do something very similar where there a few things that are user-wide and it would be very nice to not have to constantly fetch data (from a db) for different users, but also I don’t necessarily want to actually plot this data - just cache it until later.

Thanks for any help!
Ryan

ThomasB · February 23, 2020, 11:16am

Actually, from my side, I was looking for a solution for a single bokeh session.
Hence, attaching something to the cur_doc document probably solves my problem as long as it is not synced with the client/browser on each and every change current view of the GUI.
I want the user to be able to switch between small subsets of the data for analysis, without transferring the complete data set every time.
Did not try the concept yet, though.

Thomas

p-himik · February 24, 2020, 5:18pm

By initialize the document you are referring to in the actual app where for instance you do curdoc().addroot()?

Yes. Or in a proper application where you have to implement a function with one argument, doc.

you would have to keep track of the session id to make sure you get the correct document/session, right?

No, Bokeh already does that for you.

how do you access this field from the document

curdoc().prefetched_data = pd.DataFrame(...)  # Or some other attribute that's not taken by Bokeh.

...

data = curdoc().prefetched_data

rmitchel · February 26, 2020, 4:22pm

Ah ok, I’m beginning to understand more -
No, Bokeh already does that for you.
But I think I’m misunderstanding what curdoc fetches exactly - I have a bokeh app embedded into django:

The bokeh app:

def playground_handler(doc: Document):
    def cb(data):
        print(data)

    doc.my_cb = cb

The django url endpoint:

def playground(request: HttpRequest) -> HttpResponse:
    script = server_document(request.build_absolute_uri())

    if request.method == 'POST':
        print("got post")
        name = request.POST.get('test')
        curdoc().my_cb(name)
        return HttpResponse('')

    return render(request, "test/playground.html", dict(script=script))

The post request is successfully received (from a button click on the template page), however the doc that curdoc() returns does not have a session_context on it! Any ideas what is going on here?

Thanks,
Ryan

p-himik · February 26, 2020, 4:35pm

Oh, sorry, not a clue.
I don’t want to touch Django with a ten-foot pole, I hate it with passion.

rmitchel · February 26, 2020, 4:58pm

Haha fair enough, though I’m hoping it’s the same in this instance as with flask (unless you hate flask too)? Or I guess a more generic question is how does bokeh figure out which session to get?

p-himik · February 26, 2020, 5:20pm

Bokeh is built on top of Tornado, and that’s what I use. I’ve never used Flask, can’t really say anything about it.

how does bokeh figure out which session to get?

There’s a map of session IDs to session objects in bokeh.server.contexts.ApplicationContext._sessions.

rmitchel · February 26, 2020, 5:27pm

Ohh ok, so then how are session id’s determined? The issue must be occurring when the curdoc() session id is determined vs the id established by the autoload/websocket connection.

p-himik · February 26, 2020, 5:51pm

Seems like it’s this line in case of Django: bokeh/consumers.py at b19f2c5547024bdc288d02e73fdb65e65991df5f · bokeh/bokeh · GitHub

rmitchel · February 26, 2020, 5:55pm

Hm yeah, I suppose now I need to figure out how to steal that value!

rmitchel · March 2, 2020, 7:06pm

I’ve switched over to a non-django example to test this - for some reason when I set data on the document it’s not remembered.
This is the bokeh app:

data = {"test": "123"}

doc = curdoc()

doc.pre_data = data

fig = figure(tools="tap,save",
                        background_fill_color='gray', background_fill_alpha=0.3,
                        match_aspect=True, plot_height=500, plot_width=2000)
doc.add_root(fig)

For rendering this bokeh document I do:

        bokeh_server_url = "http://localhost:5006/example"
        with client.pull_session(url=bokeh_server_url) as session:
            bokeh = session.id
            # save session id for later

            server_script = server_session(session_id=session.id, url=bokeh_server_url)
            context = dict(script=server_script, test="ok")
            return render(request, self.template, context)

And then to get the data again (at a later point):

        bokeh_server_url = "http://localhost:5006/example"
        bokeh_id = # obtain saved session
            with pull_session(session_id=bokeh_id, url=bokeh_server_url) as session:
                doc = session.document
                print(doc.pre_data["test"]) # doc does not have pre_data on it

I’ve removed the django saving logic for brevity but I have checked that the ids are the same - am I missing something fundamentally with how the bokeh server provides the ServerSession?

The resulting doc also does not have a SessionContext, which seems weird.

Bryan · March 2, 2020, 9:43pm

Bokeh server processes do not share any state and are meant to be completely horizontally scalable. So in general this can’t be a reliable operation, because in cases where there are multiple Bokeh server processes (e.g. with --num-procs or behind a load balancer), there is no guarantee that the first network call and subsequent calls land on the same process. In which event, the default behavior of the Bokeh server is simply to create a brand new session and document on demand.

I think the only way to make this sort of affordance supportable in general is to involve some sort of shared global backing store (e.g. redis, or a cloud filesystem, etc). We are not going to implement anything specific ourselves, however it might make sense to add hooks that users could implement to provide whatever per-session data access and retrieval they need. That would require new development, though.

If you aren’t using multiple processes then I don’t know offhand what the issue might be, but we’ve definitely never demonstrated pull_session in this way, so I don’t think I would consider it supported usage in any case. The only usage we’ve ever demonstrated is one-time “up-front” session customization

   with pull_session(url=app_url) as session:

        # update or customize that session
        session.document.roots[0].title.text = "Special Plot Title For A Specific User!"

        # generate a script to load the customized session
        script = server_session(session_id=session.id, url=app_url)

which will also only work with a single Bokeh server process.

FWIW I think the session store hooks are a good idea but I don’t know when I’d personally be able to to work on them.

rmitchel · March 3, 2020, 3:30pm

Ohhhh! Makes more sense why it’s not working then haha. I still am very curious as to how the figures and etc. makes it into the result of pull_session but not other customization (done in the app itself too)! In any case though, there are a few possibilities that I can think of right now for implementation (I’ll give one a go ) -

The simplest version is just attaching a list of changes to a document every time the document is pulled, i.e. we don’t completely store the document, just the changes are stored.

So now pull_session has the functionality:

 with pull_session(url=app_url) as session:
         session.document.x # added by pull_session
         session.document.y # added by pull_session

While it would be fine to just expect the user to mutate all their data from there how they please, it may be nice to enhance pull_session to accept functions that act on various properties of the document. I wouldn’t recommend this as a standalone improvement to pull_session since it’s unnecessarily over-complicating it - but if we’re already enhancing it I think it becomes more viable.

In terms of the hooks, I think it makes sense to also put them in bokeh.client, given its purpose: Creating and customizing specific sessions of a Bokeh application running in a Bokeh Server, before passing them to a viewer. Since we don’t want to implement the actual storage functionality ourselves it’s as simple as providing to bokeh.client a class/object that implements store and retrieve.

class MyChangeManager:
    
    def store(self, change, property_name):
          'redis or other store details, associate with property_name provided by bokeh'

    def retrieve(self, property_name):
         item = 'retrieve from store by id'
         return item

On our side we implement the generation of uniqueIDs. The only remaining issue is ‘where do these changes get specified’? Well ideally they get specified in the bokeh application, whether it be a curdoc() version or a def app(doc):. Therefore, the MyChangeManager needs to also live within the bokeh server and be synced across processes, and then we provide the method save that can be used actually in the app to indicate to bokeh that a property needs to be saved to the store.

Largely spitballing here, overall I don’t think it’s that large of an implementation on the bokeh side, and should only remain complex enough to enable any type of store. Thoughts?

rmitchel · March 4, 2020, 4:45pm

Related server question - I’m noticing that BokehJS opens a wsconnection which is handled by ws.py and that is how that document is loaded (and session context is established), which does actually have the added properties to it. On the other hand, if you pull_session it uses client.py, connection.py, states.py, etc. and seems to basically mimic what bokehjs would do but client-side, but then it doesn’t have the session_context attached (or the added properties). What’s going on with that? - if I do:

  with pull_session(url=app_url) as session:

        # update or customize that session
        session.document.roots[0].title.text = "Special Plot Title For A Specific User!"

        # generate a script to load the customized session
        script = server_session(session_id=session.id, url=app_url)

Then it goes through both processes of ws.py and the ‘pretending’ to be the browser client in order to customize!