Adding sklearn predict data to plot

alofgran · December 6, 2019, 3:24am

I’m attempting to create a line plot with historical data, and then a single point (scatter plot presumably) with the predicted value for my machine learning (sklearn) model. I plotted the historical data without too much trouble, and have since been adding code to plot the predicted value. The plot has a bokeh select menu that will allow me to choose the ID number for each item, thus selecting the appropriate plot and model. At the moment, the date used in the prediction is static (set at 03/31/2020), however, once I’m able to plot the static date, I plan to begin enabling a widget to allow for a user-selected date. Consequently, I’ve attempted to write the code so that the prediction is run once the new ID number is selected.

Since I’ve added the prediction code, things have broken down. Can anyone tell me where I’m going wrong?

Below is my current code for the plot itself.
The full code (including the regression code and corresponding data) can be found at my github here ( code: app_test.py , data: pred_data.csv , historical_data.csv , features_created.pkd .)

from bokeh.io import curdoc
from bokeh.layouts import column, row
from bokeh.models import ColumnDataSource, Select, DataRange1d, HoverTool
from bokeh.plotting import figure

# Set up (initial) data
historical_data = historical_data.loc[:, ['ndc', 'date', 'nadac_per_unit']]
historical_data = historical_data.sort_values('date')
historical_source = ColumnDataSource(historical_data[historical_data.loc[:, 'ndc']=='781593600'])
#
import datetime as dt
# prediction_data.loc[:, 'date'] = dt.datetime(2020, 3, 31)
prediction_data.loc[:, 'year'] = 2020
prediction_data.loc[:, 'month'] = 3
prediction_data.loc[:, 'day'] = 31
first_prediction = lin_model.predict(prediction_data)
first_prediction = pd.DataFrame(data = {'ndc':first_prediction[0][0], 'predictions':first_prediction[0][1][0]}, index = [0]) #these element slices are correct
first_prediction['date'] = pd.to_datetime(prediction_data[['year', 'month', 'day']], infer_datetime_format=True, errors = 'coerce')
prediction_source = ColumnDataSource(first_prediction[first_prediction.loc[:, 'ndc']=='781593600'])

id_list = list(prediction_data['ndc'].astype(str))

# Set up plot
plot = figure(plot_height=800, plot_width=800, title='Drug Price Over Time',
              x_axis_type = 'datetime',
              tools="crosshair, pan, reset, save, wheel_zoom")
plot.x_range = DataRange1d(range_padding = .01)
plot.add_tools(HoverTool(tooltips=[('Date', '@date{%F}'), ('Price', '@nadac_per_unit')],
                                    formatters = {'date': 'datetime'}))

plot.line('date', 'nadac_per_unit', source=historical_source)
plot.scatter('date', 'predictions', source=prediction_source)

# Set up widgets
id_select = Select(title='drug_id', value='781593600', options=id_list)

# Set up callbacks
def update_data(attrname, old, new):

    #Get the current select value
    curr_id = id_select.value
    # Generate the new data
    new_historical = historical_data[historical_data['ndc']==curr_id]
    new_historical = new_historical.sort_values('date')

    prediction_data = prediction_data[prediction_data.loc[:, 'ndc']==curr_id]
    new_prediction_data = lin_model.predict(prediction_data)
    new_prediction_data = pd.DataFrame(data = {'ndc':new_prediction_data[0][0], 'predictions':new_prediction_data[0][1][0]}, index = [0]) #these element slices are correct
    new_prediction_data['date'] = pd.to_datetime(prediction_data[['year', 'month', 'day']], infer_datetime_format=True, errors = 'coerce')
    new_prediction_source = ColumnDataSource(new_prediction_data)
    # Overwrite current data with new data
    historical_source.data = ColumnDataSource.from_df(new_historical)
    # prediction_source.data = ColumnDataSource.from_df(new_predicted)

# Action when select menu changes
id_select.on_change('value', update_data)

# Set up layouts and add to document
inputs = column(id_select)

curdoc().add_root(row(inputs, plot, width = 1000))
curdoc().title = 'Drug Price Predictor'

Bryan · December 6, 2019, 4:37pm

All the Bokeh parts look reasonable at a glance, so the issue is possibly with the logic in update_data. Have you tried putting in print statements to inspect the state of the data you are manipulating to make sure it is what you expect? The output will be in the same console log where you ran bokeh serve. Alternatively, are there any errors, either in the server log, or the browser’s javascript log?

alofgran · December 6, 2019, 4:49pm

I have in the past, but as some changes have been made, I’m going through that process now. It appears that there may be a few problems: 1) historical_data is apparently an empty dataframe, and 2) the predicted_data date column value is naT. Not exactly sure why that’s the case, but it gives me a place to start.

To answer your questions, no errors in the console log. Javascript console shows a TypeError: this.element is null, and a warning that [bokeh] could not set initial ranges. Presumably these two notices are due to the empty historical_data dataframe?

Bryan · December 6, 2019, 6:24pm

That’s exactly right, by default Bokeh plots have auto-ranging DataRange1d objects, and that’s exactly how they complain if here is no data to compute ranges on.

alofgran · December 9, 2019, 10:48pm

Ok, so I finally got the historical data and predictions showing up on the same plot. The select menu works as well, but the it’s extremely slow to display the list (probably 45-60 seconds from the time I click on the select menu, to the time it displays. There are only about 30-35 options, so I’m not sure of the cause of the problem. My initial suspicion (due to the time required for that operation) was that the predict method was being called each time I selected the menu, however, I don’t think that’s the case.

Is there any glaring problem in this updated code that would result in extended times to produce the select menu?

from bokeh.io import curdoc
from bokeh.layouts import column, row
from bokeh.models import ColumnDataSource, Select, DataRange1d, HoverTool
from bokeh.plotting import figure

# Set up (initial) data
historical_data = historical_data.loc[:, ['ndc', 'date', 'nadac_per_unit']]
hist_temp = historical_data[historical_data.loc[:, 'ndc']==781593600].sort_values('date')
historical_source = ColumnDataSource(data = hist_temp)
#
import datetime as dt
#Get initial prediction
date = dt.datetime.strptime('-'.join(('2020', '3', '31')), '%Y-%m-%d')
new_prediction_data = prediction_data[prediction_data.loc[:, 'ndc']==781593600] #working
new_prediction_data.loc[:, 'year'] = date.year
new_prediction_data.loc[:, 'month'] = date.month
new_prediction_data.loc[:, 'day'] = date.day
new_prediction_data = lin_model.predict(new_prediction_data)
new_prediction_data = pd.DataFrame(data = {'ndc':new_prediction_data[0][0], 'nadac_per_unit':new_prediction_data[0][1][0]}, index = [0]) #these element slices are correct
new_prediction_data['date'] = pd.to_datetime(date, format='%Y-%m-%d')
new_prediction_data['ndc'] = new_prediction_data['ndc'].astype(float).astype('int64')
new_prediction_data['nadac_per_unit'] = new_prediction_data['nadac_per_unit'].astype('float16')
prediction_source = ColumnDataSource(data=new_prediction_data)

id_list = list(prediction_data['ndc'].astype(str))

# Set up plot
plot = figure(plot_height=800, plot_width=800, title='Drug Price Over Time',
              x_axis_type = 'datetime',
              tools="crosshair, pan, reset, save, wheel_zoom")
plot.xaxis.axis_label = 'Time'
plot.yaxis.axis_label = 'Price ($)'
plot.axis.axis_label_text_font_style = 'bold'
plot.grid.grid_line_alpha = 0.8
plot.x_range = DataRange1d(range_padding = .01)
plot.add_tools(HoverTool(tooltips=[('Date', '@date{%F}'), ('Price', '@nadac_per_unit')],
                                    formatters = {'date': 'datetime'}))

plot.line('date', 'nadac_per_unit', source=historical_source)
plot.scatter('date', 'nadac_per_unit', source=prediction_source, fill_color='red', size=8)

# Set up widgets
id_select = Select(title='drug_id', value='781593600', options=id_list)

# Set up callbacks
def update_data(attrname, old, new):

    #Get the current select value
    curr_id = id_select.value
    # Generate the new data
    new_historical = historical_data[historical_data.loc[:, 'ndc']==int(curr_id)]
    new_historical = new_historical.sort_values('date')

    new_prediction_data = prediction_data[prediction_data.loc[:, 'ndc']==int(curr_id)] #working
    date = dt.datetime.strptime('-'.join(('2020', '3', '31')), '%Y-%m-%d')
    new_prediction_data.loc[:, 'year'] = date.year
    new_prediction_data.loc[:, 'month'] = date.month
    new_prediction_data.loc[:, 'day'] = date.day
    new_prediction_data = lin_model.predict(new_prediction_data)
    new_prediction_data = pd.DataFrame(data = {'ndc':new_prediction_data[0][0], 'nadac_per_unit':new_prediction_data[0][1][0]}, index = [0]) #these element slices are correct
    new_prediction_data['date'] = pd.to_datetime(date, format='%Y-%m-%d')
    new_prediction_data['ndc'] = new_prediction_data['ndc'].astype(float).astype('int64')

    # Overwrite current data with new data
    historical_source.data = ColumnDataSource.from_df(new_historical)
    prediction_source.data = ColumnDataSource.from_df(new_prediction_data)

# Action when select menu changes
id_select.on_change('value', update_data)

# Set up layouts and add to document
inputs = column(id_select)

curdoc().add_root(row(inputs, plot, width = 1000))
curdoc().title = 'Drug Price Predictor'

Bryan · December 9, 2019, 11:21pm

My initial suspicion (due to the time required for that operation) was that the predict method was being called each time I selected the menu, however, I don’t think that’s the case.

That seems to be exactly the case:

new_prediction_data = lin_model.predict(new_prediction_data)

is inside the update_data callback that is triggered by select box changes.

alofgran · December 9, 2019, 11:50pm

Ah, ok. Thanks for validating. So to solve the time problem, I’d have to figure out how to move that .predict method elsewhere. Presumably putting it in another function wouldn’t fix the problem. As this is my first Scikit-Learn/Bokeh venture, I’m not entirely sure how to resolve the matter…I was hoping that it would be fast enough if I filtered by the current selected ID and then ran the prediction (as opposed to processing all of the IDs each time the select menu is clicked).

Any suggestions?

All in all, I’m still a bit baffled that it takes as long as it does (considering we’re talking about datasets < 1mb).

Bryan · December 10, 2019, 12:58am

Not if the callback just ends up calling that function.

I was hoping that it would be fast enough if I filtered by the current selected ID and then ran the prediction (as opposed to processing all of the IDs each time the select menu is clicked).

I’m afraid I don’t really know enough about the domain, or your intended results, to offer much. I will say that Bokeh is not magic If you need to do something that takes time to compute, then that time will have to be spent. Usually the two options are “compute everything up front to make interactions later very fast” or “put up with the delay during the interactions” If you do that latter, you might simply want to give users an indication that “work is being done”. Since that requires a small possibly non-obvious effort, here’s a useful discussion:

Python Bokeh markup text value can't update - Stack Overflow

All in all, I’m still a bit baffled that it takes as long as it does (considering we’re talking about datasets < 1mb).

I’d still recommend doing some simple print statement profiling with time.time to actually identify the hotspot with certainty.

samirak93 · December 10, 2019, 4:31pm

I’m currently working on something similar (bokeh+sklearn). I faced the same issue where for every selectbox update, the predictions were running(just like yours). So I just moved the sklearn part to a separate function and add a new button so that the prediction function is called only when button is clicked.

This still doesn’t reduce the time for the prediction to compute but least stops the part where it’s gets executed every time selectbox is updated.