ColumnDataSource confuses Pandas dataframe index with column

MrShookshank · March 7, 2017, 1:09pm

Hi guys,

follow up to my issue yesterday.

It seems that the ColumnDataSource class takes the index parameter of a dataframe and makes it a column the same as the dataframe contents.

This issue makes pandas dataframes useless in regards to streaming as the index column created by ColumnDataSource will make bokeh internals spit errors.

Here is a very basic example of the problem;

df = pd.read_csv(‘test.csv’)

source = ColumnDataSource.from_df()

``

The error is created as simply as this where ‘source’ will now contain an ‘index’ column created from the dataframe.

I have tried to circumvent this changing datatypes and attempting to manage the index directly but so far no luck.

particularly my issue is with streaming the CDS where I get the error;

ValueError(‘Must stream updates to all existing columns (missing: index)’,)

Any ideas around this problem would be hugely appreciated.

Cheers,

Sean

Tyler_Nickerson · March 8, 2017, 2:34am

A possible solution could be be removing the index column from your data source before you bind it to any glyphs in your plot. You can do this with the ColumnDataSource.remove() method.

df = pd.DataFrame(dict(x=[1, 2, 3]))
print(df)
# x
# 0 1
# 1 2
# 2 3

src = ColumnDataSource(df)
print(src.data)
# {'x': [1, 2, 3], 'index': [0, 1, 2]}

src.remove('index')
print(src.data)
# {'x': [1, 2, 3]}

···

On Tue, Mar 7, 2017 at 5:09 AM, MrShookshank [email protected] wrote:

Hi guys,

follow up to my issue yesterday.

It seems that the ColumnDataSource class takes the index parameter of a dataframe and makes it a column the same as the dataframe contents.

This issue makes pandas dataframes useless in regards to streaming as the index column created by ColumnDataSource will make bokeh internals spit errors.

Here is a very basic example of the problem;

df = pd.read_csv(‘test.csv’)

source = ColumnDataSource.from_df()

``

The error is created as simply as this where ‘source’ will now contain an ‘index’ column created from the dataframe.

I have tried to circumvent this changing datatypes and attempting to manage the index directly but so far no luck.

particularly my issue is with streaming the CDS where I get the error;

ValueError(‘Must stream updates to all existing columns (missing: index)’,)

Any ideas around this problem would be hugely appreciated.

Cheers,

Sean

–

You received this message because you are subscribed to the Google Groups “Bokeh Discussion - Public” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/bokeh/bc3915e8-f47a-4ff3-a2c0-6f1e17446668%40continuum.io.

For more options, visit https://groups.google.com/a/continuum.io/d/optout.

MrShookshank · March 8, 2017, 3:54pm

Hi Tyler,

thanks for the solution, it appears to drop the index as required.

However I get another error here that seems to imply that the index column is necessary;

AttributeError: ‘ColumnDataSource’ object has no attribute ‘index’

here is my full code for reference;

from sqlalchemy import create_engine

from bokeh.io import curdoc

from bokeh.plotting import figure, ColumnDataSource

from bokeh.models import HoverTool, LinearColorMapper

from bokeh.palettes import *

from config import *

import pandas as pd

engine = create_engine(mysql[‘connection’], pool_size=20, max_overflow=0)

palette = small_palettes[‘Viridis’][6]

palette2 = small_palettes[‘Inferno’][10]

source = ColumnDataSource(data={‘cid’: ,

                            'ip_dst': [],

                            'ip_src': [],

                            'sid': [],

                            'sig_class_name': [],

                            'sig_gid': [],

                            'sig_id': [],

                            'sig_name': [],

                            'sig_priority': [],

                            'sig_rev': [],

                            'sig_sid': [],

                            'signature': [],

                            'tcp_dport': [],

                            'tcp_sport': [],

                            'timestamp': [],})

timeseries = figure(title=‘Alerts by Datetime’, x_axis_type=‘datetime’, plot_height=450, plot_width=1000,

                x_axis_label='Datetime', y_axis_label='Alert Priority', toolbar_location='left',

                toolbar_sticky=False, logo=None)

timeseries.line(x=‘timestamp’, y=‘sig_priority’, alpha=0.85, line_width=3, color=‘deepskyblue’, line_dash=“4 4”,

            source=source)

mapper = LinearColorMapper(palette=[‘blue’, ‘green’, ‘yellow’])

timeseries.circle(x=‘timestamp’, y=‘sig_priority’, fill_color={‘field’: ‘sig_priority’, ‘transform’: mapper},

              source=source, size=5)

hover = HoverTool(tooltips=[(“Priority”, “@sig_priority”), (“Signature”, “@sig_name”),

                            ('Classification', '@sig_class_name'), ('Source IP', '@ip_src'),

                            ('Source Port', '@tcp_sport'), ('Destination IP', '@ip_dst'),

                            ('Destination Port', '@tcp_dport'), ('Timestamp', '@timestamp')])

timeseries.add_tools(hover)

def update():

df = pd.read_sql_query('SELECT * FROM event LEFT JOIN signature ON event.signature=signature.sig_id '

                       'LEFT JOIN sig_class ON signature.sig_class_id=sig_class.sig_class_id;',

                       engine).drop('sig_class_id', axis=1)

df2 = pd.read_sql_query('SELECT ip_src,ip_dst FROM iphdr;', engine)

df3 = pd.read_sql_query('SELECT tcp_sport, tcp_dport FROM tcphdr;', engine)

df6 = pd.concat([df, df2, df3], axis=1, verify_integrity=True)

initial = ColumnDataSource(df6)

initial.remove('index')

source = ColumnDataSource.from_df(initial)

source.stream(source, 300)

curdoc().add_root(timeseries)

curdoc().add_periodic_callback(update, 1000)

curdoc().title = “Snort Alerts”

``

I feel like i’ve tried everything but the ColumnDataSource seems to break it every time.

Cheers,

Sean

···

On Wednesday, March 8, 2017 at 2:34:59 AM UTC, Tyler Nickerson wrote:

A possible solution could be be removing the index column from your data source before you bind it to any glyphs in your plot. You can do this with the ColumnDataSource.remove() method.

df = pd.DataFrame(dict(x=[1, 2, 3]))
print(df)
# x
# 0 1
# 1 2
# 2 3

src = ColumnDataSource(df)
print(src.data)
# {'x': [1, 2, 3], 'index': [0, 1, 2]}

src.remove('index')
print(src.data)
# {'x': [1, 2, 3]}

On Tue, Mar 7, 2017 at 5:09 AM, MrShookshank [email protected] wrote:

Hi guys,

follow up to my issue yesterday.

It seems that the ColumnDataSource class takes the index parameter of a dataframe and makes it a column the same as the dataframe contents.

This issue makes pandas dataframes useless in regards to streaming as the index column created by ColumnDataSource will make bokeh internals spit errors.

Here is a very basic example of the problem;

df = pd.read_csv(‘test.csv’)

source = ColumnDataSource.from_df()

``

The error is created as simply as this where ‘source’ will now contain an ‘index’ column created from the dataframe.

I have tried to circumvent this changing datatypes and attempting to manage the index directly but so far no luck.

particularly my issue is with streaming the CDS where I get the error;

ValueError(‘Must stream updates to all existing columns (missing: index)’,)

Any ideas around this problem would be hugely appreciated.

Cheers,

Sean

–

You received this message because you are subscribed to the Google Groups “Bokeh Discussion - Public” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/bokeh/bc3915e8-f47a-4ff3-a2c0-6f1e17446668%40continuum.io.

For more options, visit https://groups.google.com/a/continuum.io/d/optout.

Tyler_Nickerson · March 10, 2017, 1:40am

It looks like the error is being caused by this line:

source = ColumnDataSource.from_df(initial)

My guess is that the error you are getting is because from_df() is expecting a DataFrame object and initial is a ColumnDataSource object.

Also, I’m not certain that the stream method can take a ColumnDataSource object. This page leads me to believe it is expecting a dictionary. Converting to a dictionary might be the easiest solution, as the DataFrame.to_dict() method doesn’t create a dictionary item for the index. To illustrate, below is update() modified to pass a dictionary to source.stream() instead.

def update():

    df = pd.read_sql_query('SELECT * FROM event LEFT JOIN signature ON event.signature=signature.sig_id '
                           'LEFT JOIN sig_class ON signature.sig_class_id=sig_class.sig_class_id;'                           ,
engine).drop('sig_class_id', axis=1
    )
df2 = pd.read_sql_query('SELECT ip_src,ip_dst FROM iphdr;'
    , engine)
df3 = pd.read_sql_query('SELECT tcp_sport, tcp_dport FROM tcphdr;'
    , engine)
df6 = pd.concat([df, df2, df3], axis=1, verify_integrity=True
    )
new_data = df6.to_dict()
source.stream(new_data, 30)

It might need some tweaking to get it working as desired in your code, but I think that should get you past the error. If not, I might need a reproducible example to get a better idea of what is happening.

Regards,
Tyler

···

On Wed, Mar 8, 2017 at 7:54 AM, MrShookshank [email protected] wrote:

Hi Tyler,

thanks for the solution, it appears to drop the index as required.

However I get another error here that seems to imply that the index column is necessary;

AttributeError: ‘ColumnDataSource’ object has no attribute ‘index’

here is my full code for reference;

from sqlalchemy import create_engine

from bokeh.io import curdoc

from bokeh.plotting import figure, ColumnDataSource

from bokeh.models import HoverTool, LinearColorMapper

from bokeh.palettes import *

from config import *

import pandas as pd

engine = create_engine(mysql[‘connection’], pool_size=20, max_overflow=0)

palette = small_palettes[‘Viridis’][6]

palette2 = small_palettes[‘Inferno’][10]

source = ColumnDataSource(data={‘cid’: ,

                            'ip_dst': [],

                            'ip_src': [],

                            'sid': [],

                            'sig_class_name': [],

                            'sig_gid': [],

                            'sig_id': [],

                            'sig_name': [],

                            'sig_priority': [],

                            'sig_rev': [],

                            'sig_sid': [],

                            'signature': [],

                            'tcp_dport': [],

                            'tcp_sport': [],

                            'timestamp': [],})

timeseries = figure(title=‘Alerts by Datetime’, x_axis_type=‘datetime’, plot_height=450, plot_width=1000,

                x_axis_label='Datetime', y_axis_label='Alert Priority', toolbar_location='left',

                toolbar_sticky=False, logo=None)

timeseries.line(x=‘timestamp’, y=‘sig_priority’, alpha=0.85, line_width=3, color=‘deepskyblue’, line_dash=“4 4”,

            source=source)

mapper = LinearColorMapper(palette=[‘blue’, ‘green’, ‘yellow’])

timeseries.circle(x=‘timestamp’, y=‘sig_priority’, fill_color={‘field’: ‘sig_priority’, ‘transform’: mapper},

              source=source, size=5)

hover = HoverTool(tooltips=[(“Priority”, “@sig_priority”), (“Signature”, “@sig_name”),

                            ('Classification', '@sig_class_name'), ('Source IP', '@ip_src'),

                            ('Source Port', '@tcp_sport'), ('Destination IP', '@ip_dst'),

                            ('Destination Port', '@tcp_dport'), ('Timestamp', '@timestamp')])

timeseries.add_tools(hover)

def update():

df = pd.read_sql_query('SELECT * FROM event LEFT JOIN signature ON event.signature=signature.sig_id '

                       'LEFT JOIN sig_class ON signature.sig_class_id=sig_class.sig_class_id;',

                       engine).drop('sig_class_id', axis=1)

df2 = pd.read_sql_query('SELECT ip_src,ip_dst FROM iphdr;', engine)

df3 = pd.read_sql_query('SELECT tcp_sport, tcp_dport FROM tcphdr;', engine)

df6 = pd.concat([df, df2, df3], axis=1, verify_integrity=True)

initial = ColumnDataSource(df6)

initial.remove('index')

source = ColumnDataSource.from_df(initial)

source.stream(source, 300)

curdoc().add_root(timeseries)

curdoc().add_periodic_callback(update, 1000)

curdoc().title = “Snort Alerts”

``

I feel like i’ve tried everything but the ColumnDataSource seems to break it every time.

Cheers,

Sean

On Wednesday, March 8, 2017 at 2:34:59 AM UTC, Tyler Nickerson wrote:

A possible solution could be be removing the index column from your data source before you bind it to any glyphs in your plot. You can do this with the ColumnDataSource.remove() method.

df = pd.DataFrame(dict(x=[1, 2, 3]))
print(df)
# x
# 0 1
# 1 2
# 2 3

src = ColumnDataSource(df)
print(src.data)
# {'x': [1, 2, 3], 'index': [0, 1, 2]}

src.remove('index')
print(src.data)
# {'x': [1, 2, 3]}

On Tue, Mar 7, 2017 at 5:09 AM, MrShookshank [email protected] wrote:

Hi guys,

follow up to my issue yesterday.

It seems that the ColumnDataSource class takes the index parameter of a dataframe and makes it a column the same as the dataframe contents.

This issue makes pandas dataframes useless in regards to streaming as the index column created by ColumnDataSource will make bokeh internals spit errors.

Here is a very basic example of the problem;

df = pd.read_csv(‘test.csv’)

source = ColumnDataSource.from_df()

``

The error is created as simply as this where ‘source’ will now contain an ‘index’ column created from the dataframe.

I have tried to circumvent this changing datatypes and attempting to manage the index directly but so far no luck.

particularly my issue is with streaming the CDS where I get the error;

ValueError(‘Must stream updates to all existing columns (missing: index)’,)

Any ideas around this problem would be hugely appreciated.

Cheers,

Sean

–

You received this message because you are subscribed to the Google Groups “Bokeh Discussion - Public” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/bokeh/bc3915e8-f47a-4ff3-a2c0-6f1e17446668%40continuum.io.

For more options, visit https://groups.google.com/a/continuum.io/d/optout.

–

You received this message because you are subscribed to the Google Groups “Bokeh Discussion - Public” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/bokeh/9fb306da-9597-4547-b2b4-c43cb226f113%40continuum.io.

For more options, visit https://groups.google.com/a/continuum.io/d/optout.

MrShookshank · March 10, 2017, 12:26pm

Thanks Tyler,

This is exactly what was needed. The .stream method needed a dictionary and I was passing it a ColumnDataSource which was causing the error.

My working code looks like this now;

I start my ColumnDataSource object empty like this.

source = ColumnDataSource(data={‘cid’: ,

‘ip_dst’: ,

‘ip_src’: ,

‘sid’: ,

‘sig_class_name’: ,

‘sig_gid’: ,

‘sig_id’: ,

‘sig_name’: ,

‘sig_priority’: ,

‘sig_rev’: ,

‘sig_sid’: ,

‘signature’: ,

‘tcp_dport’: ,

‘tcp_sport’: ,

‘timestamp’: })

``

And the update() function for the stream method looks like this;

def update():

df = pd.read_sql_query('SELECT * FROM event LEFT JOIN signature ON event.signature=signature.sig_id ’

‘LEFT JOIN sig_class ON signature.sig_class_id=sig_class.sig_class_id;’,

engine).drop(‘sig_class_id’, axis=1)

df2 = pd.read_sql_query(‘SELECT ip_src,ip_dst FROM iphdr;’, engine)

df3 = pd.read_sql_query(‘SELECT tcp_sport, tcp_dport FROM tcphdr;’, engine)

df6 = pd.concat([df, df2, df3], axis=1, verify_integrity=True).to_dict(orient=‘list’)

source.stream(df6, 300)

curdoc().add_periodic_callback(update, 100)

``

There is only one small issue where my graph doesn’t show all the existing data but does update successfully. Not sure whether this is to do with the ColumnDataSource starting empty or the update() function not retrieving all of the data as expected, or something else entirely. Regardless, we can say the original issue is solved.

Thank you for the assistance,

Regards,

Sean

···

On Friday, March 10, 2017 at 1:41:04 AM UTC, Tyler Nickerson wrote:

It looks like the error is being caused by this line:

source = ColumnDataSource.from_df(initial)
My guess is that the error you are getting is because from_df() is expecting a DataFrame object and initial is a ColumnDataSource object.

Also, I’m not certain that the stream method can take a ColumnDataSource object. This page leads me to believe it is expecting a dictionary. Converting to a dictionary might be the easiest solution, as the DataFrame.to_dict() method doesn’t create a dictionary item for the index. To illustrate, below is update() modified to pass a dictionary to source.stream() instead.

def update():

    df = pd.read_sql_query('SELECT * FROM event LEFT JOIN signature ON event.signature=signature.sig_id '
                           'LEFT JOIN sig_class ON signature.sig_class_id=sig_class.sig_class_id;'                           ,
engine).drop('sig_class_id', axis=1
    )
df2 = pd.read_sql_query('SELECT ip_src,ip_dst FROM iphdr;'
    , engine)
df3 = pd.read_sql_query('SELECT tcp_sport, tcp_dport FROM tcphdr;'
    , engine)
df6 = pd.concat([df, df2, df3], axis=1, verify_integrity=True
    )
new_data = df6.to_dict()
source.stream(new_data, 30)

It might need some tweaking to get it working as desired in your code, but I think that should get you past the error. If not, I might need a reproducible example to get a better idea of what is happening.

Regards,
Tyler

On Wed, Mar 8, 2017 at 7:54 AM, MrShookshank [email protected] wrote:

Hi Tyler,

thanks for the solution, it appears to drop the index as required.

However I get another error here that seems to imply that the index column is necessary;

AttributeError: ‘ColumnDataSource’ object has no attribute ‘index’

here is my full code for reference;

from sqlalchemy import create_engine

from bokeh.io import curdoc

from bokeh.plotting import figure, ColumnDataSource

from bokeh.models import HoverTool, LinearColorMapper

from bokeh.palettes import *

from config import *

import pandas as pd

engine = create_engine(mysql[‘connection’], pool_size=20, max_overflow=0)

palette = small_palettes[‘Viridis’][6]

palette2 = small_palettes[‘Inferno’][10]

source = ColumnDataSource(data={‘cid’: ,

                            'ip_dst': [],

                            'ip_src': [],

                            'sid': [],

                            'sig_class_name': [],

                            'sig_gid': [],

                            'sig_id': [],

                            'sig_name': [],

                            'sig_priority': [],

                            'sig_rev': [],

                            'sig_sid': [],

                            'signature': [],

                            'tcp_dport': [],

                            'tcp_sport': [],

                            'timestamp': [],})

timeseries = figure(title=‘Alerts by Datetime’, x_axis_type=‘datetime’, plot_height=450, plot_width=1000,

                x_axis_label='Datetime', y_axis_label='Alert Priority', toolbar_location='left',

                toolbar_sticky=False, logo=None)

timeseries.line(x=‘timestamp’, y=‘sig_priority’, alpha=0.85, line_width=3, color=‘deepskyblue’, line_dash=“4 4”,

            source=source)

mapper = LinearColorMapper(palette=[‘blue’, ‘green’, ‘yellow’])

timeseries.circle(x=‘timestamp’, y=‘sig_priority’, fill_color={‘field’: ‘sig_priority’, ‘transform’: mapper},

              source=source, size=5)

hover = HoverTool(tooltips=[(“Priority”, “@sig_priority”), (“Signature”, “@sig_name”),

                            ('Classification', '@sig_class_name'), ('Source IP', '@ip_src'),

                            ('Source Port', '@tcp_sport'), ('Destination IP', '@ip_dst'),

                            ('Destination Port', '@tcp_dport'), ('Timestamp', '@timestamp')])

timeseries.add_tools(hover)

def update():

df = pd.read_sql_query('SELECT * FROM event LEFT JOIN signature ON event.signature=signature.sig_id '

                       'LEFT JOIN sig_class ON signature.sig_class_id=sig_class.sig_class_id;',

                       engine).drop('sig_class_id', axis=1)

df2 = pd.read_sql_query('SELECT ip_src,ip_dst FROM iphdr;', engine)

df3 = pd.read_sql_query('SELECT tcp_sport, tcp_dport FROM tcphdr;', engine)

df6 = pd.concat([df, df2, df3], axis=1, verify_integrity=True)

initial = ColumnDataSource(df6)

initial.remove('index')

source = ColumnDataSource.from_df(initial)

source.stream(source, 300)

curdoc().add_root(timeseries)

curdoc().add_periodic_callback(update, 1000)

curdoc().title = “Snort Alerts”

``

I feel like i’ve tried everything but the ColumnDataSource seems to break it every time.

Cheers,

Sean

On Wednesday, March 8, 2017 at 2:34:59 AM UTC, Tyler Nickerson wrote:

A possible solution could be be removing the index column from your data source before you bind it to any glyphs in your plot. You can do this with the ColumnDataSource.remove() method.

df = pd.DataFrame(dict(x=[1, 2, 3]))
print(df)
# x
# 0 1
# 1 2
# 2 3

src = ColumnDataSource(df)
print(src.data)
# {'x': [1, 2, 3], 'index': [0, 1, 2]}

src.remove('index')
print(src.data)
# {'x': [1, 2, 3]}

On Tue, Mar 7, 2017 at 5:09 AM, MrShookshank [email protected] wrote:

Hi guys,

follow up to my issue yesterday.

It seems that the ColumnDataSource class takes the index parameter of a dataframe and makes it a column the same as the dataframe contents.

This issue makes pandas dataframes useless in regards to streaming as the index column created by ColumnDataSource will make bokeh internals spit errors.

Here is a very basic example of the problem;

df = pd.read_csv(‘test.csv’)

source = ColumnDataSource.from_df()

``

The error is created as simply as this where ‘source’ will now contain an ‘index’ column created from the dataframe.

I have tried to circumvent this changing datatypes and attempting to manage the index directly but so far no luck.

particularly my issue is with streaming the CDS where I get the error;

ValueError(‘Must stream updates to all existing columns (missing: index)’,)

Any ideas around this problem would be hugely appreciated.

Cheers,

Sean

–

You received this message because you are subscribed to the Google Groups “Bokeh Discussion - Public” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/bokeh/bc3915e8-f47a-4ff3-a2c0-6f1e17446668%40continuum.io.

For more options, visit https://groups.google.com/a/continuum.io/d/optout.

–

You received this message because you are subscribed to the Google Groups “Bokeh Discussion - Public” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/bokeh/9fb306da-9597-4547-b2b4-c43cb226f113%40continuum.io.

For more options, visit https://groups.google.com/a/continuum.io/d/optout.