Default behavior of charts.Line(DataFrame)

The default behavior for a Line chart given a DataFrame data argument takes the first two columns as the x and y. E.g.:

charts.Line(df)

I think Bokeh 0.9 did something more like this (and a list of series certainly seems quite analogous to a DataFrame):

charts.Line([df.iloc[:,0].tolist(), df.iloc[:,1].tolist()])

The second version is also quite similar to what you’d get via a df.plot() call (using Pandas’ matplotlib convenience wrappers).

If folks agree that the second should be the default, I’ll file and issue and investigate how hard a pull request would be if you’d like me to do that.

For now, a workaround is to construct an argument for y like list(df), or if you’d prefer something more clear, but longer, df.colnames.tolist().

Generally, I’m loving where the Bokeh API is going, though - keep up the good work!

Thanks,

Dav

Ooops, I mean df.columns.tolist() as the more sensible y-value, not colnames (which is not a thing).

Sorry for the deprecation in the new interface. Not an excuse, but I’ve been focusing on general features and will make sure that we also specially handle these kinds of use cases as well. In the upcoming PR for charts I have submitted I have abstracted out how this assignment works when no column selections are made, so that each chart builder class can define their own custom method. The reason that this is more complicated than it may seem is that there are a large number of potential input types that we are accepting, so there is a common class to deal with that adaptation and interpretation of what the inputs mean. I added a special column selector for line charts (which many times seem to use data where column names are a categorical variable), that first down selects to only the numerical columns, then assigns the numerical column names to y.

The default behavior that makes sense for something like a scatter chart is to assume that the inputs were provided in order. So, Chart(df.mpg, df.disp) works the same as Chart(df, x=‘mpg’, y=‘disp’). This flexible input handling is just made complicated by supporting python 2 and 3 input arguments.

Btw, the thing that can make this kind of difficult to handle is that your data format isn’t the only way that it could exist. You could have all the values in each of your columns in a single column (e.g., miles_per_gallon), and one additional categorical column containing the labels (e.g., model).

···

On Sunday, October 18, 2015 at 4:12:31 PM UTC-5, Dav Clark @ Cal wrote:

Ooops, I mean df.columns.tolist() as the more sensible y-value, not colnames (which is not a thing).

Totally get that, and good to know that you’re still working on making the internals clean.I suppose one will be able to override the way, e.g., an “unexplained” data frame is handled? Thought it’s not clear that would be any better than just wrapping in a function…

I’m mostly thinking about introducing relatively bad programmers to Bokeh. So, those initial attempts that succeed or fail are important. So, I think consistency (e.g., with DataFrame.plot, charts.Line([list, of, sequences]), etc.) is more important than getting the default behavior somehow “right.” Intermediate programmers can just adapt things to work how they want.

But my guess is that your main focus is not beginning programmers!

D

···

On Sunday, October 18, 2015 at 6:45:36 PM UTC-7, Nick Roth wrote:

Btw, the thing that can make this kind of difficult to handle is that your data format isn’t the only way that it could exist. You could have all the values in each of your columns in a single column (e.g., miles_per_gallon), and one additional categorical column containing the labels (e.g., model).

The overriding of how it is handled is more for making sure that we can provide advanced users and/or developers the hooks they need to build custom chart types. It wouldn’t be something a general user would be concerned with.

“Right” is subjective, consistent is dependent on what you are comparing to (chart to chart, bokeh to ggplot, bokeh to pandas, etc.). On top of that, it depends on the person for what kinds of data they typically will see. In scientific applications, you might see more “wide” column-oriented data, while with business uses you will see more “tall”, normalized type data sets. It ends up being difficult to make everyone happy with the default behavior, since you will end up making one of them unhappy, so that is why the focus is more on the core functionality. In the future, I’ll focus more on introspecting the types of columns, etc to make smarter choices to infer what is meant.

Originally, we didn’t consider the case of providing no inputs at all, so the place to provide input would be here where the interface was mocked up: https://github.com/bokeh/bokeh/wiki/Bokeh-Days-Working-Document#line.

There was also a discussion on the data types here: https://github.com/bokeh/bokeh/wiki/Bokeh-Days-Working-Document#data-2

···

On Sunday, October 18, 2015 at 9:08:19 PM UTC-5, Dav Clark @ Cal wrote:

On Sunday, October 18, 2015 at 6:45:36 PM UTC-7, Nick Roth wrote:

Btw, the thing that can make this kind of difficult to handle is that your data format isn’t the only way that it could exist. You could have all the values in each of your columns in a single column (e.g., miles_per_gallon), and one additional categorical column containing the labels (e.g., model).

Totally get that, and good to know that you’re still working on making the internals clean.I suppose one will be able to override the way, e.g., an “unexplained” data frame is handled? Thought it’s not clear that would be any better than just wrapping in a function…

I’m mostly thinking about introducing relatively bad programmers to Bokeh. So, those initial attempts that succeed or fail are important. So, I think consistency (e.g., with DataFrame.plot, charts.Line([list, of, sequences]), etc.) is more important than getting the default behavior somehow “right.” Intermediate programmers can just adapt things to work how they want.

But my guess is that your main focus is not beginning programmers!

D