Data shows up when plotted as a line graph but not as a scatter plot

SVN · April 11, 2020, 10:33pm

I have two columns, one contains dates and the other percentages. All the data is properly rendered when using a line graph but when I use a scatter plot only the 0 values from the percentages column are plotted. As a new member I can only upload one photo so I have the incorrect graph below. If I plot the same data using a line graph all the data shows up, is this a misunderstanding on my side on how the scatter plot works

Incorrect Scatter plot:

Bryan · April 11, 2020, 10:36pm

Hi @SVN in order to help we would really need to see the code.

SVN · April 11, 2020, 10:38pm

Code for scatter plot:

import pandas as pd
from bokeh.io import output_notebook, show
from bokeh.plotting import figure

stats = pd.read_csv("C:/Users/sachi/Desktop/Code Projects/HTML+CSS/Environment-Visualization/data/airpollution.csv", engine='python')

years = stats.iloc[2:30,0]
carbonMonoxide = stats.iloc[2:30,5]

p = figure(plot_width=1200, plot_height=600)
p.xaxis.axis_label = 'Years'
p.yaxis.axis_label = 'Percentage'
p.circle(years, carbonMonoxide, color= 'orange', size = 10)
show(p)

Code for line graph:

import pandas as pd
from bokeh.io import output_notebook, show
from bokeh.plotting import figure

stats = pd.read_csv("C:/Users/sachi/Desktop/Code Projects/HTML+CSS/Environment-Visualization/data/airpollution.csv", engine='python')

years = stats.iloc[2:30,0]
carbonMonoxide = stats.iloc[2:30,5]

p = figure(plot_width=1200, plot_height=600)
p.xaxis.axis_label = 'Years'
p.yaxis.axis_label = 'Percentage'
p.line(years, carbonMonoxide, line_width=2)
show(p)

The line graph plots all the values which is why im confused

Bryan · April 11, 2020, 10:41pm

Hi @SVN please also edit your post to use code formatting so that the code is intelligible (either with the </> icon on the editing toolbar, or triple backtick ``` fences around the code blocks)

SVN · April 11, 2020, 10:43pm

Sorry about that I’ve edited my last post

SVN · April 11, 2020, 10:51pm

This is what my data contains as well, I’m using the years for one axis and values in the Carbon column for the other axis.

Bryan · April 11, 2020, 11:01pm

@SVN I don’t have any explanation for this offhand (and have never seen anything like this reported). Can you provide the CSV file or a part of it? What version of Bokeh are you using BTW?

SVN · April 11, 2020, 11:03pm

I’m using 2.0.1 (latest) I tried attaching the CSV but it’s not a supported format. Is there any other way of attaching it or sending it?

Bryan · April 11, 2020, 11:06pm

Simplest thing is probably to put it in a public gist: https://gist.github.com/

SVN · April 11, 2020, 11:17pm

I’ve added the files to a github repo. airpollution.csv is the data I’m using in the data folder. Visualization.ipynb contains the code.

Bryan · April 11, 2020, 11:41pm

@SVN your CSV file has a ton of unrelated junk in it that is preventing proper parsing by Pandas. The first “title” line, the “notes” at the end. None of that should be in a CSV file. Because of this cruft, Pandas is not properly able to interpret the types of your data, e.g:

In [3]: years
Out[3]:
2     1990
3     1991

< edited >

28    2016
29    2017
Name: Air pollutant emissions, Canada, 1990 to 2017, dtype: object

In [4]: carbonMonoxide
Out[4]:
2       0
3      -2

<edited>

28    -54
29    -54
Name: Unnamed: 5, dtype: object

Notice the dtype is object, instead of some actual numeric type that it should be. This is almost certainly the root cause of what you are observing with Bokeh (I’m suprised it works at all, in any fashion, with object dtypes)

One red flag was also these lines:

years = stats.iloc[2:30,0]
carbonMonoxide = stats.iloc[2:30,5]

You should really never need hacky things like this. A CSV file should contain the data, and optionally the column headers, and nothing else. If you really can’t delete that junk from the file entirely, you will need to find a way to filter it out before Pandas loads it.

Bryan · April 11, 2020, 11:47pm

And noting: if I delete all the extraneous lines at the top and the bottom of the file, then Pandas is able to properly infer real Int64 types for the columns, which means Bokeh is also able to plot just fine as expected:

SVN · April 11, 2020, 11:56pm

Thanks so much, I retrieved this data from a public government database I don’t know why they added all that extra stuff to the CSV file. I removed it and it’s working perfectly. Thanks again

Bryan · April 12, 2020, 12:03am

Ah, welcome to the wonders of data scrubbing!

p-himik · April 12, 2020, 4:10am

FWIW, Pandas can read such CSV files but you have to explicitly provide its read_csv function with the correct skiprows and skipfooter arguments. A quite useful feature, given the amount of malformed CSV files out there.

SVN · April 12, 2020, 5:14am

Thanks for the heads up I’ll use that rather than iloc next time