Concept for visualizing millions of data points

Hi,

my name is Jan Girlich from Germany and I'm working in IT Security. I
studied IT, worked before as C++ developer and know my way around Python
as well.

I'm currently building on a visualization project with about 15 to 20
million data points, which need to be browseable by panning and zooming.
Trying to solve this is how I found bokeh.

The data is several time series, which should be displayed in a
x-y-graph with one axis being the time and the other axis representing
the series. So, for example x is the time and y is "series 1", "series
2" and so on. Every x-y-coordinate represents one data point between 0
and 255 represented by a rectangle colored in a shade of gray according
to its value. When hovering over a coordinate a box should show more
details.

Now, this is really slow already when simply dumping about 400.000 data
points into a ColumnDataSource, so I'd like to discuss my approach to
this problem:

My idea is to put all the data in a pandas DataFrame and depending on
the zoom level call a callback on the bokeh server, which then finds the
data points, which are within one pixel of the screen, and calculate a
mean gray value for this pixel. The goal is to keep the number of
elements to display low (and then maybe use WebGL to display them, so
panning is fast). Do you think this could work? How would you solve this?

Cheers :slight_smile:
Jan

If the data is not changing consider processing and creating subset of data at different zoom levels and then store and display these images.

···

On Thu, Jun 16, 2016 at 4:54 AM, Jan Girlich [email protected] wrote:

Hi,

my name is Jan Girlich from Germany and I’m working in IT Security. I

studied IT, worked before as C++ developer and know my way around Python

as well.

I’m currently building on a visualization project with about 15 to 20

million data points, which need to be browseable by panning and zooming.

Trying to solve this is how I found bokeh.

The data is several time series, which should be displayed in a

x-y-graph with one axis being the time and the other axis representing

the series. So, for example x is the time and y is “series 1”, "series

2" and so on. Every x-y-coordinate represents one data point between 0

and 255 represented by a rectangle colored in a shade of gray according

to its value. When hovering over a coordinate a box should show more

details.

Now, this is really slow already when simply dumping about 400.000 data

points into a ColumnDataSource, so I’d like to discuss my approach to

this problem:

My idea is to put all the data in a pandas DataFrame and depending on

the zoom level call a callback on the bokeh server, which then finds the

data points, which are within one pixel of the screen, and calculate a

mean gray value for this pixel. The goal is to keep the number of

elements to display low (and then maybe use WebGL to display them, so

panning is fast). Do you think this could work? How would you solve this?

Cheers :slight_smile:

Jan

You received this message because you are subscribed to the Google Groups “Bokeh Discussion - Public” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/bokeh/57626937.3080301%40modzero.ch.

For more options, visit https://groups.google.com/a/continuum.io/d/optout.

Hi Trampas,

···

Am 16.06.2016 um 12:50 schrieb Trampas Stern:

If the data is not changing consider processing and creating subset of data
at different zoom levels and then store and display these images.

no, the data should be manipulatable. For example it should be possible
to select a time series and change its color from gray levels to red
levels and similar things.

Thanks for the suggestion, though.
Jan

I'm on my phone so I can't dig up a reference just now, but what you want is the open source DataShader project which integrates closely with bokeh for interactive visualization hundreds of millions of points.

Bryan

···

On Jun 16, 2016, at 07:02, Jan Girlich <[email protected]> wrote:

Hi Trampas,

Am 16.06.2016 um 12:50 schrieb Trampas Stern:
If the data is not changing consider processing and creating subset of data
at different zoom levels and then store and display these images.

no, the data should be manipulatable. For example it should be possible
to select a time series and change its color from gray levels to red
levels and similar things.

Thanks for the suggestion, though.
Jan

--
You received this message because you are subscribed to the Google Groups "Bokeh Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/bokeh/5762955B.5070102%40modzero.ch.
For more options, visit https://groups.google.com/a/continuum.io/d/optout.

Hi Jan,

It is really infeasible at this current point in time to dump 15-20 million data points to the client browser. Also, there is no need to do it this way.

The kind of visualization you’re talking about is a “heatmap” visualization, and this will need to be processed on the server side, and send to the browser as a raster data set. Bokeh can be used to interact with this through the browser.

I think that you should take a look at Holoviews and its Raster elements:

http://holoviews.org/Tutorials/Bokeh_Elements.html#Raster Elements

(Note that this page takes a while to load because of the large number of example graphs on it.)

If you wish to view the timeseries not as a heatmap, but as actually overlapping lines, with dynamic coloration of the overlapping regions, then you should look at the Datashader project, which also integrates with Bokeh: https://anaconda.org/jbednar/tseries/notebook

-Peter

···

On Thu, Jun 16, 2016 at 3:54 AM, Jan Girlich [email protected] wrote:

Hi,

my name is Jan Girlich from Germany and I’m working in IT Security. I

studied IT, worked before as C++ developer and know my way around Python

as well.

I’m currently building on a visualization project with about 15 to 20

million data points, which need to be browseable by panning and zooming.

Trying to solve this is how I found bokeh.

The data is several time series, which should be displayed in a

x-y-graph with one axis being the time and the other axis representing

the series. So, for example x is the time and y is “series 1”, "series

2" and so on. Every x-y-coordinate represents one data point between 0

and 255 represented by a rectangle colored in a shade of gray according

to its value. When hovering over a coordinate a box should show more

details.

Now, this is really slow already when simply dumping about 400.000 data

points into a ColumnDataSource, so I’d like to discuss my approach to

this problem:

My idea is to put all the data in a pandas DataFrame and depending on

the zoom level call a callback on the bokeh server, which then finds the

data points, which are within one pixel of the screen, and calculate a

mean gray value for this pixel. The goal is to keep the number of

elements to display low (and then maybe use WebGL to display them, so

panning is fast). Do you think this could work? How would you solve this?

Cheers :slight_smile:

Jan

You received this message because you are subscribed to the Google Groups “Bokeh Discussion - Public” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/bokeh/57626937.3080301%40modzero.ch.

For more options, visit https://groups.google.com/a/continuum.io/d/optout.

Peter Wang

CTO, Co-founder

Hi Peter,

···

Am 16.06.2016 um 15:05 schrieb Peter Wang:

If you wish to view the timeseries not as a heatmap, but as actually
overlapping lines, with dynamic coloration of the overlapping regions, then
you should look at the Datashader project, which also integrates with
Bokeh: https://anaconda.org/jbednar/tseries/notebook

I just watched your (Peter) talk at
https://continuum-analytics.wistia.com/medias/8zu9idwoym and datashader
looks like exactly what I'm looking for. I'll try to build a proof of
concept with it with my dataset tomorrow.

Thanks to Bryan for suggesting it as well!
Jan