How to stream huge file into CDS or dataframe?

Jonas_Grave_Kristens · July 6, 2020, 9:01pm

Hi,
I am looking into making a filereader widget that can stream data from a csv file into a CDS or a pandas dataframe; I am running Bokeh as a server. The problem with the current Bokeh fileinput widget is that it reads all data into the memory of the browser and I have users who have huge files (1GB) which makes the browser crash.

I have tried to create a JS script using papaparse but it is very slow (reading a row at a time; not a requirement to use papaparse):

Using pandas directly and read a 1.75 GB csv with pd.read_csv(): less than 3 min
My JS stream code into CDS with .stream: after 16 min it was not finished and I stopped the script

(I do not know if my JS code is correct; I assume there could be room for improvement as I do not know JS).

When I have googled this I do find some SO hints with respect to JS code, but I do not understand how to implement it for streaming a file in chunks into either a CDS or a df.
If anyone got time to help/advise how to do this I would very much appreciate it. Thanks.

main.py

import os
import pandas as pd
import time
from bokeh.layouts import row, column
from bokeh.models import Div, CustomJS, Button, TextInput, ColumnDataSource
from bokeh.io import curdoc


file_source = ColumnDataSource() 

status = Div(text = '')
button = Button(label="Select file", button_type="success")

button.js_on_click(
    CustomJS(
        args = dict(file_source = file_source, status = status),
        code = open(
            os.path.join(
                os.path.dirname(__file__), 
                'static/js',
                'load_csv.js'
            )
        ).read()
    )
)

def data_update(attr, old, new):
    # Callback to register when data loading is done and work on data
    print(old)
    print(new)
    if new != 'Done':
        return

    try:
        df = file_source.to_df()
    except:
        print('Not able to convert data to df')
        return

    print(df.head())

# Use status div text to figure out when data load is done
status.on_change('text', data_update)

# if just reading data directly into df
#print('reading data...')
#t0=time.time()
#df2=pd.read_csv(file_name) # 1.75 GB
#t1=time.time()
#print(t1-t0)      
# 165 secs = 2 min 45 sec
#print(df2.head())

curdoc().add_root(row(button, status))

index.html

{% extends base %}

<!-- goes in head -->
{% block preamble %}
<script src="csv_papa_parse/static/js/papaparse.min.js"></script>
{% endblock %}

<!-- goes in body -->
{% block contents %}
<div>
    <h1>Load CSV data</h1>
    <p>Read data in chunks into CDS using stream method.</p>
</div>
  {{ super() }} 
{% endblock %}

load_csv.js

function getData(file_name) {
	Papa.parse(file_name, {
	  header: true,
	  fastMode: true,
	  step: function(row) {
	    const csvData = {};

        // need to format data to bokeh CDS setup 
        // papaparse row.data is JS Object with one value per heading eg
        // {
        //    'x': 0,
        //    'y': 10
        // }
	    for (let [key, value] of Object.entries(row.data)) {
	        csvData[key] = [value];	
	    }

	    if (file_source.get_length() == null) {
	    	file_source.data = csvData;
		} else {
			file_source.stream(csvData);
		}
	  },
	  complete: function(results, file_name) {
	    console.log('Complete'); 
	    status.text = 'Done';
	  }
	});
}

var data_url;
var input = document.createElement('input');
input.type = 'file';
status.text = '';

input.onchange = e => { 
   var file = e.target.files[0]; 

   data_url = file.name;
   console.log(data_url);

   getData(file);
}

input.click();

test.csv

x,y
0,10
1,11
2,12
3,13
4,14

app layout

stream_csv
|-main.py
|-test.csv
|--/templates
|  |-index.html
|--/static
   |--/js
   |-load_csv.js
   |-papaparse.min.js

p-himik · July 7, 2020, 2:46pm

If you read a CSV file by chunks and feed the parsed chunks into the steam method, you will remove the need to store the whole CSV in memory, but you will not remove the need to store the whole data in memory. That 1GB of data will still be there. If you want to plot the whole data just as if it was a small data set, you cannot work around that.

In general, such large data sets are loaded via the backend, not the frontend. There, you can either filter the data and update the data source on user interaction so it never contains unneeded data, or you can use something like datashader to render a huge set of glyphs as a simple image.

Jonas_Grave_Kristens · July 7, 2020, 6:03pm

It is OK to store the whole CSV data in memory of the Bokeh app itself that runs on the server. I have tested loading the csv file directly into dataframe and use the Bokeh app that way and it is OK. However I would like to, if possible, to have a file selector on the web page for the user instead of a text input box where the user can enter directory and filename.

If I need to do it on the backend how would I go about that? Do I then need Flask?

Martin_Guthrie · July 8, 2020, 1:26am

I also had a large amount of data to load into a graph, but collecting that data also takes a long time, say for example 5 seconds. So I tried sending the data in 1 second chunks. The plot x-axis was set to 5 seconds. What I found was that when I added 1sec of data, the plot would redraw from time zero, not from where it left off. In the ColumnDataSource there is append and replace option. I used append, and thought the plot action would also be append, not the redraw of all the data. Since there was a lot of data points, the amount of time to render the 1sec new chuck (which was redrawing from time zero) was >1sec, so the result was bad experience.

I saw the datashader solution but haven’t tried it yet.

p-himik · July 8, 2020, 6:55am

You don’t need Flask, but you cannot use Bokeh FileInput widget because it loads all the data in the browser. You will have to use a regular <input type="file"> input and handle its submission manually. Maybe with Flask, maybe with Tornado (which is what Bokeh already uses).