Hi,
I am looking into making a filereader widget that can stream data from a csv file into a CDS or a pandas dataframe; I am running Bokeh as a server. The problem with the current Bokeh fileinput
widget is that it reads all data into the memory of the browser and I have users who have huge files (1GB) which makes the browser crash.
I have tried to create a JS script using papaparse but it is very slow (reading a row at a time; not a requirement to use papaparse):
- Using pandas directly and read a 1.75 GB csv with pd.read_csv(): less than 3 min
- My JS stream code into CDS with
.stream
: after 16 min it was not finished and I stopped the script
(I do not know if my JS code is correct; I assume there could be room for improvement as I do not know JS).
When I have googled this I do find some SO hints with respect to JS code, but I do not understand how to implement it for streaming a file in chunks into either a CDS or a df.
If anyone got time to help/advise how to do this I would very much appreciate it. Thanks.
main.py
import os
import pandas as pd
import time
from bokeh.layouts import row, column
from bokeh.models import Div, CustomJS, Button, TextInput, ColumnDataSource
from bokeh.io import curdoc
file_source = ColumnDataSource()
status = Div(text = '')
button = Button(label="Select file", button_type="success")
button.js_on_click(
CustomJS(
args = dict(file_source = file_source, status = status),
code = open(
os.path.join(
os.path.dirname(__file__),
'static/js',
'load_csv.js'
)
).read()
)
)
def data_update(attr, old, new):
# Callback to register when data loading is done and work on data
print(old)
print(new)
if new != 'Done':
return
try:
df = file_source.to_df()
except:
print('Not able to convert data to df')
return
print(df.head())
# Use status div text to figure out when data load is done
status.on_change('text', data_update)
# if just reading data directly into df
#print('reading data...')
#t0=time.time()
#df2=pd.read_csv(file_name) # 1.75 GB
#t1=time.time()
#print(t1-t0)
# 165 secs = 2 min 45 sec
#print(df2.head())
curdoc().add_root(row(button, status))
index.html
{% extends base %}
<!-- goes in head -->
{% block preamble %}
<script src="csv_papa_parse/static/js/papaparse.min.js"></script>
{% endblock %}
<!-- goes in body -->
{% block contents %}
<div>
<h1>Load CSV data</h1>
<p>Read data in chunks into CDS using stream method.</p>
</div>
{{ super() }}
{% endblock %}
load_csv.js
function getData(file_name) {
Papa.parse(file_name, {
header: true,
fastMode: true,
step: function(row) {
const csvData = {};
// need to format data to bokeh CDS setup
// papaparse row.data is JS Object with one value per heading eg
// {
// 'x': 0,
// 'y': 10
// }
for (let [key, value] of Object.entries(row.data)) {
csvData[key] = [value];
}
if (file_source.get_length() == null) {
file_source.data = csvData;
} else {
file_source.stream(csvData);
}
},
complete: function(results, file_name) {
console.log('Complete');
status.text = 'Done';
}
});
}
var data_url;
var input = document.createElement('input');
input.type = 'file';
status.text = '';
input.onchange = e => {
var file = e.target.files[0];
data_url = file.name;
console.log(data_url);
getData(file);
}
input.click();
test.csv
x,y
0,10
1,11
2,12
3,13
4,14
app layout
stream_csv
|-main.py
|-test.csv
|--/templates
| |-index.html
|--/static
|--/js
|-load_csv.js
|-papaparse.min.js