Working on large datasets -- comparison with shiny

First of all, like everyone on this forum I generally love Dash. I was excited to see the beta a year ago and I’m glad it’s coming of age. Secondly, this is yet another post on a topic that’s been discussed pretty extensively.

There is a particular type of workflow that we use all the time, which is pretty easy to do in Shiny and (as far as I can tell) virtually impossible difficult to do in Dash/Flask, because of the inability to store server-side variables. Basically our apps let the user load some large array of data (for this purpose it doesn’t matter if it comes from an uploaded file or from a database qurey), do some operations on it, display some results, and perhaps let the user download some results. In shiny this is pretty straighforward, as session variables can be shared across all the callbacks.

There are many threads about how to store pandas dataframes in memory, and answers like this Sharing a dataframe between plots work quite nicely for small dataframes. However for large dataframes (millions of elements), it seems like I’m going to be spending the entire time just reading and writing to the json.

The second standard solution seems to be to not store the data in memory, but rather to fetch the data from the database every time it’s needed (and put it in a database if it was uploaded by the user), and combine this with caching. This would be great if I didn’t need to manipulate the data at all. But if I want to modify a row or column (but I need the entire dataset for computation), then I’m going to be doing a lot of fetching. Moreover, I think with most caching methods, I still need to serialize/deserialize.

Working in a standalone script or in a notebook, I’ve found that pandas/numpy have some major advantages over any R data structure in terms of performance. I was really hoping to take advantage of this + dash to build some of our next-gen apps. But at the moment it seems difficult to take advantage of it. Does anyone have some ideas for alternatives I haven’t considered?

9 Likes

Some background: In order for this to work across multiple python processes, we need to store the data somewhere that is accessible to each of the processes. There are 3 places to store this data:
1 - On the disk (e.g. on a file or on a new database)
2 - In a shared memory space like with Redis
3 - In the user’s browser session

For 1 and 2:

  • Easiest to implement with tools like Flask-Caching, see Performance | Dash for Python Documentation | Plotly
  • Data has to be serialized out of python data structures into simpler datastructures like strings and numbers (or just JSON) for storage
  • Data that is cached in this way will be available for every future session.
    • If you open up the app in a new browser window (or if a different viewer opens up the app), the app’s callbacks may retrieve the data from the cache instead of computing it fresh.
  • Since data is stored for all sessions, you could run into memory issues (if using e.g. Redis and not the file system) if you’re storing the output for every single set of parameters of your cache.
    • As such, you need to balanace what you cache and what you compute. If querying the raw data is slow (e.g. from SQL), then you could cache the results of the query once and then perform fast computations (e.g. pandas aggregations) on top of that.

For 3:

  • Implemented by saving the data as part of Dash’s front-end store through methods explained in Sharing a dataframe between plots
  • Data has to be converted to a string like JSON for storage and transport
  • Data that is cached in this way will only be available in the user’s current session.
    • If you open up a new browser, the app’s callbacks will always compute the data. The data is only cached and transported between callbacks within the session.
    • As such, online 1 and 2, this method doesn’t increase the memory footprint of the app.
    • There could be a cost in network transport. If your sharing 10MB of data between callbacks, then that data will be transported over the network between each callback.
    • If the network cost is too high, then compute the aggregations upfront and transport those. Your app likely won’t be displaying 10MB of data, it will just be displaying a subset or an aggregation of it.

Reading and writing JSON isn’t that expensive but sending it over the network could be. To get around this, the last point in the outline above is an option:

If the network cost is too high, then compute the aggregations upfront and transport those. Your app likely won’t be displaying 10MB of data, it will just be displaying a subset or an aggregation of it.

For example:

@app.callback(Output('intermediate-value', 'children'), [Input('dropdown', 'value')])
def clean_data(value):
     # an expensive query step
     cleaned_df = your_expensive_clean_or_compute_step(value)
      
     # a few filter steps that compute the data
     # as it's needed in the future callbacks
     df_1 = cleaned_df[cleaned_df == 'apples']
     df_2 = cleaned_df[cleaned_df == 'oranges']
     df_3 = cleaned_df[cleaned_df == 'figs']
     return {
         df_1: df_1.to_json(orient='split'),
         df_2: df_2.to_json(orient='split'),
         df_3: df_3.to_json(orient='split'),
     }

@app.callback(
    Output('graph', 'figure'),
    [Input('intermediate-value', 'children'])
def update_graph_1(jsonified_cleaned_data):
    dff = pd.read_json(jsonified_cleaned_data['df_1'])
    figure = create_figure_1(dff) 
    return figure

@app.callback(
    Output('graph', 'figure'),
    [Input('intermediate-value', 'children'])
def update_graph_2(jsonified_cleaned_data):
    dff = pd.read_json(jsonified_cleaned_data['df_2'])
    figure = create_figure_2(dff) 
    return figure

@app.callback(
    Output('graph', 'figure'),
    [Input('intermediate-value', 'children'])
def update_graph_3(jsonified_cleaned_data):
    dff = pd.read_json(jsonified_cleaned_data['df_3'])
    figure = create_figure_3(dff) 
    return figure

Your mileage will vary depending on your aggregations and your UI. You could end up reducing a 5M row dataframe into 3 bar graphs each with 100 points each, in which case the transport costs will be really low.

1 Like

@chriddyp thanks a lot for the detailed reply. I understand that the type of approach I have with shiny is incompatible with the paralellization in dash.

In my case it’s not network issues I’m running into (I’m just running locally for testing), but really the cost of reading the json, at least when it’s read all in one big chunk. What you’ve outlined in the last section – splitting up the data into multiple smaller pieces – is a great idea and is the direction I started going in. That approach could take major advantage of the paralellization, but what does become difficult is to interact with the smaller pieces. To put another way, the last example would be pretty tricky to write if the user needed to specify [apples,oranges,figs] and the list could be long. We discussed this briefly here also: Capture click on subplot. I guess if you fix the number of elements it becomes pretty doable (as per More sliders! (help with button callback example)).

I thought about this a little bit more today and came up with another alternative solution. This solution:

  • Uses Redis via Flask-Cache for storing “global variables”. This data is accessed through a function who’s output is cached and keyed by its input arguments.
  • Uses the hidden div solution to send a signal to the other callbacks when the expensive computation is complete
  • Note that instead of Redis, you could also save this to the file system. See https://flask-caching.readthedocs.io/en/latest/ for more details.

This “signaling” is cool because it allows the expensive computation to only take up one process. Without this type of signaling, each callback could end up computing the expensive computation in parallel, locking 4 processes instead of 1.

This approach also has the advantage that future sessions use the pre-computed value. This will work well for apps that have a small number of inputs.

Here’s what this example looks like. Some things to note:

  • I’ve simulated an expensive process by using a time.sleep(5).
  • When the app loads, it takes 5 seconds to render all 4 graphs
  • The initial computation only blocks 1 process
  • Once the computation is complete, the signal is sent and 4 callbacks are executed in parallel to render the graphs. Each of these callbacks retrieves the data from the “global store”: the redis cache.
  • I’ve set processes=6 in app.run_server so that multiple callbacks can be executed in parallel. In production, this is done with something like $ gunicorn --workers 6 --threads 2 app:server
  • Selecting a value in the dropdown will take less than 5 seconds if it has already been selected in the past. This is because the value is being pulled from the cache.
  • Similarly, reloading the page or opening the app in a new window is also fast because the initial state and the initial expensive computation has already been computed.

Here’s a GIF of this app (too big to show inline): https://user-images.githubusercontent.com/1280389/31468665-bf1b6026-aeac-11e7-9388-d9a5e71d964e.gif

import copy
import dash
from dash.dependencies import Input, Output
import dash_html_components as html
import dash_core_components as dcc
import datetime
from flask_caching import Cache
import numpy as np
import os
import pandas as pd
import time


app = dash.Dash(__name__)
CACHE_CONFIG = {
    # try 'filesystem' if you don't want to setup redis
    'CACHE_TYPE': 'redis',
    'CACHE_REDIS_URL': os.environ.get('REDIS_URL', 'localhost:6379')
}
cache = Cache()
cache.init_app(app.server, config=CACHE_CONFIG)

N = 100

df = pd.DataFrame({
    'category': (
        (['apples'] * 5 * N) +
        (['oranges'] * 10 * N) +
        (['figs'] * 20 * N) +
        (['pineapples'] * 15 * N)
    )
})
df['x'] = np.random.randn(len(df['category']))
df['y'] = np.random.randn(len(df['category']))

app.layout = html.Div([
    dcc.Dropdown(
        id='dropdown',
        options=[{'label': i, 'value': i} for i in df['category'].unique()],
        value='apples'
    ),
    html.Div([
        html.Div(dcc.Graph(id='graph-1'), className="six columns"),
        html.Div(dcc.Graph(id='graph-2'), className="six columns"),
    ], className="row"),
    html.Div([
        html.Div(dcc.Graph(id='graph-3'), className="six columns"),
        html.Div(dcc.Graph(id='graph-4'), className="six columns"),
    ], className="row"),

    # hidden signal value
    html.Div(id='signal', style={'display': 'none'})
])


# perform expensive computations in this "global store"
# these computations are cached in a globally available
# redis memory store which is available across processes
# and for all time.
@cache.memoize()
def global_store(value):
    # simulate expensive query
    print('Computing value with {}'.format(value))
    time.sleep(5)
    return df[df['category'] == value]


def generate_figure(value, figure):
    fig = copy.deepcopy(figure)
    filtered_dataframe = global_store(value)
    fig['data'][0]['x'] = filtered_dataframe['x']
    fig['data'][0]['y'] = filtered_dataframe['y']
    fig['layout'] = {'margin': {'l': 20, 'r': 10, 'b': 20, 't': 10}}
    return fig


@app.callback(Output('signal', 'children'), [Input('dropdown', 'value')])
def compute_value(value):
    # compute value and send a signal when done
    global_store(value)
    return value


@app.callback(Output('graph-1', 'figure'), [Input('signal', 'children')])
def update_graph_1(value):
    # generate_figure gets data from `global_store`.
    # the data in `global_store` has already been computed
    # by the `compute_value` callback and the result is stored
    # in the global redis cached
    return generate_figure(value, {
        'data': [{
            'type': 'scatter',
            'mode': 'markers',
            'marker': {
                'opacity': 0.5,
                'size': 14,
                'line': {'border': 'thin darkgrey solid'}
            }
        }]
    })


@app.callback(Output('graph-2', 'figure'), [Input('signal', 'children')])
def update_graph_2(value):
    return generate_figure(value, {
        'data': [{
            'type': 'scatter',
            'mode': 'lines',
            'line': {'shape': 'spline', 'width': 0.5},
        }]
    })


@app.callback(Output('graph-3', 'figure'), [Input('signal', 'children')])
def update_graph_3(value):
    return generate_figure(value, {
        'data': [{
            'type': 'histogram2d',
        }]
    })


@app.callback(Output('graph-4', 'figure'), [Input('signal', 'children')])
def update_graph_4(value):
    return generate_figure(value, {
        'data': [{
            'type': 'histogram2dcontour',
        }]
    })


# Dash CSS
app.css.append_css({
    "external_url": "https://codepen.io/chriddyp/pen/bWLwgP.css"})
# Loading screen CSS
app.css.append_css({
    "external_url": "https://codepen.io/chriddyp/pen/brPBPO.css"})

if __name__ == '__main__':
    app.run_server(debug=True, processes=6)
2 Likes

@chriddyp thanks, looks interesting – just wanted to say it might be a couple of days before I have a chance to play with this fully.

Am I right to assume that there is a hidden serialization taking place when Redis stores the DataFrame result? If not, how does it work?

1 Like

Yeah, from their source code, it looks like they pickle it: werkzeug/werkzeug/contrib/cache.py at 40bfd13b4383512160a139aea3721c3746a4d7ae · pallets/werkzeug · GitHub. It may be faster for you to perform the string serialization yourself (e.g. to JSON or CSV). Some benchmarks have shown that JSON serialization is faster than Pickling (Pickle vs JSON - Which is Faster? - Konstantin Kovshenin) but my guess is that it depends on the type and size of the object.

This is surprising to me. I wonder if some methods of pandas dataframe serialization are faster than others. Alternatively, converging a CSV and back might make this faster too.

df.to_json(orient='split')
df.to_json(orient='records')
df.to_json(orient='index')
df.to_json(orient='columns')
df.to_json(orient='values')
df.to_json(orient='table')

I’ve pulled some of this discussion into a new chapter of the Dash user guide: Part 5. Sharing Data Between Callbacks | Dash for Python Documentation | Plotly. Thanks for kicking off this discussion! :slight_smile:

3 Likes

New guide looks great.

Wrt to the best way to serialize the pandas dataframe, I’ve found that a binary encoding really gives an order of magnitude performance boost if dealing with large tables of numeric values.

I have been using something like this:

import io
import base64
import pandas as pd

def enc(df):
    buf=io.BytesIO()
    df.to_msgpack(buf)
    buf.seek(0)
    return base64.b64encode(buf.read())

def dec(s):
    return pd.read_msgpack(io.BytesIO(base64.b64decode(s)))
5 Likes

That’s good to know! I wonder if it would be possible to use Feather with redis. Feather provides binary DataFrame storage (using Apache Arrow) that’s compatible with both Python and R. For saving DataFrames to disk, I’ve switched to using it, as it gives a noticeable speedup over saving as CSV.

2 Likes

@chubukov @nedned I’ve tested a few of these solutions.

I need to store uploaded dataframes for my app, and didn’t want to keep passing them back and forth via hidden divs, so I did some basic testing of server-side solutions. I was a bit surprised at the results! I timed the reading (and writing) of a dataframe of random numbers of size NxN. Read results look like:

tabular_read_times

The green line is actually under the red line (i.e. feather very comparable to arrow+redis).

The methods here, in legend order, are:

  • just store as CSV on disk - pd.read_csv()
  • just store as pickle on disk - pd.read_pickle()
  • store as a feather file on disk - feather.read_dataframe()
  • serialise with arrow and push into a redis store - basically redis.Redis().get() and pyarrow.deserialize_pandas()
  • serialise with msgpack and b64decode as suggested above, and use redis - redis.Redis().get(), base64.b64decode(), and pd.read_msgpack()

So CSV files are unsurprisingly terrible. When you see someone saying “I’ve tried X, and it’s way better than CSV” - they’re right. Don’t write to straight CSV. But it that doesn’t tell you which of the alternatives are better.

To my surprise, pickle is out-performing everything! I really thought feather would be better for dataframes. The main disadvantage of pickle is it’s not an interchange format, but if your app just needs to cache something for its own use, it’s fine, and easy.

Also worth noting that pickle here is reading/writing from disk, and STILL outperforming the serialise into redis (in-memory) solutions. This could easily change on different systems with different memory/disk speeds. I’m running on my Mac laptop.

Note that I didn’t try:

  • pickling and storing in redis (but I assume this will perform well)
  • the Plasma object store (which is designed exactly for this purpose)

My tests of write times looked quite similar, except that the msgpack solution performed very well. For me, read times are more important than write times, so the plot above is the most important.

So what I’m actually doing in my app now is:

  • Pickling raw uploaded data to disk, and having it read in by any callback that wants it
  • Storing transformed data in redis using flask-caching’s @cache.memoize(), which appears to also use pickle
  • Getting data from disk or from redis by indexing with a unique session ID (as suggested elsewhere by @chriddyp) and, for memoization, the Upload component’s timestamp for the most recent data upload.
3 Likes

Oh wow that’s super interesting!

I also would have thought feather would have been more performant than pickling. I’d be curious to know if using pandas’ built-in support for feather (pandas.read_feather()) would make any difference. It could be interesting to try with text data rather than numeric too. My understanding is that there are more overheads with serialising numeric data, than compared with text. I also read somewhere that the pickling approach won’t scale to very large files (eg multiple gigabytes) whereas Feather can.

A potentially big issue with pickling is that it can open you up to massive security vulnerabilities, as unpickling essentially involves evaluating arbitrary Python. But if you control the pickling and unpickling, and the pickled objects themselves are never exposed publicly, then it’s probably fine.

There’s also quite a few config parameters for Redis that I think can potentially affect performance, so maybe with some tweaking, the Redis option would do better.

A post was split to a new topic: Filesystem caching issue?

All the three callbacks have the same input and output. Is that supposedly done? How does it work