Apache Arrow and Dash - Community Thread

Hey Dash Community :wave:

I wanted to create a thread to share tips and tricks with using the new Apache Arrow project with Python, Dash, and Pandas. See Wes McKinney - Apache Arrow and the “10 Things I Hate About pandas” for an introductory essay by the author.

Is anybody using this project yet? In what ways?

I’d also like to collect some simple examples of Arrow with Dash that we can eventually fold into the Dash user guide (if they are compelling!)

Of the top of my head, I can see Arrow being useful in Dash in a couple of ways:

  1. Using Parquet files → PyArrow when caching (to the disk) and loading data from callbacks or in transfering between callbacks and multi-user dash apps. Right now, the examples are using JSON, which may or may not be slower and may or may not have issues in data conversion. See Performance | Dash for Python Documentation | Plotly and Capture window/tab closing event - #2 by chriddyp for examples that could be adapted to Arrow with Parquet

  2. Using PyArrow’s Tables instead of Pandas DataFrames (Pandas Integration — Apache Arrow v14.0.1). In Fast Python Serialization with Ray and Apache Arrow | Apache Arrow, the authors mention that Arrow

    Arrow supports zero-copy reads, so objects can naturally be stored in shared memory and used by multiple processes

    Does this apply for mutliple wsgi processes? In Dash, a common pattern is to read from a global Pandas DataFrame, which (AFAIK) is copied for each of the workers when the app runs under gunicorn ($ gunicorn app.server --workers 4). If this was an Arrow Table, would this Table be automatically shared across all workers? How does that work?

  3. Another common pattern in Dash is share data between callbacks by serializing the data as JSON and storing in the network/the user’s browser (see Part 4. Sharing Data Between Callbacks | Dash for Python Documentation | Plotly). These examples use JSON. Is there a way that some serialized form of an Arrow Table would be more efficient (both in serialization/deserialization time but also in request size).

  4. When Dash apps deal with “large data” (data that is too big to fit in RAM), we usually store it in a database or a sqlite file. Is Parquet + Arrow a better solution for this now?

It would be great to fill up this thread with answers to these questions in the form of simple, reproducable examples. And if anyone has any other ideas or examples, please share!

3 Likes

For using CSV files as a cache/intermediate storage, I always use Feather, which serialises Pandas DataFrames to/from Apache Arrow. It’s super straightforward to use, and gives you an easy guarenteed speedup over reading CSV from disk into DataFrames. CSV is obviously still the way to go for an interchange format, but if you’re just caching some data for use in an app, I’m of the opinion that plain CSV is just leaving performance on the table. I got this insight from this slide of this PyconAU talk.

$ pip install feather-format pandas

For writing a DataFrame to a feather file:

import feather
import pandas as pd

df = pandas.DataFrame(list(range(10)))
feather.write_dataframe(df, 'df.feather')

For loading a DataFrame from a feather file:

import feather
df = feather.read_dataframe('df.feather')
5 Likes

Or alternatively, is the “Plasma In-Memory Object Store” a way to do this? See https://arrow.apache.org/docs/python/plasma.html#using-arrow-and-pandas-with-plasma and Plasma In-Memory Object Store | Apache Arrow

1 Like

Perspective and the python and dash components support serializing the data in arrow form. Soon you’ll be able to spin up a perspective db in python and in JS and sync them via arrow, with complete binary compatibility between the C++ backend and C++/wasm frontend, which is pretty cool.

3 Likes

I wish I had seen this thread before - my contribution is https://github.com/russellromney/brain-plasma which uses the Plasma In-Memory Object Store (needs a shorter official name) to store the things together in a shared-memory namespace.

@timkpaine perspective looks interesting.

Have these initial questions been answered in the meantime and could be shared here maybe? Would be interested to know if it’s worth to invest time into Apache Arrow.

  1. Using Parquet files → PyArrow when caching (to the disk) and loading data from callbacks or in transfering between callbacks and multi-user dash apps. Right now, the examples are using JSON, which may or may not be slower and may or may not have issues in data conversion. See https://plot.ly/dash/performance and Capture window/tab closing event for examples that could be adapted to Arrow with Parquet
  2. Using PyArrow’s Tables instead of Pandas DataFrames (https://arrow.apache.org/docs/python/pandas.html ). In http://arrow.apache.org/blog/2017/10/15/fast-python-serialization-with-ray-and-arrow/ , the authors mention that Arrow

Arrow supports zero-copy reads, so objects can naturally be stored in shared memory and used by multiple processes
Does this apply for mutliple wsgi processes? In Dash, a common pattern is to read from a global Pandas DataFrame, which (AFAIK) is copied for each of the workers when the app runs under gunicorn ( $ gunicorn app.server --workers 4 ). If this was an Arrow Table, would this Table be automatically shared across all workers? How does that work?

  1. Another common pattern in Dash is share data between callbacks by serializing the data as JSON and storing in the network/the user’s browser (see https://plot.ly/dash/sharing-data-between-callbacks ). These examples use JSON. Is there a way that some serialized form of an Arrow Table would be more efficient (both in serialization/deserialization time but also in request size).
  2. When Dash apps deal with “large data” (data that is too big to fit in RAM), we usually store it in a database or a sqlite file. Is Parquet + Arrow a better solution for this now?
  1. Parquet is usually faster than any other data format (Feather is close).
  2. PyArrow Table doesn’t require any serialization from Apache Arrow, but you pay I/O cost to transfer from shared memory to the requesting process
  3. The clientside JSON store pattern only works if the data is small, otherwise you’re sending the data that is stored in the browser back and forth multiple times. So anything that stays on the server is likely going to be faster than clientside JSON
  4. Anything using Arrow is usually faster than static files, because all serialization happens in-memory. However, if the data is larger-than-memory, Arrow doesn’t help much as the data cannot be stored in memory. Parquet alone is likely the best bet in combination with Dask or Vaex. SQLite may be good depending on the operations needed

brain-plasma (mentioned above) provides a simple API for fast-serialization, smaller-than-memory values. Dask and Vaex do a good job otherwise. Just remember that Arrow needs to be held in memory to be most useful. If speed and out-of-core processing are most important, perhaps consider Julia and JuliaDB which has the fastest load time and out-of-core processing I’ve seen anywhere. It’s a tough problem to solve no matter how you swing it.

3 Likes

Many thanks for your reply @russellthehippo! I will dig depper into brain-plasma then, I already had a look but didn’t quite know where to start yet. I will come up with an example workflow soon and maybe we can further talk about improvements and best practises. That would be great!

Good luck, shoot me a message if you have more specific questions I can help with.