Show and Tell - brain-plasma (fast sharing for large objects between callbacks)

russellthehippo · May 8, 2019, 9:00am

As mentioned here I’ve been using Apache Plasma to solve what is still one of Dash’s biggest problems: sharing large data between callbacks, apps, and pages.

Now I’ve roughly formalized some of that functionality in the PyPi package brain-plasma. It’s a simple and easy-to-use way to store Python objects, even very large Pandas dataframes or dictionaries, in a shared memory space. This method offers (imperfect, but much better) thread safety, blazing speed relative to reading from disk or Redis, and a super simple, if corny, API. Basically, it uses Plasma to function as the “brain” of your app or other Python project by creating an indexed object namespace in Plasma.

brain_plasma.Brain can brain.learn() new things, brain.recall()old factoids and can brain.forget() just like I too often do; I can tell my brain to brain.wake_up() if it’s been brain.sleep()ing; sadly, sometimes it’s just brain.dead(). But it can store quite a bit of brain.knowledge() and it very good at remembering brain.names().

Full basic docs at GitHub - russellromney/brain-plasma: Shared-memory Python object namespace with Apache Plasma. Built because of Plotly Dash, useful anywhere.

Basic usage is:

from brain_plasma import Brain
brain = Brain()

df = pd.DataFrame(numpy.random.randint(0,100,size=(1000000,4))
txt = 'my text string'

# store the data
brain.learn(df,'df')
brain.learn(txt,'txt')

# get the data again
txt==brain.recall('txt')
> True

# delete a name's value
brain.forget('df')

# get all variable names currently available to brain
vars = brain.names()

This is still a work in progress in EXTREME ALPHA i.e. I built it today and is only tested enough to confirm that the functionality works and is better than what I was using before. So, please don’t use this on your production apps until a) the Apache Plasma API is more stable (it’s not) or b) until this API is more stable and c) the functionality is hammered out a bit more (probably in v0.15).

I’d love any help, requests, or critiques you have!

tcbegley · May 8, 2019, 10:45am

I really like the corny API

russellthehippo · May 9, 2019, 12:29am

Added thanks to @tcbegley: __getitem__ and __setitem__

brain = Brain()
brain['text'] = 'asdf' # calls brain.learn()
brain['text'] # calls brain.recall()
# >>> 'asdf'

russellthehippo · May 9, 2019, 10:57pm

Updates:

Ability to start the underlying plasma_state process when you instantiate Brain

brain = Brain(start_process=True, size=100000000)

Also, if you have used brain.dead(i_am_sure=True) to kill the plasma_state process, you can restart it with the new method brain.start(path='this/path',size=numberofbytes) (parameters are optional - default is to use the previous size and path)

Fixed bug that sometimes doesn’t let you assign a new value to a given name:

# old error example
brain['a'] = 'asdf'
brain['a']
# >>> 'asdf'
brain['a'] = 5
# >>> Plasma Error - ObjectID already exists

New attributes: brain.size & brain.mb
number of bytes (integer) and megabytes (e.g. '50 MB'), respectively, available in the plasma_store

russellthehippo · May 16, 2019, 4:42am

Updates:

Ability to resize the memory available in the underlying plasma_store process without losing any variables.

brain['a'] = [1,2,3,4]
brain.size
# >>> 50000000

brain.resize(100000000)

# size changes
brain.size
# >>> 100000000

# all the values remain
brain['a']
# >>> [1,2,3,4]

Now you have to specify to NOT start the process rather than assuming that the plasma_state process is already there.

Plus general bugfixes, stabilizing the API, and performance.

russellthehippo · May 17, 2019, 8:47am

Updates:

new functions

# how much space is used
`brain.used()`

# how much space is free
`brain.free()`

# dynamically find size of plasma_state
`brain.size()`

# see dictionary of names:ObjectID()s
`brain.object_map()`

Bugfix: brain.start() and brain.resize()started a new plasma_store instance, now they don’t. Problem was in brain.dead()

russellthehippo · August 8, 2019, 9:45pm

Update with release v0.2:

Big things! The brain-plasma is stable again (with breaking changes around starting and killing Plasma instances), documentation is better, and there is a new killer feature: namespaces!

RELEASE WITH BREAKING CHANGES

changed parameter order of learn() to ('name',thing) which is more intuitive (but you should always use bracket notation)
removed ability to start, kill, or resize the underlying brain instance (stability)
added ability to use unique namespaces to hold same-name values.
newly available:
- len(brain) --> # 5
- del brain['this'] --> # brain.forget('this')
- 'this' in brain --> # True
- (implemented __len__ , __delitem__ , and __contains__ )

Using namespaces:

brain.namespace
>>> 'default'

brain['this'] = 'default text object'

# change namespace
brain.set_namespace('newname')
brain['this'] = 'newname text object'

brain.set_namespace('default')
brain['this']
>>> 'default text object'

brain.names(namespaces='all')
>>>['this','this']

brain.show_namespaces()
>>> {'default','newname'}

brain.remove_namespace('newname')
brain.namespace
>>> 'default'

I’m currently using the namespaces feature to back up persistent user state data that can’t be held client-side, and continuing to use the main storage features for quick big-object access.

I’d love some help on emulating dictionary indexed assignment behavior like in this helpful issue: https://github.com/russellromney/brain-plasma/issues/18

Hope this is useful for some folks!

sebbas · October 2, 2019, 10:17am

Hi Russell,

Thanks for this tool but could you please tell me (and maybe others) how to use it with Dash?

I tried it in my app and i get this:

WARNING: Logging before InitGoogleLogging() is written to STDERR
E1002 12:08:20.027307 2871042944 io.cc:168] Connection to IPC socket failed for pathname /tmp/plasma, retrying 5 more times
E1002 12:08:20.430019 2871042944 io.cc:168] Connection to IPC socket failed for pathname /tmp/plasma, retrying 4 more times
E1002 12:08:20.834305 2871042944 io.cc:168] Connection to IPC socket failed for pathname /tmp/plasma, retrying 3 more times
E1002 12:08:21.237042 2871042944 io.cc:168] Connection to IPC socket failed for pathname /tmp/plasma, retrying 2 more times
E1002 12:08:21.638428 2871042944 io.cc:168] Connection to IPC socket failed for pathname /tmp/plasma, retrying 1 more times
Traceback (most recent call last):
  File "index.py", line 8, in <module>
    from apps import triangles_app, ts_analysis, graphing, gtaa, play
  File "/Users/Desktop/webapp/apps/graphing.py", line 34, in <module>
    brain = Brain()
  File "/Users/Desktop/webapp/env/lib/python3.7/site-packages/brain_plasma/brain_plasma.py", line 20, in __init__
    self.client = plasma.connect(self.path,num_retries=5)
  File "pyarrow/_plasma.pyx", line 805, in pyarrow._plasma.connect
  File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Could not connect to socket /tmp/plasma

I am running the dash app in a virtualenv and plasma_store -m 50000000 -s /tmp/plasma in a separate terminal window.

Thanks!

sebbas · October 2, 2019, 10:58am

I solved my problem by launching the plasma_store process with a path inside the dash webapp. Is this the best way to do it?

russellthehippo · October 2, 2019, 2:35pm

You should not need to do that. Afaik, /tmp is used to create temporary socket files which are used to communicate with the Plasma instance. Changing the path should not how brain-plasma works though. Which system are you on?

sebbas · October 2, 2019, 2:49pm

I’m on MacOS. Also, I couldn’t use start_process=True as argument (i have brain-plasma 0.2)

russellthehippo · October 2, 2019, 3:03pm

That’s odd that it doesn’t work on Mac. Will you open an issue on Github with an example of the non-working code?

In v0.2 I removed that ability as it made the tool unstable. I have updated the reference in the README on Github.

dldx · June 20, 2020, 2:40am

This is really cool! Thanks for making it @russellthehippo. I’m using it for a server and was wondering what the best way to start a gunicorn/flask process is with brain/plasma. I can’t start all workers at the same time because they will all be trying do the initial write of the dataframe to the same object. Right now, I’m running an initial script to read my data into the brain, and then start the gunicorn workers.

russellthehippo · June 21, 2020, 12:29am

It sounds like you need to have each worker check if the object exists already before it tries to load it. Maybe I misunderstand the question.

dldx · June 21, 2020, 3:12am

Well, I do, but these objects have to be loaded as soon as the worker starts and all the workers try to load the same object at the same time. I guess there’s no obvious solution besides a two stage initialisation.

russellthehippo · August 19, 2020, 5:06am

Update with Release v0.3:

Summary

This release is the biggest release yet in the path to production usefulness outside of a few large objects. I mostly rewrote brain_plasma.Brain and entirely refactored: it now hashes names for direct access to speed up read and write operations by several orders of magnitude due to fewer and more lightweight calls. The API is mostly the same. Custom exceptions are added to help users catch and understand errors better. Most functions are unit tested and can be checked with pytest.

The sum of these changes means brain-plasma can be used as a fast production backend similar to Redis, but with fast support for very large values as well as for very small values (and for pure Python objects rather than transformed values a la JSON) and for few as well as many values. I’m pretty excited about it.

Hashing speedup

Speedup results are drastic, especially when there are more than a dozen or so names in the store. This is because the old Brain called client.list() multiple times for a most Brain interactions. This was admittedly a horrible design. The new Brain doesn’t call client.list() at all for most operations including all reads and writes. The script many_vals.py compares the old with the new Brains (all values in seconds):

plasma_store -m 10000000 -s /tmp/plasma
# new terminal
python many_vals.py
>>>
100 items:
    learn:
        old: 3.6606647968292236
        hash: 0.030955076217651367
    recall:
        old: 4.092543840408325
        hash: 0.017110824584960938
 10 items:
    learn:
        old: 0.32016992568969727
        hash: 0.005012035369873047
    recall:
        old: 0.31406521797180176
        hash: 0.002324819564819336

Unit tests

Most functions are tested in tests/. Check yourself or test your changes with:

pip install pytest
pytest

Exceptions

Custom exceptions are added to help users catch and understand errors better. Most types of errors that are unique to the functions rather than to Python errors are defined as custom exceptions. Function docstrings mention which exceptions which may be caught. New exceptions are imported en masse like:

from brain_plasma.exceptions import (
    BrainNameNotExistError,
    BrainNamespaceNameError,
    BrainNamespaceNotExistError,
    BrainNamespaceRemoveDefaultError,
    BrainNameLengthError,
    BrainNameTypeError,
    BrainClientDisconnectedError,
    BrainRemoveOldNameValueError,
    BrainLearnNameError,
    BrainUpdateNameError,
)

Other

Code is formatted with the excellent black. Markdown is formatted with Prettier.

mwveliz · September 25, 2020, 7:23pm

Hello I am using brain-plasma in production, I update with kombu my dataframes so I don’t have to query database anymore, only at start, in my Dockerfile I call entrypoint.sh and use this line for starting alongside guinicorn

plasma_store -m 50000000 -s /tmp/plasma &

then I do

exec gunicorn src.app:server --bind 0.0.0.0:8000 --log-level=info --timeout=90

I hope this can be helpful for you

irr3 · October 3, 2020, 6:06am

Could Brain Plasma be used as an alternative backend for https://pythonhosted.org/Flask-Caching/#custom-cache-backends

That would be great I guess.

dldx · October 7, 2020, 11:51pm

Hi @mwveliz, thanks for the reply! Could you please explain in a bit more detail how you deal with the issue of multiple workers trying to write to plasma the initial dataframe? Thanks

mwveliz · November 11, 2020, 5:39pm

Hi @dldx I am not sure because I only use one worker, but maybe this way (using --preload) you could do it:

gunicorn --preload src.app:server --bind 0.0.0.0:8000 --log-level=info --timeout=90

as descirbed on

github.com/plotly/dash

Flask route returns 403 due to CSRF failure when using multiple worker process

opened 05:56AM - 10 Sep 17 UTC

closed 03:33PM - 12 Sep 17 UTC

ned2

I've tried running a Dash app with multiple worker processes using both Gunicorn… and mod_wsgi-express and have found that this results in the `_dash-update-component` route responding with a 403 Bad Request response intermittently. After digging into it a bit I think what's happening is that the Dash app is loaded as a separate instance within each worker process, which results in a different CSRF secret key being generated for the SeaSurf middleware for each worker process, but the client only knows about one of them. So the app might work ok for a few requests, while worker with the right key happens to handle the requests, but as soon as a worker with a non-matching key is allocated to handle the request, we get this error: ``` -------------------------------------------------------------------------------- WARNING in flask_seasurf [[...]lib/python3.6/site-packages/flask_seasurf.py:282]: Forbidden (CSRF token missing or incorrect.): /_dash-update-component -------------------------------------------------------------------------------- ``` And indeed initialising the Dash app with `csrf_protect=False` makes this problem go away. I've found a simple workaround for this issue when using Gunicorn, which is to set the preload_app setting to True, which can be done by providing the --preload flag on the command line. eg: `gunicorn --preload --workers 2 --bind 0.0.0.0:5000 main:app.server` This has the effect of making Gunicorn load the app first before the worker process are forked, meaning they all share the same initialised state (and therefore the same secret key). This is also useful if your Dash app involves memory intensive objects, as you only need to store one copy in memory rather than n copies, where n is the number of workers. It could be worth finding a more general solution (I haven't looked into whether there's a similar preload setting for mod_wsgi), or maybe this this is an acceptable workaround, in which case perhaps this wants to be documented somewhere. I'd be happy to put together a Gunicorn section for the deployment page in the Dash User Guide.

and

Hope it can help you

Topic		Replies	Views
Show and tell: Super-fast way to share data between callbacks: Apache Plasma in-memory object store Dash Python show-and-tell	4	8102	May 9, 2019
Apache Arrow and Dash - Community Thread Dash Python	8	8441	September 15, 2020
Show and Tell - Community Thread :tada: Dash Python show-and-tell	87	47109	June 7, 2023
Show and Tell - WebDash (Dash running entirely in the browser) Dash Python show-and-tell	6	2037	May 25, 2021
Show and Tell - Making Interactive Machine Learning Explorers using Dash + Scikit-learn Dash Python show-and-tell	2	2140	August 9, 2018

Show and Tell - brain-plasma (fast sharing for large objects between callbacks)

Update with release v0.2:

Update with Release v0.3:

Summary

Hashing speedup

Unit tests

Exceptions

Other

Related Topics