General: how to run Python scripts in parallel of a Flask app? (Multiple uses of a single VPS)

Hello there,

My webApp is deployed on a VPS running on Ubuntu, with Gunicorn and NGinx. The app runs in a venv, and it’s monitored by supervisorctl. Besides what is needed to have the webApp running, there is nothing else.

Now, I would like to have a cron task, which would execute a python script every 24h. This script would download some zip file, extract the data, parse and cleansen them, and, eventually, save the relevant data in the /data directory of my dash webApp.

First question(maybe completely dumb): What’s the correct way to get scripts executed “in the background”, without they mess up with everything else (Especially my webApp) ?

Second question: what if i did need another script running permanently in the background, like a loop which would be executed every 2sec to grab new data? Do I have to set up something to have this running h24, 7d/7 ?

If I had several VPS I would try every potential solution I would find with Google, until I figure out what’s the best one, but given that I only have 1 VPS, I don’t want to go in this direction. Any hint, or link to decent explanations, would be warmly welcome:)

Thanks!

P.S: It seems that something similar was discussed here [Solved] Updating server side app data on a schedule but given that I would have several crawlers running in parallel, I would like to keep each of them totally independant from the dash app.

We recommend using celery for running periodic tasks, and sharing the data between the celery process & the dash app via redis. Here’s an example: https://github.com/plotly/dash-redis-celery-periodic-updates/

Thanks for the link, @chriddyp. But why is it better the use celerey than crontab, for instance?

Also, what if one of my script is just a loop executed non stop to get some external data and feed a csv or sql db ? Should I use Celery for this too?

One of my crawler is currently running on my PC (instead of on the VPS); it uses selenium & Tor to get the data (it has to bypass super-low limit of request/hour/IP), then a parser cleansens all the mess I get, before saving the relevant info in a db. I need to move these scripts to the VPS. Does Redis/Celery make this possible? These tasks should run non-stop.
Other crawlers and parsers have to go through pdfs to extract data. I’ve around 15 scripts which should all be independant.

Some of my scripts are running non stop, and some just once a month; but I dont know yet if I actually need live data in my dash app. It seems enough for the time being if my supervisorctl just reloads the dash app once a day, to load the updated CSVs.

I’m not sure to understand how Celery helps here.

My app.py is in my FlaskApp folder.
In this folder there is the venv, in which are installed gunicorn, nginx and supervisor.

One of the last step of the deployment was writting a supervisor config file. From my venv I therefore wrote:

sudo nano /etc/supervisor/conf.d/flaskapp.conf

and then in the config file:

[program:flaskapp]
directory=/home/ubuntu/Flask_App
command=/home/ubuntu/Flask_App/venv/bin/gunicorn -w 3 app:app
user=…
autostart=true
autorestart=true
stopasgroup=true
killasgroup=true
stderr_logfile=/var/log/flaskapp/flaskapp.err.log
stdout_logfile=/var/log/flaskapp/flaskapp.out.log

And this webapp works fine.

Now, I would like to enable the execution of several other sripts, on the same VPS, completely independant from the flask app, running non stop, in the background. Everytime that these scripts retrieve 500KB of data from external sources, they stop, write or overwrite a CSV in the Data folder of the webApp, and restart.

On the other side, I would execute the command “sudo supervisorctl reload” automatically once a day. This way, the dash app restart, and automatically reloads the different CSV files needed.

So, 2 questions.

  1. If i had to use celery, I guess that I should install it in the venv. And then? Do I have to modify the supervisor config file, too? I’m a bit lost here…

  2. If I dont need Celery, then, could I just write python scripts, put them in a new folders in the /home directoy, set new venv in these folders, and then …then what ? Do I have to set up a new supervisor config file for each scripts, so that if one of these scripts stops running unexpectedly, it restarts? How does this work? I’m a bit confused on this matter, @chriddyp

The more I read about Celery, the less I understand how it would help :thinking:

David, look at RQ. You can send background tasks to an independent worker process and schedule them. I find it much simpler than Celery.

Thank you Russell, I will have a look on RQ then,

But still, why do I need to queue the processes and workers if the goal is to have them running non stop in parallel?

As far as i understand, a queue enables subtasks to be executed the one after the other, while the main app runs in the foreground. So, why is Celery or RQ usefull, to run several never-ending scripts in parallel?

The key is to use rq-scheduler. The queue is just the way you access the worker - unless you have many tasks that will stack up, it functions like an independent process.

For tasks that run every 2 seconds, the concept still works, as long as the tasks are very short. Just schedule them every 2 seconds. Otherwise, have a dcc.Interval trigger your process within the app.

What I have in mind is not exactly a task which would run every 2 seconds nor periodically ;

It’s really like I have my flask app, running in its own environment,
but, given that the VPS is powerfull enough,
I would also like to use the VPS for other goals than just running a webserver, with a dash app on it.

Let’s say I want my VPS to crawl data from several external websites, and then store the result on the server, in a db, growing every day.

There would just be no relation with the FlaskApp. Besides the fact that both things would be running on the same server. (Technically, the FlaskApp could nevertheless access the db aforementioned, if it was useful, and if the db was stored in the data folder of this FlaskApp)

It’s this very particular thing I’m trying to achieve.

Given that I do not even find the answers with Google, I believe I might be asking the wrong questions.

Let’s reword it then;
On my PC, if I start the dash app in the Spyder’s console, this console cannot be used for anything else anymore.
So, if I need to run another scripts that just goes on a website to get some random data, then I have to open a new console, to run it.
From my perspective, the flask app and this second script would be running “in parallel”.

Question is, how to do that on the VPS ?

Newbie’s question I guess, but still, I dont know how to structure the folders, files, config, to make this second script restarting automatically if it stops running for any reason.

Got it. For a VPS, running a flask app only affects the port (e.g. 5000 or 8050). Otherwise the VPS is totally free to work on other things. Just open up a new terminal (you can do multiple SSH connections at once) and run whatever you’d like.

For example - I have a VPS on DigitalOcean that runs 5 separate Dash apps, a Jupyter server, and several database instances - Elasticsearch, PostgresQL, MySQL, and MongoDB. Those things are all running and accessible at the same time.

Excellent, thank you @russellthehippo , it looks like you made my day :slight_smile:

And I assume that I can also use Supervisor, to monitor these different scripts, so that if one of them stops running for whatever reason, it restarts automatically?

How do you monitor your database instances, for example?

You can use supervisor to monitor python scripts, though I recommend not running scripts - instead scheduling tasks by running functions with a background worker such as RQ or Celery.

All I do is run a background task every minute that checks database availability for each instance. If there is a problem I get an email. I’ve never had a problem though.