The original article
Scheduling Notebook and Script Runs with GitHub Actions by Naveen Venkatesan Towards Data Science
The following steps are based on the original article. But since it is published a year ago, some information needs to be updated.
- The python version, and GitHub’s built-in action version.
- The weather information task in it does not work now, because the weather API is out of use, so I totally changed the task and python code.
Use Github Actions
- Another solution: Use Flask
Go through the example which pulls weather information every 30 hours.
Setting up the Virtual Environment
python3 -m venv ~/venvs/gh-actions
source ~/venvs/gh-actions/bin/activate
And you can get in the virtual environment after you run the above last line “… activate”, and see (gh-actions) before your terminal prompt.
(gh-actions) (base) kb@ks-MacBook-Air ~ %
If you want to quit, use deactivate
.
pip freeze
can be used to check if this is a clean python environment. If the output of this command is blank, it means there are no dependencies installed on it.
And after you install the libraries you need for the program, by pip install
, you can use pip freeze > requirements.txt
to create a file that can be used to replicate the current virtual environment.
Our own code
The Author’s Example - We don’t use
Get Weather Data and save it to a file and plot it
On 2023-02-03, the weather API in this article is expired. So we will not use it.
We will try - Append Data to File
# Import packages
import pandas as pd
import os
# generate a df
lis = [[x,x*x] for x in range(5)]
df = pd.DataFrame(lis, columns=["number", "square"])
# add time column
df["UTC time"] = pd.Timestamp.now()
print(df)
# read and save the latest file to "latest.csv"
try:
os.rename("square.csv", "latest.csv")
except:
pass
# save the new data to "square.csv"
df.to_csv("square.csv", index=False)
This code create a “squere.csv” file, and if there is already a file with that name exist, the old file’s name is changed to “latest.csv”. All of the code exists in a .ipynb
file, which is a Jupyter notebook file we can run on google Colab or other python environment.
Setting up Github Actions to run the code on a schedule
Define a workflow using a YAML file
creating a file /.github/workflows/update-file.yml
The project structure will be as follows:
GitHub Repo
|
+-- .github
| |
| +-- workflows
| |
| +-- update-file.yml
|
+-- Project Files/Folders
First, name and schedule
In the /.github/workflows/update-file.yml
file, first, we give the name
of our workflow, and then the running schedule
.
name: Update CSV File
on:
schedule:
- cron: '30 0-23/1 * * *'
workflow_dispatch:
The schedule is a [[%cron]] expression, and keep in mind that the time will be in UTC.
workflow_dispatch
will make the jobs be done on the cron
schedule time.
Second, the task lists
List out the specific tasks that will need to be run in our workflow. runs-on: ubuntu-lastest
means we ask for a Ubuntu virtual machine for our task.
jobs:
update_file:
name: update CSV file with the current time
runs-on: ubuntu-lastest
under the update_file
task, there are some steps.
- first, checkout our repository by
actions/checkout@v2
, which is a built-in action.
steps:
- name: Checkout
uses: actions/checkout@v2
- set up Python on the Ubuntu virtual machine
by using another built-in action actions/setup-python@v2
, and we will use Python 3.7 and set pip
as the cache for caching dependencies for quicker subsequent setups.
- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: '3.7'
cache: 'pip'
- install our dependencies from
requirements.txt
we created in step “Setting up the Virtual Environment”
- name: Install Dependencies
run: pip install -r requirements.txt
- convert our Jupyter notebook file
.ipynb
to a script file.py
so we can run it directly with Python rather than google Colab.
The code will run and create a square.csv
file as we mentioned in “We will try - Append Data to File”
- name: Run Script and Update Plot
run: |
jupyter nbconvert --to script square.ipynb
python square.py
- finally, we need to commit and push the changes to the files above to the repository.
- name: Commit and Push Changes
run: |
git config --local user.email "[email protected]"
git config --local user.name "GitHub Actions"
git add square.csv
git commit -m "Updated square.csv on `date` with GitHub Actions"
git push origin master
And the final YAML
file will be like this:
Pay attention to this:
Error : A YAML file cannot contain tabs as indentation.
The file:
name: Update CSV File
on:
schedule:
- cron: '0,30 0-23/1 * * *'
workflow_dispatch:
jobs:
update_file:
name: update CSV file with the current time
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
cache: 'pip'
- name: Install Dependencies
run: pip install -r requirements.txt
- name: Run Script and Update Plot
run: |
jupyter nbconvert --to script square.ipynb
python square.py
- name: Commit and Push Changes
run: |
git config --local user.email "[email protected]"
git config --local user.name "GitHub Actions"
git add square.csv
git commit -m "Updated square.csv on `date` with GitHub Actions"
git push origin master
The GitHub repository is here.
Pros and Cons
GitHub provides an easy-to-use solution for doing that. It is a cloud solution, and GitHub can provide a proper environment for you to run the Python code, so we won’t need to worry about a local environment.
But it only supports public repositories for a free account, so if we want to run a private code, we need a paid enterprise account.