Acquire Daily Earthquakes Data Using Cron

Schedule Jobs 1: How to use Cron and Python scripts to acquire and upload earthquakes data to a staging database every day.

Published in

Python in Plain English

8 min readOct 30, 2021

Scheduling regularly recurring tasks is a very common use case in many different situations, such as automatic database backup, automatic data acquisition, and processing in ETL pipeline, automated ML model retraining and deployment, and so on. Based on the requirement and the nature of the tasks, different tools can be used to achieve the goal. In this tutorial, I am going to demonstrate how to use Cron and Python scripts to acquire and upload earthquakes data to a staging database every day.

What is Cron Job?

Cron job (also known as cron command-line utility) is a job scheduler on Unix-like operating systems. Cron jobs can be scheduled at fixed times, dates, or (repeating) intervals, and it is best suitable for scheduling repetitive system maintenance or administration tasks.

Cron is a daemon that runs when the system boots. It checks the crontab files to see what tasks to run and when. Each user of the system can have their own crontab file. Crontab files also can be shared among users or at the system level for collaboration or other similar purposes. User crontab files are stored under /var/spool/cron folder, and system-wide crontab files include /etc/crontab and those stored under /etc/cron.d folder.

$ pwd
/etc
$ ls -l cron*
-rw------- 1 root root    5 Dec 18  2020 cron.allow
-rw------- 1 root root    6 Nov 13  2019 cron.deny
-rw------- 1 root root  255 Nov 13  2019 crontabcron.d:
total 20
-rw-r--r-- 1 root root 513 Aug 20 09:19 mysched_mystat
-rw-r--r-- 1 root root  51 Dec  3  2020 my-dfmon
-rw-r----- 1 root root 287 Aug 31 11:45 my-drop_caches
-rw-r--r-- 1 root root   0 Aug 30 14:48 my-gsctools
-rw-r--r-- 1 root root  98 May 26 10:22 my-perfmon
-rw-r--r-- 1 root root 329 Aug 31 11:45 my-sacron.daily:
total 36
-rwxr-xr-x 1 root root  778 Jun 24  2020 mdadm
-rwxr-xr-x 1 root root 1924 Feb 16  2017 mlocate.cron
-rwxr--r-- 1 root root  983 Jan  7  2020 suse-clean_catman
-rwxr-xr-x 1 root root 1879 Jan 21  2020 suse.de-backup-rc.config
-rwxr-xr-x 1 root root 2059 Sep 11  2014 suse.de-backup-rpmdb
-rwxr-xr-x 1 root root  566 Sep 11  2014 suse.de-check-battery
-rwxr-xr-x 1 root root  371 Sep 11  2014 suse.de-cron-local
-rwxr--r-- 1 root root 1693 Jan  7  2020 suse-do_mandb
cron.hourly:
total 0cron.monthly:
total 0cron.weekly:
total 0
$
$ cat cron.allow
root
$ cat cron.deny
guest
$ cat crontab
SHELL=/bin/sh
PATH=/usr/bin:/usr/sbin:/sbin:/bin:/usr/lib/news/bin
MAILTO=root
#
# check scripts in cron.hourly, cron.daily, cron.weekly, and cron.monthly
#
-*/15 * * * *   root  test -x /usr/lib/cron/run-crons && /usr/lib/cron/run-crons >/dev/null 2>&1$

Next, I am going to prepare a new user such that I can schedule Cron jobs using this new user.

$ useradd -m -g users demo
$ passwd demo
New password:
BAD PASSWORD: it is too short
Retype new password:
passwd: password updated successfully
$
$ pwd
/home
$ ls -l
total 16
drwxr-xr-x 7 demo     users   4096 Oct 29 15:06 demo
drwxr-xr-x 7 dscuser  users   4096 Dec 31  2020 dscuser
$
$ vi /etc/sudoers
 71 ##
 72 ## Runas alias specification
 73 ##
 74
 75 ##
 76 ## User privilege specification
 77 ##
 78 root ALL=(ALL) ALL
 79 demo ALL=(ALL) ALL                  <=== add the new user here
 80
 81
 82 ## Uncomment to allow members of group wheel to execute any command
 83 # %wheel ALL=(ALL) ALL
 84
"/etc/sudoers" [readonly] 91L, 3433C$ sudo su demo
demo@localhost:/home> 
demo@localhost:/home> ls -l
total 16
drwxr-xr-x 7 demo     users   4096 Oct 29 15:06 demo
drwxr-xr-x 7 dscuser  users   4096 Dec 31  2020 dscuser
demo@localhost:/home> cd demo
demo@localhost:~> ls -l
total 8
drwxr-xr-x 2 demo users 4096 Jun 27  2017 bin
drwxr-xr-x 2 demo users 4096 Dec 18  2020 public_html
demo@localhost:~>

A Cron job is defined in the crontab file, and each Cron job is defined by one line in the crontab file. The bottom line in the comments below gives the format of the line defining a Cron job. Each line consists of five (time/date specification) fields representing when to execute the (shell) command, and one field specifying what (shell) command to be executed.

# ┌─────────── minute (0 - 59)
# │ ┌─────────── hour (0 - 23)
# │ │ ┌─────────── day of the month (1 - 31)
# │ │ │ ┌─────────── month (1 - 12)
# │ │ │ │ ┌─────────── day of the week (0 - 6) (Sunday to Saturday;
# │ │ │ │ │                       7 is also Sunday on some systems)
# │ │ │ │ │
# │ │ │ │ │
# * * * * * <command to execute>

The time/date specification fields are defined in unix-cron format. If the current time and date match that defined in the time/date specification fields, the system executes the specified command.

If “*” is used in a field, it stands for “first-last”. For example, if all five fields are filled with “*”, it means the command will be executed once every minute.

Ranges, lists, and “*/n” can also be used in different fields. One can refer to this page for detailed explanations.

The Crontab file can be listed, edited, and removed by using crontab commands.

Usage:
 crontab [options] file
 crontab [options]
 crontab -n [hostname]Options:
 -u <user>  define user
 -e         edit user's crontab
 -l         list user's crontab
 -r         delete user's crontab
 -i         prompt before deleting
 -n <host>  set host in cluster to run users' crontabs
 -c         get host in cluster to run users' crontabs
 -s         selinux context
 -x <mask>  enable debugging

In the next section, before actually demonstrating how to schedule a daily job using Cron, I am going to set up a PostgreSQL database for the staging earthquake data table and show the python code actually acquiring earthquake data and uploading it to the staging database.

Python Script for Acquiring and Uploading Earthquakes data

Here, I use a docker container to hold my PostgreSQL database. For simplicity, I pull the latest PostgreSQL image and run the database with it.

demo@localhost:~> docker pull postgres:latest
latest: Pulling from library/postgres
7d63c13d9b9b: Pull complete
cad0f9d5f5fe: Pull complete
ff74a7a559cb: Pull complete
c43dfd845683: Pull complete
e554331369f5: Pull complete
d25d54a3ac3a: Pull complete
bbc6df00588c: Pull complete
d4deb2e86480: Pull complete
cb59c7cc00aa: Pull complete
80c65de48730: Pull complete
1525521889be: Pull complete
38df9e245e81: Pull complete
79300c1d4f7a: Pull complete
Digest: sha256:db927beee892dd02fbe963559f29a7867708747934812a80f83bff406a0d54fd
Status: Downloaded newer image for postgres:latest
docker.io/library/postgres:latest
demo@localhost:~> docker images
REPOSITORY                                                  TAG                 IMAGE ID            CREATED             SIZE
postgres                                                    latest              317a302c7480        9 days ago          374MB
demo@localhost:~>
demo@localhost:~>
demo@localhost:~> docker volume create postgres-volume
postgres-volume
demo@localhost:~> docker volume ls
DRIVER              VOLUME NAME
local               postgres-volume
demo@localhost:~>
demo@localhost:~> docker run -d --name=pgdb_staging -p 5432:5432 -v postgres-volume:/home/demo/postgres/data -e POSTGRES_PASSWORD=dbc postgres
a3cfee411b31568559ef041e9abe762e9dceacd1791861d5dbac0609deb3602b
demo@localhost:~> docker ps
CONTAINER ID   IMAGE                                                     COMMAND                  CREATED         STATUS        PORTS                                       NAMES
be9bfb767fac   postgres                                                  "docker-entrypoint.s…"   2 seconds ago   Up 1 second   0.0.0.0:5432->5432/tcp, :::5432->5432/tcp   pgdb_staging
demo@localhost:~>

Now, we log into the database and prepare the user and database.

demo@localhost:~> psql -h localhost -p 5432 -d postgres -U postgres
Password for user postgres:
psql (14.0)
Type "help" for help.postgres=# 
postgres=#
postgres=# CREATE DATABASE demo;
CREATE DATABASE
postgres=# SELECT datname, dattablespace FROM pg_catalog.pg_database;
  datname  | dattablespace
-----------+---------------
 postgres  |          1663
 template1 |          1663
 template0 |          1663
 demo      |          1663
(4 rows)postgres=# 
postgres=# CREATE USER dbc WITH PASSWORD 'dbc';
CREATE ROLE
postgres=# 
postgres=# \du
                                   List of roles
 Role name |                         Attributes                         | Member of
-----------+------------------------------------------------------------+-----------
 dbc       |                                                            | {}
 postgres  | Superuser, Create role, Create DB, Replication, Bypass RLS | {}postgres=# 
postgres=# \q
demo@localhost:~> psql -h localhost -p 5432 -d demo -U dbc
Password for user dbc:
psql (14.0)
Type "help" for help.demo=> \dt
Did not find any relations.
demo=> 
demo=> \q
demo@localhost:~>

Next, the snippet below shows the Python code actually retrieving earthquakes data and uploading it to the staging database.

Scheduling Cron Job

Now that we have everything ready, we go ahead to schedule the daily job using Cron.

demo@localhost:~> 
demo@localhost:~> sudo crontab -u demo -e
no crontab for demo - using an empty one
crontab: installing new crontab
demo@localhost:~>
demo@localhost:~>
demo@localhost:~> sudo crontab -l
no crontab for root
demo@localhost:~> sudo crontab -u demo -l
# cron job (demo): executing the python script for acquiring and uploading earthquake data to staging database
0 1 * * * python /home/demo/earthquakes_daily_report.py
demo@localhost:~>

At 1:00 am every day, the script will retrieve the earthquake data of the day and upload it to the staging database. The snippet below shows the data uploaded after the first day.

demo@localhost:~> psql -h localhost -p 5432 -d demo -U dbc
Password for user dbc:
psql (14.0)
Type "help" for help.demo=> \dt
          List of relations
 Schema |    Name     | Type  | Owner
--------+-------------+-------+-------
 public | earthquakes | table | dbc
(1 row)demo=> SELECT COUNT(*) FROM earthquakes;
 count
-------
   210
(1 row)demo=> SELECT * FROM earthquakes WHERE mag > 5;
     id     |          time           |                     place                     | mag |        coordinates        |                                       detail
------------+-------------------------+-----------------------------------------------+-----+---------------------------+------------------------------------------------------------------------------------
 us7000frfx | 2021-11-04 16:36:53.368 | 298 km SW of Bluff, New Zealand               | 5.5 | {165.2157,-48.2606,10}    | https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=us7000frfx&format=geojson
 us7000frfu | 2021-11-04 16:25:05.135 | 35 km SW of Finschhafen, Papua New Guinea     | 5.2 | {147.626,-6.7899,46.39}   | https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=us7000frfu&format=geojson
 us7000frem | 2021-11-04 15:46:38.965 | Maug Islands region, Northern Mariana Islands | 5.3 | {145.0334,19.0284,564.21} | https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=us7000frem&format=geojson
 us7000frcu | 2021-11-04 08:57:06.817 | 65 km ESE of Nikolski, Alaska                 | 5.2 | {-167.9814,52.7003,35.82} | https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=us7000frcu&format=geojson
 us7000fraj | 2021-11-04 02:42:44.037 | 63 km NE of Amahai, Indonesia                 | 5.7 | {129.3383,-2.9492,18}     | https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=us7000fraj&format=geojson
(5 rows)demo=>
demo=> \q
demo@localhost:~>
demo@localhost:~>

If we don’t want a job to continue, we can remove the corresponding line in the crontab file using the edit command, crontab -e. As currently, we only have one job defined, in the snippet below, we simply remove the whole crontab file.

demo@localhost:~>
demo@localhost:~> sudo crontab -u demo -r
[sudo] password for demo:
demo@localhost:~> sudo crontab -u demo -l
no crontab for demo
demo@localhost:~>

Wait till the next day, check the database again, and we will see that no new data is added into the database anymore.

Summary

In this tutorial, I have demonstrated how to schedule recurring tasks using Cron, which may be arguably the simplest and most commonly used tool for scheduling jobs. However, cronjob has its limitations. For example, using cronjob, tasks recurring more than once every minute cannot be scheduled. Another example is portability.

While cron is available in all Unix distros, it is not available on Windows machines. To achieve the same goal, an equivalent tool, such as schtasks, has to be used. Other limitations include (but are not limited to): it usually requires root privilege; it may be inconvenient for scheduling tasks in a distributed environment and may create a single-point-of-failure; and so on.

In the next tutorial, I am going to demonstrate how to use another very popular job scheduling tool, Celery, to schedule jobs.

More content at plainenglish.io. Sign up for our free weekly newsletter. Get exclusive access to writing opportunities and advice in our community Discord.

Python in Plain English

Acquire Daily Earthquakes Data Using Cron

Schedule Jobs 1: How to use Cron and Python scripts to acquire and upload earthquakes data to a staging database every day.

What is Cron Job?

Python Script for Acquiring and Uploading Earthquakes data

Scheduling Cron Job

Summary

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Python in Plain English

Written by Chuan Zhang

No responses yet