Storing/Retrieving Objects Using SQLAlchemy
Storing/Retrieving Objects into/from a relational database is very normal in many use cases. For example, in ETL pipelines, although data warehouse usually depends on RDBMS, NoSQL data sources are not rare. Especially for the staging database, sometimes, hierarchical data types, such as JSON, XML, YAML, etc, need to be stored into the relational database directly. Another example is that sometimes, we need to store our machine learning model(s), which is usually of hierarchical data type, into a relational database.
SQLAlchemy is a very popular and powerful ORM (Object Relational Mapping) tool. For hierarchical data, in SQLAlchemy, it can be handled as PickleType, and for PickleType data fields, SQLAlchemy can serialize or deserialize it automatically when data is stored into or retrieved from database. In this tutorial, I demonstrate how a hierarchical data field can be saved into and retrieved from PostgreSQL database using SQLAlchemy.
Preparing Database, and Creating an Engine
First, we start a Postgres Database using Docker container
docker run -d --name pgdb_staging -p 5433:5432 -e POSTGRES_PASSWORD=dbc postgres:latest
Next, we create a database named demo
on the database server.
postgres=# create database demo;
CREATE DATABASE
postgres=# select datname, dattablespace from pg_catalog.pg_database;
datname | dattablespace
-----------+---------------
postgres | 1663
demo | 1663
template1 | 1663
template0 | 1663
(4 rows)postgres=#
Now, lets import the packages we need and create an db engine.
Creating Table using SQLAlchemy ORM
After executing the above python code, the table pickled_earthquakes
is created in the database.
postgres=# \c demo
You are now connected to database "demo" as user "postgres".
demo=#
demo=# \dt
List of relations
Schema | Name | Type | Owner
--------+---------------------+-------+----------
public | pickled_earthquakes | table | postgres
(1 row)demo=#
demo=# \d pickled_earthquakes
Table "public.pickled_earthquakes"
Column | Type | Collation | Nullable | Default
------------+---------+-----------+----------+----------------------------------------------------
eq_id | integer | | not null | nextval('pickled_earthquakes_eq_id_seq'::regclass)
date | date | | |
earthquake | bytea | | |
Indexes:
"pickled_earthquakes_pkey" PRIMARY KEY, btree (eq_id)demo=#
demo=# SELECT COUNT(*) FROM pickled_earthquakes;
count
-------
0
(1 row)demo=#
By checking the row count of the table, we see that there is no data in the table yet. Next, we prepare the functions for extracting the earthquake data in JSON format.
Extracting Earthquake Data and Storing them into the Database
After executing the above python code, the extracted earthquake data in JSON format got saved into the database. To make sure we didn’t lose any data, we can check row count, and a randomly selected record from the extracted data as follows.
>>> len(earthquakes)
238
>>> import pprint
>>> pprint.pprint(earthquakes[123])
{'geometry': {'coordinates': [52.2186, 28.0958, 10], 'type': 'Point'},
'id': 'us7000ftt0',
'properties': {'alert': None,
'cdi': None,
'code': '7000ftt0',
'detail': 'https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=us7000ftt0&format=geojson',
'dmin': 4.753,
'felt': None,
'gap': 104,
'ids': ',us7000ftt0,',
'mag': 4.3,
'magType': 'mb',
'mmi': None,
'net': 'us',
'nst': None,
'place': '88 km NW of Mohr, Iran',
'rms': 0.4,
'sig': 284,
'sources': ',us,',
'status': 'reviewed',
'time': 1636798788435,
'title': 'M 4.3 - 88 km NW of Mohr, Iran',
'tsunami': 0,
'type': 'earthquake',
'types': ',origin,phase-data,',
'tz': None,
'updated': 1636806902040,
'url': 'https://earthquake.usgs.gov/earthquakes/eventpage/us7000ftt0'},
'type': 'Feature'}
By checking the row count in database, we confirm no record is lost.
demo=# SELECT COUNT(*) FROM pickled_earthquakes;
count
-------
238
(1 row)demo=#
For data value, as it is serialized in database, we need to retrieve it from Python to see its actually value.
Retrieving Earthquake Data from the Database
>>> eq_123 = session.query(EarthQuake).filter(EarthQuake.eq_id==124)
>>> pprint.pprint(eq_123.first().earthquake)
{'geometry': {'coordinates': [52.2186, 28.0958, 10], 'type': 'Point'},
'id': 'us7000ftt0',
'properties': {'alert': None,
'cdi': None,
'code': '7000ftt0',
'detail': 'https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=us7000ftt0&format=geojson',
'dmin': 4.753,
'felt': None,
'gap': 104,
'ids': ',us7000ftt0,',
'mag': 4.3,
'magType': 'mb',
'mmi': None,
'net': 'us',
'nst': None,
'place': '88 km NW of Mohr, Iran',
'rms': 0.4,
'sig': 284,
'sources': ',us,',
'status': 'reviewed',
'time': 1636798788435,
'title': 'M 4.3 - 88 km NW of Mohr, Iran',
'tsunami': 0,
'type': 'earthquake',
'types': ',origin,phase-data,',
'tz': None,
'updated': 1636806902040,
'url': 'https://earthquake.usgs.gov/earthquakes/eventpage/us7000ftt0'},
'type': 'Feature'}
By comparing the earthquake data from the source and from the database, we confirmed that the two records are identical.
Conclusion
In this short tutorial, I have demonstrated how to store a JSON data field into the Postgres database, and retrieve it back from the database respectively. In the example, we have seen that if we use session.add()
and session.query()
method respectively, the serialization and deserialization of the hierarchical data are taken care of the methods already. Alternatively, if we write SQL queries explicitly and send them to the database engine using session.execute()
method, we have to serialize and deserialize the hierarchical data ourselves, say, by calling pickle.dumps()
and pickle.loads()
methods respectively. However, due to the security vulnerability of sending SQL queries explicitly, this method is not recommended, especially in production environment.