Yet Another Omop Loader

Introduction

Yet Another OMOP Loader is a tool designed to load data into an OMOP Common Data Model Postgres database.

The tool is designed to be run at multiple stages of a pipeline, with the following stages.
  1. Create the database schemas and tables

  2. Load the vocabularies

  3. Load the data

  4. Create the primary keys

  5. Create the indicies

  6. Create the foreign keys

Each stage is idempotent so can be re-run to ensure a database is in a known state, the tool will not attempt to recreate objects that already exist or load data into tables which already contain data. Seperate schemas can be specified for data and vocabularies so a single database can hold multiple OMOP datasets which share commmon vocabularies.

To configure your environment ready to use the tool:
  1. Create an empty database and a user with admin access. Set the database name and user name into config.DB_CONN_STR

  2. Download the latest DDL files for postgres from https://github.com/OHDSI/CommonDataModel/tree/main/inst/ddl/5.4

  3. Set config.DDL_FILE, config.CONSTRAINTS_FILE, config.INDICIES_FILE and config.KEYS_FILE to point to the relevant files.

  4. Prepare your data into CSV files. It should be one file per CDM table. If the filename doesn’t match the table name (with a .csv suffix) you can set config.DATA_PATTERN to extact the table name from the file name.

  5. Set config.DAT_PATH to point to your CSV files.

  6. Download a vocab zip file from Athena and set config.VOCABS_ZIP to the path of the zip file.

  7. Set other environment variables as required.

Installation

- Install the script requirements using pip install -r requirements.txt
- If you wish to rebuild the documentation also run pip install -r requirements.docs.txt

Usage

usage: omoploader.py [-h] [-d] [-dr] [-sc] [--omopschema OMOPSCHEMA]
                     [--vocabschema VOCABSCHEMA]
                     {clean,build,vocabs,load,pkeys,index,fkeys,all,reload} ...

Positional Arguments

action

Possible choices: clean, build, vocabs, load, pkeys, index, fkeys, all, reload

Database operation

Named Arguments

-d, --debug

Display debug logging.

Default: False

-dr, --dryrun

Rollback the transaction on completion.

Default: False

-sc, --skipcheck

Skips checking the state of the database before running action.

Default: False

--omopschema

OMOP Schema. Overrides config.DB_OMOP_SCHEMA

--vocabschema

Vocab Schema. Overrides config.DB_VOCAB_SCHEMA

Sub-commands

clean

Removes all objects by deleting the schemas

omoploader.py clean [-h]

build

Builds the CDM Tables

omoploader.py build [-h]

vocabs

Loads the Vocabularies

omoploader.py vocabs [-h]

load

Loads the CSV data

omoploader.py load [-h]

pkeys

Builds the primary keys

omoploader.py pkeys [-h]

index

Builds the indexes

omoploader.py index [-h]

fkeys

Builds the foreign keys

omoploader.py fkeys [-h]

all

Runs all actions except for clean

omoploader.py all [-h]

reload

Reloads the CSV data

omoploader.py reload [-h]

Configuration Settings

config.DDL_FILE

Path to the Postgres DDL file as downloaded from the OHDSI Github. Set from the YAOL_DDL_FILE env var.

config.CONSTRAINTS_FILE

Path to the Postgres constraints sql file as downloaded from the OHDSI Github. Set from the YAOL_CONSTRAINTS_FILE env var.

config.INDICIES_FILE

Path to the Postgres indexes sql file as downloaded from the OHDSI Github. Set from the YAOL_INDICIES_FILE env var.

config.KEYS_FILE

Path to the Postgres primary keys sql file as downloaded from the OHDSI Github. Set from the YAOL_KEYS_FILE env var.

config.DATA_PATH

Path to the CSV files containing omop data. One per table. Set from the YAOL_DATA_PATH env var.

config.DATA_PATTERN

Regular expression to extract the OMOP table name from the CSV file name. Set from the YAOL_DATA_PATTERN env var.

config.DB_CONN_STR

Postgres connection string. Set from the YAOL_DB_CONN_STR env var.

config.DB_OMOP_SCHEMA

Name of the Schema to use for the CDM data tables. Set from the YAOL_DB_OMOP_SCHEMA env var.

config.DB_VOCAB_SCHEMA

Name of the Schema to use for the vocab tables. Set from the YAOL_DB_VOCAB_SCHEMA env var. Defaults to DB_OMOP_SCHEMA

config.DB_RESULTS_SCHEMA

Name of the Schema to use for the Achilles results tables. Set from the YAOL_DB_RESULTS_SCHEMA env var.

config.VOCABS_ZIP

Path to a zip file containg OMOP vocabulary files as downlaoded from Athena. Set from the YAOL_VOCAB_ZIP env var.

Functions

omoploader.add_schema(ddl_file, schema_name, vocab_schema_name)
Reads an OMOP DDL file as downloaded from the OHDSI github and replaces the schema template variable

with the specified schema.

Parameters:
  • ddl_file (str) – The path to the OMOP DDL file.

  • schema_name (str) – The schema name to replace the @cdmDatabaseSchema placeholder with.

  • vocab_schema_name (str) – The schema name to replace the @cdmDatabaseSchema placeholder with for vocab tablees

Returns:

A string containing the contents of the ddl file with the specifed schema set.

Return type:

str

omoploader.build(conn)

Calls build_cdm() with the values of config.DB_OMOP_SCHEMA, config.DDL_FILE and config.DB_RESULTS_SCHEMA

Parameters:

conn (psycopg.connection) – A psycopg connection object to the postgres database

Returns:

None

Return type:

None

omoploader.build_cdm(conn, schema_name, vocab_schema_name, ddl_file, results_schema_name)

Build the OMOP CDM Tables by executing the OMOP DDL file. Does nothing if the already exist (by replacing the CREATE TABLE statements with CREATE TABLE IF NOT EXISTS statements)

Parameters:
  • conn (psycopg.connection) – A psycopg connection object to the postgres database

  • schema_name (str) – The name of the CDM schema to create and build the tables in. This replaces the schema template variable in the DDL file.

  • vocab_schema_name (str) – The name of the CDM schema to create and build the tables in. This replaces the schema template variable in the DDL file for vocab tables.

  • results_schema_name (str) – The name of the results schema to create and build the results the tables in (not currently used)

Returns:

None

Return type:

None

omoploader.build_fkeys(conn, schema_name, vocab_schema_name, constraints_file)

Build the OMOP CDM foreign keys by executing the OMOP Constrains file. Does nothing if they already exist.

Parameters:
  • conn (psycopg.connection) – A psycopg connection object to the postgres database

  • schema_name (str) – The name of the CDM schema. This replaces the schema template varaible in the sql file.

  • vocab_schema_name (str) – The name of the CDM schema. This replaces the schema template varaible in the sql file for vocab tables.

  • constraints_file (str) – The name of the file containing the SQL statements to create the foreign keys.

Returns:

None

Return type:

None

omoploader.build_indicies(conn, schema_name, vocab_schema_name, indices_file)

Build the OMOP CDM Indexes by executing the OMOP Indexes file. Does nothing if they already exist.

Parameters:
  • conn (psycopg.connection) – A psycopg connection object to the postgres database

  • schema_name (str) – The name of the CDM schema. This replaces the schema template varaible in the sql file.

  • vocab_schema_name (str) – The name of the CDM schema. This replaces the schema template varaible in the sql file for vocab tables.

  • indices_file (str) – The name of the file containing the SQL statements to create the indexes.

Returns:

None

Return type:

None

omoploader.build_pkeys(conn, schema_name, vocab_schema_name, pkeys_file)

Build the OMOP CDM Primary Keys by executing the OMOP Primary Keys file. Does nothing if they already exist.

Parameters:
  • conn (psycopg.connection) – A psycopg connection object to the postgres database

  • schema_name (str) – The name of the CDM schema. This replaces the schema template varaible in the sql file.

  • vocab_schema_name (str) – The name of the CDM schema. This replaces the schema template varaible in the sql file for vocab tables.

  • pkeys_file (str) – The name of the file containing the SQL statements to create the Keys.

Returns:

None

Return type:

None

omoploader.build_table_map(data_pattern, data_path)

Reads the files in a folder and uses a regular expression to extract the OMOP table name from the file name.

Parameters:
  • conn – A psycopg connection object to the postgres database

  • data_path (str) – The folder containing data files to be loaded into OMOP tables.

Returns:

A a list of tuples of (file name,omop table name)

Return type:

tuple

omoploader.clean(conn)

Calls drop_cdm() with the values of config.DB_OMOP_SCHEMA, config.DB_VOCAB_SCHEMA and config.DB_RESULTS_SCHEMA

Parameters:

conn (psycopg.connection) – A psycopg connection object to the postgres database

Returns:

None

Return type:

None

omoploader.drop_cdm(conn, schema_name, vocab_schema_name, results_schema_name)

Drops the specifed schemas from the database. Does nothing if they calready exist.

Parameters:
  • conn (psycopg.connection) – A psycopg connection object to the postgres database

  • schema_name (str) – The name of the CDM schema to remove.

  • vocab_schema_name (str) – The name of the vocab schema to remove.

  • results_schema_name (str) – The name of the results schema to remove.

Returns:

None

Return type:

None

omoploader.fkeys(conn, delete_first=False, skip_check=False)

Ensures indexes are created by calling indicies() then calls build_fkeys() with the values config.DB_OMOP_SCHEMA, config.DB_VOCAB_SCHEMA, config.CONSTRAINTS_FILE.

Parameters:
  • conn (bool) – A psycopg connection object to the postgres database

  • delete_first (bool) – Delete contraints and then re-create

  • skip_check – If true, no check is performed on the state of the database first.

Returns:

None

Return type:

None

omoploader.get_args_parser()

Builds a parser to handle the command line arguments.

Returns:

Parser for arguments

Return type:

argparse.ArgumentParser

omoploader.handle_args()

Parses the command line arguments.

Returns:

Namespace containing parsed arguments

Return type:

argparse.Namespace

omoploader.index(conn, delete_first=False, skip_check=False)

Ensures keys are created by calling keys() then calls build_indicies() with the values config.DB_OMOP_SCHEMA, config.DB_VOCAB_SCHEMA, and config.INDICIES_FILE.

Parameters:
  • conn (bool) – A psycopg connection object to the postgres database

  • delete_first (bool) – Delete index and then re-create

  • skip_check – If true, no check is performed on the state of the database first.

Returns:

None

Return type:

None

omoploader.load(conn, delete_first=False, skip_check=False)

Ensures vocabs are loaded by calling vocabs(), builds a table to file map and then calls load_data_csv() with the values of and config.DB_OMOP_SCHEMA.

Parameters:
  • conn (bool) – A psycopg connection object to the postgres database

  • delete_first (bool) – Delete data from each table before loading

  • skip_check – If true, no check is performed on the state of the database first.

Returns:

None

Return type:

None

omoploader.load_data_csv(conn, db_schema, table_map, delete_first=False)

Loads data from CSV files into OMOP tables. Expects one file per table. N.B. This will not load data into any table which already contains data (i.e if count(*)>0).

Parameters:
  • conn (psycopg.connection) – A psycopg connection object to the postgres database

  • schema_name (str) – The name of the CDM schema. This replaces the schema template varaible in the sql file.

  • table_map (list(tuple)) – A list of tuples specifying a fully qualified file path and an OMOP table name to load the data into i.e [(file name,table name)].

  • delete_first (bool) – Delete all rows from table before loading data. Defaults to False. Data will not be loaded to any table contaning data.

Returns:

None

Return type:

None

omoploader.load_vocabs_from_zip(conn, db_schema, zip_file)

Loads OMOP vocabluaries from a zip file as downloaded from Athena. N.B. This does not currently handle vocabs which require a license/post processing (e.g. CPT4).

Parameters:
  • conn (psycopg.connection) – A psycopg connection object to the postgres database

  • schema_name (str) – The name of the CDM schema. This replaces the schema template variable in the sql file for the vocab tables.

  • zip_file (str) – The path to the zip file containing vocab files.

Returns:

None

Return type:

None

omoploader.pkeys(conn, delete_first=False, skip_check=False)

Ensures data is loaded by calling load() then calls build_keys() with the values config.DB_OMOP_SCHEMA, config.DB_VOCAB_SCHEMA, and config.KEYS_FILE.

Parameters:
  • conn (bool) – A psycopg connection object to the postgres database

  • delete_first (bool) – Delete key from each table and then re-create

  • skip_check – If true, no check is performed on the state of the database first.

Returns:

None

Return type:

None

omoploader.run_sql_template(conn, schema_name, vocab_schema_name, template_file)

Executes a file of SQL statements as downloaded from the OHDSI github. The contents of the file are first passed to the add_schema function to replace the schema placeholder.

Parameters:
  • conn (psycopg.connection) – A psycopg connection object to the postgres database

  • schema_name (str) – The schema name to run the statements against.

  • vocab_schema_name (str) – The schema name to run the statements against for vocab tables.

  • template_file – The file of sql statements to run.

Returns:

None

Return type:

None

omoploader.setup_logging(debug)

Sets the logging level. Sets to DEBUG if debug paramter is true, otherwise sets INFO

Parameters:

debug (boolean) – Specifes whether debug should be set.

Returns:

None

Return type:

None

omoploader.vocabs(conn, skip_check=False)

Ensures tables are built by calling build() and then Calls load_vocabs_file_zip() with the values of config.DB_OMOP_SCHEMA, config.VOCABS_ZIP.

Parameters:
  • conn (bool) – A psycopg connection object to the postgres database

  • skip_check – If true, no check is performed on the state of the database first.

Returns:

None

Return type:

None

dbutils.index_exists(conn, index_name)

Checks whether the given index exists.

Parameters:
  • conn (psycopg.connection) – A psycopg connection object to the postgres database

  • index_name (str) – The name of the index to check.

Returns:

True if the index exists

Return type:

bool

dbutils.is_vocab_table(table_name)

Checks whether the given table is a vocabulary table.

Parameters:

table_name (str) – The name of the table to check.

Returns:

True if the table is a vocabulary table

Return type:

bool

dbutils.key_exists(conn, key_name)

Checks whether the given primary or foreign key exists.

Parameters:
  • conn (psycopg.connection) – A psycopg connection object to the postgres database

  • key_name (str) – The name of the key to check.

Returns:

True if the key exists

Return type:

bool

dbutils.schema_exists(conn, schema_name)

Checks whether the given schema exists.

Parameters:
  • conn (psycopg.connection) – A psycopg connection object to the postgres database

  • schema_name (str) – The name of the schema to check.

Returns:

True if the schema exists

Return type:

bool

dbutils.table_is_empty(conn, schema_name, table_name)

Checks whether the given table is empty.

Parameters:
  • conn (psycopg.connection) – A psycopg connection object to the postgres database

  • schema_name (str) – The name of the CDM schema. This replaces the schema template varaible in the sql file.

  • table_name (str) – The name of the table to check.

Returns:

True if the table is empty

Return type:

bool