Skip to content

Python script to migrate genomic data from MySQL DB to parquet files (with added support to upload output files to AWS S3) | GSoC '22

Notifications You must be signed in to change notification settings

rohitxsh/sql2parquet_py

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project report: https://github.com/rohitxsh/ensembl_lakehouse_ui/blob/main/README.md

Python script to move data from SQL to parquet files | GSoC '22

Recommended: Python 3.9.x
Run the script via

  • Command line:
  1. Setup your AWS keys as explained here: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#configuration (config. location path: ~/.aws/ [~ -> Root directory])
  2. Run the script via python3 -m sql2parquet.main
  • Dockerfile:
  1. Update your AWS keys in .aws/credentials [.aws directory should be in same directory as the Dockerfile]
  2. Build the image from the dockerfile via docker build --tag sql2parquet .
  3. Run the container via docker run -d --name sql2parquet sql2parquet

config.toml schema:

[[databases]]
location = "string, DB server address"
port="string, DB server's port no."
DB_USER="string, username to access the specified DB server"
[[databases.species]]
DBname="string, DB name"
species="string, species scientific name"
[[databases.tables]]
table = "string, table name"
query = '''
milti-line string, SQL query to construct the table, variables can we used that are defined in vars
'''

Supported DB: MySQL
Engine: PyMySQL


.aws configuration files content for reference:

config
[default]
region=eu-west-2

credentials
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY

About

Python script to migrate genomic data from MySQL DB to parquet files (with added support to upload output files to AWS S3) | GSoC '22

Topics

Resources

Stars

Watchers

Forks