Datacrossways is a lightweight, cloud based data management service. The service supports data upload, storage, data sharing, and fine grained data access control. It was designed to be easily deployed on Amazon AWS.
Launcher of data portal using the Flask API and React fronted. Datacrossways is meant for deployment on Amazon AWS. It allows users to connect to a React frontend or access resources programmatically, by directly interacting with the Datacrossways API. The frontend receives all information from the Datacrossways API.
The API accesses a Postgres database that persists information. The API needs access to some AWS resources and requires limited AWS permissions that are passes by a configuration file. Specifically the API requires to create S3 buckets and upload and retrieve files from it.
GoogleOAuth configuration • Create temporary AWS user • Create EC2 instance • Create AWS resources • Remove AWS resources
Deploy Datacrossways for production
Deploy API locally • Deploy React frontend locally
Decide on a domain (e.g. datacrossways.org) and get a fixed/elastic IP address. These should be the first steps to take. Then follow all other instructions below for an easy setup of a Datacrossways instance. The whole process should not take too long. To register a new domain you can use the AWS Route53
service. During the setup process, you will need to provide the domain name.
Datacrossways currently uses Google OAuth to manage user logins. It is a prerequisite for initializing a Datacrossways instance. To set up credentials go to https://console.cloud.google.com/apis/dashboard, where you need to have an account, or you need to create a new one.
If not done so you will have to configure your OAuth consent screen
first. Fill out the information such as domain and admin email. In the scopes section, select the first three options:
Then Save and continue. Next, you can add test users. While you are still testing the website, only the test users can use the OAuth login.
Click on + CREATE CREDENTIALS
and select OAuth client ID
. There, create a new web application
entry and fill in the Authorized JavaScript origins
and Authorized redirect URIs
. Here we can set multiple domains (choose one you want to use and own) that we would like to use. For the redicrect URL add following entry: https://<domain>/api/user/authorize?provider=google
. Then select CREATE
.
The newly created entry should appear under OAuth 2.0 Client IDs
. Click Download OAuth client
and save Your Client ID
and Your Client Secret
as a JSON. This file can later be used when deploying the Datacrossways instance, so keep it handy.
Datacrossways also supports Orcid. Log into Orcid and then under Profile View
select Developer Tools
. Here you can configure the required information. The redirect URI should be base_url+/api/user/authorize?provider=orcid
(Example: https://datacrossways.org/api/user/authorize?provider=orcid). To enable Orcid add the Orcid account information to the config.json
in the OAuth section, once it is created during the setup process. It should look something like:
"oauth":{
"orcid": {
"client_id": "APP-B2KWOS2DS3DSJJH",
"client_secret": "32120d-0981-7453e-lk982-okas908ahjk23"
},
"google": { ... }
}
Datacrossways requires several AWS resources to be configured before the datacrossways API and frontend can run. While most of the configuration is automated there are some initial steps that need to be performed manually. The first step is to create a temporary role
with credentials to create the final user
credentials and S3 bucket
, as well as an RDS database
.
This role will only be used to set up the required AWS resources. After the setup, this role can be removed again. Generally, this many user rights can be problematic and you want to limit the instance user rights once the resources are created.
Log into the AWS dashboard at https://aws.amazon.com.
- Navigate to create role under IAM
- navigate to IAM
- under
Access management
selectRoles
in the left menu - Select
Create role
button
- Create Role
- select AWS service and use case EC2
- Select
Next
button
- Attach Permissions
- In
filter policies
typeEC2FullAccess
pressenter
and check box - In
filter policies
typeIAMFullAccess
pressenter
and check box - In
filter policies
typeAmazonS3FullAccess
pressenter
and check box - In
filter policies
typeAmazonRDSFullAccess
pressenter
and check box - Select
Next
button
- In
- Add name and description
- Choose a unique role name
- write a description
role for datacrossways configuration
- select
Create role
button
When all is done, the user should look something like this:
Depending on the deployment, this instance can be used to host the Datacrossways API and frontend, or can only be used to configure the AWS resources (in case of running the API and frontend locally for development). A small, cost-efficient instance should be sufficient for most use cases (t2.small
). Data traffic bypasses the host server, so it does not require significant harddisc space. It is recommended to have at least 20GB
to build all docker images when Datacrossways is deployed on this host.
Assuming you want to create resources in region us-east-1
you can first create an Elastic IP address
. These IP addresses will remain reserved, even if you should terminate the AWS instance. This is recommended to make sure the domain will be properly linked to your datacrossways instance. Navigate to https://us-east-2.console.aws.amazon.com/ec2/home?region=us-east-1#Addresses: and select Allocate Elastic IP address
. Then select Allocate
. Adding a tag is optional.
Log into the AWS dashboard at https://aws.amazon.com.
- Navigate to EC2 dashboard
- Search for service EC2 which should open the EC2 dashboard
- Select
Launch Instance
button, clickLaunch Instance
- Configure Instance
- Under
Quick Start
selectUbuntu
(as time of writing Ubuntu Server, 22.04 LTS (HMV), SSD Volume Type) - Under
Instance
type select desired instance (at leastt2.small
@0.023/h or ~ $17/month), other good options are the othert2/t3
burstable instances. - Pricing overview https://aws.amazon.com/ec2/pricing/on-demand/ - Under
Key pair
either use an existingkey pair
or generate a new one - Enter key pair name and download.pem
if working on UNIX or.ppk
when working with Windows and Putty. Thepem/ppk
file is used to log into the instance once created. Under UNIX the key should be placed into a folder with limited user rights (chmod 700) and the key (chmod 600) - Under
Configure Storage
set to at least20GB
. Space is mainly needed to build Docker images. If disk space is too small it can result in some minor issues. - After selecting storage space change from
Not encrypted
toEncrypted
. Then select the(default) aws/ebs
key underKMS key
to encrypt the hard drive - Optional but highly recommended: Under
Network settings
restrict SSH traffic toMy IP
- Select
Allow HTTPS traffic from internet
- Select
Allow HTTP traffic from internet
- Under
Advanced details
select IAM instance profile and select the role created before - Select
Launch Instance
button - Assuming
us-east-1
navigate tohttps://us-east-1.console.aws.amazon.com/ec2/home?region=us-east-1#Instances:instanceState=running
select newly created instance, selectActions
,Networking
,Manage IP addresses
and attachElastic IP
- Select newly created instance in table and copy
Public IPv4 address
- Under UNIX connect to instance with
ssh -i pathtokey/key.pem ubuntu@ipaddress
- Windows users: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html
Under EC2 select the newly created instance. The public IP can be found in the
Instance Details
tab.
- Under
Datacrossways needs to be accessible via a dedicated domain. The easiest way is to register a domain using Route53 which is a service from AWS. First, check if the domain is still available. If it is you can proceed to checkout.
Then follow the descriptions of the registration to complete the domain registration. The domain will then be accessible after some time (usually a couple of minutes). Once the domain is registered you need to link your AWS instance with the domain. Under hosted zones
(https://us-east-1.console.aws.amazon.com/route53/v2/hostedzones) select the newly created domain and add a new record.
Select Create Record
and in the following dialogue paste the IP address of the newly created instance into the Value
field. All other settings should be left unchanged. Make sure the record type is A - Routes traffic to an IPv4 address and some AWS resources
. Then create the record.
Now it is time to create the other AWS resources. They encompass a designated user to control S3 access, an S3 bucket with specific configurations, as well as an RDS database to store metadata on stored data objects.
After creating a temporary user and an AWS instance log into the server. From there get Datacrossways
using git.
git clone https://github.com/MaayanLab/datacrossways.git
Now assuming you have generated and downloaded the OAuth information described in the section above (GoogleOAuth configuration
) copy the JSON into a folder named ~/datacrossways/secrets
. You can create a new file with the information downloaded from the Google Developer Console. The file can be named any way you like. The code below is an example of how you can create this file:
mkdir ~/datacrossways/secrets
vi ~/datacrossways/secrets/google_oauth.json
Go into the datacrossways
folder in the home directory and run the command below. It will ask for some required information.
~/datacrossways/setup.sh
Now you can run the aws configuration script which will create the resources. Project names should not contain commas
, periods
, underscores
, or spaces
. Since the bucket name
is created from the project name there can be a conflict. The bucket name is <project_name>-dxw-vault
. Since bucket names are globally unique this might lead to errors. So make sure the project name is unique to avoid conflicts with existing resources.
Warning: When this is run all uploaded data is deleted permanently!
To remove resources created before run the following command and follow onscreen instructions:
python3 ~/datacrossways/aws/aws_remove.py <project_name>
This script relies in a config file ~/datacommons/secrets/aws_config_<project_name>-dxw.json
that is automatically generated when running aws_setup.py
. The database will take more than a minute to fully shut down completely, the status can be seen in the RDS section of the AWS console. While the status is
the database name can not be reused. Deleting the security group may fail as it is still linked to the RDS which takes time to delete. You can rerun the script after the RDS is completely shut down or remove the security group manually.
In case of an error (e.g. the aws_config_<project_name>-dxw.json) gets lost the resources can easily be removed manually. The resources will be in RDS
, IAM
, and S3
. To delete:
- Delete user
- Go to https://us-east-1.console.aws.amazon.com/iamv2/home#/users
- Find user <project_name>-dxw-user and select checkbox and then
Delete
(if the temporary user is still there also remove this user)
- Delete policy
- Go to https://us-east-1.console.aws.amazon.com/iamv2/home#/policies
- Type
dxw
in the filter input and hit enter - Select policy and under
Actions
select delete
- Delete RDS database
- Assuming the database was generated in
us-east-1
, go to https://us-east-1.console.aws.amazon.com/rds/home?region=us-east-1#databases - Select
<project_name>-dxw-db
and underActions
selectDelete
- Assuming the database was generated in
- Delete S3 bucket
- Assuming the database was generated in
us-east-1
, go to https://s3.console.aws.amazon.com/s3/buckets?region=us-east-1 - Search for
dxw
and select<project_name>-dxw-vault
- First select
Empty
and thenDelete
- Assuming the database was generated in
- Security group
- Assuming the database was generated in
us-east-1
, go to https://us-east-1.console.aws.amazon.com/ec2/home?region=us-east-1#SecurityGroups - Search for
<project_name>
and select<project_name>-dxw
- Under actions select
Delete security groups
- The security group can only be deleted once the RDS instance is completely shut down. This process can take more than a minute.
- Assuming the database was generated in
The backend API
and React fronend
can be deployed on a local computer, mainly for development purposes. They still require the AWS resources like the database
and S3 bucket
configuration. The setup is described in detail here.
Most of the work is done when the AWS resources were created. The remaining steps are launching the API and frontend using docker-compose.
For development, the Oauth authentification might be problematic, especially when the font end is developed on a different server. For this reason there The developer flag has to be added in the config file. This will then bypass any authentification requirements and assume a generic admin user. To modify the behavior edit ~datacrossways/secrets/config.json
and set the field development
to be either true
or false
. By default, the development status is false
.
Before you continue make sure you log out and back in after running setup.sh.
To start the datacrossway service run the command below. It will ask for some additional information. Namely for the domain name and an email required for Let's Encrypt notifications. The domain should be entered in this format without protocol prefix e.g. datacrossways.org
.
~/datacrossways/start.sh
The following command will stop the docker containers
~/datacrossways/stop.sh
Removing the docker containers will not remove any of the persisted data in the database or the S3 bucket. If you want to permanently delete the project first run
cd ~/datacrossways
docker compose down
And then remove all the cloud resources following the steps described here.
There are a lot of steps to deploy Datacrossways and some will cause problems down the road and prevent successful deployment. this section collects common issues that are encountered:
AWS instance is not an Ubuntu instance.
Make sure you select the Ubuntu option when launching an instance.OAuth does not work (redirect not valid).
Make sure the URL useshttps
.
Even though the API and React frontend are running locally, the cloud resources are still required. To create them please go through the steps described here first. When the S3 bucket
is created with all additional configuration proceed to deploy the API.
First get the API code using git:
git clone https://github.com/MaayanLab/datacrossways_api
Then navigate to the datacorssways_api
folder. The API requires a config file secrets/config.json
. The configuration contains information about:
- Internal URLs (
api
,fronend
,redirect
) - GoogleOAuth client credentials
- Database credentials
- AWS user credentials (Important: these are the credentials from the AWS user that has only read and write access to the newly created S3 bucket and NOT the
temporary user
)
The config.json
file can be created after setting up all AWS resource. For this run python3 ~/datacrossways/aws/aws_setup.py <aws_id> <aws_key> <project_name>
and follow the instructions to retrieve the Google OAuth credentials
here. The JSON
file from the Google Developer Console
should be copied into ~/datacrossways/secrets/
(the name of the file is not important, it will be automatically detected).
python ~/datacrossways/create_config.py <project_name>
This will generate a file at ~/datacrossways/secrets/config.json
.
Then run:
mkdir ~/datacrossways_api/secrets
mv ~/datacrossways/secrets/conf.json~/datacrossways_api/secrets/config.json
cd ~/datacrossways_api
flask run
The API should now be up and running
{
"api":{
"url": "http://localhost:5000"
},
"frontend": {
"url": "http://localhost:3000/"
},
"redirect": "http://localhost:5000/",
"oauth": {
"google": {
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"client_id": "XXXXXXXXXXXXX-xxxxxxxxxxxxxxxx.apps.googleusercontent.com",
"client_secret": "XXXXXXXXXX-xxxxxxxxxxxxx",
"javascript_origins": [
"http://localhost:5000"
],
"project_id": "xxxxxxxxxx",
"redirect_uris": [
"http://localhost:5000/authorize"
],
"token_uri": "https://oauth2.googleapis.com/token"
}
},
"aws": {
"aws_id": "xxxxxxxxxxxxxxxx",
"aws_key": "xxxxxxxxxxxxxx",
"bucket_name": "unique_bucket_name",
"region": "us-east-1"
},
"database":{
"user": "xxxxxxx",
"pass": "xxxxxxxxxxxxx",
"server": "xxxxxxxxxx.xxxxx.rds.amazonaws.com",
"port": "5432",
"name": "xxxxxxxxx"
}
}
The API is a flask application and can be started using the command flask run
.
The React frontend depends on the API, so it should be set up first. Then get the frontend using git with:
git clone https://github.com/MaayanLab/datacrossways_frontend
Navigate into the project folder and run npm install --legacy-peer-deps
. To start the frontend run npm run dev
. The fronend is currently accessed via the API port at http://localhost:5000
.