Data Modeling and ETL processing for Shopify data

This project aims at Data Modeling and ETL processing for tracking users' songplay activity for Sparkify using relational databases.

Technologies used

Python
PostgreSQL

Source data information

There are two set of source files available for processing. song_data and log_data.

song_data

set of files that contains metadata about songs.

single file contains metadata about a single song.
json format.
sample data:

log_data

set of files that contains transactional data about the songs played in the app by various users.

single file contains transactional data for a given day. a list of json data records are present in the file.
json format.
sample data:

Data Model

Data is modeled in form of STAR schema postgreSQL tables.
`SPARKIFYDB` database contains the below tables:

`SONGPLAYS` table is a Fact table containing all the songs played by various users of the app.
`TIME`,`SONGS`,`USERS` and `ARTISTS` tables are Dimentsion tables containing metadata about the songplay data.

Below is the ER diagram which provides more details about the database tables:

ETL Processing

Data Profile:

songs and artists tables are populated from the song_data source file.
Data profiled shows that year, artist_location, artist_latitude and artist_longitude are not always present.
An Artist could have multiple songs. some song_names have special characters in the source file encoded in utf8 format and target database is set to the same in order to preserve data accuracy.
Users, time and songplay tables are populated from the log_data source file(except for user_id and artist_id which are linked from songs and artists table respectively).
A user could have multiple songplays.

Data Processing

Pandas Dataframes are used to hold the data from the json files.
Transformations and data cleansing is done as mentioned in the next section before inserting data.
sql queries are used to insert data into the tables and are invoked from the python script using the psycopg2 package for postGRE sql.

Transformations and data cleansing:

few of the songs do not have year attribute in the source files. for such records year attribute is set to null before adding it to the database.
'ts' contains timestamp in milliseconds format and is converted to datetime before inserting into time/songplays table.
few of the artists do not have location, latitude and longitude attributes in the source file. for such records location, latitude and longitude are set to null before adding it to the database.
song_data file contains data about a single file and is processed file by file.
log_data file contains data multiple songplay records and is processed file by file.
before adding data to users table - data from the dataframe is deduped before inserting records. (Although there is a ON CONFLICT clause in the table database - this is done for faster processing).
data from log_file is sorted based on 'ts' - so that level attribute for a user will be latest when inserted in the tables.

Project Files

create_tables.py and sql_queries.py are used to create the required database schema as per the model and has insert queries for the respective tables. etl.py is used for processing the source JSON files, data transformations and data cleansing and invokation to insert the data into PostgreSQL tables.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
log_file_sample_data.json		log_file_sample_data.json
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Modeling and ETL processing for Shopify data

Table of Contents

Technologies used

Source data information

song_data

log_data

Data Model

ETL Processing

Data Profile:

Data Processing

Transformations and data cleansing:

Project Files

References

About

Releases

Packages

Languages

bhavikvadhia/rdbms-data-model-and-etl-dend

Folders and files

Latest commit

History

Repository files navigation

Data Modeling and ETL processing for Shopify data

Table of Contents

Technologies used

Source data information

song_data

log_data

Data Model

ETL Processing

Data Profile:

Data Processing

Transformations and data cleansing:

Project Files

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages