Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drop down to sqlite3 for gpkg read to avoid gdal dependencies #604

Open
jameshalgren opened this issue Feb 22, 2023 · 4 comments
Open

drop down to sqlite3 for gpkg read to avoid gdal dependencies #604

jameshalgren opened this issue Feb 22, 2023 · 4 comments

Comments

@jameshalgren
Copy link
Contributor

jameshalgren commented Feb 22, 2023

def read_geopkg(file_path):
flowpaths = gpd.read_file(file_path, layer="flowpaths")
attributes = gpd.read_file(file_path, layer="flowpath_attributes").drop('geometry', axis=1)
#merge all relevant data into a single dataframe
flowpaths = pd.merge(flowpaths, attributes, on='id')

In most cases (in every case?) for t-route, we are discarding the geometry data and only use the geo-enabled packages for convenience in reading the files where the network connectivity information is stored.

To avoid unnecessary (see note) dependencies, we could update this function to look something like the following:

## TESTED; See note below
import sqlite3
import pandas as pd

def read_geopkg(file_path):
    flowpaths = read_geopkg_layer_sqlite3(file_path, layer="flowpaths")
    attributes = read_geopkg_layer_sqlite3(file_path, layer="flowpath_attributes")
    #merge all relevant data into a single dataframe
    return pd.merge(flowpaths, attributes, on='id')

def read_geopkg_layer_sqlite3(file_path, layer):
    con = sqlite3.connect(file_path)
    return pd.read_sql_query("SELECT * from " + layer, con)
    

Note: More accurately, "troublesome" dependencies. Gdal has been causing some trouble on the arm64 builds and if we can circumvent those issues, we gain (potentially) quite a bit of performance.
Note2: The geopandas read is (I believe) lazy, so this direct sql read with the connection closing will carry a performance hit, potentially, especially for very large flowpath tables.

@jameshalgren
Copy link
Contributor Author

jameshalgren commented Feb 22, 2023

Tested this by using the old function and the new function to read a geopackage for the 03w VPU:

wget -P 03w -c https://nextgen-hydrofabric.s3.amazonaws.com/v1.2/nextgen_03W.gpkg

Then compare the outputs:

d = read_geopkg("03w/nextgen_03W.gpkg")
d2 = read_geopkg_sqlite3("03w/nextgen_03W.gpkg")
d.drop(["geometry"],axis=1) == d2.drop(["geom", "fid_x", "fid_y"],axis=1)
## Returns all True

The only difference is the geometry.

Useful comments on datacarpentry.org here.

@jameshalgren
Copy link
Contributor Author

ping @ZacharyWills

@groutr
Copy link
Contributor

groutr commented Mar 30, 2023

Your sql statement is a special case that has its own convenience function, pd.read_sql_table.

pd.read_sql_table('flowpaths', con, index_col='id')

@jameshalgren
Copy link
Contributor Author

Your sql statement is a special case that has its own convenience function, pd.read_sql_table.

pd.read_sql_table('flowpaths', con, index_col='id')

Python 3.10+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants