Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define standard necessary column names for input data. #24

Open
anujsinha3 opened this issue Jun 20, 2024 · 5 comments
Open

Define standard necessary column names for input data. #24

anujsinha3 opened this issue Jun 20, 2024 · 5 comments
Assignees

Comments

@anujsinha3
Copy link
Collaborator

anujsinha3 commented Jun 20, 2024

Currently, each column in the CSV file is accessed by an integer index. This has the following limitations:

  1. The workflow is prone to failures due to the wrong ordering of columns as the input CSV file columns are STRICTLY positioned with no flexibility.
  2. The source code becomes convoluted and difficult to comprehend.
  3. Code vectorization is difficult, and increased usage for nested loops impacts performance.

We plan to use pandas data frames going forward, for which we need to standardize the column names that will be part of the input CSV file.

Existing column names: (Confirm if these column names are standard ones, or if any change if required)
"unix_start_t",
"user_ID",
"orig_lat",
"orig_long",
"orig_unc",
"stay_lat",
"stay_long",
"stay_unc",
"stay_dur",
"stay_ind",
"human_start_t"

@gracejia513
Copy link
Collaborator

Hi Anuj, I believe these columns are not part of the input file:
"stay_lat",
"stay_long",
"stay_unc",
"stay_dur",
"stay_ind",
"human_start_t"

However, we can use them as standardized column names for the output file.

@Anurag19101996
Copy link

@Anurag19101996
Copy link

@gracejia513 Please confirm once.

@gracejia513
Copy link
Collaborator

@anujsinha3 did you have a standardized column for datetime? Is it UNIX_START_T?

@anujsinha3
Copy link
Collaborator Author

Column names have been standardized in the following format, i.e. snake_casing. The ordering of columns DO NOT matter in the csv file.

The column names are insensitive to capital or small letters but do require '_' where mentioned.

A few Examples:

'orig_lat', 'orig_long', 'unix_start_t', 'user_id'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants