From 14e05504a70758e95e3615d2343075a0ec4c503f Mon Sep 17 00:00:00 2001 From: Mani Sarkar Date: Mon, 12 Oct 2020 21:02:08 +0100 Subject: [PATCH 01/18] Adding Visualization sub-topic to Data preparation guide. Changes to create placeholders for links and contents to come across the other guides --- .../collect-and-prepare-data/data-preparation.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/content/purgatorio/collect-and-prepare-data/data-preparation.md b/content/purgatorio/collect-and-prepare-data/data-preparation.md index 2960e0f0..cb9a062a 100644 --- a/content/purgatorio/collect-and-prepare-data/data-preparation.md +++ b/content/purgatorio/collect-and-prepare-data/data-preparation.md @@ -1,6 +1,6 @@ --- title: Data Preparation -author: clone95 +author: clone95, neomatrix369 description: The purpose of this guide is to show you the different preprocessing steps you need to apply to your data before feeding them to Machine Learning models. --- @@ -29,6 +29,7 @@ The purpose of this guide is to show you the importance of theese steps, mostly - [Data Discretization](#Data-Discretization) - [Feature Scaling](#Feature-Scaling) - [Data Cleaning Tools](#Data-Cleaning-Tools) +- [Visualization](#Visualization) - [Merge Data Sets and Integration](#Merge-Data-Sets-and-Integration) - [Sanity Check](#Sanity-Check) - [Automate These Boring Stuffs!](#Automate-These-Boring-Stuffs!) @@ -156,6 +157,13 @@ _Best practices and exercises:_ [1](https://www.kaggle.com/jfeng1023/data-cleani ### Data Cleaning Tools You're not going to hunt tigers without a rifle! You have a ton of tools out there that will help you during the data cleaning process, the one I want to suggest you is [this](https://www.analyticsindiamag.com/10-best-data-cleaning-tools-get-data/) open source tool from Google. Check [here](https://www.quora.com/What-are-the-best-open-source-data-cleansing-tools-software-available) for more. +### Visualization + +(visualization during data preparation process: before, during and after) +. +. +. + ### Merge Data Sets and Integration Now that you hopefully have been successful in your data cleaning process, you can merge data from different source to create big [de-normalized](https://www.researchgate.net/post/When_and_why_do_we_need_data_normalization_in_data_mining_algorithms) data tables, ready to be explored and consumed. [This](https://www.quora.com/Is-data-warehouse-normalized-or-denormalized-Why) is why. From a763b5194fdba518649110a540c08ea635b00e2e Mon Sep 17 00:00:00 2001 From: Mani Sarkar Date: Sat, 7 Nov 2020 20:31:32 +0000 Subject: [PATCH 02/18] Data preparation: reorganise the topics and sub-topics including the ToC --- .../data-preparation.md | 128 +++++++++--------- 1 file changed, 64 insertions(+), 64 deletions(-) diff --git a/content/purgatorio/collect-and-prepare-data/data-preparation.md b/content/purgatorio/collect-and-prepare-data/data-preparation.md index cb9a062a..f9810f74 100644 --- a/content/purgatorio/collect-and-prepare-data/data-preparation.md +++ b/content/purgatorio/collect-and-prepare-data/data-preparation.md @@ -9,13 +9,6 @@ Real world data is almost always messy or unstructured, and most of the time of The purpose of this guide is to show you the importance of theese steps, mostly about text data, but will be listed guides about cleaning each kind data you can encounter. # Index -- [Data Preprocessing](#Data-Preprocessing) -- [Don't Joke With Data](#Don't-Joke-With-Data) -- [Business Questions](#Business-Questions) -- [Data Profiling](#Data-Profiling) -- [Who To Leave Behind](#Who-To-Leave-Behind) -- [Start Small](#Start-small) -- [The Toolkit](#The-Toolkit) - [Data Cleaning](#Data-Cleaning) - [Get Rid of Extra Spaces](#Get-Rid-of-Extra-Spaces) - [Select and Treat All Blank Cells](#Select-and-Treat-All-Blank-Cells) @@ -29,67 +22,19 @@ The purpose of this guide is to show you the importance of theese steps, mostly - [Data Discretization](#Data-Discretization) - [Feature Scaling](#Feature-Scaling) - [Data Cleaning Tools](#Data-Cleaning-Tools) +- [Data Preprocessing / Data wrangling / Data manipulation](#Data-Preprocessing) +- [Data Profiling](#Data-Profiling) +- [Don't Joke With Data](#Don't-Joke-With-Data) +- [Business Questions](#Business-Questions) +- [Who To Leave Behind](#Who-To-Leave-Behind) +- [Start Small](#Start-small) +- [The Toolkit](#The-Toolkit) - [Visualization](#Visualization) - [Merge Data Sets and Integration](#Merge-Data-Sets-and-Integration) - [Sanity Check](#Sanity-Check) - [Automate These Boring Stuffs!](#Automate-These-Boring-Stuffs!) -# Data Preprocessing - -Data preprocessing (also known as Data Preparation, but "Preprocessing" sounds more like magic) is the [iterative process](http://www.jsoftware.us/vol12/306-JSW15277.pdf) of gathering, combining, structuring and organizing data so it can be analyzed as part of data visualization, analytics, and machine learning applications. - -[Real-world data](https://www.quanticate.com/blog/real-world-data-analysis-in-clinical-trials) is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. - -It's the [core ability](https://blogs.sas.com/content/hiddeninsights/2017/11/30/analytical-data-preparation-important/) of any data scientist or data engineer, and you must _be able to manipulate, clean, and structure_ your data during the everyday work (besides expecting that this will take the most of your [daily-time](https://www.infoworld.com/article/3228245/the-80-20-data-science-dilemma.html)!). - -There are a lot of different data types out there, and they deserve [different treatments](http://blog.appliedinformaticsinc.com/data-mining-challenges-in-data-cleaning/). - -As usual the structure I've planned to get you started consists of having a [general overview](https://searchbusinessanalytics.techtarget.com/definition/data-preparation), and then dive deep into each data processing situation you can encounter. - -[Here](https://towardsdatascience.com/data-pre-processing-techniques-you-should-know-8954662716d6) you have a gentle end-to-end panoramic view of the entire process. - -**Let's Start!** - -### Don't Joke With Data -First, [**data is King**](https://www.edq.com/glossary/data-quality-importance/). In the [data-driven epoch](https://www.venturi-group.com/qa-with-helen-mannion/), having [data quality issues](https://www.ringlead.com/blog/7-common-data-quality-issues/) means to lose tremendous amounts of value for a company, in the present and in the future. So, respect your King and care a lot about him. The most immediate way to do this is to plan and [work hard](https://nektardata.com/high-quality-data/) to _produce_ good quality data. -Your goal is to plan a collecting data infrastructure that fixes problems beforehand. This means to care to a lot about planning well your database schemas (do I need [third-normal form](https://social.technet.microsoft.com/Forums/Lync/en-US/7bf4ca30-a1bc-415d-97e6-ce0ac3137b53/normalized-3nf-vs-denormalizedstar-schema-data-warehouse-?forum=sqldatawarehousing) or not?), how do you collect data from sensors (physical or conceptual) and so on. These are problems if you're building a system up from the ground, but most of the times in you're gonna facing real-world problems that someone wants to solve with [_already available_](https://www.wired.com/insights/2013/05/more-data-more-problems-is-big-data-always-right/) data. - -### Business Questions -Asking the [right business questions](https://www.datapine.com/blog/data-analysis-questions/) is hard, but it has the [biggest impact](https://towardsdatascience.com/start-your-data-exploration-with-questions-2f1d42cff29e) on your performance of solving a particular problem. Remember, you want to [solve a problem](http://www.informit.com/articles/article.aspx?p=2271188&seqNum=2), not to create new ones! - -### Data Profiling -According to the (cold as ice) [Wikipedia definition](https://en.wikipedia.org/wiki/Data_profiling): "Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics and informative data summaries."\ -So Wikipedia is subtly suggesting us to take a coffee with the data. - -During this informal meeting, ask the data questions like: -- which business problem are you meant to solve? (what is important, and what is not) -- how have you been collected (with noise, missing values...)? -- how many friends of yours are there and where can I find them? (data dimensions and retrieving from storages) - -Eventually, you may find the data too much quiet, maybe they're just shy! \ -Anyway, you're going to [ask these questions to the business user](https://business-analysis-excellence.com/business-requirements-meeting/)! - -_Best practices and exercises:_ [1](https://www.iqint.org/idq2013/presentations/downloads/di_loreto_data_profiling_tutorial_monday_am.pdf), [2](https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Data-profiling-tutorials-use-cases-and-exercise/td-p/145347) - -### Who To Leave Behind -During the data profiling process, it's common to realize that often some of your data are [useless](https://ambisense.net/why-useless-data-is-worse-than-no-data/). Your data may have too much noise or they are partial, and most likely you don't all of them to answer your business problems. -[To drop or not to drop, the Dilemma](https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/). -Each time you're facing a data related problem, try to understand what data you need and what you' don't - that is, for each piece of information, ask yourself (and ask the _business user_): -- How this data is going to help me? -- Is possible to use them, reducing noise o missing values? -- Considering the benefits/costs of the preparation process versus the business value created, Is this data worth it? - -### Start Small -It's stupid to handle GBs of data each time you want to try a data preparation step. Just use [small subsets](https://sdtimes.com/bi/data-gets-big-best-practices-data-preparation-scale/) of the data (but take care that they are representative and you catch all the problems). Remember, if you want to experiment with string cleaning, you don't need to launch your script on 10M rows. - -### The Toolkit -The tools we're gonna use are Python3 and his [Pandas library](https://pandas.pydata.org/), the de-facto standard to manipulate datasets. -The heavy lifting here is done by the [DataFrame class](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which comes with a bunch of useful functions for your daily data tasks. -Hopefully, you already know Python, if not start from there (do the steps I suggest you in the ML guide requirements), and then take this [Beginner Pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html). Don't worry if now some ideas are not totally clear, but try to get the big picture of the common [Pandas operations](https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/). - -_Best practices and exercises:_ [1](https://github.com/guipsamora/pandas_exercises), [2](https://www.w3resource.com/python-exercises/pandas/index.php), [3](https://www.machinelearningplus.com/python/101-pandas-exercises-python/), [4](https://www.kaggle.com/pistak/pandas-tutorial-with-interactive-exercises), [5](http://disi.unitn.it/~teso/courses/sciprog/python_pandas_exercises.html) - ### Data Cleaning [Data cleaning](https://en.wikipedia.org/wiki/Data_cleansing) is the general process of taking data, after you have a clear big picture of them, and you need to realize the actual process of replacing characters, dropping incomplete rows, fill missing values and so forth. In the next sections, we'll explore all the common data cleaning situations. @@ -157,6 +102,62 @@ _Best practices and exercises:_ [1](https://www.kaggle.com/jfeng1023/data-cleani ### Data Cleaning Tools You're not going to hunt tigers without a rifle! You have a ton of tools out there that will help you during the data cleaning process, the one I want to suggest you is [this](https://www.analyticsindiamag.com/10-best-data-cleaning-tools-get-data/) open source tool from Google. Check [here](https://www.quora.com/What-are-the-best-open-source-data-cleansing-tools-software-available) for more. +# Data Preprocessing / Data wrangling / Data manipulation + +Data preprocessing (also known as Data Preparation, but "Preprocessing" sounds more like magic) is the [iterative process](http://www.jsoftware.us/vol12/306-JSW15277.pdf) of gathering, combining, structuring and organizing data so it can be analyzed as part of data visualization, analytics, and machine learning applications. + +[Real-world data](https://www.quanticate.com/blog/real-world-data-analysis-in-clinical-trials) is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. + +It's the [core ability](https://blogs.sas.com/content/hiddeninsights/2017/11/30/analytical-data-preparation-important/) of any data scientist or data engineer, and you must _be able to manipulate, clean, and structure_ your data during the everyday work (besides expecting that this will take the most of your [daily-time](https://www.infoworld.com/article/3228245/the-80-20-data-science-dilemma.html)!). + +There are a lot of different data types out there, and they deserve [different treatments](http://blog.appliedinformaticsinc.com/data-mining-challenges-in-data-cleaning/). + +As usual the structure I've planned to get you started consists of having a [general overview](https://searchbusinessanalytics.techtarget.com/definition/data-preparation), and then dive deep into each data processing situation you can encounter. + +[Here](https://towardsdatascience.com/data-pre-processing-techniques-you-should-know-8954662716d6) you have a gentle end-to-end panoramic view of the entire process. + +### Data Profiling +According to the (cold as ice) [Wikipedia definition](https://en.wikipedia.org/wiki/Data_profiling): "Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics and informative data summaries."\ +So Wikipedia is subtly suggesting us to take a coffee with the data. + +During this informal meeting, ask the data questions like: +- which business problem are you meant to solve? (what is important, and what is not) +- how have you been collected (with noise, missing values...)? +- how many friends of yours are there and where can I find them? (data dimensions and retrieving from storages) + +Eventually, you may find the data too much quiet, maybe they're just shy! \ +Anyway, you're going to [ask these questions to the business user](https://business-analysis-excellence.com/business-requirements-meeting/)! + +_Best practices and exercises:_ [1](https://www.iqint.org/idq2013/presentations/downloads/di_loreto_data_profiling_tutorial_monday_am.pdf), [2](https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Data-profiling-tutorials-use-cases-and-exercise/td-p/145347) + + +**Let's Start!** + +### Don't Joke With Data +First, [**data is King**](https://www.edq.com/glossary/data-quality-importance/). In the [data-driven epoch](https://www.venturi-group.com/qa-with-helen-mannion/), having [data quality issues](https://www.ringlead.com/blog/7-common-data-quality-issues/) means to lose tremendous amounts of value for a company, in the present and in the future. So, respect your King and care a lot about him. The most immediate way to do this is to plan and [work hard](https://nektardata.com/high-quality-data/) to _produce_ good quality data. +Your goal is to plan a collecting data infrastructure that fixes problems beforehand. This means to care to a lot about planning well your database schemas (do I need [third-normal form](https://social.technet.microsoft.com/Forums/Lync/en-US/7bf4ca30-a1bc-415d-97e6-ce0ac3137b53/normalized-3nf-vs-denormalizedstar-schema-data-warehouse-?forum=sqldatawarehousing) or not?), how do you collect data from sensors (physical or conceptual) and so on. These are problems if you're building a system up from the ground, but most of the times in you're gonna facing real-world problems that someone wants to solve with [_already available_](https://www.wired.com/insights/2013/05/more-data-more-problems-is-big-data-always-right/) data. + +### Business Questions +Asking the [right business questions](https://www.datapine.com/blog/data-analysis-questions/) is hard, but it has the [biggest impact](https://towardsdatascience.com/start-your-data-exploration-with-questions-2f1d42cff29e) on your performance of solving a particular problem. Remember, you want to [solve a problem](http://www.informit.com/articles/article.aspx?p=2271188&seqNum=2), not to create new ones! + +### Who To Leave Behind +During the data profiling process, it's common to realize that often some of your data are [useless](https://ambisense.net/why-useless-data-is-worse-than-no-data/). Your data may have too much noise or they are partial, and most likely you don't all of them to answer your business problems. +[To drop or not to drop, the Dilemma](https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/). +Each time you're facing a data related problem, try to understand what data you need and what you' don't - that is, for each piece of information, ask yourself (and ask the _business user_): +- How this data is going to help me? +- Is possible to use them, reducing noise o missing values? +- Considering the benefits/costs of the preparation process versus the business value created, Is this data worth it? + +### Start Small +It's stupid to handle GBs of data each time you want to try a data preparation step. Just use [small subsets](https://sdtimes.com/bi/data-gets-big-best-practices-data-preparation-scale/) of the data (but take care that they are representative and you catch all the problems). Remember, if you want to experiment with string cleaning, you don't need to launch your script on 10M rows. + +### The Toolkit +The tools we're gonna use are Python3 and his [Pandas library](https://pandas.pydata.org/), the de-facto standard to manipulate datasets. +The heavy lifting here is done by the [DataFrame class](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which comes with a bunch of useful functions for your daily data tasks. +Hopefully, you already know Python, if not start from there (do the steps I suggest you in the ML guide requirements), and then take this [Beginner Pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html). Don't worry if now some ideas are not totally clear, but try to get the big picture of the common [Pandas operations](https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/). + +_Best practices and exercises:_ [1](https://github.com/guipsamora/pandas_exercises), [2](https://www.w3resource.com/python-exercises/pandas/index.php), [3](https://www.machinelearningplus.com/python/101-pandas-exercises-python/), [4](https://www.kaggle.com/pistak/pandas-tutorial-with-interactive-exercises), [5](http://disi.unitn.it/~teso/courses/sciprog/python_pandas_exercises.html) + ### Visualization (visualization during data preparation process: before, during and after) @@ -181,5 +182,4 @@ As I told you at the very beginning, the data preprocessing process can take a l _Best practices and exercises:_ [1](https://blog.panoply.io/5-data-preparation-tools-1-automated-data-platform), [2](https://www.quora.com/How-do-I-make-an-automated-data-cleaning-in-Python-for-ML-Is-there-a-trick-for-that), [3](https://www.quora.com/Is-there-a-python-package-to-automate-data-preparation-in-machine-learning), [4](https://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/), [5](https://www.analyticsvidhya.com/blog/2018/10/rapidminer-data-preparation-machine-learning/) ### Conclusions -Now you're ready to take your data and play with them in a variety of ways, and you have a nice panoramic overview of the entire process. You can refer to this page when you clean data, to check if you're not missing some steps. Remember that probably each situation requires a subset of these steps. - +Now you're ready to take your data and play with them in a variety of ways, and you have a nice panoramic overview of the entire process. You can refer to this page when you clean data, to check if you're not missing some steps. Remember that probably each situation requires a subset of these steps. \ No newline at end of file From 2e6905077fc80069657216656a8f3f3298314049 Mon Sep 17 00:00:00 2001 From: Mani Sarkar Date: Sat, 7 Nov 2020 21:16:22 +0000 Subject: [PATCH 03/18] Data preparation: moving topics out of Data Cleaning into Data Preprocessing --- .../data-preparation.md | 36 +++++++++---------- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/content/purgatorio/collect-and-prepare-data/data-preparation.md b/content/purgatorio/collect-and-prepare-data/data-preparation.md index f9810f74..a48a1dff 100644 --- a/content/purgatorio/collect-and-prepare-data/data-preparation.md +++ b/content/purgatorio/collect-and-prepare-data/data-preparation.md @@ -14,15 +14,15 @@ The purpose of this guide is to show you the importance of theese steps, mostly - [Select and Treat All Blank Cells](#Select-and-Treat-All-Blank-Cells) - [Convert Values Type](#Convert-Values-Type) - [Remove Duplicates](#Remove-Duplicates) - - [Change Text to Lower/Upper Case](#Change-Text-to-Lower/Upper-Case) - [Spell Check](#Spell-Check) - [Dealing with Special Characters](#Dealing-with-Special-Characters) - - [Normalizing Dates](#Normalizing-Dates) - [Verification To Enrich Data](#Verification-To-Enrich-Data) - [Data Discretization](#Data-Discretization) - - [Feature Scaling](#Feature-Scaling) - [Data Cleaning Tools](#Data-Cleaning-Tools) - [Data Preprocessing / Data wrangling / Data manipulation](#Data-Preprocessing) + - [Change Text to Lower/Upper Case](#Change-Text-to-Lower/Upper-Case) + - [Normalizing Dates](#Normalizing-Dates) + - [Feature Scaling](#Feature-Scaling) - [Data Profiling](#Data-Profiling) - [Don't Joke With Data](#Don't-Joke-With-Data) - [Business Questions](#Business-Questions) @@ -60,9 +60,6 @@ _Best practices and exercises:_ [1](https://www.kaggle.com/nirmal51194/data-clea ### Remove Duplicates You don't want to duplicate data, they both are noise and occupy space! Learn [how to handle them simply](https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/) with Pandas. -### Change Text to Lower/Upper Case -You want to _Capitalize_ names, or maybe make them uniform (some people can enter data with or without capital letters!). Check [here](https://www.geeksforgeeks.org/python-pandas-series-str-lower-upper-and-title/) for the Pandas way to do it. - ### Spell Check You want to correct wrong words, for the sake of evenness. Check [here](https://www.tutorialspoint.com/python/python_spelling_check.htm) for a good Python module to do it. Also, this is a good starting point to [implement it](https://stackoverflow.com/questions/46409475/spell-checker-in-pandas). @@ -78,11 +75,6 @@ UTF-encoding is the standard to follow, but remember that not everyone follows t _Best practices and exercises:_ [1](https://www.w3resource.com/python-exercises/python-basic-exercise-92.php), [2](https://stackoverflow.com/questions/22518703/escape-sequences-exercise-in-python?rq=1), [3](https://learnpythonthehardway.org/book/ex2.html) -### Normalizing Dates -I think there could be one hundred ways to write down a date. You need to decide your format and make them uniform across your dataset, and [here](https://medium.com/jbennetcodes/dealing-with-datetimes-like-a-pro-in-pandas-b80d3d808a7f) you learn how to do it. - -_Best practices and exercises:_ [1](https://www.w3resource.com/python-exercises/python-conditional-exercise-41.php), [2](https://www.w3resource.com/python-exercises/date-time-exercise/), [3](https://www.kaggle.com/anezka/data-cleaning-challenge-parsing-dates) - ### Verification to enrich data Sometimes can be useful to engineer some data, for example: suppose you're dealing with [e-commerce data](https://www.edataindia.com/why-data-cleansing-is-important/), and you have the prices of each object sold. You may want to add a new column in your dataset, with a label carrying handy information like a Price_level [low, medium, high] based on upper and lower bounds you can decide. This is really simple in Pandas, check [here](https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column). Another example is to add a Gender column (M, F) to easily explore data and gain insights in a customers dataset. @@ -93,12 +85,6 @@ Many Machine Learning and Data Analysis methods cannot handle continuous data, a _Best practices and exercises:_ [1](https://www.researchgate.net/post/What_are_the_best_methods_for_discretization_of_continuous_features), [2](https://towardsdatascience.com/discretisation-using-decision-trees-21910483fa4b), [3](https://docs.microsoft.com/en-us/sql/analysis-services/data-mining/discretization-methods-data-mining) -### Feature Scaling -Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. -[Here](Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.) you find a serious tutorial about this fundamental step. - -_Best practices and exercises:_ [1](https://www.kaggle.com/jfeng1023/data-cleaning-challenge-scale-and-normalize-data), [2](https://www.quora.com/When-should-you-perform-feature-scaling-and-mean-normalization-on-the-given-data-What-are-the-advantages-of-these-techniques), [3](https://www.quora.com/When-do-I-have-to-do-feature-scaling-in-machine-learning) - ### Data Cleaning Tools You're not going to hunt tigers without a rifle! You have a ton of tools out there that will help you during the data cleaning process, the one I want to suggest you is [this](https://www.analyticsindiamag.com/10-best-data-cleaning-tools-get-data/) open source tool from Google. Check [here](https://www.quora.com/What-are-the-best-open-source-data-cleansing-tools-software-available) for more. @@ -116,7 +102,21 @@ As usual the structure I've planned to get you started consists of having a [gen [Here](https://towardsdatascience.com/data-pre-processing-techniques-you-should-know-8954662716d6) you have a gentle end-to-end panoramic view of the entire process. -### Data Profiling +### Change Text to Lower/Upper Case +You want to _Capitalize_ names, or maybe make them uniform (some people can enter data with or without capital letters!). Check [here](https://www.geeksforgeeks.org/python-pandas-series-str-lower-upper-and-title/) for the Pandas way to do it. + +### Normalizing Dates +I think there could be one hundred ways to write down a date. You need to decide your format and make them uniform across your dataset, and [here](https://medium.com/jbennetcodes/dealing-with-datetimes-like-a-pro-in-pandas-b80d3d808a7f) you learn how to do it. + +_Best practices and exercises:_ [1](https://www.w3resource.com/python-exercises/python-conditional-exercise-41.php), [2](https://www.w3resource.com/python-exercises/date-time-exercise/), [3](https://www.kaggle.com/anezka/data-cleaning-challenge-parsing-dates) + +### Feature Scaling +Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. +[Here](Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.) you find a serious tutorial about this fundamental step. + +_Best practices and exercises:_ [1](https://www.kaggle.com/jfeng1023/data-cleaning-challenge-scale-and-normalize-data), [2](https://www.quora.com/When-should-you-perform-feature-scaling-and-mean-normalization-on-the-given-data-What-are-the-advantages-of-these-techniques), [3](https://www.quora.com/When-do-I-have-to-do-feature-scaling-in-machine-learning) + +# Data Profiling According to the (cold as ice) [Wikipedia definition](https://en.wikipedia.org/wiki/Data_profiling): "Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics and informative data summaries."\ So Wikipedia is subtly suggesting us to take a coffee with the data. From 026b7787fefd26404206c74736f5af285f94d2f2 Mon Sep 17 00:00:00 2001 From: Mani Sarkar Date: Wed, 11 Nov 2020 18:41:45 +0000 Subject: [PATCH 04/18] Reorganising topics and sub-topics under Data Preparation --- .../data-preparation.md | 61 +++++++++++++++++++ 1 file changed, 61 insertions(+) diff --git a/content/purgatorio/collect-and-prepare-data/data-preparation.md b/content/purgatorio/collect-and-prepare-data/data-preparation.md index a48a1dff..123a0a93 100644 --- a/content/purgatorio/collect-and-prepare-data/data-preparation.md +++ b/content/purgatorio/collect-and-prepare-data/data-preparation.md @@ -9,6 +9,10 @@ Real world data is almost always messy or unstructured, and most of the time of The purpose of this guide is to show you the importance of theese steps, mostly about text data, but will be listed guides about cleaning each kind data you can encounter. # Index +- [Business Questions](#Business-Questions) +- [Start Small](#Start-small) +- [Data Preprocessing](#Data-Preprocessing) +- [Data Profiling](#Data-Profiling) - [Data Cleaning](#Data-Cleaning) - [Get Rid of Extra Spaces](#Get-Rid-of-Extra-Spaces) - [Select and Treat All Blank Cells](#Select-and-Treat-All-Blank-Cells) @@ -33,7 +37,60 @@ The purpose of this guide is to show you the importance of theese steps, mostly - [Merge Data Sets and Integration](#Merge-Data-Sets-and-Integration) - [Sanity Check](#Sanity-Check) - [Automate These Boring Stuffs!](#Automate-These-Boring-Stuffs!) +- [Don't Joke With Data](#Don't-Joke-With-Data) +- [Who To Leave Behind](#Who-To-Leave-Behind) +- [The Toolkit](#The-Toolkit) + +**Let's Start!** + +### Business Questions +Asking the [right business questions](https://www.datapine.com/blog/data-analysis-questions/) is hard, but it has the [biggest impact](https://towardsdatascience.com/start-your-data-exploration-with-questions-2f1d42cff29e) on your performance of solving a particular problem. Remember, you want to [solve a problem](http://www.informit.com/articles/article.aspx?p=2271188&seqNum=2), not to create new ones! + +### Start Small +It's stupid to handle GBs of data each time you want to try a data preparation step. Just use [small subsets](https://sdtimes.com/bi/data-gets-big-best-practices-data-preparation-scale/) of the data (but take care that they are representative and you catch all the problems). Remember, if you want to experiment with string cleaning, you don't need to launch your script on 10M rows. + +# Data Preprocessing + +Data preprocessing (also known as Data Preparation, but "Preprocessing" sounds more like magic) is the [iterative process](http://www.jsoftware.us/vol12/306-JSW15277.pdf) of gathering, combining, structuring and organizing data so it can be analyzed as part of data visualization, analytics, and machine learning applications. + +[Real-world data](https://www.quanticate.com/blog/real-world-data-analysis-in-clinical-trials) is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. + +It's the [core ability](https://blogs.sas.com/content/hiddeninsights/2017/11/30/analytical-data-preparation-important/) of any data scientist or data engineer, and you must _be able to manipulate, clean, and structure_ your data during the everyday work (besides expecting that this will take the most of your [daily-time](https://www.infoworld.com/article/3228245/the-80-20-data-science-dilemma.html)!). + +There are a lot of different data types out there, and they deserve [different treatments](http://blog.appliedinformaticsinc.com/data-mining-challenges-in-data-cleaning/). + +As usual the structure I've planned to get you started consists of having a [general overview](https://searchbusinessanalytics.techtarget.com/definition/data-preparation), and then dive deep into each data processing situation you can encounter. + +[Here](https://towardsdatascience.com/data-pre-processing-techniques-you-should-know-8954662716d6) you have a gentle end-to-end panoramic view of the entire process. + +### Data Profiling +According to the (cold as ice) [Wikipedia definition](https://en.wikipedia.org/wiki/Data_profiling): "Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics and informative data summaries."\ +So Wikipedia is subtly suggesting us to take a coffee with the data. + +During this informal meeting, ask the data questions like: +- which business problem are you meant to solve? (what is important, and what is not) +- how have you been collected (with noise, missing values...)? +- how many friends of yours are there and where can I find them? (data dimensions and retrieving from storages) +Eventually, you may find the data too much quiet, maybe they're just shy! \ +Anyway, you're going to [ask these questions to the business user](https://business-analysis-excellence.com/business-requirements-meeting/)! + +_Best practices and exercises:_ [1](https://www.iqint.org/idq2013/presentations/downloads/di_loreto_data_profiling_tutorial_monday_am.pdf), [2](https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Data-profiling-tutorials-use-cases-and-exercise/td-p/145347) + +### Who To Leave Behind +During the data profiling process, it's common to realize that often some of your data are [useless](https://ambisense.net/why-useless-data-is-worse-than-no-data/). Your data may have too much noise or they are partial, and most likely you don't all of them to answer your business problems. +[To drop or not to drop, the Dilemma](https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/). +Each time you're facing a data related problem, try to understand what data you need and what you' don't - that is, for each piece of information, ask yourself (and ask the _business user_): +- How this data is going to help me? +- Is possible to use them, reducing noise o missing values? +- Considering the benefits/costs of the preparation process versus the business value created, Is this data worth it? + +### The Toolkit +The tools we're gonna use are Python3 and his [Pandas library](https://pandas.pydata.org/), the de-facto standard to manipulate datasets. +The heavy lifting here is done by the [DataFrame class](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which comes with a bunch of useful functions for your daily data tasks. +Hopefully, you already know Python, if not start from there (do the steps I suggest you in the ML guide requirements), and then take this [Beginner Pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html). Don't worry if now some ideas are not totally clear, but try to get the big picture of the common [Pandas operations](https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/). + +_Best practices and exercises:_ [1](https://github.com/guipsamora/pandas_exercises), [2](https://www.w3resource.com/python-exercises/pandas/index.php), [3](https://www.machinelearningplus.com/python/101-pandas-exercises-python/), [4](https://www.kaggle.com/pistak/pandas-tutorial-with-interactive-exercises), [5](http://disi.unitn.it/~teso/courses/sciprog/python_pandas_exercises.html) ### Data Cleaning [Data cleaning](https://en.wikipedia.org/wiki/Data_cleansing) is the general process of taking data, after you have a clear big picture of them, and you need to realize the actual process of replacing characters, dropping incomplete rows, fill missing values and so forth. In the next sections, we'll explore all the common data cleaning situations. @@ -179,6 +236,10 @@ _Best practices and exercises:_ [1](https://blog.socialcops.com/academy/resource ### Automate These Boring Stuffs! As I told you at the very beginning, the data preprocessing process can take a long time and be very tedious. Because of this, you want to [automate](https://www.youtube.com/watch?v=UZUoH7_mYx4) the most you can. Also, **automation is married with iteration**, so this is the way you need to plan your data preprocessing pipelines. [Here](https://github.com/mdkearns/automated-data-preprocessing) you find a good command line tool for doing that, but I'm almost sure you'll need to build your own (remember, each problem is unique!), but this is a good starting point. +### Don't Joke With Data +First, [**data is King**](https://www.edq.com/glossary/data-quality-importance/). In the [data-driven epoch](https://www.venturi-group.com/qa-with-helen-mannion/), having [data quality issues](https://www.ringlead.com/blog/7-common-data-quality-issues/) means to lose tremendous amounts of value for a company, in the present and in the future. So, respect your King and care a lot about him. The most immediate way to do this is to plan and [work hard](https://nektardata.com/high-quality-data/) to _produce_ good quality data. +Your goal is to plan a collecting data infrastructure that fixes problems beforehand. This means to care to a lot about planning well your database schemas (do I need [third-normal form](https://social.technet.microsoft.com/Forums/Lync/en-US/7bf4ca30-a1bc-415d-97e6-ce0ac3137b53/normalized-3nf-vs-denormalizedstar-schema-data-warehouse-?forum=sqldatawarehousing) or not?), how do you collect data from sensors (physical or conceptual) and so on. These are problems if you're building a system up from the ground, but most of the times in you're gonna facing real-world problems that someone wants to solve with [_already available_](https://www.wired.com/insights/2013/05/more-data-more-problems-is-big-data-always-right/) data. + _Best practices and exercises:_ [1](https://blog.panoply.io/5-data-preparation-tools-1-automated-data-platform), [2](https://www.quora.com/How-do-I-make-an-automated-data-cleaning-in-Python-for-ML-Is-there-a-trick-for-that), [3](https://www.quora.com/Is-there-a-python-package-to-automate-data-preparation-in-machine-learning), [4](https://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/), [5](https://www.analyticsvidhya.com/blog/2018/10/rapidminer-data-preparation-machine-learning/) ### Conclusions From b5a85c19ec7850c3f7eb386a0ac84b4b7374efdf Mon Sep 17 00:00:00 2001 From: Mani Sarkar Date: Wed, 11 Nov 2020 20:39:38 +0000 Subject: [PATCH 05/18] Further amendments to various sub-topics in the Data Prep guide --- .../data-preparation.md | 198 +++++++----------- 1 file changed, 80 insertions(+), 118 deletions(-) diff --git a/content/purgatorio/collect-and-prepare-data/data-preparation.md b/content/purgatorio/collect-and-prepare-data/data-preparation.md index 123a0a93..0ba6bb58 100644 --- a/content/purgatorio/collect-and-prepare-data/data-preparation.md +++ b/content/purgatorio/collect-and-prepare-data/data-preparation.md @@ -5,12 +5,13 @@ description: The purpose of this guide is to show you the different preprocessin --- # What you will learn -Real world data is almost always messy or unstructured, and most of the time of the Data Scientist is spent on data preprocessing (or data cleaning), before visualize them or feeding them to Machine Learning models. -The purpose of this guide is to show you the importance of theese steps, mostly about text data, but will be listed guides about cleaning each kind data you can encounter. +Real world data is almost always messy or unstructured, and most of the time of the Data Scientist is spent in data preprocessing (or data cleaning or data preprocessing), before visualizing them or feeding them to Machine Learning models. + +The purpose of this guide is to show you the importance of these steps, mostly about text data, but there will be guides on cleaning different kinds of data you may encounter. # Index -- [Business Questions](#Business-Questions) - [Start Small](#Start-small) +- [Business Questions](#Business-Questions) - [Data Preprocessing](#Data-Preprocessing) - [Data Profiling](#Data-Profiling) - [Data Cleaning](#Data-Cleaning) @@ -19,6 +20,9 @@ The purpose of this guide is to show you the importance of theese steps, mostly - [Convert Values Type](#Convert-Values-Type) - [Remove Duplicates](#Remove-Duplicates) - [Spell Check](#Spell-Check) + - [Grammar Check](#Grammar-Check) + - [Reshape your data](#Reshape-your-data) + - [Converting to categorical data type](#Converting-to-categorical-data-type) - [Dealing with Special Characters](#Dealing-with-Special-Characters) - [Verification To Enrich Data](#Verification-To-Enrich-Data) - [Data Discretization](#Data-Discretization) @@ -27,31 +31,30 @@ The purpose of this guide is to show you the importance of theese steps, mostly - [Change Text to Lower/Upper Case](#Change-Text-to-Lower/Upper-Case) - [Normalizing Dates](#Normalizing-Dates) - [Feature Scaling](#Feature-Scaling) -- [Data Profiling](#Data-Profiling) -- [Don't Joke With Data](#Don't-Joke-With-Data) -- [Business Questions](#Business-Questions) -- [Who To Leave Behind](#Who-To-Leave-Behind) -- [Start Small](#Start-small) -- [The Toolkit](#The-Toolkit) + - [Text data](#Text-data) + - [Data Cleaning Tools](#Data-Cleaning-Tools) - [Visualization](#Visualization) - [Merge Data Sets and Integration](#Merge-Data-Sets-and-Integration) +- [Aggregating data (centralising)](#Aggregating-data-centralising) +- [Bias and balance/imbalance](#Bias-and-balance-imbalance) - [Sanity Check](#Sanity-Check) - [Automate These Boring Stuffs!](#Automate-These-Boring-Stuffs!) +- [Doing it in real-time](#Doing-it-in-real-time) - [Don't Joke With Data](#Don't-Joke-With-Data) - [Who To Leave Behind](#Who-To-Leave-Behind) - [The Toolkit](#The-Toolkit) +- [Conclusion](#Conclusion) **Let's Start!** -### Business Questions -Asking the [right business questions](https://www.datapine.com/blog/data-analysis-questions/) is hard, but it has the [biggest impact](https://towardsdatascience.com/start-your-data-exploration-with-questions-2f1d42cff29e) on your performance of solving a particular problem. Remember, you want to [solve a problem](http://www.informit.com/articles/article.aspx?p=2271188&seqNum=2), not to create new ones! - -### Start Small -It's stupid to handle GBs of data each time you want to try a data preparation step. Just use [small subsets](https://sdtimes.com/bi/data-gets-big-best-practices-data-preparation-scale/) of the data (but take care that they are representative and you catch all the problems). Remember, if you want to experiment with string cleaning, you don't need to launch your script on 10M rows. +## Start Small +It's stupid to handle GBs of data each time you want to try a data preparation step. Just use [small subsets](https://sdtimes.com/bi/data-gets-big-best-practices-data-preparation-scale/) of the data instead (but take care that they are representative and you catch all the problems). Remember, if you want to experiment with text cleaning, you don't need to launch your script on 10M rows. Test your data on a small sub-set or sample of data to learn if it works well there before going full-scale. -# Data Preprocessing +## Business Questions +Asking the [right business questions](https://www.datapine.com/blog/data-analysis-questions/) is hard, but it has the [biggest impact](https://towardsdatascience.com/start-your-data-exploration-with-questions-2f1d42cff29e) on your performance of solving a particular problem. Remember, you want to [solve a problem](http://www.informit.com/articles/article.aspx?p=2271188&seqNum=2), not to create new ones! -Data preprocessing (also known as Data Preparation, but "Preprocessing" sounds more like magic) is the [iterative process](http://www.jsoftware.us/vol12/306-JSW15277.pdf) of gathering, combining, structuring and organizing data so it can be analyzed as part of data visualization, analytics, and machine learning applications. +## Data Preprocessing +Data preprocessing (also known as Data Preparation, but "Preprocessing" sounds more like magic) is the [iterative process](http://www.jsoftware.us/vol12/306-JSW15277.pdf) of gathering, combining, structuring and (re)organizing data so it can be analyzed as part of data visualization, analytics, and machine learning processes. [Real-world data](https://www.quanticate.com/blog/real-world-data-analysis-in-clinical-trials) is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. @@ -63,37 +66,22 @@ As usual the structure I've planned to get you started consists of having a [gen [Here](https://towardsdatascience.com/data-pre-processing-techniques-you-should-know-8954662716d6) you have a gentle end-to-end panoramic view of the entire process. -### Data Profiling -According to the (cold as ice) [Wikipedia definition](https://en.wikipedia.org/wiki/Data_profiling): "Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics and informative data summaries."\ -So Wikipedia is subtly suggesting us to take a coffee with the data. +## Data Profiling +According to the (cold as ice) [Wikipedia definition](https://en.wikipedia.org/wiki/Data_profiling): "Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics and informative data summaries." \ +So Wikipedia is subtly suggesting us to have a coffee while working with our data. During this informal meeting, ask the data questions like: - which business problem are you meant to solve? (what is important, and what is not) - how have you been collected (with noise, missing values...)? -- how many friends of yours are there and where can I find them? (data dimensions and retrieving from storages) +- how many friends of yours are there and where can I find them? (data dimensions and retrieving from storages, related sources) -Eventually, you may find the data too much quiet, maybe they're just shy! \ +Eventually, you may find the data to be too quiet, maybe it's just shy! \ Anyway, you're going to [ask these questions to the business user](https://business-analysis-excellence.com/business-requirements-meeting/)! _Best practices and exercises:_ [1](https://www.iqint.org/idq2013/presentations/downloads/di_loreto_data_profiling_tutorial_monday_am.pdf), [2](https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Data-profiling-tutorials-use-cases-and-exercise/td-p/145347) -### Who To Leave Behind -During the data profiling process, it's common to realize that often some of your data are [useless](https://ambisense.net/why-useless-data-is-worse-than-no-data/). Your data may have too much noise or they are partial, and most likely you don't all of them to answer your business problems. -[To drop or not to drop, the Dilemma](https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/). -Each time you're facing a data related problem, try to understand what data you need and what you' don't - that is, for each piece of information, ask yourself (and ask the _business user_): -- How this data is going to help me? -- Is possible to use them, reducing noise o missing values? -- Considering the benefits/costs of the preparation process versus the business value created, Is this data worth it? - -### The Toolkit -The tools we're gonna use are Python3 and his [Pandas library](https://pandas.pydata.org/), the de-facto standard to manipulate datasets. -The heavy lifting here is done by the [DataFrame class](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which comes with a bunch of useful functions for your daily data tasks. -Hopefully, you already know Python, if not start from there (do the steps I suggest you in the ML guide requirements), and then take this [Beginner Pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html). Don't worry if now some ideas are not totally clear, but try to get the big picture of the common [Pandas operations](https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/). - -_Best practices and exercises:_ [1](https://github.com/guipsamora/pandas_exercises), [2](https://www.w3resource.com/python-exercises/pandas/index.php), [3](https://www.machinelearningplus.com/python/101-pandas-exercises-python/), [4](https://www.kaggle.com/pistak/pandas-tutorial-with-interactive-exercises), [5](http://disi.unitn.it/~teso/courses/sciprog/python_pandas_exercises.html) - -### Data Cleaning -[Data cleaning](https://en.wikipedia.org/wiki/Data_cleansing) is the general process of taking data, after you have a clear big picture of them, and you need to realize the actual process of replacing characters, dropping incomplete rows, fill missing values and so forth. In the next sections, we'll explore all the common data cleaning situations. +## Data Cleaning +[Data cleaning](https://en.wikipedia.org/wiki/Data_cleansing) is the general process of taking data, after you have a clear big picture of them, and you need to realize the actual process of replacing characters, dropping incomplete rows, fill missing values and so forth. In the next sections, we'll explore all the common data cleaning situations. Also see [Data Cleaning on awesome-ai-ml-dl](https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/data/data-preparation.md#data-cleaning) section to learn more about this topic. ### Get Rid of Extra Spaces One of the first things you want to do is [remove extra spaces](https://stackoverflow.com/questions/43332057/pandas-strip-white-space). Take care! Some space can carry information, but it heavily depends on the situation. For example, in "Complete Name": "Giacomo Ciarlini" in nice to have space so we can later split this into "Name": "Giacomo" and "Surname": "Ciarlini". I want you to notice that in general, apart from recommending and suggestion customization systems, unique identifiers like names or IDs are something you can generally drop. Often, they do not carry information. @@ -110,137 +98,111 @@ Often real-world data is incomplete and is necessary to handle this situation. [ _Best practices and exercises:_ [1](https://www.kaggle.com/nirmal51194/data-cleaning-challenge-handling-missing-values), [2](https://stefvanbuuren.name/fimd/missing-data-pattern.html), [3](https://www.ethz.ch/content/dam/ethz/special-interest/math/statistics/sfs/Education/Advanced%20Studies%20in%20Applied%20Statistics/course-material-1719/Multivariate/w10-in-class-exercise-imputation-solution.pdf), [4](http://uc-r.github.io/missing_values) -### Convert Values Type -[Different data types](https://pbpython.com/pandas_dtypes.html) carries different information, and you need to care about this. -[Here](https://www.geeksforgeeks.org/python-pandas-series-astype-to-convert-data-type-of-series/) is a good tutorial on how to convert type values. Remember that Python has some shortcut for doing this (executing str(3) will give you back the "3" string) but I recommend you to learn how to do it with Pandas. +### Convert Value Types +[Different data types](https://pbpython.com/pandas_dtypes.html) carry different kinds of information, and you need to care about this. +[Here](https://www.geeksforgeeks.org/python-pandas-series-astype-to-convert-data-type-of-series/) is a good tutorial on how to convert value types. Remember that Python has some shortcut for doing this (executing `str(3)` will give you back the "3" string) but I recommend you to learn how to do it with Pandas. ### Remove Duplicates -You don't want to duplicate data, they both are noise and occupy space! Learn [how to handle them simply](https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/) with Pandas. +You don't want to duplicate data, they are noise and occupy space! Learn [how to handle them simply](https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/) with Pandas. ### Spell Check You want to correct wrong words, for the sake of evenness. Check [here](https://www.tutorialspoint.com/python/python_spelling_check.htm) for a good Python module to do it. Also, this is a good starting point to [implement it](https://stackoverflow.com/questions/46409475/spell-checker-in-pandas). +This is also useful when you are dealing with text data (columns of text data in a tabular dataset). + _Best practices and exercises:_ [1](https://stackoverflow.com/questions/7315114/spell-check-program-in-python), [2](https://norvig.com/spell-correct.html), [3](https://github.com/garytse89/Python-Exercises/tree/master/autoCorrect) +### Grammar Check +Just like Spell Check, Grammar check of text data can be of great importance depending on the NLP task you are about to perform with them. + ### Reshape your data Maybe you're going to feed your data into a neural network or show them in a colorful bars plot. Anyway, you need to transform your data and give them the right shape for your data pipeline. [Here](https://towardsdatascience.com/seven-clean-steps-to-reshape-your-data-with-pandas-or-how-i-use-python-where-excel-fails-62061f86ef9c) is a very good tutorial for this task. _Best practices and exercises:_ [1](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html), [2](https://discuss.codecademy.com/t/faq-data-cleaning-with-pandas-reshaping-your-data/384794). +### Converting to categorical data type +When dealing with numeric or string (alphanumeric) columns which represent categories or multi-class labels, it's best to convert them into the categorical type. This does not just save memory, also makes the dataframe faster to operate on. And further makes data analysis step easier to perform. Further to that categorical column types under the hood maintain a category code per value in the column, which can be used instead of their string equivalents - saving some preprocessing or column transformations. + +One additional benefit of doing this, would be to help spot inconsistent namings and replace them with consistent ones. Inconsistent labels can lead to incorrect analysis and visualisations. Although these can be spotted during summarisation of categorical data. + +Read all about it in the [Pandas docs](https://pandas.pydata.org/docs/) on [Categorical data type](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html). + ### Dealing with Special Characters UTF-encoding is the standard to follow, but remember that not everyone follows the rules (otherwise, we'd not need [crime predictive analytics](http://scholarworks.sjsu.edu/cgi/viewcontent.cgi?article=1633&context=etd_projects). You can learn [here](https://stackoverflow.com/questions/45596529/replacing-special-characters-in-pandas-dataframe) how to deal with strange accents or special characters. _Best practices and exercises:_ [1](https://www.w3resource.com/python-exercises/python-basic-exercise-92.php), [2](https://stackoverflow.com/questions/22518703/escape-sequences-exercise-in-python?rq=1), [3](https://learnpythonthehardway.org/book/ex2.html) ### Verification to enrich data -Sometimes can be useful to engineer some data, for example: suppose you're dealing with [e-commerce data](https://www.edataindia.com/why-data-cleansing-is-important/), and you have the prices of each object sold. You may want to add a new column in your dataset, with a label carrying handy information like a Price_level [low, medium, high] based on upper and lower bounds you can decide. This is really simple in Pandas, check [here](https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column). Another example is to add a Gender column (M, F) to easily explore data and gain insights in a customers dataset. +Sometimes can be useful to engineer some data, for example: suppose you're dealing with [e-commerce data](https://www.edataindia.com/why-data-cleansing-is-important/), and you have the prices of each object sold. You may want to add a new column in your dataset, with a label carrying handy information like a Price_level [low, medium, high] based on upper and lower bounds you can decide. This is really simple in Pandas, check [here](https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column). Another example is to add a Gender column (M, F) to easily explore data and gain insights in a customers dataset. Also known as bucket values, which are super handy -- may even midly fall under the Feature Engineering category. _Best practices and exercises:_ [1](http://www.inweb.org.br/w3c/dataenrichment/), [2](https://solutionsreview.com/data-integration/best-practices-for-data-enrichment-after-etl/), [3](http://www.inweb.org.br/w3c/dataenrichment/) -### Data Discretization +### Data Discretization Many Machine Learning and Data Analysis methods cannot handle continuous data, and dealing with them can be computationally prohibitive. [Here](https://www.youtube.com/watch?v=TF3_6lwITQg) you find a good video explaining why and how you need to discretize data. _Best practices and exercises:_ [1](https://www.researchgate.net/post/What_are_the_best_methods_for_discretization_of_continuous_features), [2](https://towardsdatascience.com/discretisation-using-decision-trees-21910483fa4b), [3](https://docs.microsoft.com/en-us/sql/analysis-services/data-mining/discretization-methods-data-mining) -### Data Cleaning Tools -You're not going to hunt tigers without a rifle! You have a ton of tools out there that will help you during the data cleaning process, the one I want to suggest you is [this](https://www.analyticsindiamag.com/10-best-data-cleaning-tools-get-data/) open source tool from Google. Check [here](https://www.quora.com/What-are-the-best-open-source-data-cleansing-tools-software-available) for more. - -# Data Preprocessing / Data wrangling / Data manipulation - -Data preprocessing (also known as Data Preparation, but "Preprocessing" sounds more like magic) is the [iterative process](http://www.jsoftware.us/vol12/306-JSW15277.pdf) of gathering, combining, structuring and organizing data so it can be analyzed as part of data visualization, analytics, and machine learning applications. - -[Real-world data](https://www.quanticate.com/blog/real-world-data-analysis-in-clinical-trials) is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. - -It's the [core ability](https://blogs.sas.com/content/hiddeninsights/2017/11/30/analytical-data-preparation-important/) of any data scientist or data engineer, and you must _be able to manipulate, clean, and structure_ your data during the everyday work (besides expecting that this will take the most of your [daily-time](https://www.infoworld.com/article/3228245/the-80-20-data-science-dilemma.html)!). - -There are a lot of different data types out there, and they deserve [different treatments](http://blog.appliedinformaticsinc.com/data-mining-challenges-in-data-cleaning/). +### Feature Scaling +Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. +[Here](Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.) you find a serious tutorial about this fundamental step. Also known as Normalizing (bring value of numeric column between 0 and 1) or Standardizing (bring value of numeric column between -1 and 1) data, see [Normalization vs Standardization](https://towardsdatascience.com/normalization-vs-standardization-cb8fe15082eb). Normalization is also called min-max approach, see another [example](https://towardsdatascience.com/data-normalization-with-pandas-and-scikit-learn-7c1cc6ed6475). -As usual the structure I've planned to get you started consists of having a [general overview](https://searchbusinessanalytics.techtarget.com/definition/data-preparation), and then dive deep into each data processing situation you can encounter. +_Best practices and exercises:_ [1](https://www.kaggle.com/jfeng1023/data-cleaning-challenge-scale-and-normalize-data), [2](https://www.quora.com/When-should-you-perform-feature-scaling-and-mean-normalization-on-the-given-data-What-are-the-advantages-of-these-techniques), [3](https://www.quora.com/When-do-I-have-to-do-feature-scaling-in-machine-learning) -[Here](https://towardsdatascience.com/data-pre-processing-techniques-you-should-know-8954662716d6) you have a gentle end-to-end panoramic view of the entire process. +### Text data +Just like the above, transformations or preprocessings can be performed on numeric, date or categorical data, similarly text data can also be processed in such a fashion. Although text data would undergo regex and string transformation processes deemed necessary for the NLP tasks they would be used for thereafter. -### Change Text to Lower/Upper Case -You want to _Capitalize_ names, or maybe make them uniform (some people can enter data with or without capital letters!). Check [here](https://www.geeksforgeeks.org/python-pandas-series-str-lower-upper-and-title/) for the Pandas way to do it. +### Data Cleaning Tools +You're not going to hunt tigers without a rifle! You have a ton of tools out there that will help you during the data cleaning process, the one I want to suggest you is [this](https://www.analyticsindiamag.com/10-best-data-cleaning-tools-get-data/) open source tool from Google. Check [here](https://www.quora.com/What-are-the-best-open-source-data-cleansing-tools-software-available) for more. -### Normalizing Dates -I think there could be one hundred ways to write down a date. You need to decide your format and make them uniform across your dataset, and [here](https://medium.com/jbennetcodes/dealing-with-datetimes-like-a-pro-in-pandas-b80d3d808a7f) you learn how to do it. +## Visualization +Visualization of data before and after many of the above steps is vital, to ensure the balance, bias and shape of the data is maintained. And the transformed or preprocessed data is representative of it's original form. Even if we can't control the way such data is going to evolve, we can atleast see the before and after effects of a transormation/preprocessing step before proceeding with it. Or if we even do proceed with it, we know from the visuals what the outcome stands to be from it (more or less). -_Best practices and exercises:_ [1](https://www.w3resource.com/python-exercises/python-conditional-exercise-41.php), [2](https://www.w3resource.com/python-exercises/date-time-exercise/), [3](https://www.kaggle.com/anezka/data-cleaning-challenge-parsing-dates) +## Merge Data Sets and Integration +Now that you hopefully have been successful in your data cleaning process, you can merge data from different source to create big [de-normalized](https://www.researchgate.net/post/When_and_why_do_we_need_data_normalization_in_data_mining_algorithms) data tables, ready to be explored and consumed. [This](https://www.quora.com/Is-data-warehouse-normalized-or-denormalized-Why) is why. -### Feature Scaling -Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. -[Here](Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.) you find a serious tutorial about this fundamental step. +## Aggregating data (centralising) +Aggregating data or centralising data (or sometimes called normalising data) - even though this topic overlaps with the [Data Collection](https://virgili0.github.io/Virgilio/purgatorio/collect-and-prepare-data/data-collection.html) topic covered in the respective guide. It's good to touch on the topic and be reminded of it briefly. As covered in the [Business Questions](#Business-Questions) when we ask questions about the data, one of them is to find it's source. But it also could give rise to other related data or sources of data that could be relevant to the current task and then be brought in. Which throws light on the data aggregation process - how to bring the different sources of data and convert it into one form before performing any preprocessing or transformations on it. This process itself is sort of a preprocessing or transformations step on its own. -_Best practices and exercises:_ [1](https://www.kaggle.com/jfeng1023/data-cleaning-challenge-scale-and-normalize-data), [2](https://www.quora.com/When-should-you-perform-feature-scaling-and-mean-normalization-on-the-given-data-What-are-the-advantages-of-these-techniques), [3](https://www.quora.com/When-do-I-have-to-do-feature-scaling-in-machine-learning) +On the other hand, this question could throw light on the sources of data the current raw-data is made up of (and make us aware of the aggregatation process it underwent) before taking it's current form. -# Data Profiling -According to the (cold as ice) [Wikipedia definition](https://en.wikipedia.org/wiki/Data_profiling): "Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics and informative data summaries."\ -So Wikipedia is subtly suggesting us to take a coffee with the data. +_Best practices and exercises:_ [1](https://www.ssc.wisc.edu/sscc/pubs/sfr-combine.htm), [2](https://rpubs.com/wsundstrom/t_merge), [3](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html), [4](https://searchbusinessanalytics.techtarget.com/feature/Using-data-merging-and-concatenation-techniques-to-integrate-data), [5](https://www.analyticsvidhya.com/blog/2016/06/9-challenges-data-merging-subsetting-r-python-beginner/) -During this informal meeting, ask the data questions like: -- which business problem are you meant to solve? (what is important, and what is not) -- how have you been collected (with noise, missing values...)? -- how many friends of yours are there and where can I find them? (data dimensions and retrieving from storages) +## Bias and balance/imbalance +It is but hard to first check and know the current bias of the data or how the data is balanced or how much imbalance exists in the raw data. To add to that, at each of the above transformation / preprocessing steps we may be introducing bias or dampening existing bias or a combination of the two, in the raw data while we process it or transform it. -Eventually, you may find the data too much quiet, maybe they're just shy! \ -Anyway, you're going to [ask these questions to the business user](https://business-analysis-excellence.com/business-requirements-meeting/)! +## Sanity Check +You always want to be sure that your data are _exactly_ how you want them to be, and because of this is a good rule of thumb to apply a sanity check after each complete iteration of the data preprocessing pipeline (i.e. each step we have seen until now). +Look [here](https://www.trifacta.com/blog/4-key-steps-to-sanity-checking-your-data/) for a good overview. Depending on your case, the sanity check can vary a lot. -_Best practices and exercises:_ [1](https://www.iqint.org/idq2013/presentations/downloads/di_loreto_data_profiling_tutorial_monday_am.pdf), [2](https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Data-profiling-tutorials-use-cases-and-exercise/td-p/145347) +_Best practices and exercises:_ [1](https://blog.socialcops.com/academy/resources/4-data-checks-clean-data/), [2](https://www.r-bloggers.com/data-sanity-checks-data-proofer-and-r-analogues/), [3](https://www.quora.com/What-is-the-example-of-Sanity-testing-and-smoke-testing) +## Automate These Boring Stuffs! +As I told you at the very beginning, the data preprocessing process can take a long time and be very tedious. Because of this, you want to [automate](https://www.youtube.com/watch?v=UZUoH7_mYx4) the most you can. Also, **automation is married with iteration**, so this is the way you need to plan your data preprocessing pipelines. [Here](https://github.com/mdkearns/automated-data-preprocessing) you find a good command line tool for doing that, but I'm almost sure you'll need to build your own (remember, each problem is unique!), but this is a good starting point. -**Let's Start!** +## Doing it in real-time +Fully connected to the [previous section](#Automate-These-Boring-Stuffs!), automating redundant or repeated tasks makes the workflow repeatable, consistent, efficient and reliable. And given these qualities, it's not far away from being given the task of handling real-world raw data directly from the source or the various sources (centralising or aggregation of data). This takes away the whole manual step from the process and keeps things real and practical -- production ready all the time. In this way you can see all the flavours of data/input and the nuances and edge-cases to handle each time a step fails or gives false positives or false negatives. -### Don't Joke With Data +## Don't Joke With Data First, [**data is King**](https://www.edq.com/glossary/data-quality-importance/). In the [data-driven epoch](https://www.venturi-group.com/qa-with-helen-mannion/), having [data quality issues](https://www.ringlead.com/blog/7-common-data-quality-issues/) means to lose tremendous amounts of value for a company, in the present and in the future. So, respect your King and care a lot about him. The most immediate way to do this is to plan and [work hard](https://nektardata.com/high-quality-data/) to _produce_ good quality data. Your goal is to plan a collecting data infrastructure that fixes problems beforehand. This means to care to a lot about planning well your database schemas (do I need [third-normal form](https://social.technet.microsoft.com/Forums/Lync/en-US/7bf4ca30-a1bc-415d-97e6-ce0ac3137b53/normalized-3nf-vs-denormalizedstar-schema-data-warehouse-?forum=sqldatawarehousing) or not?), how do you collect data from sensors (physical or conceptual) and so on. These are problems if you're building a system up from the ground, but most of the times in you're gonna facing real-world problems that someone wants to solve with [_already available_](https://www.wired.com/insights/2013/05/more-data-more-problems-is-big-data-always-right/) data. -### Business Questions -Asking the [right business questions](https://www.datapine.com/blog/data-analysis-questions/) is hard, but it has the [biggest impact](https://towardsdatascience.com/start-your-data-exploration-with-questions-2f1d42cff29e) on your performance of solving a particular problem. Remember, you want to [solve a problem](http://www.informit.com/articles/article.aspx?p=2271188&seqNum=2), not to create new ones! +_Best practices and exercises:_ [1](https://blog.panoply.io/5-data-preparation-tools-1-automated-data-platform), [2](https://www.quora.com/How-do-I-make-an-automated-data-cleaning-in-Python-for-ML-Is-there-a-trick-for-that), [3](https://www.quora.com/Is-there-a-python-package-to-automate-data-preparation-in-machine-learning), [4](https://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/), [5](https://www.analyticsvidhya.com/blog/2018/10/rapidminer-data-preparation-machine-learning/) -### Who To Leave Behind +## Who To Leave Behind During the data profiling process, it's common to realize that often some of your data are [useless](https://ambisense.net/why-useless-data-is-worse-than-no-data/). Your data may have too much noise or they are partial, and most likely you don't all of them to answer your business problems. [To drop or not to drop, the Dilemma](https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/). Each time you're facing a data related problem, try to understand what data you need and what you' don't - that is, for each piece of information, ask yourself (and ask the _business user_): - How this data is going to help me? -- Is possible to use them, reducing noise o missing values? -- Considering the benefits/costs of the preparation process versus the business value created, Is this data worth it? +- Is possible to use them, reducing noise or missing values? +- Considering the benefits/costs of the preparation process versus the business value created, Is the effort worth it? -### Start Small -It's stupid to handle GBs of data each time you want to try a data preparation step. Just use [small subsets](https://sdtimes.com/bi/data-gets-big-best-practices-data-preparation-scale/) of the data (but take care that they are representative and you catch all the problems). Remember, if you want to experiment with string cleaning, you don't need to launch your script on 10M rows. +## The Toolkit +The tools we're gonna use are Python3 and his [Pandas library](https://pandas.pydata.org/), the de-facto standard to manipulate datasets. There are a whole lot of other tools that have come out which are either built on top of Pandas or Numpy or independently, see [Data Preparation on awesome-ai-ml-dl](https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/data/data-preparation.md) for more details. -### The Toolkit -The tools we're gonna use are Python3 and his [Pandas library](https://pandas.pydata.org/), the de-facto standard to manipulate datasets. The heavy lifting here is done by the [DataFrame class](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which comes with a bunch of useful functions for your daily data tasks. -Hopefully, you already know Python, if not start from there (do the steps I suggest you in the ML guide requirements), and then take this [Beginner Pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html). Don't worry if now some ideas are not totally clear, but try to get the big picture of the common [Pandas operations](https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/). +Hopefully, you already know Python, if not start from there (do the steps I suggest you in the ML guide requirements), and then take this [Beginner Pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html). Don't worry if by now some ideas are not totally clear, but try to get the big picture of the common [Pandas operations](https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/). _Best practices and exercises:_ [1](https://github.com/guipsamora/pandas_exercises), [2](https://www.w3resource.com/python-exercises/pandas/index.php), [3](https://www.machinelearningplus.com/python/101-pandas-exercises-python/), [4](https://www.kaggle.com/pistak/pandas-tutorial-with-interactive-exercises), [5](http://disi.unitn.it/~teso/courses/sciprog/python_pandas_exercises.html) -### Visualization - -(visualization during data preparation process: before, during and after) -. -. -. - -### Merge Data Sets and Integration -Now that you hopefully have been successful in your data cleaning process, you can merge data from different source to create big [de-normalized](https://www.researchgate.net/post/When_and_why_do_we_need_data_normalization_in_data_mining_algorithms) data tables, ready to be explored and consumed. [This](https://www.quora.com/Is-data-warehouse-normalized-or-denormalized-Why) is why. - -_Best practices and exercises:_ [1](https://www.ssc.wisc.edu/sscc/pubs/sfr-combine.htm), [2](https://rpubs.com/wsundstrom/t_merge), [3](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html), [4](https://searchbusinessanalytics.techtarget.com/feature/Using-data-merging-and-concatenation-techniques-to-integrate-data), [5](https://www.analyticsvidhya.com/blog/2016/06/9-challenges-data-merging-subsetting-r-python-beginner/) - -### Sanity Check -You always want to be sure that your data are _exactly_ how you want them to be, and because of this is a good rule of thumb to apply a sanity check after each complete iteration of the data preprocessing pipeline (i.e. each step we have seen until now) -Look [here](https://www.trifacta.com/blog/4-key-steps-to-sanity-checking-your-data/) for a good overview. Depending on your case, the sanity check can vary a lot. - -_Best practices and exercises:_ [1](https://blog.socialcops.com/academy/resources/4-data-checks-clean-data/), [2](https://www.r-bloggers.com/data-sanity-checks-data-proofer-and-r-analogues/), [3](https://www.quora.com/What-is-the-example-of-Sanity-testing-and-smoke-testing) - -### Automate These Boring Stuffs! -As I told you at the very beginning, the data preprocessing process can take a long time and be very tedious. Because of this, you want to [automate](https://www.youtube.com/watch?v=UZUoH7_mYx4) the most you can. Also, **automation is married with iteration**, so this is the way you need to plan your data preprocessing pipelines. [Here](https://github.com/mdkearns/automated-data-preprocessing) you find a good command line tool for doing that, but I'm almost sure you'll need to build your own (remember, each problem is unique!), but this is a good starting point. - -### Don't Joke With Data -First, [**data is King**](https://www.edq.com/glossary/data-quality-importance/). In the [data-driven epoch](https://www.venturi-group.com/qa-with-helen-mannion/), having [data quality issues](https://www.ringlead.com/blog/7-common-data-quality-issues/) means to lose tremendous amounts of value for a company, in the present and in the future. So, respect your King and care a lot about him. The most immediate way to do this is to plan and [work hard](https://nektardata.com/high-quality-data/) to _produce_ good quality data. -Your goal is to plan a collecting data infrastructure that fixes problems beforehand. This means to care to a lot about planning well your database schemas (do I need [third-normal form](https://social.technet.microsoft.com/Forums/Lync/en-US/7bf4ca30-a1bc-415d-97e6-ce0ac3137b53/normalized-3nf-vs-denormalizedstar-schema-data-warehouse-?forum=sqldatawarehousing) or not?), how do you collect data from sensors (physical or conceptual) and so on. These are problems if you're building a system up from the ground, but most of the times in you're gonna facing real-world problems that someone wants to solve with [_already available_](https://www.wired.com/insights/2013/05/more-data-more-problems-is-big-data-always-right/) data. - -_Best practices and exercises:_ [1](https://blog.panoply.io/5-data-preparation-tools-1-automated-data-platform), [2](https://www.quora.com/How-do-I-make-an-automated-data-cleaning-in-Python-for-ML-Is-there-a-trick-for-that), [3](https://www.quora.com/Is-there-a-python-package-to-automate-data-preparation-in-machine-learning), [4](https://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/), [5](https://www.analyticsvidhya.com/blog/2018/10/rapidminer-data-preparation-machine-learning/) - -### Conclusions -Now you're ready to take your data and play with them in a variety of ways, and you have a nice panoramic overview of the entire process. You can refer to this page when you clean data, to check if you're not missing some steps. Remember that probably each situation requires a subset of these steps. \ No newline at end of file +## Conclusions +Now you're ready to take your data and play with them in a variety of ways, and you have a nice panoramic overview of the entire process. You can refer to this page when you clean data, to check if you're not missing some steps. Remember that probably each situation requires a subset of these steps. From b18d2e351a6b23141690f6b75bdeb0a199656883 Mon Sep 17 00:00:00 2001 From: Mani Sarkar Date: Wed, 11 Nov 2020 20:43:53 +0000 Subject: [PATCH 06/18] Fixed broken link in Data collection --- content/purgatorio/collect-and-prepare-data/data-collection.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/purgatorio/collect-and-prepare-data/data-collection.md b/content/purgatorio/collect-and-prepare-data/data-collection.md index ed61080b..d94b716e 100644 --- a/content/purgatorio/collect-and-prepare-data/data-collection.md +++ b/content/purgatorio/collect-and-prepare-data/data-collection.md @@ -170,7 +170,7 @@ Then there is also ethics you do not want to miss out on and [the section to fol With rising concerns over _privacy_ and _bias_, you want to be sure that the data collected does respect the ethics and standards in this field as much as possible. -To help with that as the awareness about things are improving, there are a lot of resources available, one such place to start would be [here](https://github.com/neomatrix369/awesome-ai-ml-dl/blob/naster/README-details.md#ethics--altruistic-motives). One of the resources mentioned there is that of a python package called [Deon](https://pypi.org/project/deon/). Interestingly it has a _digital checklist_ you can consult and see if they apply to what you are about to do. +To help with that as the awareness about things are improving, there are a lot of resources available, one such place to start would be [here](https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/README-details.md#ethics--altruistic-motives). One of the resources mentioned there is that of a python package called [Deon](https://pypi.org/project/deon/). Interestingly it has a _digital checklist_ you can consult and see if they apply to what you are about to do. #### Interpretability / Explainability From fed7b6c43c7b72ad530f274c8c1caaf305ea9e61 Mon Sep 17 00:00:00 2001 From: Mani Sarkar Date: Wed, 11 Nov 2020 21:16:14 +0000 Subject: [PATCH 07/18] Adding a line about the visualisation guide --- .../collect-and-prepare-data/data-preparation.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/content/purgatorio/collect-and-prepare-data/data-preparation.md b/content/purgatorio/collect-and-prepare-data/data-preparation.md index 0ba6bb58..63d9648c 100644 --- a/content/purgatorio/collect-and-prepare-data/data-preparation.md +++ b/content/purgatorio/collect-and-prepare-data/data-preparation.md @@ -106,7 +106,7 @@ _Best practices and exercises:_ [1](https://www.kaggle.com/nirmal51194/data-clea You don't want to duplicate data, they are noise and occupy space! Learn [how to handle them simply](https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/) with Pandas. ### Spell Check -You want to correct wrong words, for the sake of evenness. Check [here](https://www.tutorialspoint.com/python/python_spelling_check.htm) for a good Python module to do it. Also, this is a good starting point to [implement it](https://stackoverflow.com/questions/46409475/spell-checker-in-pandas). +You want to correct wrong words, for the sake of evenness. Check [here](https://www.tutorialspoint.com/python_text_processing/python_spelling_check.htm) for a good Python module to do it. Also, this is a good starting point to [implement it](https://stackoverflow.com/questions/46409475/spell-checker-in-pandas). This is also useful when you are dealing with text data (columns of text data in a tabular dataset). @@ -135,7 +135,7 @@ _Best practices and exercises:_ [1](https://www.w3resource.com/python-exercises/ ### Verification to enrich data Sometimes can be useful to engineer some data, for example: suppose you're dealing with [e-commerce data](https://www.edataindia.com/why-data-cleansing-is-important/), and you have the prices of each object sold. You may want to add a new column in your dataset, with a label carrying handy information like a Price_level [low, medium, high] based on upper and lower bounds you can decide. This is really simple in Pandas, check [here](https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column). Another example is to add a Gender column (M, F) to easily explore data and gain insights in a customers dataset. Also known as bucket values, which are super handy -- may even midly fall under the Feature Engineering category. -_Best practices and exercises:_ [1](http://www.inweb.org.br/w3c/dataenrichment/), [2](https://solutionsreview.com/data-integration/best-practices-for-data-enrichment-after-etl/), [3](http://www.inweb.org.br/w3c/dataenrichment/) +_Best practices and exercises:_ [1](https://web.archive.org/web/20200813205611/http://www.inweb.org.br/w3c/dataenrichment/), [2](https://solutionsreview.com/data-integration/best-practices-for-data-enrichment-after-etl/) ### Data Discretization Many Machine Learning and Data Analysis methods cannot handle continuous data, and dealing with them can be computationally prohibitive. [Here](https://www.youtube.com/watch?v=TF3_6lwITQg) you find a good video explaining why and how you need to discretize data. @@ -157,6 +157,8 @@ You're not going to hunt tigers without a rifle! You have a ton of tools out the ## Visualization Visualization of data before and after many of the above steps is vital, to ensure the balance, bias and shape of the data is maintained. And the transformed or preprocessed data is representative of it's original form. Even if we can't control the way such data is going to evolve, we can atleast see the before and after effects of a transormation/preprocessing step before proceeding with it. Or if we even do proceed with it, we know from the visuals what the outcome stands to be from it (more or less). +The specifics of what kinds of visualisations to use is to be made available in the Visualisation Guide. + ## Merge Data Sets and Integration Now that you hopefully have been successful in your data cleaning process, you can merge data from different source to create big [de-normalized](https://www.researchgate.net/post/When_and_why_do_we_need_data_normalization_in_data_mining_algorithms) data tables, ready to be explored and consumed. [This](https://www.quora.com/Is-data-warehouse-normalized-or-denormalized-Why) is why. @@ -183,7 +185,7 @@ As I told you at the very beginning, the data preprocessing process can take a l Fully connected to the [previous section](#Automate-These-Boring-Stuffs!), automating redundant or repeated tasks makes the workflow repeatable, consistent, efficient and reliable. And given these qualities, it's not far away from being given the task of handling real-world raw data directly from the source or the various sources (centralising or aggregation of data). This takes away the whole manual step from the process and keeps things real and practical -- production ready all the time. In this way you can see all the flavours of data/input and the nuances and edge-cases to handle each time a step fails or gives false positives or false negatives. ## Don't Joke With Data -First, [**data is King**](https://www.edq.com/glossary/data-quality-importance/). In the [data-driven epoch](https://www.venturi-group.com/qa-with-helen-mannion/), having [data quality issues](https://www.ringlead.com/blog/7-common-data-quality-issues/) means to lose tremendous amounts of value for a company, in the present and in the future. So, respect your King and care a lot about him. The most immediate way to do this is to plan and [work hard](https://nektardata.com/high-quality-data/) to _produce_ good quality data. +First, [**data is King**](https://www.edq.com/glossary/data-quality-importance/). In the [data-driven epoch](https://www.venturi-group.com/qa-with-helen-mannion/), having [data quality issues](https://www.ringlead.com/blog/7-common-data-quality-issues/) means to lose tremendous amounts of value for a company, in the present and in the future. So, respect your King and care a lot about him. The most immediate way to do this is to plan and [work hard](https://web.archive.org/web/20190708202946/https://nektardata.com/high-quality-data/) to _produce_ good quality data. Your goal is to plan a collecting data infrastructure that fixes problems beforehand. This means to care to a lot about planning well your database schemas (do I need [third-normal form](https://social.technet.microsoft.com/Forums/Lync/en-US/7bf4ca30-a1bc-415d-97e6-ce0ac3137b53/normalized-3nf-vs-denormalizedstar-schema-data-warehouse-?forum=sqldatawarehousing) or not?), how do you collect data from sensors (physical or conceptual) and so on. These are problems if you're building a system up from the ground, but most of the times in you're gonna facing real-world problems that someone wants to solve with [_already available_](https://www.wired.com/insights/2013/05/more-data-more-problems-is-big-data-always-right/) data. _Best practices and exercises:_ [1](https://blog.panoply.io/5-data-preparation-tools-1-automated-data-platform), [2](https://www.quora.com/How-do-I-make-an-automated-data-cleaning-in-Python-for-ML-Is-there-a-trick-for-that), [3](https://www.quora.com/Is-there-a-python-package-to-automate-data-preparation-in-machine-learning), [4](https://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/), [5](https://www.analyticsvidhya.com/blog/2018/10/rapidminer-data-preparation-machine-learning/) @@ -200,7 +202,7 @@ Each time you're facing a data related problem, try to understand what data you The tools we're gonna use are Python3 and his [Pandas library](https://pandas.pydata.org/), the de-facto standard to manipulate datasets. There are a whole lot of other tools that have come out which are either built on top of Pandas or Numpy or independently, see [Data Preparation on awesome-ai-ml-dl](https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/data/data-preparation.md) for more details. The heavy lifting here is done by the [DataFrame class](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which comes with a bunch of useful functions for your daily data tasks. -Hopefully, you already know Python, if not start from there (do the steps I suggest you in the ML guide requirements), and then take this [Beginner Pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html). Don't worry if by now some ideas are not totally clear, but try to get the big picture of the common [Pandas operations](https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/). +Hopefully, you already know Python, if not start from there (do the steps I suggest you in the ML guide requirements), and then take this [Beginner Pandas tutorial](https://web.archive.org/web/20200719131732/https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html). Don't worry if by now some ideas are not totally clear, but try to get the big picture of the common [Pandas operations](https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/). _Best practices and exercises:_ [1](https://github.com/guipsamora/pandas_exercises), [2](https://www.w3resource.com/python-exercises/pandas/index.php), [3](https://www.machinelearningplus.com/python/101-pandas-exercises-python/), [4](https://www.kaggle.com/pistak/pandas-tutorial-with-interactive-exercises), [5](http://disi.unitn.it/~teso/courses/sciprog/python_pandas_exercises.html) From 819df7fef0313f5a570ccf7704e9e6ef42758a13 Mon Sep 17 00:00:00 2001 From: Mani Sarkar Date: Tue, 1 Dec 2020 20:29:52 +0000 Subject: [PATCH 08/18] Data prepation guide: fixing the Data Preprocessing... topic link in the ToC --- .../purgatorio/collect-and-prepare-data/data-preparation.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/content/purgatorio/collect-and-prepare-data/data-preparation.md b/content/purgatorio/collect-and-prepare-data/data-preparation.md index 63d9648c..f5ca0d8e 100644 --- a/content/purgatorio/collect-and-prepare-data/data-preparation.md +++ b/content/purgatorio/collect-and-prepare-data/data-preparation.md @@ -12,7 +12,7 @@ The purpose of this guide is to show you the importance of these steps, mostly a # Index - [Start Small](#Start-small) - [Business Questions](#Business-Questions) -- [Data Preprocessing](#Data-Preprocessing) +- [Data Preprocessing (Data wrangling / Data manipulation)](#data-preprocessing-data-wrangling--data-manipulation) - [Data Profiling](#Data-Profiling) - [Data Cleaning](#Data-Cleaning) - [Get Rid of Extra Spaces](#Get-Rid-of-Extra-Spaces) @@ -27,7 +27,6 @@ The purpose of this guide is to show you the importance of these steps, mostly a - [Verification To Enrich Data](#Verification-To-Enrich-Data) - [Data Discretization](#Data-Discretization) - [Data Cleaning Tools](#Data-Cleaning-Tools) -- [Data Preprocessing / Data wrangling / Data manipulation](#Data-Preprocessing) - [Change Text to Lower/Upper Case](#Change-Text-to-Lower/Upper-Case) - [Normalizing Dates](#Normalizing-Dates) - [Feature Scaling](#Feature-Scaling) @@ -53,7 +52,7 @@ It's stupid to handle GBs of data each time you want to try a data preparation s ## Business Questions Asking the [right business questions](https://www.datapine.com/blog/data-analysis-questions/) is hard, but it has the [biggest impact](https://towardsdatascience.com/start-your-data-exploration-with-questions-2f1d42cff29e) on your performance of solving a particular problem. Remember, you want to [solve a problem](http://www.informit.com/articles/article.aspx?p=2271188&seqNum=2), not to create new ones! -## Data Preprocessing +## Data Preprocessing (Data wrangling / Data manipulation) Data preprocessing (also known as Data Preparation, but "Preprocessing" sounds more like magic) is the [iterative process](http://www.jsoftware.us/vol12/306-JSW15277.pdf) of gathering, combining, structuring and (re)organizing data so it can be analyzed as part of data visualization, analytics, and machine learning processes. [Real-world data](https://www.quanticate.com/blog/real-world-data-analysis-in-clinical-trials) is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. From f2429db355fed51bb29b4f766f4aa845ff7f7c03 Mon Sep 17 00:00:00 2001 From: Mani Sarkar Date: Tue, 1 Dec 2020 20:36:59 +0000 Subject: [PATCH 09/18] Data prepation guide: fixing all centralising to centralizing --- .../collect-and-prepare-data/data-preparation.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/purgatorio/collect-and-prepare-data/data-preparation.md b/content/purgatorio/collect-and-prepare-data/data-preparation.md index f5ca0d8e..eebbe5ec 100644 --- a/content/purgatorio/collect-and-prepare-data/data-preparation.md +++ b/content/purgatorio/collect-and-prepare-data/data-preparation.md @@ -34,7 +34,7 @@ The purpose of this guide is to show you the importance of these steps, mostly a - [Data Cleaning Tools](#Data-Cleaning-Tools) - [Visualization](#Visualization) - [Merge Data Sets and Integration](#Merge-Data-Sets-and-Integration) -- [Aggregating data (centralising)](#Aggregating-data-centralising) +- [Aggregating data (centralizing)](#Aggregating-data-centralizing) - [Bias and balance/imbalance](#Bias-and-balance-imbalance) - [Sanity Check](#Sanity-Check) - [Automate These Boring Stuffs!](#Automate-These-Boring-Stuffs!) @@ -161,8 +161,8 @@ The specifics of what kinds of visualisations to use is to be made available in ## Merge Data Sets and Integration Now that you hopefully have been successful in your data cleaning process, you can merge data from different source to create big [de-normalized](https://www.researchgate.net/post/When_and_why_do_we_need_data_normalization_in_data_mining_algorithms) data tables, ready to be explored and consumed. [This](https://www.quora.com/Is-data-warehouse-normalized-or-denormalized-Why) is why. -## Aggregating data (centralising) -Aggregating data or centralising data (or sometimes called normalising data) - even though this topic overlaps with the [Data Collection](https://virgili0.github.io/Virgilio/purgatorio/collect-and-prepare-data/data-collection.html) topic covered in the respective guide. It's good to touch on the topic and be reminded of it briefly. As covered in the [Business Questions](#Business-Questions) when we ask questions about the data, one of them is to find it's source. But it also could give rise to other related data or sources of data that could be relevant to the current task and then be brought in. Which throws light on the data aggregation process - how to bring the different sources of data and convert it into one form before performing any preprocessing or transformations on it. This process itself is sort of a preprocessing or transformations step on its own. +## Aggregating data (centralizing) +Aggregating data or centralizing data (or sometimes called normalising data) - even though this topic overlaps with the [Data Collection](https://virgili0.github.io/Virgilio/purgatorio/collect-and-prepare-data/data-collection.html) topic covered in the respective guide. It's good to touch on the topic and be reminded of it briefly. As covered in the [Business Questions](#Business-Questions) when we ask questions about the data, one of them is to find it's source. But it also could give rise to other related data or sources of data that could be relevant to the current task and then be brought in. Which throws light on the data aggregation process - how to bring the different sources of data and convert it into one form before performing any preprocessing or transformations on it. This process itself is sort of a preprocessing or transformations step on its own. On the other hand, this question could throw light on the sources of data the current raw-data is made up of (and make us aware of the aggregatation process it underwent) before taking it's current form. @@ -181,7 +181,7 @@ _Best practices and exercises:_ [1](https://blog.socialcops.com/academy/resource As I told you at the very beginning, the data preprocessing process can take a long time and be very tedious. Because of this, you want to [automate](https://www.youtube.com/watch?v=UZUoH7_mYx4) the most you can. Also, **automation is married with iteration**, so this is the way you need to plan your data preprocessing pipelines. [Here](https://github.com/mdkearns/automated-data-preprocessing) you find a good command line tool for doing that, but I'm almost sure you'll need to build your own (remember, each problem is unique!), but this is a good starting point. ## Doing it in real-time -Fully connected to the [previous section](#Automate-These-Boring-Stuffs!), automating redundant or repeated tasks makes the workflow repeatable, consistent, efficient and reliable. And given these qualities, it's not far away from being given the task of handling real-world raw data directly from the source or the various sources (centralising or aggregation of data). This takes away the whole manual step from the process and keeps things real and practical -- production ready all the time. In this way you can see all the flavours of data/input and the nuances and edge-cases to handle each time a step fails or gives false positives or false negatives. +Fully connected to the [previous section](#Automate-These-Boring-Stuffs!), automating redundant or repeated tasks makes the workflow repeatable, consistent, efficient and reliable. And given these qualities, it's not far away from being given the task of handling real-world raw data directly from the source or the various sources (centralizing or aggregation of data). This takes away the whole manual step from the process and keeps things real and practical -- production ready all the time. In this way you can see all the flavours of data/input and the nuances and edge-cases to handle each time a step fails or gives false positives or false negatives. ## Don't Joke With Data First, [**data is King**](https://www.edq.com/glossary/data-quality-importance/). In the [data-driven epoch](https://www.venturi-group.com/qa-with-helen-mannion/), having [data quality issues](https://www.ringlead.com/blog/7-common-data-quality-issues/) means to lose tremendous amounts of value for a company, in the present and in the future. So, respect your King and care a lot about him. The most immediate way to do this is to plan and [work hard](https://web.archive.org/web/20190708202946/https://nektardata.com/high-quality-data/) to _produce_ good quality data. From 0b4bc2a664c4e3bb9404db9a9bafacf3a73a7c5a Mon Sep 17 00:00:00 2001 From: Mani Sarkar Date: Tue, 1 Dec 2020 20:42:23 +0000 Subject: [PATCH 10/18] Data prepation guide: replacing first-person with third-person (I -> Virgilio) --- .../collect-and-prepare-data/data-preparation.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/content/purgatorio/collect-and-prepare-data/data-preparation.md b/content/purgatorio/collect-and-prepare-data/data-preparation.md index eebbe5ec..171428b7 100644 --- a/content/purgatorio/collect-and-prepare-data/data-preparation.md +++ b/content/purgatorio/collect-and-prepare-data/data-preparation.md @@ -61,7 +61,7 @@ It's the [core ability](https://blogs.sas.com/content/hiddeninsights/2017/11/30/ There are a lot of different data types out there, and they deserve [different treatments](http://blog.appliedinformaticsinc.com/data-mining-challenges-in-data-cleaning/). -As usual the structure I've planned to get you started consists of having a [general overview](https://searchbusinessanalytics.techtarget.com/definition/data-preparation), and then dive deep into each data processing situation you can encounter. +As usual the structure Virgilio has planned to get you started consists of having a [general overview](https://searchbusinessanalytics.techtarget.com/definition/data-preparation), and then dive deep into each data processing situation you can encounter. [Here](https://towardsdatascience.com/data-pre-processing-techniques-you-should-know-8954662716d6) you have a gentle end-to-end panoramic view of the entire process. @@ -83,7 +83,7 @@ _Best practices and exercises:_ [1](https://www.iqint.org/idq2013/presentations/ [Data cleaning](https://en.wikipedia.org/wiki/Data_cleansing) is the general process of taking data, after you have a clear big picture of them, and you need to realize the actual process of replacing characters, dropping incomplete rows, fill missing values and so forth. In the next sections, we'll explore all the common data cleaning situations. Also see [Data Cleaning on awesome-ai-ml-dl](https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/data/data-preparation.md#data-cleaning) section to learn more about this topic. ### Get Rid of Extra Spaces -One of the first things you want to do is [remove extra spaces](https://stackoverflow.com/questions/43332057/pandas-strip-white-space). Take care! Some space can carry information, but it heavily depends on the situation. For example, in "Complete Name": "Giacomo Ciarlini" in nice to have space so we can later split this into "Name": "Giacomo" and "Surname": "Ciarlini". I want you to notice that in general, apart from recommending and suggestion customization systems, unique identifiers like names or IDs are something you can generally drop. Often, they do not carry information. +One of the first things you want to do is [remove extra spaces](https://stackoverflow.com/questions/43332057/pandas-strip-white-space). Take care! Some space can carry information, but it heavily depends on the situation. For example, in "Complete Name": "Giacomo Ciarlini" in nice to have space so we can later split this into "Name": "Giacomo" and "Surname": "Ciarlini". Virgilio wants you to notice that in general, apart from recommending and suggestion customization systems, unique identifiers like names or IDs are something you can generally drop. Often, they do not carry information. _Bonus tip_: learn how to use [Regex](https://www.analyticsvidhya.com/blog/2015/06/regular-expression-python/) for pattern matching, this is one of the powerful tools each data guy need to master. _Best practices and exercises:_ [1](https://www.quora.com/How-do-you-remove-all-whitespace-from-a-Python-string), [2](https://towardsdatascience.com/5-methods-to-remove-the-from-your-data-in-python-and-the-fastest-one-281489382455), [3](https://www.tutorialspoint.com/How-to-remove-all-leading-whitespace-in-string-in-Python) @@ -99,7 +99,7 @@ _Best practices and exercises:_ [1](https://www.kaggle.com/nirmal51194/data-clea ### Convert Value Types [Different data types](https://pbpython.com/pandas_dtypes.html) carry different kinds of information, and you need to care about this. -[Here](https://www.geeksforgeeks.org/python-pandas-series-astype-to-convert-data-type-of-series/) is a good tutorial on how to convert value types. Remember that Python has some shortcut for doing this (executing `str(3)` will give you back the "3" string) but I recommend you to learn how to do it with Pandas. +[Here](https://www.geeksforgeeks.org/python-pandas-series-astype-to-convert-data-type-of-series/) is a good tutorial on how to convert value types. Remember that Python has some shortcut for doing this (executing `str(3)` will give you back the "3" string) but Virgilio recommends you to learn how to do it with Pandas. ### Remove Duplicates You don't want to duplicate data, they are noise and occupy space! Learn [how to handle them simply](https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/) with Pandas. @@ -151,7 +151,7 @@ _Best practices and exercises:_ [1](https://www.kaggle.com/jfeng1023/data-cleani Just like the above, transformations or preprocessings can be performed on numeric, date or categorical data, similarly text data can also be processed in such a fashion. Although text data would undergo regex and string transformation processes deemed necessary for the NLP tasks they would be used for thereafter. ### Data Cleaning Tools -You're not going to hunt tigers without a rifle! You have a ton of tools out there that will help you during the data cleaning process, the one I want to suggest you is [this](https://www.analyticsindiamag.com/10-best-data-cleaning-tools-get-data/) open source tool from Google. Check [here](https://www.quora.com/What-are-the-best-open-source-data-cleansing-tools-software-available) for more. +You're not going to hunt tigers without a rifle! You have a ton of tools out there that will help you during the data cleaning process, the one Virgilio wants to suggest you is [this](https://www.analyticsindiamag.com/10-best-data-cleaning-tools-get-data/) open source tool from Google. Check [here](https://www.quora.com/What-are-the-best-open-source-data-cleansing-tools-software-available) for more. ## Visualization Visualization of data before and after many of the above steps is vital, to ensure the balance, bias and shape of the data is maintained. And the transformed or preprocessed data is representative of it's original form. Even if we can't control the way such data is going to evolve, we can atleast see the before and after effects of a transormation/preprocessing step before proceeding with it. Or if we even do proceed with it, we know from the visuals what the outcome stands to be from it (more or less). @@ -178,7 +178,7 @@ Look [here](https://www.trifacta.com/blog/4-key-steps-to-sanity-checking-your-da _Best practices and exercises:_ [1](https://blog.socialcops.com/academy/resources/4-data-checks-clean-data/), [2](https://www.r-bloggers.com/data-sanity-checks-data-proofer-and-r-analogues/), [3](https://www.quora.com/What-is-the-example-of-Sanity-testing-and-smoke-testing) ## Automate These Boring Stuffs! -As I told you at the very beginning, the data preprocessing process can take a long time and be very tedious. Because of this, you want to [automate](https://www.youtube.com/watch?v=UZUoH7_mYx4) the most you can. Also, **automation is married with iteration**, so this is the way you need to plan your data preprocessing pipelines. [Here](https://github.com/mdkearns/automated-data-preprocessing) you find a good command line tool for doing that, but I'm almost sure you'll need to build your own (remember, each problem is unique!), but this is a good starting point. +As Virgilio told you at the very beginning, the data preprocessing process can take a long time and be very tedious. Because of this, you want to [automate](https://www.youtube.com/watch?v=UZUoH7_mYx4) the most you can. Also, **automation is married with iteration**, so this is the way you need to plan your data preprocessing pipelines. [Here](https://github.com/mdkearns/automated-data-preprocessing) you find a good command line tool for doing that, but Virgilio is almost sure you'll need to build your own (remember, each problem is unique!), but this is a good starting point. ## Doing it in real-time Fully connected to the [previous section](#Automate-These-Boring-Stuffs!), automating redundant or repeated tasks makes the workflow repeatable, consistent, efficient and reliable. And given these qualities, it's not far away from being given the task of handling real-world raw data directly from the source or the various sources (centralizing or aggregation of data). This takes away the whole manual step from the process and keeps things real and practical -- production ready all the time. In this way you can see all the flavours of data/input and the nuances and edge-cases to handle each time a step fails or gives false positives or false negatives. @@ -201,7 +201,7 @@ Each time you're facing a data related problem, try to understand what data you The tools we're gonna use are Python3 and his [Pandas library](https://pandas.pydata.org/), the de-facto standard to manipulate datasets. There are a whole lot of other tools that have come out which are either built on top of Pandas or Numpy or independently, see [Data Preparation on awesome-ai-ml-dl](https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/data/data-preparation.md) for more details. The heavy lifting here is done by the [DataFrame class](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which comes with a bunch of useful functions for your daily data tasks. -Hopefully, you already know Python, if not start from there (do the steps I suggest you in the ML guide requirements), and then take this [Beginner Pandas tutorial](https://web.archive.org/web/20200719131732/https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html). Don't worry if by now some ideas are not totally clear, but try to get the big picture of the common [Pandas operations](https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/). +Hopefully, you already know Python, if not start from there (do the steps Virgilio suggested to you in the ML guide requirements), and then take this [Beginner Pandas tutorial](https://web.archive.org/web/20200719131732/https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html). Don't worry if by now some ideas are not totally clear, but try to get the big picture of the common [Pandas operations](https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/). _Best practices and exercises:_ [1](https://github.com/guipsamora/pandas_exercises), [2](https://www.w3resource.com/python-exercises/pandas/index.php), [3](https://www.machinelearningplus.com/python/101-pandas-exercises-python/), [4](https://www.kaggle.com/pistak/pandas-tutorial-with-interactive-exercises), [5](http://disi.unitn.it/~teso/courses/sciprog/python_pandas_exercises.html) From e911c309de5d88fd98c1d5c79e24f8bdefb4858d Mon Sep 17 00:00:00 2001 From: Mani Sarkar Date: Tue, 1 Dec 2020 20:44:43 +0000 Subject: [PATCH 11/18] Data prepation guide: removing to and correcting the sentence --- content/purgatorio/collect-and-prepare-data/data-preparation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/purgatorio/collect-and-prepare-data/data-preparation.md b/content/purgatorio/collect-and-prepare-data/data-preparation.md index 171428b7..ff13896e 100644 --- a/content/purgatorio/collect-and-prepare-data/data-preparation.md +++ b/content/purgatorio/collect-and-prepare-data/data-preparation.md @@ -102,7 +102,7 @@ _Best practices and exercises:_ [1](https://www.kaggle.com/nirmal51194/data-clea [Here](https://www.geeksforgeeks.org/python-pandas-series-astype-to-convert-data-type-of-series/) is a good tutorial on how to convert value types. Remember that Python has some shortcut for doing this (executing `str(3)` will give you back the "3" string) but Virgilio recommends you to learn how to do it with Pandas. ### Remove Duplicates -You don't want to duplicate data, they are noise and occupy space! Learn [how to handle them simply](https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/) with Pandas. +You don't want duplicate data, they may be noisy, redundant and occupy more space! Learn [how to handle them simply](https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/) with Pandas. ### Spell Check You want to correct wrong words, for the sake of evenness. Check [here](https://www.tutorialspoint.com/python_text_processing/python_spelling_check.htm) for a good Python module to do it. Also, this is a good starting point to [implement it](https://stackoverflow.com/questions/46409475/spell-checker-in-pandas). From 33b56fa6a6ac698c454bf9f91556371057af5b42 Mon Sep 17 00:00:00 2001 From: Mani Sarkar Date: Tue, 1 Dec 2020 20:48:21 +0000 Subject: [PATCH 12/18] Data prepation guide: removing referenced to FE terminology --- content/purgatorio/collect-and-prepare-data/data-preparation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/purgatorio/collect-and-prepare-data/data-preparation.md b/content/purgatorio/collect-and-prepare-data/data-preparation.md index ff13896e..980b1f11 100644 --- a/content/purgatorio/collect-and-prepare-data/data-preparation.md +++ b/content/purgatorio/collect-and-prepare-data/data-preparation.md @@ -132,7 +132,7 @@ UTF-encoding is the standard to follow, but remember that not everyone follows t _Best practices and exercises:_ [1](https://www.w3resource.com/python-exercises/python-basic-exercise-92.php), [2](https://stackoverflow.com/questions/22518703/escape-sequences-exercise-in-python?rq=1), [3](https://learnpythonthehardway.org/book/ex2.html) ### Verification to enrich data -Sometimes can be useful to engineer some data, for example: suppose you're dealing with [e-commerce data](https://www.edataindia.com/why-data-cleansing-is-important/), and you have the prices of each object sold. You may want to add a new column in your dataset, with a label carrying handy information like a Price_level [low, medium, high] based on upper and lower bounds you can decide. This is really simple in Pandas, check [here](https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column). Another example is to add a Gender column (M, F) to easily explore data and gain insights in a customers dataset. Also known as bucket values, which are super handy -- may even midly fall under the Feature Engineering category. +Sometimes can be useful to engineer some data, for example: suppose you're dealing with [e-commerce data](https://www.edataindia.com/why-data-cleansing-is-important/), and you have the prices of each object sold. You may want to add a new column in your dataset, with a label carrying handy information like a Price_level [low, medium, high] based on upper and lower bounds you can decide. This is really simple in Pandas, check [here](https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column). Another example is to add a Gender column (M, F) to easily explore data and gain insights in a customers dataset. _Best practices and exercises:_ [1](https://web.archive.org/web/20200813205611/http://www.inweb.org.br/w3c/dataenrichment/), [2](https://solutionsreview.com/data-integration/best-practices-for-data-enrichment-after-etl/) From 70c75ca84c574a9ac72bd1b06740a77ac4d39b3b Mon Sep 17 00:00:00 2001 From: Mani Sarkar Date: Tue, 1 Dec 2020 20:52:44 +0000 Subject: [PATCH 13/18] Data prepation guide: added lines and improved formating and language --- .../purgatorio/collect-and-prepare-data/data-preparation.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/content/purgatorio/collect-and-prepare-data/data-preparation.md b/content/purgatorio/collect-and-prepare-data/data-preparation.md index 980b1f11..1f7a173d 100644 --- a/content/purgatorio/collect-and-prepare-data/data-preparation.md +++ b/content/purgatorio/collect-and-prepare-data/data-preparation.md @@ -143,7 +143,11 @@ _Best practices and exercises:_ [1](https://www.researchgate.net/post/What_are_t ### Feature Scaling Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. -[Here](Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.) you find a serious tutorial about this fundamental step. Also known as Normalizing (bring value of numeric column between 0 and 1) or Standardizing (bring value of numeric column between -1 and 1) data, see [Normalization vs Standardization](https://towardsdatascience.com/normalization-vs-standardization-cb8fe15082eb). Normalization is also called min-max approach, see another [example](https://towardsdatascience.com/data-normalization-with-pandas-and-scikit-learn-7c1cc6ed6475). +[Here](Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.) you find a serious tutorial about this fundamental step. + +Also known as Normalizing (bring the values of a numeric column between 0 and 1) or Standardizing (bring the values of a numeric column between -1 and 1) data, see [Normalization vs Standardization](https://towardsdatascience.com/normalization-vs-standardization-cb8fe15082eb). + +Normalization is also called min-max approach, see another [example](https://towardsdatascience.com/data-normalization-with-pandas-and-scikit-learn-7c1cc6ed6475). _Best practices and exercises:_ [1](https://www.kaggle.com/jfeng1023/data-cleaning-challenge-scale-and-normalize-data), [2](https://www.quora.com/When-should-you-perform-feature-scaling-and-mean-normalization-on-the-given-data-What-are-the-advantages-of-these-techniques), [3](https://www.quora.com/When-do-I-have-to-do-feature-scaling-in-machine-learning) From 7c16732b1d89237896331c3eaf3b5925dbd365e7 Mon Sep 17 00:00:00 2001 From: Mani Sarkar Date: Tue, 1 Dec 2020 21:02:49 +0000 Subject: [PATCH 14/18] Data prepation guide: adding a note about presence of paid/premium product --- .../purgatorio/collect-and-prepare-data/data-preparation.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/content/purgatorio/collect-and-prepare-data/data-preparation.md b/content/purgatorio/collect-and-prepare-data/data-preparation.md index 1f7a173d..3b430893 100644 --- a/content/purgatorio/collect-and-prepare-data/data-preparation.md +++ b/content/purgatorio/collect-and-prepare-data/data-preparation.md @@ -77,7 +77,9 @@ During this informal meeting, ask the data questions like: Eventually, you may find the data to be too quiet, maybe it's just shy! \ Anyway, you're going to [ask these questions to the business user](https://business-analysis-excellence.com/business-requirements-meeting/)! -_Best practices and exercises:_ [1](https://www.iqint.org/idq2013/presentations/downloads/di_loreto_data_profiling_tutorial_monday_am.pdf), [2](https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Data-profiling-tutorials-use-cases-and-exercise/td-p/145347) +_Best practices and exercises:_ [1](https://www.iqint.org/idq2013/presentations/downloads/di_loreto_data_profiling_tutorial_monday_am.pdf)++, [2](https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Data-profiling-tutorials-use-cases-and-exercise/td-p/145347)++ + +> ++ - beware that this resource contains one or more premium or commercial (paid) product, if you are aware of an alternative solution to them please do share it with us ## Data Cleaning [Data cleaning](https://en.wikipedia.org/wiki/Data_cleansing) is the general process of taking data, after you have a clear big picture of them, and you need to realize the actual process of replacing characters, dropping incomplete rows, fill missing values and so forth. In the next sections, we'll explore all the common data cleaning situations. Also see [Data Cleaning on awesome-ai-ml-dl](https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/data/data-preparation.md#data-cleaning) section to learn more about this topic. From 6c00e91d22ba8cd703a8c5e41eb2c3c898fdbefd Mon Sep 17 00:00:00 2001 From: Mani Sarkar Date: Tue, 1 Dec 2020 21:06:12 +0000 Subject: [PATCH 15/18] Data prepation guide: fixed the description to the Data Cleaning section --- .../collect-and-prepare-data/data-preparation.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/content/purgatorio/collect-and-prepare-data/data-preparation.md b/content/purgatorio/collect-and-prepare-data/data-preparation.md index 3b430893..7c85aed5 100644 --- a/content/purgatorio/collect-and-prepare-data/data-preparation.md +++ b/content/purgatorio/collect-and-prepare-data/data-preparation.md @@ -82,7 +82,15 @@ _Best practices and exercises:_ [1](https://www.iqint.org/idq2013/presentations/ > ++ - beware that this resource contains one or more premium or commercial (paid) product, if you are aware of an alternative solution to them please do share it with us ## Data Cleaning -[Data cleaning](https://en.wikipedia.org/wiki/Data_cleansing) is the general process of taking data, after you have a clear big picture of them, and you need to realize the actual process of replacing characters, dropping incomplete rows, fill missing values and so forth. In the next sections, we'll explore all the common data cleaning situations. Also see [Data Cleaning on awesome-ai-ml-dl](https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/data/data-preparation.md#data-cleaning) section to learn more about this topic. +[Data cleaning](https://en.wikipedia.org/wiki/Data_cleansing) is the general process of ensuring that the quality of your data would be enough to satisfy the requirements of the problem you want to solve. + +For example, it can consists of replacing characters in strings, dropping incomplete rows, fill missing values and so forth. In the next sections, we'll explore all the common data cleaning situations. + +While it's hard to state that some steps are strictly required and others aren't, it's clever to know and try a as many apporaches as possible. + +Also see [Data Cleaning on awesome-ai-ml-dl](https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/data/data-preparation.md#data-cleaning) section to learn more about this topic. + +We will assume the data it's tabular, to see more about other types of data, check the related sections of the Inferno. ### Get Rid of Extra Spaces One of the first things you want to do is [remove extra spaces](https://stackoverflow.com/questions/43332057/pandas-strip-white-space). Take care! Some space can carry information, but it heavily depends on the situation. For example, in "Complete Name": "Giacomo Ciarlini" in nice to have space so we can later split this into "Name": "Giacomo" and "Surname": "Ciarlini". Virgilio wants you to notice that in general, apart from recommending and suggestion customization systems, unique identifiers like names or IDs are something you can generally drop. Often, they do not carry information. From d33dde9b8f9dd9436e38beef20962c70b9d60da5 Mon Sep 17 00:00:00 2001 From: Mani Sarkar Date: Tue, 1 Dec 2020 21:20:47 +0000 Subject: [PATCH 16/18] Data prepation guide: adding the Types of data section (expanding Text Data section) --- .../data-preparation.md | 21 ++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/content/purgatorio/collect-and-prepare-data/data-preparation.md b/content/purgatorio/collect-and-prepare-data/data-preparation.md index 7c85aed5..3bb3ebc8 100644 --- a/content/purgatorio/collect-and-prepare-data/data-preparation.md +++ b/content/purgatorio/collect-and-prepare-data/data-preparation.md @@ -30,7 +30,7 @@ The purpose of this guide is to show you the importance of these steps, mostly a - [Change Text to Lower/Upper Case](#Change-Text-to-Lower/Upper-Case) - [Normalizing Dates](#Normalizing-Dates) - [Feature Scaling](#Feature-Scaling) - - [Text data](#Text-data) + - [Types of data](#types-of-data) - [Data Cleaning Tools](#Data-Cleaning-Tools) - [Visualization](#Visualization) - [Merge Data Sets and Integration](#Merge-Data-Sets-and-Integration) @@ -161,8 +161,23 @@ Normalization is also called min-max approach, see another [example](https://tow _Best practices and exercises:_ [1](https://www.kaggle.com/jfeng1023/data-cleaning-challenge-scale-and-normalize-data), [2](https://www.quora.com/When-should-you-perform-feature-scaling-and-mean-normalization-on-the-given-data-What-are-the-advantages-of-these-techniques), [3](https://www.quora.com/When-do-I-have-to-do-feature-scaling-in-machine-learning) -### Text data -Just like the above, transformations or preprocessings can be performed on numeric, date or categorical data, similarly text data can also be processed in such a fashion. Although text data would undergo regex and string transformation processes deemed necessary for the NLP tasks they would be used for thereafter. +### Types of data + +**Tabular data** + +Also known as columnar or spreadsheet-like data where each column may be a different data type like string, numeric, date, etc. This includes most kinds of data commonly stored in a relational database or tab, or .csv files. + +Such data can then represent categorical, numeric/continuous, time-series data or a mix of all of these in different proportions -- this is the next level of abstraction of such types of data. + +**Text data** + +Just as transformations or preprocessings can be performed on numeric, date or categorical data, similarly text data can also be processed in such a fashion. Although text data would undergo regex and string transformation processes deemed necessary for the NLP tasks they would be used for thereafter. The end result of such processing could be one or or more tabular datasets which could then further be processed like any other tabular dataset (see above section). + +**Image/Video/Audio/Signal data** + +Unlike Tabular or Text data, such data is made up of mostly continuous values. The original data would be in binary format in the form of directories of files. These files would then be processed and transformed into rows and columns of continuous data with a minority number of categorical or other data types to represent such data, eventually they maybe represented in the tabular format for analysis, processing and training purposes. And so these final datasets would go through the same preprocessing like any other tabular data would. + +**Note:** _Not to confuse with the term Time-series data. The concept of time-series is the next level of abstraction of this type of data. Each of these data type above can be covered in more detail in futher guides at the **Inferno** or **Paradiso** levels and outside the current scope to keep the brevity in understanding of these concepts. To catch a glimpse of some of the specific preprocessing or transformation steps that we can do per type of data, see this [resource](https://www.linkedin.com/posts/shivan-kumar_datascience-machinelearning-deeplearning-activity-6732600618751442944-kNRY)._ ### Data Cleaning Tools You're not going to hunt tigers without a rifle! You have a ton of tools out there that will help you during the data cleaning process, the one Virgilio wants to suggest you is [this](https://www.analyticsindiamag.com/10-best-data-cleaning-tools-get-data/) open source tool from Google. Check [here](https://www.quora.com/What-are-the-best-open-source-data-cleansing-tools-software-available) for more. From 918048d05105d825cba3d30da043d3226f5b03e0 Mon Sep 17 00:00:00 2001 From: Mani Sarkar Date: Thu, 10 Dec 2020 15:55:49 +0000 Subject: [PATCH 17/18] Data prepation guide: correcting the definitions of Normalisation and Standardisation --- content/purgatorio/collect-and-prepare-data/data-preparation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/purgatorio/collect-and-prepare-data/data-preparation.md b/content/purgatorio/collect-and-prepare-data/data-preparation.md index 3bb3ebc8..f6dcd7d4 100644 --- a/content/purgatorio/collect-and-prepare-data/data-preparation.md +++ b/content/purgatorio/collect-and-prepare-data/data-preparation.md @@ -155,7 +155,7 @@ _Best practices and exercises:_ [1](https://www.researchgate.net/post/What_are_t Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. [Here](Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.) you find a serious tutorial about this fundamental step. -Also known as Normalizing (bring the values of a numeric column between 0 and 1) or Standardizing (bring the values of a numeric column between -1 and 1) data, see [Normalization vs Standardization](https://towardsdatascience.com/normalization-vs-standardization-cb8fe15082eb). +Also known as Normalizing data (bring the values of a numeric column between 0 and 1) or Standardizing data (bring the values of a numeric column between -n and m -- there is a notion that they can between -1 and 1, but in reality n and m are dependent on the minimum and maximum values of the original distribution, respectively), see [Normalization vs Standardization](https://towardsdatascience.com/normalization-vs-standardization-cb8fe15082eb). Normalization is also called min-max approach, see another [example](https://towardsdatascience.com/data-normalization-with-pandas-and-scikit-learn-7c1cc6ed6475). From e409ba7d937c6f86389325276dacddb64b3a3cab Mon Sep 17 00:00:00 2001 From: clone1995 Date: Sun, 24 Jan 2021 00:35:28 +0100 Subject: [PATCH 18/18] small fixes and formatting enhancement --- .../data-preparation.md | 134 ++++++++++++------ 1 file changed, 94 insertions(+), 40 deletions(-) diff --git a/content/purgatorio/collect-and-prepare-data/data-preparation.md b/content/purgatorio/collect-and-prepare-data/data-preparation.md index f6dcd7d4..8f4ccbdf 100644 --- a/content/purgatorio/collect-and-prepare-data/data-preparation.md +++ b/content/purgatorio/collect-and-prepare-data/data-preparation.md @@ -5,7 +5,7 @@ description: The purpose of this guide is to show you the different preprocessin --- # What you will learn -Real world data is almost always messy or unstructured, and most of the time of the Data Scientist is spent in data preprocessing (or data cleaning or data preprocessing), before visualizing them or feeding them to Machine Learning models. +Real-world data is almost always messy or unstructured, and most of the time of the Data Scientist is spent in data preprocessing (or data cleaning or data preprocessing), before visualizing them or feeding them to Machine Learning models. The purpose of this guide is to show you the importance of these steps, mostly about text data, but there will be guides on cleaning different kinds of data you may encounter. @@ -47,27 +47,35 @@ The purpose of this guide is to show you the importance of these steps, mostly a **Let's Start!** ## Start Small -It's stupid to handle GBs of data each time you want to try a data preparation step. Just use [small subsets](https://sdtimes.com/bi/data-gets-big-best-practices-data-preparation-scale/) of the data instead (but take care that they are representative and you catch all the problems). Remember, if you want to experiment with text cleaning, you don't need to launch your script on 10M rows. Test your data on a small sub-set or sample of data to learn if it works well there before going full-scale. +It's not a good idea to load GigaBytes of data each time you want to try a data preparation step. + +Start with [small subsets](https://sdtimes.com/bi/data-gets-big-best-practices-data-preparation-scale/) of the data instead (but take care that they are representative and you catch all the problems). + +Remember, if you want to experiment with text cleaning, you don't need to launch your script on 10M rows. Test your data on a small subset or sample of data to learn if it works well there before going full-scale. ## Business Questions -Asking the [right business questions](https://www.datapine.com/blog/data-analysis-questions/) is hard, but it has the [biggest impact](https://towardsdatascience.com/start-your-data-exploration-with-questions-2f1d42cff29e) on your performance of solving a particular problem. Remember, you want to [solve a problem](http://www.informit.com/articles/article.aspx?p=2271188&seqNum=2), not to create new ones! +Before trying to prepare the data, you want to be sure you have the right objective in mind. + +Asking the [right business questions](https://www.datapine.com/blog/data-analysis-questions/) is hard, but it has the [biggest impact](https://towardsdatascience.com/start-your-data-exploration-with-questions-2f1d42cff29e) on your performance of solving a particular problem. + + Remember, you want to [solve a problem](http://www.informit.com/articles/article.aspx?p=2271188&seqNum=2), not to create new ones! ## Data Preprocessing (Data wrangling / Data manipulation) -Data preprocessing (also known as Data Preparation, but "Preprocessing" sounds more like magic) is the [iterative process](http://www.jsoftware.us/vol12/306-JSW15277.pdf) of gathering, combining, structuring and (re)organizing data so it can be analyzed as part of data visualization, analytics, and machine learning processes. +**Data preprocessing** (also known as Data Preparation, but "Preprocessing" sounds more like magic) is the **[iterative process](http://www.jsoftware.us/vol12/306-JSW15277.pdf) of gathering, combining, structuring, and (re)organizing data so it can be analyzed as part of data visualization, analytics, and machine learning processes.** [Real-world data](https://www.quanticate.com/blog/real-world-data-analysis-in-clinical-trials) is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. -It's the [core ability](https://blogs.sas.com/content/hiddeninsights/2017/11/30/analytical-data-preparation-important/) of any data scientist or data engineer, and you must _be able to manipulate, clean, and structure_ your data during the everyday work (besides expecting that this will take the most of your [daily-time](https://www.infoworld.com/article/3228245/the-80-20-data-science-dilemma.html)!). +It's the [core ability](https://blogs.sas.com/content/hiddeninsights/2017/11/30/analytical-data-preparation-important/) of any data scientist or data engineer, and you must _be able to manipulate, clean, and structure_ your data during the everyday work (besides expecting that this will make the most of your [daily-time](https://www.infoworld.com/article/3228245/the-80-20-data-science-dilemma.html)!). There are a lot of different data types out there, and they deserve [different treatments](http://blog.appliedinformaticsinc.com/data-mining-challenges-in-data-cleaning/). -As usual the structure Virgilio has planned to get you started consists of having a [general overview](https://searchbusinessanalytics.techtarget.com/definition/data-preparation), and then dive deep into each data processing situation you can encounter. +As usual, the structure Virgilio has planned to get you started consists of having a [general overview](https://searchbusinessanalytics.techtarget.com/definition/data-preparation), and then dive deep into each data processing situation you can encounter. -[Here](https://towardsdatascience.com/data-pre-processing-techniques-you-should-know-8954662716d6) you have a gentle end-to-end panoramic view of the entire process. +[Here](https://towardsdatascience.com/data-pre-processing-techniques-you-should-know-8954662716d6) you have a gentle end-to-end panoramic view of the most common data preparation steps. ## Data Profiling -According to the (cold as ice) [Wikipedia definition](https://en.wikipedia.org/wiki/Data_profiling): "Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics and informative data summaries." \ -So Wikipedia is subtly suggesting us to have a coffee while working with our data. +According to the [Wikipedia definition](https://en.wikipedia.org/wiki/Data_profiling): "Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics and informative data summaries." \ +So Wikipedia is subtly suggesting us to have a coffee with our data. :-) During this informal meeting, ask the data questions like: - which business problem are you meant to solve? (what is important, and what is not) @@ -77,6 +85,11 @@ During this informal meeting, ask the data questions like: Eventually, you may find the data to be too quiet, maybe it's just shy! \ Anyway, you're going to [ask these questions to the business user](https://business-analysis-excellence.com/business-requirements-meeting/)! +Check these tools to quickly make a profile of your data and get a 3000-feet view of them. + +- [Dtale](https://github.com/man-group/dtale) +- [Pandas Profiling](https://github.com/pandas-profiling/pandas-profiling) + _Best practices and exercises:_ [1](https://www.iqint.org/idq2013/presentations/downloads/di_loreto_data_profiling_tutorial_monday_am.pdf)++, [2](https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Data-profiling-tutorials-use-cases-and-exercise/td-p/145347)++ > ++ - beware that this resource contains one or more premium or commercial (paid) product, if you are aware of an alternative solution to them please do share it with us @@ -84,16 +97,18 @@ _Best practices and exercises:_ [1](https://www.iqint.org/idq2013/presentations/ ## Data Cleaning [Data cleaning](https://en.wikipedia.org/wiki/Data_cleansing) is the general process of ensuring that the quality of your data would be enough to satisfy the requirements of the problem you want to solve. -For example, it can consists of replacing characters in strings, dropping incomplete rows, fill missing values and so forth. In the next sections, we'll explore all the common data cleaning situations. +For example, it can consist of replacing characters in strings, dropping incomplete rows, fill missing values, and so forth. In the next sections, we'll explore all the common data cleaning situations. -While it's hard to state that some steps are strictly required and others aren't, it's clever to know and try a as many apporaches as possible. +While it's hard to state that some steps are strictly required and others aren't, it's clever to know and try as many approaches as possible. -Also see [Data Cleaning on awesome-ai-ml-dl](https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/data/data-preparation.md#data-cleaning) section to learn more about this topic. +Also, see [Data Cleaning on awesome-ai-ml-dl](https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/data/data-preparation.md#data-cleaning) section to learn more about this topic. We will assume the data it's tabular, to see more about other types of data, check the related sections of the Inferno. ### Get Rid of Extra Spaces -One of the first things you want to do is [remove extra spaces](https://stackoverflow.com/questions/43332057/pandas-strip-white-space). Take care! Some space can carry information, but it heavily depends on the situation. For example, in "Complete Name": "Giacomo Ciarlini" in nice to have space so we can later split this into "Name": "Giacomo" and "Surname": "Ciarlini". Virgilio wants you to notice that in general, apart from recommending and suggestion customization systems, unique identifiers like names or IDs are something you can generally drop. Often, they do not carry information. +One of the first things you want to do is [remove extra spaces](https://stackoverflow.com/questions/43332057/pandas-strip-white-space). Take care! Some space can carry information, but it heavily depends on the situation. For example, in "Complete Name": "Andrea Carli" in nice to have space so we can later split this into "Name": "Andrea" and "Surname": "Carli". + +Virgilio wants you to notice that in general, apart from recommending and suggesting customization systems, unique identifiers like names or IDs are something you can generally drop. Often, they do not carry information. _Bonus tip_: learn how to use [Regex](https://www.analyticsvidhya.com/blog/2015/06/regular-expression-python/) for pattern matching, this is one of the powerful tools each data guy need to master. _Best practices and exercises:_ [1](https://www.quora.com/How-do-you-remove-all-whitespace-from-a-Python-string), [2](https://towardsdatascience.com/5-methods-to-remove-the-from-your-data-in-python-and-the-fastest-one-281489382455), [3](https://www.tutorialspoint.com/How-to-remove-all-leading-whitespace-in-string-in-Python) @@ -112,7 +127,7 @@ _Best practices and exercises:_ [1](https://www.kaggle.com/nirmal51194/data-clea [Here](https://www.geeksforgeeks.org/python-pandas-series-astype-to-convert-data-type-of-series/) is a good tutorial on how to convert value types. Remember that Python has some shortcut for doing this (executing `str(3)` will give you back the "3" string) but Virgilio recommends you to learn how to do it with Pandas. ### Remove Duplicates -You don't want duplicate data, they may be noisy, redundant and occupy more space! Learn [how to handle them simply](https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/) with Pandas. +You don't want duplicate data, they may be noisy, redundant, and occupy more space! Learn [how to handle them simply](https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/) with Pandas. ### Spell Check You want to correct wrong words, for the sake of evenness. Check [here](https://www.tutorialspoint.com/python_text_processing/python_spelling_check.htm) for a good Python module to do it. Also, this is a good starting point to [implement it](https://stackoverflow.com/questions/46409475/spell-checker-in-pandas). @@ -122,7 +137,7 @@ This is also useful when you are dealing with text data (columns of text data in _Best practices and exercises:_ [1](https://stackoverflow.com/questions/7315114/spell-check-program-in-python), [2](https://norvig.com/spell-correct.html), [3](https://github.com/garytse89/Python-Exercises/tree/master/autoCorrect) ### Grammar Check -Just like Spell Check, Grammar check of text data can be of great importance depending on the NLP task you are about to perform with them. +Just like Spell Check, a Grammar check of text data can be of great importance depending on the NLP task you are about to perform with them. ### Reshape your data Maybe you're going to feed your data into a neural network or show them in a colorful bars plot. Anyway, you need to transform your data and give them the right shape for your data pipeline. [Here](https://towardsdatascience.com/seven-clean-steps-to-reshape-your-data-with-pandas-or-how-i-use-python-where-excel-fails-62061f86ef9c) is a very good tutorial for this task. @@ -130,19 +145,27 @@ Maybe you're going to feed your data into a neural network or show them in a col _Best practices and exercises:_ [1](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html), [2](https://discuss.codecademy.com/t/faq-data-cleaning-with-pandas-reshaping-your-data/384794). ### Converting to categorical data type -When dealing with numeric or string (alphanumeric) columns which represent categories or multi-class labels, it's best to convert them into the categorical type. This does not just save memory, also makes the dataframe faster to operate on. And further makes data analysis step easier to perform. Further to that categorical column types under the hood maintain a category code per value in the column, which can be used instead of their string equivalents - saving some preprocessing or column transformations. +When dealing with numeric or string (alphanumeric) columns that represent categories or multi-class labels, it's best to convert them into the categorical type. + +This does not just save memory, also makes the dataframe faster to operate on. And further makes the data analysis step easier to perform. -One additional benefit of doing this, would be to help spot inconsistent namings and replace them with consistent ones. Inconsistent labels can lead to incorrect analysis and visualisations. Although these can be spotted during summarisation of categorical data. + Further to that categorical column types under the hood maintain a category code per value in the column, which can be used instead of their string equivalents - saving some preprocessing or column transformations. + +One additional benefit of doing this would be to help spot inconsistent namings and replace them with consistent ones. Inconsistent labels can lead to incorrect analysis and visualizations. Although these can be spotted during the summarization of categorical data. Read all about it in the [Pandas docs](https://pandas.pydata.org/docs/) on [Categorical data type](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html). ### Dealing with Special Characters -UTF-encoding is the standard to follow, but remember that not everyone follows the rules (otherwise, we'd not need [crime predictive analytics](http://scholarworks.sjsu.edu/cgi/viewcontent.cgi?article=1633&context=etd_projects). You can learn [here](https://stackoverflow.com/questions/45596529/replacing-special-characters-in-pandas-dataframe) how to deal with strange accents or special characters. +UTF-encoding is the standard to follow, but remember that not everyone follows the rules (otherwise, we'd not need [crime predictive analytics](http://scholarworks.sjsu.edu/cgi/viewcontent.cgi?article=1633&context=etd_projects). + + You can learn [here](https://stackoverflow.com/questions/45596529/replacing-special-characters-in-pandas-dataframe) how to deal with strange accents or special characters. _Best practices and exercises:_ [1](https://www.w3resource.com/python-exercises/python-basic-exercise-92.php), [2](https://stackoverflow.com/questions/22518703/escape-sequences-exercise-in-python?rq=1), [3](https://learnpythonthehardway.org/book/ex2.html) ### Verification to enrich data -Sometimes can be useful to engineer some data, for example: suppose you're dealing with [e-commerce data](https://www.edataindia.com/why-data-cleansing-is-important/), and you have the prices of each object sold. You may want to add a new column in your dataset, with a label carrying handy information like a Price_level [low, medium, high] based on upper and lower bounds you can decide. This is really simple in Pandas, check [here](https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column). Another example is to add a Gender column (M, F) to easily explore data and gain insights in a customers dataset. +Sometimes can be useful to engineer some data, for example: suppose you're dealing with [e-commerce data](https://www.edataindia.com/why-data-cleansing-is-important/), and you have the prices of each object sold. You may want to add a new column in your dataset, with a label carrying handy information like a Price_level [low, medium, high] based on upper and lower bounds you can decide. + +This is simple in Pandas, check [here](https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column). Another example is to add a Gender column (M, F) to easily explore data and gain insights into a customers dataset. _Best practices and exercises:_ [1](https://web.archive.org/web/20200813205611/http://www.inweb.org.br/w3c/dataenrichment/), [2](https://solutionsreview.com/data-integration/best-practices-for-data-enrichment-after-etl/) @@ -153,11 +176,11 @@ _Best practices and exercises:_ [1](https://www.researchgate.net/post/What_are_t ### Feature Scaling Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. -[Here](Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.) you find a serious tutorial about this fundamental step. -Also known as Normalizing data (bring the values of a numeric column between 0 and 1) or Standardizing data (bring the values of a numeric column between -n and m -- there is a notion that they can between -1 and 1, but in reality n and m are dependent on the minimum and maximum values of the original distribution, respectively), see [Normalization vs Standardization](https://towardsdatascience.com/normalization-vs-standardization-cb8fe15082eb). +Also known as "Normalizing data" (bring the values of a numeric column between 0 and 1) or "Standardizing data". +See [Normalization vs Standardization](https://towardsdatascience.com/normalization-vs-standardization-cb8fe15082eb). -Normalization is also called min-max approach, see another [example](https://towardsdatascience.com/data-normalization-with-pandas-and-scikit-learn-7c1cc6ed6475). +Normalization is also called the min-max approach, see another [example](https://towardsdatascience.com/data-normalization-with-pandas-and-scikit-learn-7c1cc6ed6475). _Best practices and exercises:_ [1](https://www.kaggle.com/jfeng1023/data-cleaning-challenge-scale-and-normalize-data), [2](https://www.quora.com/When-should-you-perform-feature-scaling-and-mean-normalization-on-the-given-data-What-are-the-advantages-of-these-techniques), [3](https://www.quora.com/When-do-I-have-to-do-feature-scaling-in-machine-learning) @@ -165,35 +188,49 @@ _Best practices and exercises:_ [1](https://www.kaggle.com/jfeng1023/data-cleani **Tabular data** -Also known as columnar or spreadsheet-like data where each column may be a different data type like string, numeric, date, etc. This includes most kinds of data commonly stored in a relational database or tab, or .csv files. +Also known as columnar or spreadsheet-like data where each column may be a different data type like string, numeric, date, etc. This includes most kinds of data commonly stored in a relational database or tab or .csv files. -Such data can then represent categorical, numeric/continuous, time-series data or a mix of all of these in different proportions -- this is the next level of abstraction of such types of data. +Such data can then represent categorical, numeric/continuous, time-series data, or a mix of all of these in different proportions -- this is the next level of abstraction of such types of data. **Text data** -Just as transformations or preprocessings can be performed on numeric, date or categorical data, similarly text data can also be processed in such a fashion. Although text data would undergo regex and string transformation processes deemed necessary for the NLP tasks they would be used for thereafter. The end result of such processing could be one or or more tabular datasets which could then further be processed like any other tabular dataset (see above section). +Just as transformations or preprocessing can be performed on numeric, date, or categorical data, similarly text data can also be processed in such a fashion. + + Although text data would undergo regex and string transformation processes deemed necessary for the NLP tasks they would be used for thereafter. + + The result of such processing could be one or more tabular datasets which could then further be processed like any other tabular dataset (see above section). **Image/Video/Audio/Signal data** -Unlike Tabular or Text data, such data is made up of mostly continuous values. The original data would be in binary format in the form of directories of files. These files would then be processed and transformed into rows and columns of continuous data with a minority number of categorical or other data types to represent such data, eventually they maybe represented in the tabular format for analysis, processing and training purposes. And so these final datasets would go through the same preprocessing like any other tabular data would. +Unlike Tabular or Text data, such data is made up of mostly continuous values. The original data would be in binary format in the form of directories of files. -**Note:** _Not to confuse with the term Time-series data. The concept of time-series is the next level of abstraction of this type of data. Each of these data type above can be covered in more detail in futher guides at the **Inferno** or **Paradiso** levels and outside the current scope to keep the brevity in understanding of these concepts. To catch a glimpse of some of the specific preprocessing or transformation steps that we can do per type of data, see this [resource](https://www.linkedin.com/posts/shivan-kumar_datascience-machinelearning-deeplearning-activity-6732600618751442944-kNRY)._ +These files would then be processed and transformed into rows and columns of continuous data with a minority number of categorical or other data types to represent such data, eventually, they may be represented in the tabular format for analysis, processing, and training purposes. + + And so these final datasets would go through the same preprocessing as any other tabular data would. + +**Note:** _Each of these data types above can be covered in more detail in further guides at the **Inferno** or **Paradiso** levels and outside the current scope to keep the brevity in the understanding of these concepts. To catch a glimpse of some of the specific preprocessing or transformation steps that we can do per type of data, see this [resource](https://www.linkedin.com/posts/shivan-kumar_datascience-machinelearning-deeplearning-activity-6732600618751442944-kNRY)._ ### Data Cleaning Tools -You're not going to hunt tigers without a rifle! You have a ton of tools out there that will help you during the data cleaning process, the one Virgilio wants to suggest you is [this](https://www.analyticsindiamag.com/10-best-data-cleaning-tools-get-data/) open source tool from Google. Check [here](https://www.quora.com/What-are-the-best-open-source-data-cleansing-tools-software-available) for more. +You're not going to hunt tigers without a rifle! You have a ton of tools out there that will help you during the data cleaning process, the one Virgilio wants to suggest to you is [this](https://www.analyticsindiamag.com/10-best-data-cleaning-tools-get-data/) open-source tool from Google. Check [here](https://www.quora.com/What-are-the-best-open-source-data-cleansing-tools-software-available) for more. ## Visualization -Visualization of data before and after many of the above steps is vital, to ensure the balance, bias and shape of the data is maintained. And the transformed or preprocessed data is representative of it's original form. Even if we can't control the way such data is going to evolve, we can atleast see the before and after effects of a transormation/preprocessing step before proceeding with it. Or if we even do proceed with it, we know from the visuals what the outcome stands to be from it (more or less). +Visualization of data before and after many of the above steps is vital, to ensure the balance, bias, and shape of the data is maintained. + +And the transformed or preprocessed data is representative of its original form. Even if we can't control the way such data is going to evolve, we can at least see the before and after-effects of a transformation/preprocessing step before proceeding with it. Or if we even do proceed with it, we know from the visuals what the outcome stands to be from it (more or less). -The specifics of what kinds of visualisations to use is to be made available in the Visualisation Guide. +The specifics of what kinds of visualizations to use is to be made available in the Visualisation Guide. ## Merge Data Sets and Integration -Now that you hopefully have been successful in your data cleaning process, you can merge data from different source to create big [de-normalized](https://www.researchgate.net/post/When_and_why_do_we_need_data_normalization_in_data_mining_algorithms) data tables, ready to be explored and consumed. [This](https://www.quora.com/Is-data-warehouse-normalized-or-denormalized-Why) is why. +Now that you hopefully have been successful in your data cleaning process, you can merge data from a different source to create big [de-normalized](https://www.researchgate.net/post/When_and_why_do_we_need_data_normalization_in_data_mining_algorithms) data tables, ready to be explored and consumed. [This](https://www.quora.com/Is-data-warehouse-normalized-or-denormalized-Why) is why. ## Aggregating data (centralizing) -Aggregating data or centralizing data (or sometimes called normalising data) - even though this topic overlaps with the [Data Collection](https://virgili0.github.io/Virgilio/purgatorio/collect-and-prepare-data/data-collection.html) topic covered in the respective guide. It's good to touch on the topic and be reminded of it briefly. As covered in the [Business Questions](#Business-Questions) when we ask questions about the data, one of them is to find it's source. But it also could give rise to other related data or sources of data that could be relevant to the current task and then be brought in. Which throws light on the data aggregation process - how to bring the different sources of data and convert it into one form before performing any preprocessing or transformations on it. This process itself is sort of a preprocessing or transformations step on its own. +Aggregating data or centralizing data (or sometimes called normalizing data) - even though this topic overlaps with the [Data Collection](https://virgili0.github.io/Virgilio/purgatorio/collect-and-prepare-data/data-collection.html) topic covered in the respective guide. It's good to touch on the topic and be reminded of it briefly. -On the other hand, this question could throw light on the sources of data the current raw-data is made up of (and make us aware of the aggregatation process it underwent) before taking it's current form. +As covered in the [Business Questions](#Business-Questions) when we ask questions about the data, one of them is to find its source. But it also could give rise to other related data or sources of data that could be relevant to the current task and then be brought in. + +Which throws light on the data aggregation process - how to bring the different sources of data and convert it into one form before performing any preprocessing or transformations on it. This process itself is sort of a preprocessing or transformations step on its own. + +On the other hand, this question could throw light on the sources of data the current raw-data is made up of (and make us aware of the aggregation process it underwent) before taking its current form. _Best practices and exercises:_ [1](https://www.ssc.wisc.edu/sscc/pubs/sfr-combine.htm), [2](https://rpubs.com/wsundstrom/t_merge), [3](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html), [4](https://searchbusinessanalytics.techtarget.com/feature/Using-data-merging-and-concatenation-techniques-to-integrate-data), [5](https://www.analyticsvidhya.com/blog/2016/06/9-challenges-data-merging-subsetting-r-python-beginner/) @@ -202,35 +239,52 @@ It is but hard to first check and know the current bias of the data or how the d ## Sanity Check You always want to be sure that your data are _exactly_ how you want them to be, and because of this is a good rule of thumb to apply a sanity check after each complete iteration of the data preprocessing pipeline (i.e. each step we have seen until now). + Look [here](https://www.trifacta.com/blog/4-key-steps-to-sanity-checking-your-data/) for a good overview. Depending on your case, the sanity check can vary a lot. _Best practices and exercises:_ [1](https://blog.socialcops.com/academy/resources/4-data-checks-clean-data/), [2](https://www.r-bloggers.com/data-sanity-checks-data-proofer-and-r-analogues/), [3](https://www.quora.com/What-is-the-example-of-Sanity-testing-and-smoke-testing) ## Automate These Boring Stuffs! -As Virgilio told you at the very beginning, the data preprocessing process can take a long time and be very tedious. Because of this, you want to [automate](https://www.youtube.com/watch?v=UZUoH7_mYx4) the most you can. Also, **automation is married with iteration**, so this is the way you need to plan your data preprocessing pipelines. [Here](https://github.com/mdkearns/automated-data-preprocessing) you find a good command line tool for doing that, but Virgilio is almost sure you'll need to build your own (remember, each problem is unique!), but this is a good starting point. +As Virgilio told you at the very beginning, the data preprocessing process can take a long time and be very tedious. Because of this, you want to [automate](https://www.youtube.com/watch?v=UZUoH7_mYx4) the most you can. Also, **automation is married with iteration**, so this is the way you need to plan your data preprocessing pipelines. + + [Here](https://github.com/mdkearns/automated-data-preprocessing) you find a good command-line tool for doing that, but Virgilio is almost sure you'll need to build your own (remember, each problem is unique!), but this is a good starting point. ## Doing it in real-time -Fully connected to the [previous section](#Automate-These-Boring-Stuffs!), automating redundant or repeated tasks makes the workflow repeatable, consistent, efficient and reliable. And given these qualities, it's not far away from being given the task of handling real-world raw data directly from the source or the various sources (centralizing or aggregation of data). This takes away the whole manual step from the process and keeps things real and practical -- production ready all the time. In this way you can see all the flavours of data/input and the nuances and edge-cases to handle each time a step fails or gives false positives or false negatives. +Fully connected to the [previous section](#Automate-These-Boring-Stuffs!), automating redundant or repeated tasks makes the workflow repeatable, consistent, efficient, and reliable. And given these qualities, it's not far away from being given the task of handling real-world raw data directly from the source or the various sources (centralizing or aggregation of data). + + This takes away the whole manual step from the process and keeps things real and practical -- production-ready all the time. In this way, you can see all the flavors of data/input and the nuances and edge-cases to handle each time a step fails or gives false positives or false negatives. ## Don't Joke With Data -First, [**data is King**](https://www.edq.com/glossary/data-quality-importance/). In the [data-driven epoch](https://www.venturi-group.com/qa-with-helen-mannion/), having [data quality issues](https://www.ringlead.com/blog/7-common-data-quality-issues/) means to lose tremendous amounts of value for a company, in the present and in the future. So, respect your King and care a lot about him. The most immediate way to do this is to plan and [work hard](https://web.archive.org/web/20190708202946/https://nektardata.com/high-quality-data/) to _produce_ good quality data. -Your goal is to plan a collecting data infrastructure that fixes problems beforehand. This means to care to a lot about planning well your database schemas (do I need [third-normal form](https://social.technet.microsoft.com/Forums/Lync/en-US/7bf4ca30-a1bc-415d-97e6-ce0ac3137b53/normalized-3nf-vs-denormalizedstar-schema-data-warehouse-?forum=sqldatawarehousing) or not?), how do you collect data from sensors (physical or conceptual) and so on. These are problems if you're building a system up from the ground, but most of the times in you're gonna facing real-world problems that someone wants to solve with [_already available_](https://www.wired.com/insights/2013/05/more-data-more-problems-is-big-data-always-right/) data. +First, [**data is King**](https://www.edq.com/glossary/data-quality-importance/). In the [data-driven epoch](https://www.venturi-group.com/qa-with-helen-mannion/), having [data quality issues](https://www.ringlead.com/blog/7-common-data-quality-issues/) means to lose tremendous amounts of value for a company, in the present and the future. So, respect your King and care a lot about him. The most immediate way to do this is to plan and [work hard](https://web.archive.org/web/20190708202946/https://nektardata.com/high-quality-data/) to _produce_ good quality data. + + +Your goal is to plan a collecting data infrastructure that fixes problems beforehand. This means to care a lot about planning well your database schemas (do I need [third-normal form](https://social.technet.microsoft.com/Forums/Lync/en-US/7bf4ca30-a1bc-415d-97e6-ce0ac3137b53/normalized-3nf-vs-denormalizedstar-schema-data-warehouse-?forum=sqldatawarehousing) or not?), how do you collect data from sensors (physical or conceptual) and so on. + +These are problems if you're building a system up from the ground, but most of the time in you're gonna facing real-world problems that someone wants to solve with [_already available_](https://www.wired.com/insights/2013/05/more-data-more-problems-is-big-data-always-right/) data. _Best practices and exercises:_ [1](https://blog.panoply.io/5-data-preparation-tools-1-automated-data-platform), [2](https://www.quora.com/How-do-I-make-an-automated-data-cleaning-in-Python-for-ML-Is-there-a-trick-for-that), [3](https://www.quora.com/Is-there-a-python-package-to-automate-data-preparation-in-machine-learning), [4](https://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/), [5](https://www.analyticsvidhya.com/blog/2018/10/rapidminer-data-preparation-machine-learning/) ## Who To Leave Behind During the data profiling process, it's common to realize that often some of your data are [useless](https://ambisense.net/why-useless-data-is-worse-than-no-data/). Your data may have too much noise or they are partial, and most likely you don't all of them to answer your business problems. + + [To drop or not to drop, the Dilemma](https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/). -Each time you're facing a data related problem, try to understand what data you need and what you' don't - that is, for each piece of information, ask yourself (and ask the _business user_): +Each time you're facing a data-related problem, try to understand what data you need and what you' don't - that is, for each piece of information, ask yourself (and ask the _business user_): - How this data is going to help me? - Is possible to use them, reducing noise or missing values? - Considering the benefits/costs of the preparation process versus the business value created, Is the effort worth it? ## The Toolkit -The tools we're gonna use are Python3 and his [Pandas library](https://pandas.pydata.org/), the de-facto standard to manipulate datasets. There are a whole lot of other tools that have come out which are either built on top of Pandas or Numpy or independently, see [Data Preparation on awesome-ai-ml-dl](https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/data/data-preparation.md) for more details. +The tools we're gonna use are Python3 and his [Pandas library](https://pandas.pydata.org/), the de-facto standard to manipulate datasets. + +There are a whole lot of other tools that have come out which are either built on top of Pandas or Numpy or independently, see [Data Preparation on awesome-ai-ml-dl](https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/data/data-preparation.md) for more details. The heavy lifting here is done by the [DataFrame class](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which comes with a bunch of useful functions for your daily data tasks. -Hopefully, you already know Python, if not start from there (do the steps Virgilio suggested to you in the ML guide requirements), and then take this [Beginner Pandas tutorial](https://web.archive.org/web/20200719131732/https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html). Don't worry if by now some ideas are not totally clear, but try to get the big picture of the common [Pandas operations](https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/). + + +Hopefully, you already know Python, if not start from there (do the steps Virgilio suggested to you in the ML guide requirements), and then take this [Beginner Pandas tutorial](https://web.archive.org/web/20200719131732/https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html). + +Don't worry if by now some ideas are not clear, but try to get the big picture of the common [Pandas operations](https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/). _Best practices and exercises:_ [1](https://github.com/guipsamora/pandas_exercises), [2](https://www.w3resource.com/python-exercises/pandas/index.php), [3](https://www.machinelearningplus.com/python/101-pandas-exercises-python/), [4](https://www.kaggle.com/pistak/pandas-tutorial-with-interactive-exercises), [5](http://disi.unitn.it/~teso/courses/sciprog/python_pandas_exercises.html)