Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling missing data #98

Closed
AtharvaKhare opened this issue Jun 3, 2019 · 14 comments
Closed

Handling missing data #98

AtharvaKhare opened this issue Jun 3, 2019 · 14 comments

Comments

@AtharvaKhare
Copy link
Contributor

This issue is to consolidate related missing-data issues and to discuss potential solutions.

We can use nil to represent missing data in a DataFrame(df) / DataSeries(ds). Any better representation is welcome.

Following methods need to be added:

  1. Initializing df/ds with missing data - see DataFrame can not be initialized with missing values #21
  2. Dropping nil rows in a df, null values in a ds
  3. Detecting and converting nil values from strings - ?, NA, nan, null, nil as values should get converted to nil
  4. Filling missing data with 0/Empty string/Dictionary/Mean-mode-median

By solving this, #14 and #66 should get solved, and ability to read files(csv) with missing data should become possible.

@khinsen
Copy link

khinsen commented Jun 3, 2019

Using nil looks like an almost obvious choice. I don't see any problem with it.

As for point 3, the values to be converted to nil must be easily configurable because there is so much variety in real-life datasets. As an example, I had a dataset recently that uses '-' in numeric columns to indicate missing values. As another example, NaN is sometimes used for missing data but it can also be used in its strict IEEE-754 sense of "not a number", which is a well-defined floating-point value and not at all missing. Unfortunately, Pharo has not adopted IEEE-754 for its floating-point operations, so it can't handle NaN-containing datasets well, but that's not a reason to confuse it with missing data.

@AtharvaKhare
Copy link
Contributor Author

@khinsen There is a Float nan, would using that instead of nil be better (for numbers)?

@Ducasse
Copy link

Ducasse commented Jun 5, 2019

Indeed you should check the API because using nil will force people to check.
After checking is ok if this is in only one place such as the import time.

@khinsen
Copy link

khinsen commented Jun 5, 2019

@AtharvaKhare No. Float nan is a valid floating point value, not an indicator of something missing. But some people do use literal representations of NaN for missing data, so when reading in datasets either conversion (Nan->Float nan and Nan->nil) must be doable.

@AtharvaKhare
Copy link
Contributor Author

@Ducasse Can you please elaborate?

@khinsen So for initialization, any missing value should become nil.
I'll provide an extra message which would convert 'nan'/'NaN' (String) to Float nan, for reading dataframes from files.

@SergeStinckwich
Copy link
Member

@khinsen not sure to understand why Pharo does not support IEEE-754: http://pharobooks.gforge.inria.fr/PharoByExampleTwo-Eng/latest/Float.pdf

@khinsen
Copy link

khinsen commented Jun 6, 2019

@AtharvaKhare For reading DataFrames, here's what I'd do to deal with missing values:

  1. Have a method with an explicit argument for field values treated as missing values, e.g.
    DataFrame readFromCsv: 'file.csv' missingValueStrings: #('' '-' 'NA')
  2. After replacing the given strings by nil, convert any remaining nan, Nan etc. in numeric data as Float nan.
  3. For single-argument DataFrame>>#readFromCsv use reasonable default values. '' (blank field) and 'NA' are probably the most common ones.

@khinsen
Copy link

khinsen commented Jun 6, 2019

@SergeStinckwich Pharo implements a subset of IEEE-754. That's probably true of all higher-level languages, but in the context of this discussion (NaN), a missing feature that many languages do implement is an option to turn off exceptions and have operations such as division by zero or sqrt of a negative number return NaN instead. In fact, I haven't yet found a way to produce NaN from a computation in Pharo that does not already have NaN as an argument.

@SergeStinckwich
Copy link
Member

ok @khinsen maybe we should do a list of what is missing. What would be the benefit of turnoff exceptions and obtain Nan ?

@khinsen
Copy link

khinsen commented Jun 7, 2019

The best summary I know of the rationale behind introducing NaN is from lecture notes by William Kahan (who was the principal designer of IEEE-754):

What had been missing from computation but is now supplied by NaNs is an opportunity ( not obligation ) for software ( especially when searching ) to follow an unexceptional path ( no need for exotic control structures ) to a point where an exceptional event can be appraised after the event, when additional evidence may have accrued. Deferred judgments are usually better judgments but not always, alas.

Today the main reason for using no-exception-mode today is performance when treating large datasets, but there are cases where Kahan's argument still applies, i.e. algorithms where having to deal with an invalid result immediately is not convenient. In Pharo, another motivation may be prototyping algorithms that do use NaNs for whatever reason, including performance when turned into low-level code.

On the other hand, if I wanted to improve number crunching in Pharo, I'd start with more important issues. Such as the precision mismatch between Float and FloatArray.

@khinsen
Copy link

khinsen commented Jun 20, 2019

@AtharvaKhare The methods you added for removing rows/columns that contain missing values look good. That's one popular way to deal with missing data. However, it is often desirable to ignore missing data only for one particular operation, without removing them completely. For example, one might want to compute the average over a column. R has a boolean parameter "ignore missing data" in functions that somehow iterate over the data, which provides just that functionality. I am not sure how this would best be done in Pharo, but it's worth thinking about.

@AtharvaKhare
Copy link
Contributor Author

AtharvaKhare commented Jun 20, 2019

@khinsen I think we can define a new message like collectNotNils: or doNotNils: which would use a isNil check for every element, but I do not think this is how it should be implemented.

Would love other's thoughts on this.

@AtharvaKhare
Copy link
Contributor Author

AtharvaKhare commented Jun 20, 2019

Putting PRs here for reference:

  1. Initializing df/ds with missing data - Added support for DataFrame init with missing values #100 merged
  2. Dropping nil rows in a df, null values in a ds - Added ability to remove nil from DataFrame and Series #102 (Additional tests need to be added, will be done after 1 is merged)
  3. Detecting and converting nil values from strings - Added method to convert missing values from files #104 (Tests will be added after JSON support)
  4. Filling missing data with 0/Empty string/Dictionary/Mean-mode-median - Added DataSeries fillNilsWith method #103 (Mean/mode/median will be added after 2 is merged)

@AtharvaKhare
Copy link
Contributor Author

All the related issues are now fixed. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants