Handling missing data #98

AtharvaKhare · 2019-06-03T12:57:09Z

This issue is to consolidate related missing-data issues and to discuss potential solutions.

We can use nil to represent missing data in a DataFrame(df) / DataSeries(ds). Any better representation is welcome.

Following methods need to be added:

Initializing df/ds with missing data - see DataFrame can not be initialized with missing values #21
Dropping nil rows in a df, null values in a ds
Detecting and converting nil values from strings - ?, NA, nan, null, nil as values should get converted to nil
Filling missing data with 0/Empty string/Dictionary/Mean-mode-median

By solving this, #14 and #66 should get solved, and ability to read files(csv) with missing data should become possible.

The text was updated successfully, but these errors were encountered:

khinsen · 2019-06-03T14:18:02Z

Using nil looks like an almost obvious choice. I don't see any problem with it.

As for point 3, the values to be converted to nil must be easily configurable because there is so much variety in real-life datasets. As an example, I had a dataset recently that uses '-' in numeric columns to indicate missing values. As another example, NaN is sometimes used for missing data but it can also be used in its strict IEEE-754 sense of "not a number", which is a well-defined floating-point value and not at all missing. Unfortunately, Pharo has not adopted IEEE-754 for its floating-point operations, so it can't handle NaN-containing datasets well, but that's not a reason to confuse it with missing data.

AtharvaKhare · 2019-06-05T05:49:03Z

@khinsen There is a Float nan, would using that instead of nil be better (for numbers)?

Ducasse · 2019-06-05T06:00:39Z

Indeed you should check the API because using nil will force people to check.
After checking is ok if this is in only one place such as the import time.

khinsen · 2019-06-05T09:23:38Z

@AtharvaKhare No. Float nan is a valid floating point value, not an indicator of something missing. But some people do use literal representations of NaN for missing data, so when reading in datasets either conversion (Nan->Float nan and Nan->nil) must be doable.

AtharvaKhare · 2019-06-06T05:57:24Z

@Ducasse Can you please elaborate?

@khinsen So for initialization, any missing value should become nil.
I'll provide an extra message which would convert 'nan'/'NaN' (String) to Float nan, for reading dataframes from files.

SergeStinckwich · 2019-06-06T09:39:18Z

@khinsen not sure to understand why Pharo does not support IEEE-754: http://pharobooks.gforge.inria.fr/PharoByExampleTwo-Eng/latest/Float.pdf

khinsen · 2019-06-06T11:33:13Z

@AtharvaKhare For reading DataFrames, here's what I'd do to deal with missing values:

Have a method with an explicit argument for field values treated as missing values, e.g.
DataFrame readFromCsv: 'file.csv' missingValueStrings: #('' '-' 'NA')
After replacing the given strings by nil, convert any remaining nan, Nan etc. in numeric data as Float nan.
For single-argument DataFrame>>#readFromCsv use reasonable default values. '' (blank field) and 'NA' are probably the most common ones.

khinsen · 2019-06-06T11:40:01Z

@SergeStinckwich Pharo implements a subset of IEEE-754. That's probably true of all higher-level languages, but in the context of this discussion (NaN), a missing feature that many languages do implement is an option to turn off exceptions and have operations such as division by zero or sqrt of a negative number return NaN instead. In fact, I haven't yet found a way to produce NaN from a computation in Pharo that does not already have NaN as an argument.

SergeStinckwich · 2019-06-06T17:31:03Z

ok @khinsen maybe we should do a list of what is missing. What would be the benefit of turnoff exceptions and obtain Nan ?

khinsen · 2019-06-07T09:31:00Z

The best summary I know of the rationale behind introducing NaN is from lecture notes by William Kahan (who was the principal designer of IEEE-754):

What had been missing from computation but is now supplied by NaNs is an opportunity ( not obligation ) for software ( especially when searching ) to follow an unexceptional path ( no need for exotic control structures ) to a point where an exceptional event can be appraised after the event, when additional evidence may have accrued. Deferred judgments are usually better judgments but not always, alas.

Today the main reason for using no-exception-mode today is performance when treating large datasets, but there are cases where Kahan's argument still applies, i.e. algorithms where having to deal with an invalid result immediately is not convenient. In Pharo, another motivation may be prototyping algorithms that do use NaNs for whatever reason, including performance when turned into low-level code.

On the other hand, if I wanted to improve number crunching in Pharo, I'd start with more important issues. Such as the precision mismatch between Float and FloatArray.

khinsen · 2019-06-20T06:36:24Z

@AtharvaKhare The methods you added for removing rows/columns that contain missing values look good. That's one popular way to deal with missing data. However, it is often desirable to ignore missing data only for one particular operation, without removing them completely. For example, one might want to compute the average over a column. R has a boolean parameter "ignore missing data" in functions that somehow iterate over the data, which provides just that functionality. I am not sure how this would best be done in Pharo, but it's worth thinking about.

AtharvaKhare · 2019-06-20T16:05:39Z

@khinsen I think we can define a new message like collectNotNils: or doNotNils: which would use a isNil check for every element, but I do not think this is how it should be implemented.

Would love other's thoughts on this.

AtharvaKhare · 2019-06-20T16:10:57Z

Putting PRs here for reference:

Initializing df/ds with missing data - Added support for DataFrame init with missing values #100 merged
Dropping nil rows in a df, null values in a ds - Added ability to remove nil from DataFrame and Series #102 (Additional tests need to be added, will be done after 1 is merged)
Detecting and converting nil values from strings - Added method to convert missing values from files #104 (Tests will be added after JSON support)
Filling missing data with 0/Empty string/Dictionary/Mean-mode-median - Added DataSeries fillNilsWith method #103 (Mean/mode/median will be added after 2 is merged)

AtharvaKhare · 2019-07-22T11:16:45Z

All the related issues are now fixed. :)

AtharvaKhare closed this as completed Jul 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling missing data #98

Handling missing data #98

AtharvaKhare commented Jun 3, 2019

khinsen commented Jun 3, 2019

AtharvaKhare commented Jun 5, 2019

Ducasse commented Jun 5, 2019

khinsen commented Jun 5, 2019

AtharvaKhare commented Jun 6, 2019

SergeStinckwich commented Jun 6, 2019

khinsen commented Jun 6, 2019

khinsen commented Jun 6, 2019

SergeStinckwich commented Jun 6, 2019

khinsen commented Jun 7, 2019

khinsen commented Jun 20, 2019

AtharvaKhare commented Jun 20, 2019 •

edited

Loading

AtharvaKhare commented Jun 20, 2019 •

edited

Loading

AtharvaKhare commented Jul 22, 2019

Handling missing data #98

Handling missing data #98

Comments

AtharvaKhare commented Jun 3, 2019

khinsen commented Jun 3, 2019

AtharvaKhare commented Jun 5, 2019

Ducasse commented Jun 5, 2019

khinsen commented Jun 5, 2019

AtharvaKhare commented Jun 6, 2019

SergeStinckwich commented Jun 6, 2019

khinsen commented Jun 6, 2019

khinsen commented Jun 6, 2019

SergeStinckwich commented Jun 6, 2019

khinsen commented Jun 7, 2019

khinsen commented Jun 20, 2019

AtharvaKhare commented Jun 20, 2019 • edited Loading

AtharvaKhare commented Jun 20, 2019 • edited Loading

AtharvaKhare commented Jul 22, 2019

AtharvaKhare commented Jun 20, 2019 •

edited

Loading

AtharvaKhare commented Jun 20, 2019 •

edited

Loading