Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: can np.nan stand in for nan+/-0? #169

Closed
MichaelTiemannOSC opened this issue Jan 2, 2023 · 3 comments
Closed

Question: can np.nan stand in for nan+/-0? #169

MichaelTiemannOSC opened this issue Jan 2, 2023 · 3 comments

Comments

@MichaelTiemannOSC
Copy link

I'm trying to use uncertainties with Pandas, Pint and Pint-Pandas. Pint-Pandas makes it easy to have quantified values on a column basis that really don't interact much (or at least badly) with other columns.

uncertainties relies of wrappers to do its things, whereas Pint and Pint-Pandas are now very complete in using ExtensionArrays to interact with Pandas. ExtensionArrays define a value for their na_type, which for most things numeric means np.nan.

In my past dealings with uncertainties, the nan for that has been nan+/-0, which has been fine, except that it now makes for difficult promotion rules. If I have an extension array of quantities (tons of CO2, millions of USD, whatever) with normal float64 magnitudes, the correct na_value for that is np.nan. But if I fill the array with uncertainties as magnitudes, the logical na_value would be nan+/-0. But there's no concept of multiple na_value depending on whether there are uncertainties in the mix.

One solution is to just bite the bullet and say "if you use uncertainties anywhere, then every dataframe needs to honor them, meaning that the na_value for ANYTHING is nan+/-0 (and all magnitudes must promote to UFloat)." What I'd like to do is to manage that column-by-column.

Is there a world in which np.nan is a fully adequate value for uncertainties, with whatever promotions/substitutions, etc happening within the wrappers? Or do I need to majorly rethink my approach of layering these various abstractions (uncertainties, quantities, DataFrames) together?

@lebigot
Copy link
Collaborator

lebigot commented Jan 2, 2023

Thanks for the interesting details.

Now, can you show an example of what you want to do? I'm not fully seeing the problem yet (in part because uncertainties automatically promotes NaN to NaN±0).

@MichaelTiemannOSC
Copy link
Author

MichaelTiemannOSC commented Jan 2, 2023

Here is some sample code. It does not play well with the current version of pint-pandas 0.3, but I have a pull-request that does make it work (hgrecco/pint-pandas#140). My latest iteration doesn't show the problems I think will exist due to having multiple na_values, but the extra things the code does to smooth things over feels fragile.

import numpy as np
import uncertainties as un
from uncertainties import unumpy as unp
import pandas as pd
import pint
import pint_pandas

from pint import Quantity as Q_
from pint_pandas import PintArray as PA_
from uncertainties import ufloat

def pp_ser(ser):
    print(f"{ser.name} =\n{ser}")
    print(f"data = {ser.values.data}; np.dtype={ser.values.data.dtype}")

pa1 = pd.Series(PA_([1.1, 1.2, np.nan], dtype='pint[m]'), name='pa1')
pa2 = pd.Series(PA_([1.1, 0, 1.3], dtype='pint[m]'), name='pa2')
pa3 = pa1 / pa2
pa3.name = 'pa3'
# This is the simple case of pure float64 magnitudes
print("Pure float64")
pp_ser(pa3)

upa0 = pd.Series(PA_([ufloat(1.1, 0), ufloat(1.2, 0), np.nan, np.nan], dtype='pint[m]'), name='upa0')
# Test out self-promoting NaNs
pp_ser(upa0)

upa1 = pd.Series(PA_([1.1, 1.2, ufloat(np.nan, 0), np.nan], dtype='pint[m]'), name='upa1')
pp_ser(upa1)

upa2 = pd.Series(PA_([ufloat(0.0, 0.0), ufloat(0.0, .1), ufloat(0.0, .2), ufloat(0.0, .3)], dtype='pint[m]'), name='upa2')
pp_ser(upa2)

upa3 = pd.Series(PA_([0.01, 0.02, 0.03, 0.04], dtype='pint[m]'), name='upa3')
pp_ser(upa3)

upa4 = pd.concat([upa0, upa1,upa2,upa3], axis=1)

print(f"upa4.dtypes = {upa4.dtypes}")
for col in upa4.columns:
    print(f"upa4[{col}].values.data = {upa4[col].values.data}")
print(f"upa4.iloc[0] = {upa4.iloc[0]}")
print(f"upa4.T.iloc[0] = {upa4.T.iloc[0]}")

@MichaelTiemannOSC
Copy link
Author

So...writing up my findings for the day: when PintPandas makes the ndarray for holding values for PintArrays, it's really best to allocate for ufloats if there are any ufloats to be seen, or if there are NaNs, which are the initial values in an "empty" array. If we try too soon to allocate our ndarray for float64 only, especially when all we see are NaNs, those arrays cannot later hold ufloats.

The cost of allocating object arrays is well known (performance), but I'm seeing general happiness whereby dataframes filled with PintArrays do what is expected. What's cool (?) is that one cannot tell off-hand whether a PintArray with dtype='pint[kg]' is a float64-based or uncertainties-based array. It. Just. Works.

There are still lots of edge cases to work out, but at the end of this day, I have something that's largely behaving and it's not throwing 10,000+ warning messages about "units stripped" or "casting to float" or whatever. So that's progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants