Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT Refactor partial_fit Cython code to be more maintainable #63

Open
adam2392 opened this issue Apr 2, 2024 · 0 comments
Open

MAINT Refactor partial_fit Cython code to be more maintainable #63

adam2392 opened this issue Apr 2, 2024 · 0 comments
Assignees

Comments

@adam2392
Copy link
Collaborator

adam2392 commented Apr 2, 2024

Some segfaults arose in scikit-tree during the implementation of this PR: neurodata/treeple#249.
Some actionable items came to mind to improve what we have:

Documentation:

cpdef initialize_node_queue(
self,
Tree tree,
object X,
const float64_t[:, ::1] y,
const float64_t[:] sample_weight=None,
const unsigned char[::1] missing_values_in_feature_mask=None,
):
"""Initialize a list of roots"""
X, y, sample_weight = self._check_input(X, y, sample_weight)
# organize samples by decision paths
paths = tree.decision_path(X)
cdef intp_t PARENT
cdef intp_t CHILD
cdef intp_t i
false_roots = {}
X_copy = {}
y_copy = {}
for i in range(X.shape[0]):
# collect depths from the node paths
depth_i = paths[i].indices.shape[0] - 1
PARENT = depth_i - 1
CHILD = depth_i
# find leaf node's & their parent node's IDs
if PARENT < 0:
parent_i = 0
else:
parent_i = paths[i].indices[PARENT]
child_i = paths[i].indices[CHILD]
left = 0
if tree.children_left[parent_i] == child_i:
left = 1 # leaf node is left child
# organize samples by the leaf they fall into (false root)
# leaf nodes are marked by parent node and
# their relative position (left or right child)
if (parent_i, left) in false_roots:
false_roots[(parent_i, left)][0] += 1
X_copy[(parent_i, left)].append(X[i])
y_copy[(parent_i, left)].append(y[i])
else:
false_roots[(parent_i, left)] = [1, depth_i]
X_copy[(parent_i, left)] = [X[i]]
y_copy[(parent_i, left)] = [y[i]]
X_list = []
y_list = []
# reorder the samples according to parent node IDs
for key, value in reversed(sorted(X_copy.items())):
X_list = X_list + value
y_list = y_list + y_copy[key]
cdef object X_new = np.array(X_list)
cdef cnp.ndarray y_new = np.array(y_list)
# initialize the splitter using sorted samples
cdef Splitter splitter = self.splitter
splitter.init(X_new, y_new, sample_weight, missing_values_in_feature_mask)
# convert dict to numpy array and store value
self.initial_roots = np.array(list(false_roots.items()))
should ideally get rewritten, or at least more comments. Rn it is hard to parse what comprises initial_roots. Since this is part of the Cython codebase, it is thus a critical piece as segfaults are time-consuming and difficult to chase down.

Next, we prolly want to include a clear description for developers on what the differences are here:

if initial_roots is None:
# Recursive partition (without actual recursion)
splitter.init(X, y, sample_weight, missing_values_in_feature_mask)
if tree.max_depth <= 10:
init_capacity = <intp_t> (2 ** (tree.max_depth + 1)) - 1
else:
init_capacity = 2047
tree._resize(init_capacity)
first = 1
else:
# convert numpy array back to dict
false_roots = {}
for key_value_pair in initial_roots:
false_roots[tuple(key_value_pair[0])] = key_value_pair[1]
# reset the root array
self.initial_roots = None

Features, or documentation
Part of sklearn handles monotonic constraints and n_constant_features tracking. From first glance it is also not clear that these are actually tracked. I.e. is the monotonic constraint and n_constant_features if we do fit and then partial_fit for two subsets of the data different from if we just did fit for the entire dataset? If they are different, what does this imply?

In an ideal world the state is the same.

@PSSF23 PSSF23 self-assigned this Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants