-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensorflow-IO DAOS Plugin #1603
base: master
Are you sure you want to change the base?
Commits on Sep 6, 2021
-
Skeleton in Place + Build Correctly
Omar Marzouk committedSep 6, 2021 Configuration menu - View commit details
-
Copy full SHA for 7369e32 - Browse repository at this point
Copy the full SHA 7369e32View commit details
Commits on Sep 10, 2021
-
Merge branch 'FT-dfs-skeleton-OM' into 'devel'
Resolve "Skeleton for DFS Plugin" See merge request parallel-programming/bs-daos-tensorflow-io!1
Configuration menu - View commit details
-
Copy full SHA for b055a0c - Browse repository at this point
Copy the full SHA b055a0cView commit details -
Parsing Function Added and Tested Separately
root committedSep 10, 2021 Configuration menu - View commit details
-
Copy full SHA for 230aadd - Browse repository at this point
Copy the full SHA 230aaddView commit details -
Merge branch '2-parse-dfs-path' into 'devel'
Resolve "Parse DFS Path" See merge request parallel-programming/bs-daos-tensorflow-io!2
Configuration menu - View commit details
-
Copy full SHA for a9b128b - Browse repository at this point
Copy the full SHA a9b128bView commit details
Commits on Sep 12, 2021
-
DAOS library installed as an http archive and linked
root committedSep 12, 2021 Configuration menu - View commit details
-
Copy full SHA for 17a21d7 - Browse repository at this point
Copy the full SHA 17a21d7View commit details -
root committed
Sep 12, 2021 Configuration menu - View commit details
-
Copy full SHA for 20900e9 - Browse repository at this point
Copy the full SHA 20900e9View commit details -
root committed
Sep 12, 2021 Configuration menu - View commit details
-
Copy full SHA for c1970e9 - Browse repository at this point
Copy the full SHA c1970e9View commit details -
Merge branch 'FT-daos-lib-integration-OM' into 'devel'
Resolve "DAOS Library Integration" See merge request parallel-programming/bs-daos-tensorflow-io!3
Configuration menu - View commit details
-
Copy full SHA for 6441bf0 - Browse repository at this point
Copy the full SHA 6441bf0View commit details
Commits on Sep 15, 2021
-
Added Skeleton + Connect/Disconnect Functionality
root committedSep 15, 2021 Configuration menu - View commit details
-
Copy full SHA for b1c0128 - Browse repository at this point
Copy the full SHA b1c0128View commit details -
Merge branch '4-plugin-skeleton' into 'devel'
Resolve "Plugin Skeleton" See merge request parallel-programming/bs-daos-tensorflow-io!4
Configuration menu - View commit details
-
Copy full SHA for e3654b9 - Browse repository at this point
Copy the full SHA e3654b9View commit details -
root committed
Sep 15, 2021 Configuration menu - View commit details
-
Copy full SHA for a45d6e4 - Browse repository at this point
Copy the full SHA a45d6e4View commit details
Commits on Sep 16, 2021
-
Query + Moving Class and Helpers to header file
root committedSep 16, 2021 Configuration menu - View commit details
-
Copy full SHA for 903b89f - Browse repository at this point
Copy the full SHA 903b89fView commit details
Commits on Sep 20, 2021
-
Added Path Lookup Functionality
root committedSep 20, 2021 Configuration menu - View commit details
-
Copy full SHA for 477e81a - Browse repository at this point
Copy the full SHA 477e81aView commit details
Commits on Sep 21, 2021
-
Support for Multiple Connections
root committedSep 21, 2021 Configuration menu - View commit details
-
Copy full SHA for ae6f84a - Browse repository at this point
Copy the full SHA ae6f84aView commit details -
Merge branch 'FT-filesystem-ops-OM' into 'devel'
Resolve "Filesystem Operations" See merge request parallel-programming/bs-daos-tensorflow-io!5
Configuration menu - View commit details
-
Copy full SHA for 9a3f111 - Browse repository at this point
Copy the full SHA 9a3f111View commit details
Commits on Sep 23, 2021
-
Directory Checking + Creation & Deletion (Singl/Recursive)
root committedSep 23, 2021 Configuration menu - View commit details
-
Copy full SHA for 9f29eee - Browse repository at this point
Copy the full SHA 9f29eeeView commit details -
Merge branch 'FT-directory-operation-OM' into 'devel'
Resolve "Directory Operations" See merge request parallel-programming/bs-daos-tensorflow-io!6
Configuration menu - View commit details
-
Copy full SHA for 14073c5 - Browse repository at this point
Copy the full SHA 14073c5View commit details -
Omar Marzouk committed
Sep 23, 2021 Configuration menu - View commit details
-
Copy full SHA for 8dd9fbb - Browse repository at this point
Copy the full SHA 8dd9fbbView commit details -
Omar Marzouk committed
Sep 23, 2021 Configuration menu - View commit details
-
Copy full SHA for c8e261a - Browse repository at this point
Copy the full SHA c8e261aView commit details
Commits on Sep 25, 2021
-
Creation of Random Access + Writable + Appendable Operations
Omar Marzouk committedSep 25, 2021 Configuration menu - View commit details
-
Copy full SHA for 958b645 - Browse repository at this point
Copy the full SHA 958b645View commit details
Commits on Sep 26, 2021
-
Omar Marzouk committed
Sep 26, 2021 Configuration menu - View commit details
-
Copy full SHA for 96647da - Browse repository at this point
Copy the full SHA 96647daView commit details
Commits on Sep 27, 2021
-
Completed FileSystem Operations Table
Omar Marzouk committedSep 27, 2021 Configuration menu - View commit details
-
Copy full SHA for 038dfce - Browse repository at this point
Copy the full SHA 038dfceView commit details -
Merge branch 'FT-file-ops-OM' into 'devel'
Resolve "File Operations" See merge request parallel-programming/bs-daos-tensorflow-io!7
Configuration menu - View commit details
-
Copy full SHA for 2f29df0 - Browse repository at this point
Copy the full SHA 2f29df0View commit details
Commits on Sep 28, 2021
-
Omar Marzouk committed
Sep 28, 2021 Configuration menu - View commit details
-
Copy full SHA for 5b2d864 - Browse repository at this point
Copy the full SHA 5b2d864View commit details -
Merge branch '10-refactor-of-dfs-plugin-class' into 'devel'
Resolve "Refactor of DFS Plugin + Class" See merge request parallel-programming/bs-daos-tensorflow-io!9
Configuration menu - View commit details
-
Copy full SHA for b14fc84 - Browse repository at this point
Copy the full SHA b14fc84View commit details
Commits on Sep 29, 2021
-
Omar Marzouk committed
Sep 29, 2021 Configuration menu - View commit details
-
Copy full SHA for 10b09de - Browse repository at this point
Copy the full SHA 10b09deView commit details -
Merge branch 'FT-writable-file-ops-OM' into 'devel'
Resolve "Writable File Ops" See merge request parallel-programming/bs-daos-tensorflow-io!8
Configuration menu - View commit details
-
Copy full SHA for 26b0f36 - Browse repository at this point
Copy the full SHA 26b0f36View commit details -
Omar Marzouk committed
Sep 29, 2021 Configuration menu - View commit details
-
Copy full SHA for 83fd151 - Browse repository at this point
Copy the full SHA 83fd151View commit details -
Merge branch 'FT-random-access-ops-OM' into 'devel'
Resolve "Random Access File Ops" See merge request parallel-programming/bs-daos-tensorflow-io!10
Configuration menu - View commit details
-
Copy full SHA for 6810f04 - Browse repository at this point
Copy the full SHA 6810f04View commit details
Commits on Oct 6, 2021
-
Tests Added (Bug in isDirectory and Wildcard Matching)
Omar Marzouk committedOct 6, 2021 Configuration menu - View commit details
-
Copy full SHA for ec328c1 - Browse repository at this point
Copy the full SHA ec328c1View commit details
Commits on Oct 10, 2021
-
Tests completed & passed & wildcard matching to be checked
Omar Marzouk committedOct 10, 2021 Configuration menu - View commit details
-
Copy full SHA for cadd6f6 - Browse repository at this point
Copy the full SHA cadd6f6View commit details
Commits on Oct 14, 2021
-
Omar Marzouk committed
Oct 14, 2021 Configuration menu - View commit details
-
Copy full SHA for dc664ae - Browse repository at this point
Copy the full SHA dc664aeView commit details -
Tutorial tested and configured
root authored and root committedOct 14, 2021 Configuration menu - View commit details
-
Copy full SHA for 366d63c - Browse repository at this point
Copy the full SHA 366d63cView commit details
Commits on Oct 19, 2021
-
Implementation of Wildcard Matching
Omar Marzouk committedOct 19, 2021 Configuration menu - View commit details
-
Copy full SHA for 8a8cabf - Browse repository at this point
Copy the full SHA 8a8cabfView commit details -
Merge branch '13-wildcard-matching' into FT-tutorial-example-OM
Omar Marzouk committedOct 19, 2021 Configuration menu - View commit details
-
Copy full SHA for 3614d0c - Browse repository at this point
Copy the full SHA 3614d0cView commit details -
Merge branch '13-wildcard-matching' into 'devel'
Resolve "Wildcard Matching" See merge request parallel-programming/bs-daos-tensorflow-io!13
Configuration menu - View commit details
-
Copy full SHA for 866ebd7 - Browse repository at this point
Copy the full SHA 866ebd7View commit details -
Merge branch 'FT-tutorial-example-OM' into 'devel'
Resolve "Adding Tutorial Example" See merge request parallel-programming/bs-daos-tensorflow-io!12
Configuration menu - View commit details
-
Copy full SHA for 36e93cf - Browse repository at this point
Copy the full SHA 36e93cfView commit details
Commits on Oct 27, 2021
-
Omar Marzouk committed
Oct 27, 2021 Configuration menu - View commit details
-
Copy full SHA for 6ee86d0 - Browse repository at this point
Copy the full SHA 6ee86d0View commit details -
Merge branch 'FT-rom-region-dummy-OM' into 'devel'
Resolve "Read-Only-Memory-Region (Dummy)" See merge request parallel-programming/bs-daos-tensorflow-io!14
Configuration menu - View commit details
-
Copy full SHA for 3745d4f - Browse repository at this point
Copy the full SHA 3745d4fView commit details
Commits on Dec 5, 2021
-
Update to DAOS1.3.106 + Decoupling of DAOS API init + Handling Pool a…
…nd Container Labels
Omar Marzouk committedDec 5, 2021 Configuration menu - View commit details
-
Copy full SHA for c0ba4c4 - Browse repository at this point
Copy the full SHA c0ba4c4View commit details -
Adjusted Example + Added Build Documentation + Fixed Indentation
Omar Marzouk committedDec 5, 2021 Configuration menu - View commit details
-
Copy full SHA for a764fdf - Browse repository at this point
Copy the full SHA a764fdfView commit details
Commits on Dec 6, 2021
-
Refactor + Update Tests + Update Docs
Omar Marzouk committedDec 6, 2021 Configuration menu - View commit details
-
Copy full SHA for d5016e5 - Browse repository at this point
Copy the full SHA d5016e5View commit details -
Configuration menu - View commit details
-
Copy full SHA for eebe0af - Browse repository at this point
Copy the full SHA eebe0afView commit details -
Omar Marzouk committed
Dec 6, 2021 Configuration menu - View commit details
-
Copy full SHA for 9b3baf0 - Browse repository at this point
Copy the full SHA 9b3baf0View commit details
Commits on Dec 7, 2021
-
Omar Marzouk committed
Dec 7, 2021 Configuration menu - View commit details
-
Copy full SHA for 7035255 - Browse repository at this point
Copy the full SHA 7035255View commit details -
Omar Marzouk committed
Dec 7, 2021 Configuration menu - View commit details
-
Copy full SHA for 7c1c736 - Browse repository at this point
Copy the full SHA 7c1c736View commit details
Commits on Dec 8, 2021
-
Omar Marzouk committed
Dec 8, 2021 Configuration menu - View commit details
-
Copy full SHA for 9d72efd - Browse repository at this point
Copy the full SHA 9d72efdView commit details -
Merge branch 'devel' of https://github.com/daos-stack/tensorflow-io-daos
Omar Marzouk committedDec 8, 2021 Configuration menu - View commit details
-
Copy full SHA for b823fa1 - Browse repository at this point
Copy the full SHA b823fa1View commit details -
Updating Docs and moving them to docs/
Omar Marzouk committedDec 8, 2021 Configuration menu - View commit details
-
Copy full SHA for a2cc29e - Browse repository at this point
Copy the full SHA a2cc29eView commit details
Commits on Dec 10, 2021
-
Omar Marzouk committed
Dec 10, 2021 Configuration menu - View commit details
-
Copy full SHA for 4a8b3ec - Browse repository at this point
Copy the full SHA 4a8b3ecView commit details -
Merge branch 'devel' into FT-unified-name-space-OM
Omar Marzouk committedDec 10, 2021 Configuration menu - View commit details
-
Copy full SHA for ba9074f - Browse repository at this point
Copy the full SHA ba9074fView commit details
Commits on Dec 12, 2021
-
Omar Marzouk committed
Dec 12, 2021 Configuration menu - View commit details
-
Copy full SHA for ea94561 - Browse repository at this point
Copy the full SHA ea94561View commit details -
Merge branch 'FT-unified-name-space-OM' into 'devel'
Resolve "Unified Name Space" See merge request parallel-programming/bs-daos-tensorflow-io!15
Configuration menu - View commit details
-
Copy full SHA for a8f0a85 - Browse repository at this point
Copy the full SHA a8f0a85View commit details
Commits on Dec 15, 2021
-
Updated Tutorial + Documentation
Omar Marzouk committedDec 15, 2021 Configuration menu - View commit details
-
Copy full SHA for efbb408 - Browse repository at this point
Copy the full SHA efbb408View commit details -
Omar Marzouk committed
Dec 15, 2021 Configuration menu - View commit details
-
Copy full SHA for 5027ff7 - Browse repository at this point
Copy the full SHA 5027ff7View commit details -
Omar Marzouk committed
Dec 15, 2021 Configuration menu - View commit details
-
Copy full SHA for 62b092c - Browse repository at this point
Copy the full SHA 62b092cView commit details -
Omar Marzouk committed
Dec 15, 2021 Configuration menu - View commit details
-
Copy full SHA for c83c378 - Browse repository at this point
Copy the full SHA c83c378View commit details
Commits on Dec 16, 2021
-
Omar Marzouk committed
Dec 16, 2021 Configuration menu - View commit details
-
Copy full SHA for b827a7b - Browse repository at this point
Copy the full SHA b827a7bView commit details
Commits on Jan 2, 2022
-
Configuration menu - View commit details
-
Copy full SHA for de0a601 - Browse repository at this point
Copy the full SHA de0a601View commit details
Commits on Jan 4, 2022
-
Omar Marzouk committed
Jan 4, 2022 Configuration menu - View commit details
-
Copy full SHA for 7840b77 - Browse repository at this point
Copy the full SHA 7840b77View commit details -
Configuration menu - View commit details
-
Copy full SHA for 9987dd8 - Browse repository at this point
Copy the full SHA 9987dd8View commit details
Commits on Jan 5, 2022
-
Omar Marzouk committed
Jan 5, 2022 Configuration menu - View commit details
-
Copy full SHA for df84926 - Browse repository at this point
Copy the full SHA df84926View commit details -
Merge branch 'devel' of https://github.com/daos-stack/tensorflow-io-daos
Omar Marzouk committedJan 5, 2022 Configuration menu - View commit details
-
Copy full SHA for 5ef973c - Browse repository at this point
Copy the full SHA 5ef973cView commit details
Commits on Jan 30, 2022
-
Integrating daos build changes
Omar Marzouk committedJan 30, 2022 Configuration menu - View commit details
-
Copy full SHA for 24641ce - Browse repository at this point
Copy the full SHA 24641ceView commit details -
Omar Marzouk committed
Jan 30, 2022 Configuration menu - View commit details
-
Copy full SHA for 673ffba - Browse repository at this point
Copy the full SHA 673ffbaView commit details -
Omar Marzouk committed
Jan 30, 2022 Configuration menu - View commit details
-
Copy full SHA for 0cf005c - Browse repository at this point
Copy the full SHA 0cf005cView commit details
Commits on Feb 2, 2022
-
Omar Marzouk committed
Feb 2, 2022 Configuration menu - View commit details
-
Copy full SHA for c6dab2a - Browse repository at this point
Copy the full SHA c6dab2aView commit details
Commits on Feb 22, 2022
-
Pool Connection Error Handling
Omar Marzouk committedFeb 22, 2022 Configuration menu - View commit details
-
Copy full SHA for 5cf7335 - Browse repository at this point
Copy the full SHA 5cf7335View commit details
Commits on Mar 8, 2022
-
Omar Marzouk committed
Mar 8, 2022 Configuration menu - View commit details
-
Copy full SHA for 0858a52 - Browse repository at this point
Copy the full SHA 0858a52View commit details
Commits on Mar 10, 2022
-
Omar Marzouk committed
Mar 10, 2022 Configuration menu - View commit details
-
Copy full SHA for 9689f99 - Browse repository at this point
Copy the full SHA 9689f99View commit details
Commits on Apr 2, 2022
-
Merge remote-tracking branch 'original-repo/master' into 17-read-ahea…
…d-buffering Merging Upstream Updates
Omar Marzouk committedApr 2, 2022 Configuration menu - View commit details
-
Copy full SHA for db74d80 - Browse repository at this point
Copy the full SHA db74d80View commit details
Commits on Apr 3, 2022
-
Omar Marzouk committed
Apr 3, 2022 Configuration menu - View commit details
-
Copy full SHA for 7488654 - Browse repository at this point
Copy the full SHA 7488654View commit details
Commits on Apr 4, 2022
-
Omar Marzouk committed
Apr 4, 2022 Configuration menu - View commit details
-
Copy full SHA for dbf0bce - Browse repository at this point
Copy the full SHA dbf0bceView commit details -
Merge branch '17-read-ahead-buffering' into 'devel'
Resolve "Read-Ahead Buffering" See merge request parallel-programming/bs-daos-tensorflow-io!17
Configuration menu - View commit details
-
Copy full SHA for 57ebc1e - Browse repository at this point
Copy the full SHA 57ebc1eView commit details -
Omar Marzouk committed
Apr 4, 2022 Configuration menu - View commit details
-
Copy full SHA for e8cf9c0 - Browse repository at this point
Copy the full SHA e8cf9c0View commit details
Commits on May 6, 2022
-
Existing File Deletion when Opened in Write Mode
Omar Marzouk committedMay 6, 2022 Configuration menu - View commit details
-
Copy full SHA for e3887d5 - Browse repository at this point
Copy the full SHA e3887d5View commit details -
Omar Marzouk committed
May 6, 2022 Configuration menu - View commit details
-
Copy full SHA for 3fd58df - Browse repository at this point
Copy the full SHA 3fd58dfView commit details
Commits on May 8, 2022
-
Omar Marzouk committed
May 8, 2022 Configuration menu - View commit details
-
Copy full SHA for 7a22a32 - Browse repository at this point
Copy the full SHA 7a22a32View commit details
Commits on May 10, 2022
-
Omar Marzouk committed
May 10, 2022 Configuration menu - View commit details
-
Copy full SHA for afbc18b - Browse repository at this point
Copy the full SHA afbc18bView commit details -
Omar Marzouk committed
May 10, 2022 Configuration menu - View commit details
-
Copy full SHA for 79b9d18 - Browse repository at this point
Copy the full SHA 79b9d18View commit details
Commits on Jun 5, 2022
-
Various fixes to the DAOS tensorflow-io plugin. (#2)
General: Asserts were added and enabled after each DAOS event-related call in order to track down internal race conditions in the DAOS client code, see DAOS-10601. The DAOS_FILE structure documented behavior for the 'offset' field, but most of that behavior didn't need to be implemented. Field 'offset' was removed while fixing the Append() and Tell() functions, leaving only a single field 'file' in DAOS_FILE, so the DAOS_FILE struct was removed as well. Typos in error messages were corrected. File dfs_utils.cc: In DFS::Setup(): Code after the Connect() call replaced the detailed error status set in Connect() with a generic TF_NOT_FOUND error with no accompanying message. This cost me several days of debugging to realize that a problem was not some object not being found, but rather was a connection failure to an unhealthy container. The TF_NOT_FOUND has been removed, allowing the more detailed error messages in Connect() to be reported. In ReadBuffer::ReadBuffer() By setting buffer_offset to ULONG_MAX, an uninitialized buffer will never be matched by CacheHit(), removing the need for a separate 'initialized' variable. The valid variable is no longer needed as well, more on that below. In ReadBuffer::~ReadBuffer() daos_event_fini(() cannot be called on an event that is still in flight, it fails without doing anything, daos_event_test() must wait for any prior event to complete, otherwise the event delete that follows daos_event_fini() could then cause corruption of the event queue. Call the reworked WaitEvent() (see below) first to ensure that daos_event_fini() will clean up the event before it is deleted. In ReadBuffer::FinalizeEvent() The same problem exists here as in ~ReadBuffer(), daos_event_fini() can't be called on an event that is still in flight. However, FinalizeEvent() isn't actually needed, a call to dfs_file->buffers.clear() in Cleanup() accomplishes the same thing using the ~ReadBuffer code, so FinalizeEvent was removed. ReadBuffer::WaitEvent() There is a need for a WaitEvent() function in several places to wait for any outstanding event to complete, but this routine manipulates 'valid', so it can't be used anywhere else. Removed the 'valid' code so that this routine can become a void and be called in multiple places. ReadBuffer::AbortEvent() daos_event_abort() doesn't actually contain any logic to ask the server to abort an in-flight dfs_read() request. In addition it is buggy, internal DAOS asserts were hit due to daos_event_abort() calls during I/O testing. The code was changed to instead use WaitEvent() to simply wait for a prior read to complete before issuing a new one, and AbortEvent() was removed. ReadBuffer::ReadAsync() Both daos_event_fini() and daos_event_init() must be called on a daos_event_t structure before the event can be reused for another dfs_read() call. These have been added. The AbortEvent() call was replaced with a call to WaitEvent(). The code was changed to save the errno from a failed dfs_read() call in the event's ev_error field so that the error will be detected, and so a user cannot accidentally read trash data after a failed dfs_read() call. ReadBuffer::ReadSync() This function is no longer used, see below. ReadBuffer::CopyData() The WaitEvent() call ensures that the thread blocks until any in-flight read request is done. The event->ev_error field is used to detect I/O failure either at the time the dfs_read() is issued or in the reply, so the valid flag is no longer needed. ReadBuffer::CopyFromCache() The TF_RandomAccessFile read() function allows for int64_t-sized reads, so change the return value here to int64_t. If an I/O error occurred, then return -1 so that the caller function Read() can easily tell when there has been an I/O error. Provide a detailed error message so that the user can tell what caused the error. File dfs_filesystem.cc: In DFSRandomAccessFile constructor: Added an assert() on the event queue creation. In Cleanup(): Replaced FinalizeEvent() code with a dfs_file->buffers.clear() call. Add asserts on dfs function calls. In df_dfs_filesystem::Read(): The loop "for (auto& read_buf : dfs_file->buffers)" was missing a break statement, so CacheHit was called 256 times for each curr_offset value. A break was added. Support was added for detecting a read error and returning -1. Since Read() is now a while loop, there is no reason to specially use ReadSync() for the first buffer. Code changed to use ReadAsync() for all readahead, CopyFromCache() will block until the first buffer's I/O is complete. ReadSync is now unused, and is removed. I could not determine a reason for the WaitEvent loop: if (curr_offset >= dfs_file->file_size) because I/O requests will never be be started beyond EOF. The loop was removed. In DFSWritableFile: The Append() function had to make a dfs_get_size() call for each append to a file, adding a second round trip to the server for each append. This is very expensive. Member functions were added to cache the file size and update it locally as Append() operations are done.. Since the tensorflow API only allows one writer, local caching is allowable. Should there be an I/O error, the actual size of the file becomes unknown, the new member functions take that into account and call dfs_get_size() in those situations to reestablish the correct size of the file. In Append(): The dfs_file->daos_file.offset field was not updated after an Append() operation completed successfully, so a subsequent Tell() call would return the size of the file before the last Append(), not after, the reported size was incorrect. The code was changed to update the cached file size after successful Append() operations. In RenameFile(): Similar to the Setup() case, the detailed error statuses in Connect() were being hidden by a genereric TF_NOT_FOUND error. The generic error was removed. Signed-off-by: Kevan Rehm <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for fab7f15 - Browse repository at this point
Copy the full SHA fab7f15View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6746017 - Browse repository at this point
Copy the full SHA 6746017View commit details -
Omar Marzouk committed
Jun 5, 2022 Configuration menu - View commit details
-
Copy full SHA for b0d5ad2 - Browse repository at this point
Copy the full SHA b0d5ad2View commit details
Commits on Jun 6, 2022
-
Omar Marzouk committed
Jun 6, 2022 Configuration menu - View commit details
-
Copy full SHA for 1af1181 - Browse repository at this point
Copy the full SHA 1af1181View commit details -
Adjustments to Reading, Single Event Queue Handle, Paths Caching, and…
… FileSize Caching
Omar Marzouk committedJun 6, 2022 Configuration menu - View commit details
-
Copy full SHA for b0a7b7d - Browse repository at this point
Copy the full SHA b0a7b7dView commit details
Commits on Jun 22, 2022
-
Add support for dynamically loaded DAOS libraries (#4)
Currently if DAOS libraries are not installed on a node, the libtensorflow_io_plugins.so will fail to load due to unsatisfied externals, and all modular filesystems are then unusable, not just DFS. This PR changes the DFS plugin to dynamically load the DAOS libraries so that the DFS filesystem is available if DAOS is installed, but the other modular filesystems are still available if DAOS is not installed. The checks for the DAOS libraries and the daos_init() call are now done at filesystem registration time, not as part of each function call in the filesystem API. If the libraries are not installed then the DFS filesystem will not be not registered, and no calls into DFS functions will ever occur. In this case tensorflow will just report "File system scheme 'dfs' not implemented" when a "dfs://" path is used. A number of separate functions existed each of which was only called once as part of DFS destruction, these were combined into the DFS destructor for simplicity. Similar recombinations were done to simplify DFS construction. Signed-off-by: Kevan Rehm <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for eb90e69 - Browse repository at this point
Copy the full SHA eb90e69View commit details -
Omar Marzouk committed
Jun 22, 2022 Configuration menu - View commit details
-
Copy full SHA for 5f54aee - Browse repository at this point
Copy the full SHA 5f54aeeView commit details
Commits on Jul 13, 2022
-
Various additional plugin fixes
Global Changes: * The plugin was using duplicate definitions of internal DAOS client structures (dfs_obj_t, dfs_t, dfs_entry_t), and would create malloc'd copies of those structs in order to be able to access their private fields. Should DAOS modify those structures in future releases, the plugin would break for those releases. The dependencies on internal fields have been removed, the DAOS client API is now strictly followed. * The path_map and size_map caches used DFS mount-point-relative pathnames as keys. If more than one DFS filesystem is mounted during the same run, then the same relative pathname could be in use in both filesystems. Callers that retrieved values from the caches could get results for the wrong filesytem. Code was changed to create path_map and size_map caches per-filesystem. * A number of fields (connected, daos_fs, pool, container) were stored in the global DFS structure which meant that any time a path was presented to the plugin that was in a different DFS filesystem than the previous path, the current filesystem would have to be unmounted and then the new filesystem would be mounted. The application could have had files open at the time in the filesystem that was unmounted. The code was changed to maintain filesystem state relative to each pool/container combo, and so any number of DFS filesystems can now be mounted simultaneously. * None of the code in the DFS Cleanup() function was ever being called. This is a known tensorflow issue, see tensorflow/tensorflow#27535 The workaround is to call Cleanup() via the atexit() function. * The RenameFile function was enhanced to delete the cached size of the source file and store that cached size for the destination file. * The dfsLookUp() routine required the caller to indicate whether or not the object to be looked up was a directory. This was necessary because dfs_open() was being used to open the object, and that call requires a different open_mode for directories and files. However, the caller does not always know the type of the object being looked up, e.g PathExists() and IsDir(). If the caller guesses wrong, then the dfs_open() call fails, either with EINVAL or ENOTDIR. The caller would map these errors to ENOENT, which is incorrect. Code was changed to replace the dfs_open() call with dfs_lookup_rel(), which removes the requirement that the caller know the object's type a priori, the caller can check the type of the object after it has been opened. * The dfsLookUp() routine required all callers to implement three different behaviors depending upon the type of object being opened. 1. If a DFS root directory, a null dfs_obj_t would be returned, this would have to be special-cased by the caller. 2. If a non-root directory, a non-null dfs_obj_t would be returned which the caller must never release because the dfs_obj_t is also an entry in the path_map cache. Releasing the entry would cause future requests that use that cache entry to fail. There were a few cases in the code where this was occurring. 3. If a non-directory, a non-null dfs_obj_t would be returned which the caller must always release when done with it. The code was changed so that a DFS root directory returns a non-null dfs_obj_t. Also, whenever a directory that is in the path_map cache is referenced, dfs_dup() is used to make a (cheap) copy of the dfs_obj_t to return to the caller, so that the cached copy is never used outside of the cache. As a result, dfsLookUp() now always returns a non-null dfs_obj_t which must be released when no longer in use. Another advantage of using dfs_dup() is that it is then safe at any moment to clear a filesystem's path_map cache, there is no possibility that some caller is using a cached dfs_obj_t at that time. * All relative path references in the code have been replaced with references to a dfs_path_t class which encapsulates everything known about a particular DFS path, including the filesystem in which the path resides. Member functions make it easy to update the correct caches for the correct filesystem for each path. Also, there were many places throughout the code where string manipulation was being done, e.g. to extract a parent pathname or a basename. That code has been replaced with dfs_path_t member functions so that the actual string manipulation only occurs in a single place in the plugin. * Setup() now initializes a dfs_path_t instead of global pool, cont, and rel_path variables. It also does some minor lexical normalization of the rel_path member, as opposed to doing so in multiple places in the code downstream. * Code was modified in various places so that 100 of the tests in the tensorflow modular_filesystem_test' test suite pass. there are three remaining failing tests. One is an incorrect test, one is checking a function not implemented in the plugin. The third is reporting failures in TranslateName() which will be handled in a separate PR. * The plugin was coded to use 'dfs://' as the filesystem prefix, but the DAOS client API is coded internally to use 'daos://' as the prefix. The plugin was changed to use 'daos://' so that pathnames used by one application would not have to be munged in order to also work with tensorflow. Per file changes: dfs_utils.h: * The per-container class cont_info_t was added that maintains all per-filesystem state. * Class dfs_path_t was added that maintains all per-file state. The class knows which filesystem the file is in, e.g. to update the correct cache maps. * The global fields connected, daos_fs, pool, container, path_map, and size_map are removed, replaced by the per-filesystem versions. * Mounts are now done at the same time as connection to the container, filesytems remain mounted until their containers are disconnected. dfs_filesystem.cc: * Many of the functions were made static so that they don't show up in the library's symbol table, avoiding potential conflicts with other plugins. * Changed path references to dfs_path_t references throughout. DFSRandomAccessFile() * Replaced the dpath string with the dfs_path_t as a constructor parameter so that the per-filesystem size cache can be updated. DFSWritableFile() * Replaced the dpath string with the dfs_path_t as a constructor parameter so that the per-filesystem size cache can be updated whenever the file is appended to. NewWritableFile() * Changed file creation mode parameter to include S_IRUSR so that files can be read when the filesystem is mounted via fuse. NewAppendableFile() * Changed file creation mode parameter to include S_IRUSR so that files can be read when the filesystem is mounted via fuse. PathExists() * Reworked the code to work with the new dfsLookUp() behavior. dfsPathExists() call was removed as it no longer provided anything not already provided by dfsLookUp(). Also, many errors returned by dfsPathExists() were mapped to TF_NOT_FOUND, which was incorrect. Also, PathExists() can be called for either files or directories, but dfsPathExists internally called dfsLookUp() with isDirectory = false, so callers that passed in a directory path would get failures. Stat() * Used to call dfs_get_size(), then called dfs_ostat(), but the file size is available in stbuf, so the dfs_get_size() call was extra overhead and was removed. FlushCaches() * Used to call ClearConnections, which unmounted any filesystem and disconnected from its container and pool, when there could be files open for read or write. The ClearConnections call was removed. Code was added to clear the size caches as well as the directory caches. dfs_utils.cc * New functions were added for clearing individual filesystem caches and all filesystem caches for all mounted filesystems. * There was code in many places for path string manipulation, checking if an object was a directory, etc. dfs_path_t member functions were created to replace all these so that a particular operation was only implemented in one spot in the code. DFS::~DFS() * The code to clear the directory cache only released the first entry, there was no code to iterate through the container. Replaced with function calls to clear all directory and size caches. Unmount() * Now done automatically as part of disconnecting a container, a separate function was no longer needed. ParseDFSPath() * The code assumed that any path it was given would have both pool and container components, it was unable to handle malformed paths. Code was changed to let duns_resolve_path() validate the path components. There used to be two calls to duns_resolve_path() because DUNS_NO_CHECK_PATH was not set, and so the first path would fail if the pool and container components were labels, duns_resolv_path() only recognizes uuids if DUNS_NO_CHECK_PATH is not set. When pool and/or container labels were used, the duns_resolve_path() code would check against local mounted filesystems, and would hopefully fail. The code then prepended dfs:: and tried again, which would be recognized as a "direct path". Paths which only contained uuids were successfully parsed with the first duns_resolve_path() call. By using the DUNS_NO_CHECK_PATH flag and always including the daos:// prefix, only a single system call is needed. Setup() * Reworked to populate a dfs_path_t instead of separate pool, cont, and relpath variables. A filesystem is now automatically mounted as part of connecting to the container, so a separate function was no longer needed. ClearConnections() * The code for looping through pools and containers didn't work properly because the subroutines erase their map entries internally which invalidates the iterators being used in ClearConnections(). Code was changed so that the iterators are reinitialized each time through the loop. * Code to disconnect all the containers in a pool was moved to the DisconnectPool() function, so that it is not possible to disconnect a pool without first disconnecting all its containers. dfsDeleteObject() * Enhanced to only clear the directory cache for the filesystem in which the object existed. * If the object was a file, the size cache entry for that file is deleted. If a directory was being recursively deleted, the filesystem's size cache is now also cleared. dfsLookUp() and dfsFindParent() * As mentioned at the top, code was rewritten so that cached directory entries are never returned to a caller, instead a dup reference is returned so that the caller is always given an object reference it must release. dfsCreateDir() * Error exit statuses were enhanced in order to pass the tensorflow 'modular_filesystem_test' test suite. ConnectPool() * Simplified somewhat as the pool id_handle_t is no longer needed. ConnectContainer() * Simplified somewhat as the cont id_handle_t is no longer needed. * Added code to immediately mount any container that is connected. * Code added to initialize all the per-filesystem state variables was added. DisconnectPool() * Added code to disconnect any containers before disconnecting the pool. DisconnectContainer() * Added code to unmount any filesystem before disconnecting its container. * Added all the dfs_path_t member function implementations. * Included a few new dsym references for dfs function calls that have been added. Signed-off-by: Kevan Rehm <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 89c39e3 - Browse repository at this point
Copy the full SHA 89c39e3View commit details
Commits on Jul 25, 2022
-
Various additional plugin fixes (#6)
Global Changes: * The plugin was using duplicate definitions of internal DAOS client structures (dfs_obj_t, dfs_t, dfs_entry_t), and would create malloc'd copies of those structs in order to be able to access their private fields. Should DAOS modify those structures in future releases, the plugin would break for those releases. The dependencies on internal fields have been removed, the DAOS client API is now strictly followed. * The path_map and size_map caches used DFS mount-point-relative pathnames as keys. If more than one DFS filesystem is mounted during the same run, then the same relative pathname could be in use in both filesystems. Callers that retrieved values from the caches could get results for the wrong filesytem. Code was changed to create path_map and size_map caches per-filesystem. * A number of fields (connected, daos_fs, pool, container) were stored in the global DFS structure which meant that any time a path was presented to the plugin that was in a different DFS filesystem than the previous path, the current filesystem would have to be unmounted and then the new filesystem would be mounted. The application could have had files open at the time in the filesystem that was unmounted. The code was changed to maintain filesystem state relative to each pool/container combo, and so any number of DFS filesystems can now be mounted simultaneously. * None of the code in the DFS Cleanup() function was ever being called. This is a known tensorflow issue, see tensorflow/tensorflow#27535 The workaround is to call Cleanup() via the atexit() function. * The RenameFile function was enhanced to delete the cached size of the source file and store that cached size for the destination file. * The dfsLookUp() routine required the caller to indicate whether or not the object to be looked up was a directory. This was necessary because dfs_open() was being used to open the object, and that call requires a different open_mode for directories and files. However, the caller does not always know the type of the object being looked up, e.g PathExists() and IsDir(). If the caller guesses wrong, then the dfs_open() call fails, either with EINVAL or ENOTDIR. The caller would map these errors to ENOENT, which is incorrect. Code was changed to replace the dfs_open() call with dfs_lookup_rel(), which removes the requirement that the caller know the object's type a priori, the caller can check the type of the object after it has been opened. * The dfsLookUp() routine required all callers to implement three different behaviors depending upon the type of object being opened. 1. If a DFS root directory, a null dfs_obj_t would be returned, this would have to be special-cased by the caller. 2. If a non-root directory, a non-null dfs_obj_t would be returned which the caller must never release because the dfs_obj_t is also an entry in the path_map cache. Releasing the entry would cause future requests that use that cache entry to fail. There were a few cases in the code where this was occurring. 3. If a non-directory, a non-null dfs_obj_t would be returned which the caller must always release when done with it. The code was changed so that a DFS root directory returns a non-null dfs_obj_t. Also, whenever a directory that is in the path_map cache is referenced, dfs_dup() is used to make a (cheap) copy of the dfs_obj_t to return to the caller, so that the cached copy is never used outside of the cache. As a result, dfsLookUp() now always returns a non-null dfs_obj_t which must be released when no longer in use. Another advantage of using dfs_dup() is that it is then safe at any moment to clear a filesystem's path_map cache, there is no possibility that some caller is using a cached dfs_obj_t at that time. * All relative path references in the code have been replaced with references to a dfs_path_t class which encapsulates everything known about a particular DFS path, including the filesystem in which the path resides. Member functions make it easy to update the correct caches for the correct filesystem for each path. Also, there were many places throughout the code where string manipulation was being done, e.g. to extract a parent pathname or a basename. That code has been replaced with dfs_path_t member functions so that the actual string manipulation only occurs in a single place in the plugin. * Setup() now initializes a dfs_path_t instead of global pool, cont, and rel_path variables. It also does some minor lexical normalization of the rel_path member, as opposed to doing so in multiple places in the code downstream. * Code was modified in various places so that 100 of the tests in the tensorflow modular_filesystem_test' test suite pass. there are three remaining failing tests. One is an incorrect test, one is checking a function not implemented in the plugin. The third is reporting failures in TranslateName() which will be handled in a separate PR. * The plugin was coded to use 'dfs://' as the filesystem prefix, but the DAOS client API is coded internally to use 'daos://' as the prefix. The plugin was changed to use 'daos://' so that pathnames used by one application would not have to be munged in order to also work with tensorflow. Per file changes: dfs_utils.h: * The per-container class cont_info_t was added that maintains all per-filesystem state. * Class dfs_path_t was added that maintains all per-file state. The class knows which filesystem the file is in, e.g. to update the correct cache maps. * The global fields connected, daos_fs, pool, container, path_map, and size_map are removed, replaced by the per-filesystem versions. * Mounts are now done at the same time as connection to the container, filesytems remain mounted until their containers are disconnected. dfs_filesystem.cc: * Many of the functions were made static so that they don't show up in the library's symbol table, avoiding potential conflicts with other plugins. * Changed path references to dfs_path_t references throughout. DFSRandomAccessFile() * Replaced the dpath string with the dfs_path_t as a constructor parameter so that the per-filesystem size cache can be updated. DFSWritableFile() * Replaced the dpath string with the dfs_path_t as a constructor parameter so that the per-filesystem size cache can be updated whenever the file is appended to. NewWritableFile() * Changed file creation mode parameter to include S_IRUSR so that files can be read when the filesystem is mounted via fuse. NewAppendableFile() * Changed file creation mode parameter to include S_IRUSR so that files can be read when the filesystem is mounted via fuse. PathExists() * Reworked the code to work with the new dfsLookUp() behavior. dfsPathExists() call was removed as it no longer provided anything not already provided by dfsLookUp(). Also, many errors returned by dfsPathExists() were mapped to TF_NOT_FOUND, which was incorrect. Also, PathExists() can be called for either files or directories, but dfsPathExists internally called dfsLookUp() with isDirectory = false, so callers that passed in a directory path would get failures. Stat() * Used to call dfs_get_size(), then called dfs_ostat(), but the file size is available in stbuf, so the dfs_get_size() call was extra overhead and was removed. FlushCaches() * Used to call ClearConnections, which unmounted any filesystem and disconnected from its container and pool, when there could be files open for read or write. The ClearConnections call was removed. Code was added to clear the size caches as well as the directory caches. dfs_utils.cc * New functions were added for clearing individual filesystem caches and all filesystem caches for all mounted filesystems. * There was code in many places for path string manipulation, checking if an object was a directory, etc. dfs_path_t member functions were created to replace all these so that a particular operation was only implemented in one spot in the code. DFS::~DFS() * The code to clear the directory cache only released the first entry, there was no code to iterate through the container. Replaced with function calls to clear all directory and size caches. Unmount() * Now done automatically as part of disconnecting a container, a separate function was no longer needed. ParseDFSPath() * The code assumed that any path it was given would have both pool and container components, it was unable to handle malformed paths. Code was changed to let duns_resolve_path() validate the path components. There used to be two calls to duns_resolve_path() because DUNS_NO_CHECK_PATH was not set, and so the first path would fail if the pool and container components were labels, duns_resolv_path() only recognizes uuids if DUNS_NO_CHECK_PATH is not set. When pool and/or container labels were used, the duns_resolve_path() code would check against local mounted filesystems, and would hopefully fail. The code then prepended dfs:: and tried again, which would be recognized as a "direct path". Paths which only contained uuids were successfully parsed with the first duns_resolve_path() call. By using the DUNS_NO_CHECK_PATH flag and always including the daos:// prefix, only a single system call is needed. Setup() * Reworked to populate a dfs_path_t instead of separate pool, cont, and relpath variables. A filesystem is now automatically mounted as part of connecting to the container, so a separate function was no longer needed. ClearConnections() * The code for looping through pools and containers didn't work properly because the subroutines erase their map entries internally which invalidates the iterators being used in ClearConnections(). Code was changed so that the iterators are reinitialized each time through the loop. * Code to disconnect all the containers in a pool was moved to the DisconnectPool() function, so that it is not possible to disconnect a pool without first disconnecting all its containers. dfsDeleteObject() * Enhanced to only clear the directory cache for the filesystem in which the object existed. * If the object was a file, the size cache entry for that file is deleted. If a directory was being recursively deleted, the filesystem's size cache is now also cleared. dfsLookUp() and dfsFindParent() * As mentioned at the top, code was rewritten so that cached directory entries are never returned to a caller, instead a dup reference is returned so that the caller is always given an object reference it must release. dfsCreateDir() * Error exit statuses were enhanced in order to pass the tensorflow 'modular_filesystem_test' test suite. ConnectPool() * Simplified somewhat as the pool id_handle_t is no longer needed. ConnectContainer() * Simplified somewhat as the cont id_handle_t is no longer needed. * Added code to immediately mount any container that is connected. * Code added to initialize all the per-filesystem state variables was added. DisconnectPool() * Added code to disconnect any containers before disconnecting the pool. DisconnectContainer() * Added code to unmount any filesystem before disconnecting its container. * Added all the dfs_path_t member function implementations. * Included a few new dsym references for dfs function calls that have been added. Signed-off-by: Kevan Rehm <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c53f6a0 - Browse repository at this point
Copy the full SHA c53f6a0View commit details
Commits on Jul 26, 2022
-
Configuration menu - View commit details
-
Copy full SHA for 66a764a - Browse repository at this point
Copy the full SHA 66a764aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 83596c8 - Browse repository at this point
Copy the full SHA 83596c8View commit details