- Dockerized demo with support for different Hive versions
- Smoother handling of append log on cloud stores
- Introducing a global bloom index, that enforces unique constraint across partitions
- CLI commands to analyze workloads, manage compactions
- Migration guide for folks wanting to move datasets to Hudi
- Added Spark Structured Streaming support, with a Hudi sink
- In-built support for filtering duplicates in DeltaStreamer
- Support for plugging in custom transformation in DeltaStreamer
- Better support for non-partitioned Hive tables
- Support hard deletes for Merge on Read storage
- New slack url & site urls
- Added presto bundle for easier integration
- Tons of bug fixes, reliability improvements
- @bhasudha - Create hoodie-presto bundle jar. fixes #567 #571
- @bhasudha - Close FSDataInputStream for meta file open in HoodiePartitionMetadata . Fixes issue #573 #574
- @yaoqinn - handle no such element exception in HoodieSparkSqlWriter #576
- @vinothchandar - Update site url in README
- @yaooqinn - typo: bundle jar with unrecognized variables #570
- @bvaradar - Table rollback for inflight compactions MUST not delete instant files at any time to avoid race conditions #565
- @bvaradar - Fix Hoodie Record Reader to work with non-partitioned dataset ( ISSUE-561) #569
- @bvaradar - Hoodie Delta Streamer Features : Transformation and Hoodie Incremental Source with Hive integration #485
- @vinothchandar - Updating new slack signup link #566
- @yaooqinn - Using immutable map instead of mutables to generate parameters #559
- @n3nash - Fixing behavior of buffering in Create/Merge handles for invalid/wrong schema records #558
- @n3nash - cleaner should now use commit timeline and not include deltacommits #539
- @n3nash - Adding compaction to HoodieClient example #551
- @n3nash - Filtering partition paths before performing a list status on all partitions #541
- @n3nash - Passing a path filter to avoid including folders under .hoodie directory as partition paths #548
- @n3nash - Enabling hard deletes for MergeOnRead table type #538
- @msridhar - Add .m2 directory to Travis cache #534
- @artem0 - General enhancements #520
- @bvaradar - Ensure Hoodie works for non-partitioned Hive table #515
- @xubo245 - fix some spell errorin Hudi #530
- @leletan - feat(SparkDataSource): add structured streaming sink #486
- @n3nash - Serializing the complete payload object instead of serializing just the GenericRecord in HoodieRecordConverter #495
- @n3nash - Returning empty Statues for an empty spark partition caused due to incorrect bin packing #510
- @bvaradar - Avoid WriteStatus collect() call when committing batch to prevent Driver side OOM errors #512
- @vinothchandar - Explicitly handle lack of append() support during LogWriting #511
- @n3nash - Fixing number of insert buckets to be generated by rounding off to the closest greater integer #500
- @vinothchandar - Enabling auto tuning of insert splits by default #496
- @bvaradar - Useful Hudi CLI commands to debug/analyze production workloads #477
- @bvaradar - Compaction validate, unschedule and repair #481
- @shangxinli - Fix addMetadataFields() to carry over 'props' #484
- @n3nash - Adding documentation for migration guide and COW vs MOR tradeoffs #470
- @leletan - Add additional feature to drop later arriving dups #468
- @bvaradar - Fix regression bug which broke HoodieInputFormat handling of non-hoodie datasets #482
- @vinothchandar - Add --filter-dupes to DeltaStreamer #478
- @bvaradar - A quickstart demo to showcase Hudi functionalities using docker along with support for integration-tests #455
- @bvaradar - Ensure Hoodie metadata folder and files are filtered out when constructing Parquet Data Source #473
- @leletan - Adds HoodieGlobalBloomIndex #438
- Dependencies are now decoupled from CDH and based on apache versions!
- Support for Hive 2 is here!! Use -Dhive11 to build for older hive versions
- Deltastreamer tool reworked to make configs simpler, hardended tests, added Confluent Kafka support
- Provide strong consistency for S3 datasets
- Removed dependency on commons lang3, to ease use with different hadoop/spark versions
- Better CLI support and docs for managing async compactions
- New CLI commands to manage datasets
- @vinothchandar - Perform consistency checks during write finalize #464
- @bvaradar - Travis CI tests needs to be run in quieter mode (WARN log level) to avoid max log-size errors #465
- @lys0716 - Fix the name of avro schema file in Test #467
- @bvaradar - Hive Sync handling must work for datasets with multi-partition keys #460
- @bvaradar - Explicitly release resources in LogFileReader and TestHoodieClientBase. Fixes Memory allocation errors #463
- @bvaradar - [Release Blocking] Ensure packaging modules create sources/javadoc jars #461
- @vinothchandar - Fix bug with incrementally pulling older data #458
- @saravsars - Updated jcommander version to fix NPE in HoodieDeltaStreamer tool #443
- @n3nash - Removing dependency on apache-commons lang 3, adding necessary classes as needed #444
- @n3nash - Small file size handling for inserts into log files. #413
- @vinothchandar - Update Gemfile.lock with higher ffi version
- @bvaradar - Simplify and fix CLI to schedule and run compactions #447
- @n3nash - Fix a failing test case intermittenly in TestMergeOnRead due to incorrect prev commit time #448
- @bvaradar- CLI to create and desc hoodie table #446
- @vinothchandar- Reworking the deltastreamer tool #449
- @bvaradar- Docs for describing async compaction and how to operate it #445
- @n3nash- Adding check for rolling stats not present in existing timeline to handle backwards compatibility #451
- @bvaradar @vinothchandar - Moving all dependencies off cdh and to apache #420
- @bvaradar- Reduce minimum delta-commits required for compaction #452
- @bvaradar- Use spark Master from environment if set #454
- Ability to run compactions asynchrously & in-parallel to ingestion/write added!!!
- Day based compaction does not respect IO budgets i.e agnostic of them
- Adds ability to throttle writes to HBase via the HBaseIndex
- (Merge on read) Inserts are sent to log files, if they are indexable.
- @n3nash - Adding ability for inserts to be written to log files #400
- @n3nash - Fixing bug introducted in rollback for MOR table type with inserts into log files #417
- @n3nash - Changing Day based compaction strategy to be IO agnostic #398
- @ovj - Changing access level to protected so that subclasses can access it #421
- @n3nash - Fixing missing hoodie record location in HoodieRecord when record is read from disk after being spilled #419
- @bvaradar - Async compaction - Single Consolidated PR #404
- @bvaradar - BUGFIX - Use Guava Optional (which is Serializable) in CompactionOperation to avoid NoSerializableException #435
- @n3nash - Adding another metric to HoodieWriteStat #434
- @n3nash - Fixing Null pointer exception in finally block #440
- @kaushikd49 - Throttling to limit QPS from HbaseIndex #427
- Parallelize Parquet writing & input record read resulting in upto 2x performance improvement
- Better out-of-box configs to support upto 500GB upserts, improved ROPathFilter performance
- Added a union mode for RT View, that supports near-real time event ingestion without update semantics
- Added a tuning guide with suggestions for oft-encountered problems
- New configs for configs for compression ratio, index storage levels
- @jianxu - Use hadoopConf in HoodieTableMetaClient and related tests #343
- @jianxu - Add more options in HoodieWriteConfig #341
- @n3nash - Adding a tool to read/inspect a HoodieLogFile #328
- @ovj - Parallelizing parquet write and spark's external read operation. #294
- @n3nash - Fixing memory leak due to HoodieLogFileReader holding on to a logblock #346
- @kaushikd49 - DeduplicateRecords based on recordKey if global index is used #345
- @jianxu - Checking storage level before persisting preppedRecords #358
- @n3nash - Adding config for parquet compression ratio #366
- @xjodoin - Replace deprecated jackson version #367
- @n3nash - Making ExternalSpillableMap generic for any datatype #350
- @bvaradar - CodeStyle formatting to conform to basic Checkstyle rules. #360
- @vinothchandar - Update release notes for 0.4.1 (post) #371
- @bvaradar - Issue-329 : Refactoring TestHoodieClientOnCopyOnWriteStorage and adding test-cases #372
- @n3nash - Parallelized read-write operations in Hoodie Merge phase #370
- @n3nash - Using BufferedFsInputStream to wrap FSInputStream for FSDataInputStream #373
- @suniluber - Fix for updating duplicate records in same/different files in same pa… #380
- @bvaradar - Fixit : Add Support for ordering and limiting results in CLI show commands #383
- @n3nash - Adding metrics for MOR and COW #365
- @n3nash - Adding a fix/workaround when fs.append() unable to return a valid outputstream #388
- @n3nash - Minor fixes for MergeOnRead MVP release readiness #387
- @bvaradar - Issue-257: Support union mode in HoodieRealtimeRecordReader for pure insert workloads #379
- @n3nash - Enabling global index for MOR #389
- @suniluber - Added a new filter function to filter by record keys when reading parquet file #395
- @vinothchandar - Improving out of box experience for data source #295
- @xjodoin - Fix wrong use of TemporaryFolder junit rule #411
- Good enhancements for merge-on-read write path : spillable map for merging, evolvable log format, rollback support
- Cloud file systems should now work out-of-box for copy-on-write tables, with configs picked up from SparkContext
- Compaction action is no more, multiple delta commits now lead to a commit upon compaction
- API level changes include : compaction api, new prepped APIs for higher plugability for advanced clients
- @n3nash - Separated rollback as a table operation, implement rollback for MOR #247
- @n3nash - Implementing custom payload/merge hooks abstractions for application #275
- @vinothchandar - Reformat project & tighten code style guidelines #280
- @n3nash - Separating out compaction() API #282
- @n3nash - Enable hive sync even if there is no compaction commit #286
- @n3nash - Partition compaction strategy #281
- @n3nash - Removing compaction action type and associated compaction timeline operations, replace with commit action type #288
- @vinothchandar - Multi/Cloud FS Support for Copy-On-Write tables #293
- @vinothchandar - Update Gemfile.lock #298
- @n3nash - Reducing memory footprint required in HoodieAvroDataBlock and HoodieAppendHandle #290
- @jianxu - Add FinalizeWrite in HoodieCreateHandle for COW tables #285
- @n3nash - Adding global indexing to HbaseIndex implementation #318
- @n3nash - Small File Size correction handling for MOR table type #299
- @jianxu - Use FastDateFormat for thread safety #320
- @vinothchandar - Fix formatting in HoodieWriteClient #322
- @n3nash - Write smaller sized multiple blocks to log file instead of a large one #317
- @n3nash - Added support for Disk Spillable Compaction to prevent OOM issues #289
- @jianxu - Add new APIs in HoodieReadClient and HoodieWriteClient #327
- @jianxu - Handle inflight clean instants during Hoodie instants archiving #332
- @n3nash - Introducing HoodieLogFormat V2 with versioning support #331
- @n3nash - Re-factoring Compaction as first level API in WriteClient similar to upsert/insert #330
- Spark datasource API now supported for Copy-On-Write datasets, across all views
- BloomIndex can now prune based on key ranges & cut down index tagging time dramatically, for time-prefixed/ordered record keys
- Hive sync tool registers RO and RT tables now.
- Client application can now specify the partitioner to be used by bulkInsert(), useful for low-level control over initial record placement
- Framework for metadata tracking inside IO handles, to implement Spark accumulator-style counters, that are consistent with the timeline
- Bug fixes around cleaning, savepoints & upsert's partitioner.
- @gekath - Writes relative paths to .commit files #184
- @kaushikd49 - Correct clean bug that causes exception when partitionPaths are empty #202
- @vinothchandar - Refactor HoodieTableFileSystemView using FileGroups & FileSlices #201
- @prazanna - Savepoint should not create a hole in the commit timeline #207
- @jianxu - Fix TimestampBasedKeyGenerator in HoodieDeltaStreamer when DATE_STRING is used #211
- @n3nash - Sync Tool registers 2 tables, RO and RT Tables #210
- @n3nash - Using FsUtils instead of Files API to extract file extension #213
- @vinothchandar - Edits to documentation #219
- @n3nash - Enabled deletes in merge_on_read #218
- @n3nash - Use HoodieLogFormat for the commit archived log #205
- @n3nash - fix for cleaning log files in master branch (mor) #228
- @vinothchandar - Adding range based pruning to bloom index #232
- @n3nash - Use CompletedFileSystemView instead of CompactedView considering deltacommits too #229
- @n3nash - suppressing logs (under 4MB) for jenkins #240
- @jianxu - Add nested fields support for MOR tables #234
- @n3nash - adding new config to separate shuffle and write parallelism #230
- @n3nash - adding ability to read archived files written in log format #252
- @ovj - Removing randomization from UpsertPartitioner #253
- @ovj - Replacing SortBy with custom partitioner #245
- @esmioley - Update deprecated hash function #259
- @vinothchandar - Adding canIndexLogFiles(), isImplicitWithStorage(), isGlobal() to HoodieIndex #268
- @kaushikd49 - Hoodie Event callbacks #251
- @vinothchandar - Spark Data Source (finally) #266
- Refer to github