Skip to content

teragrep/dpf_03

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Coverity Scan Build Status

DPF_03

Holds a customized lexical tokenizer as a Spark UDF and a bloom filter aggregator. Used to tokenize spark string columns and run Spark aggregation into a bloom filter with configurable filter size selection.

Features

TokenizerUDF

Spark UDF that will tokenize incoming string value and return it as a list of byte arrays. Tokenization rules are set in blf_01.

ByteArrayListAsStringListUDF

Spark UDF that converts results from TokenizerUDF into a list of strings.

BloomFilterAggregator

Custom spark aggregator that aggregates a column string tokens into a single bloom filter and returns the bytes of the resulting filter that can be processed.

Filter size is selected by giving the aggregator the name of a spark column that holds an estimated value of tokens and by configuring a map of bloom filters with preset values (expected number of items, false positive probability).

Documentation

See the official documentation on docs.teragrep.com.

Limitations

Compatible with Java version 1.8, other versions might not work.

How to [compile/use/implement]

See tests for how to apply and import into a Spark project

Contributing

You can involve yourself with our project by opening an issue or submitting a pull request.

Contribution requirements:

  1. All changes must be accompanied by a new or changed test. If you think testing is not required in your pull request, include a sufficient explanation as why you think so.

  2. Security checks must pass

  3. Pull requests must align with the principles and values of extreme programming.

  4. Pull requests must follow the principles of Object Thinking and Elegant Objects (EO).

Read more in our Contributing Guideline.

Contributor License Agreement

Contributors must sign Teragrep Contributor License Agreement before a pull request is accepted to organization’s repositories.

You need to submit the CLA only once. After submitting the CLA you can contribute to all Teragrep’s repositories.