Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Dictionary Issue with Small Sample Sizes and from_files Function #260

Open
DvirYo-starkware opened this issue Jan 31, 2024 · 2 comments

Comments

@DvirYo-starkware
Copy link

I'm encountering an issue while attempting to train a dictionary with a small number of samples, impacting all training functions, from_continuous, from_files, from_samples. A code example demonstrating this issue is provided below.

I noticed in the documentation for the from_continuous function it mentions, "Train a dictionary from a big continuous chunk of data." While this behavior might be intentional for from_continuous, it's not explicitly stated for the from_files function, which is the function I am particularly interested in.

My use case involves training a dictionary on a dataset ranging from a million to a few hundred million items, each item varying in size from a few hundred bytes to a few dozen kilobytes. However, attempting to write all data to a single file for training fails due to the described problem. Even adding dummy empty files to facilitate training results in a very bad dictionary.

I am seeking guidance on the recommended approach for my use case. Would it be advisable to separate the data into several smaller files? If so, what would be an appropriate size for these files to ensure successful training without compromising the quality of the resulting dictionary? Your assistance in resolving this matter is greatly appreciated.

#[test]
fn train_dict_fail_for_small_size(){
    const SAMPLE_LENGTH: usize=10;

    // Failure
    let samples=[[0;SAMPLE_LENGTH]; 2];
    let ret_val=zstd::dict::from_samples(&samples, 1000);
    assert_eq!(format!("{ret_val:?}"), r#"Err(Custom { kind: Other, error: "Src size is incorrect" })"#);

    // Success
    let samples=[[0;SAMPLE_LENGTH]; 10];
    let ret_val=zstd::dict::from_samples(&samples, 1000);
    assert!(ret_val.is_ok());
}
@gyscos
Copy link
Owner

gyscos commented Mar 21, 2024

Hi, and thanks for the report!

from_continuous needs the data to be continuous in memory, not necessarily in a single file. This is the API that the C library uses.

from_files is a convenience method that builds the continuous chunk of memory from the given files.

from_samples is similar to from_files but directly takes a list of samples - it still needs to internally copy everything into a large continuous chunk to actually train.

Note that there are limits to the input data that zstd can use: if the entire input data (the sum of all files) is too small, it will reject the request. If it's too big, it will most likely ignore the data beyond some amount. I'm not entirely sure where to find the exact cutoff values.

As for the splitting, you should have each sample, or each file, represent a typical "message" or "item" you would try to compress in real conditions.

@DvirYo-starkware
Copy link
Author

Thank you for the answer.
I didn't even take into consideration that there could be kind of limitations. After becoming aware of that, a short search on Google gave me the following, which says the dictionary training size is limited to 2 GB:
facebook/zstd#3111

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants