Chunker Memory leak #647

Xib1uvXi · 2024-07-30T15:17:18Z

When I add a folder with many small files to kubo, memory usage always spikes abnormally

Total folder size: 400MiB
Number of files: 10000
Average size per file: 40KiB

I located the exception by analyzing the heap

// NewSizeSplitter returns a new size-based Splitter with the given block size.
func NewSizeSplitter(r io.Reader, size int64) Splitter {
	return &sizeSplitterv2{
		r:    r,
		size: uint32(size),
	}
}

// NextBytes produces a new chunk.
func (ss *sizeSplitterv2) NextBytes() ([]byte, error) {
	if ss.err != nil {
		return nil, ss.err
	}

	full := pool.Get(int(ss.size))
	n, err := io.ReadFull(ss.r, full)
	switch err {
	case io.ErrUnexpectedEOF:
		ss.err = io.EOF
		small := make([]byte, n)
		copy(small, full) 
		pool.Put(full)  
		return small, nil
	case nil:
		return full, nil
	default:
		pool.Put(full)
		return nil, err
	}
}

// Reader returns the io.Reader associated to this Splitter.
func (ss *sizeSplitterv2) Reader() io.Reader {
	return ss.r
}

Q1: Will using a pool result in abnormal memory consumption when the file size is much smaller than the chunk size?

welcome · 2024-07-30T15:17:21Z

Thank you for submitting your first issue to this repository! A maintainer will be here shortly to triage and review.
In the meantime, please double-check that you have provided all the necessary information to make this process easy! Any information that can help save additional round trips is useful! We currently aim to give initial feedback within two business days. If this does not happen, feel free to leave a comment.
Please keep an eye on how this issue will be labeled, as labels give an overview of priorities, assignments and additional actions requested by the maintainers:

"Priority" labels will show how urgent this is for the team.
"Status" labels will show if this is ready to be worked on, blocked, or in progress.
"Need" labels will indicate if additional input or analysis is required.

Finally, remember to use https://discuss.ipfs.io if you just need general support.

gammazero · 2024-08-01T07:51:15Z

Will using a pool result in abnormal memory consumption when the file size is much smaller than the chunk size?

The pool tries to allocate buffers sized to the power of 2 closest to the chunk size. A small file causes a new allocation, and the pool buffer is returned to the pool.

		small := make([]byte, n)
		copy(small, full) 
		pool.Put(full)

Returning the buffer to the pool does not necessarily free it and may keep it in the pool. A large number of files will cause these allocations for each when reading the last partial chunk. So I not think it has to do with small files, but the number of files.

It may be better to keep the pool-allocated buffer, and avoid the extra allocation and copy:

		small := full[:n]

or use the pool to allocate the smaller buffer:

		small := pool.Get(n)
		copy(small, full)
		pool.Put(full)

Having fewer different buffer sizes might help with GC. Will test to see if this makes any difference.

gammazero · 2024-08-01T08:12:38Z

accidental close - repoened

Xib1uvXi · 2024-08-01T08:38:42Z

I have reimplemented a Splitter and modified some kubo related code. Under scenarios with a large number of files, memory usage has stabilized and no abnormal growth has been observed.

chunk

func NewAutoSplitter(r io.Reader, size int64, fileSize int64) chunk.Splitter {
	ss := &AutoSplitter{
		r:        r,
		size:     size,
		usePool:  true,
		fileSize: fileSize,
	}

	if fileSize >= size {
		ss.mode = normal
	}

	if fileSize < fileSize {
		ss.mode = small
		ss.buffer = make([]byte, fileSize)
	}

	return ss
}

kubo, core/coreapi/unixfs.go

// Constructs a node from reader's data, and adds it. Doesn't pin.
func (adder *Adder) add(reader io.Reader, fileSize int64) (ipld.Node, error) {
	chnk := chunker.NewAutoSplitterV2(reader, chunkSize, fileSize)
	
	params := ihelper.DagBuilderParams{
		Dagserv:    adder.bufferedDS,
		RawLeaves:  adder.RawLeaves,
		Maxlinks:   ihelper.DefaultLinksPerBlock,
		NoCopy:     adder.NoCopy,
		CidBuilder: adder.CidBuilder,
	}

	db, err := params.New(chnk)
	if err != nil {
		return nil, err
	}
	var nd ipld.Node
	if adder.Trickle {
		nd, err = trickle.Layout(db)
	} else {
		nd, err = balanced.Layout(db)
	}
	if err != nil {
		return nil, err
	}

	return nd, adder.bufferedDS.Commit()
}

gammazero · 2024-08-01T22:44:38Z

@Xib1uvXi I would be very interested to see what this looks like for you after running a while: #649

Otherwise, avoiding using the pool in any case may end up being more efficient. I think it may be the case for many files, whether large or small since the extra allocation is caused by a partially filled chunk at the end of the file.

That PR includes a benchmark to compare a chunker that uses go-beffer-pool pool against one that does not. The benchmark simulates your use case of 10000 files with sizes varying between 20K and 60K bytes.

lidel assigned gammazero Jul 30, 2024

lidel added the need/analysis Needs further analysis before proceeding label Jul 30, 2024

gammazero closed this as completed Aug 1, 2024

gammazero reopened this Aug 1, 2024

gammazero mentioned this issue Aug 1, 2024

refactor(chunker): reduce overall memory use #649

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunker Memory leak #647

Chunker Memory leak #647

Xib1uvXi commented Jul 30, 2024

welcome bot commented Jul 30, 2024

gammazero commented Aug 1, 2024 •

edited

Loading

gammazero commented Aug 1, 2024

Xib1uvXi commented Aug 1, 2024

gammazero commented Aug 1, 2024 •

edited

Loading

Chunker Memory leak #647

Chunker Memory leak #647

Comments

Xib1uvXi commented Jul 30, 2024

welcome bot commented Jul 30, 2024

gammazero commented Aug 1, 2024 • edited Loading

gammazero commented Aug 1, 2024

Xib1uvXi commented Aug 1, 2024

gammazero commented Aug 1, 2024 • edited Loading

gammazero commented Aug 1, 2024 •

edited

Loading

gammazero commented Aug 1, 2024 •

edited

Loading