Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunker Memory leak #647

Open
Xib1uvXi opened this issue Jul 30, 2024 · 5 comments
Open

Chunker Memory leak #647

Xib1uvXi opened this issue Jul 30, 2024 · 5 comments
Assignees
Labels
need/analysis Needs further analysis before proceeding

Comments

@Xib1uvXi
Copy link

When I add a folder with many small files to kubo, memory usage always spikes abnormally

  • Total folder size: 400MiB
  • Number of files: 10000
  • Average size per file: 40KiB

I located the exception by analyzing the heap
image

// NewSizeSplitter returns a new size-based Splitter with the given block size.
func NewSizeSplitter(r io.Reader, size int64) Splitter {
	return &sizeSplitterv2{
		r:    r,
		size: uint32(size),
	}
}

// NextBytes produces a new chunk.
func (ss *sizeSplitterv2) NextBytes() ([]byte, error) {
	if ss.err != nil {
		return nil, ss.err
	}

	full := pool.Get(int(ss.size))
	n, err := io.ReadFull(ss.r, full)
	switch err {
	case io.ErrUnexpectedEOF:
		ss.err = io.EOF
		small := make([]byte, n)
		copy(small, full) 
		pool.Put(full)  
		return small, nil
	case nil:
		return full, nil
	default:
		pool.Put(full)
		return nil, err
	}
}

// Reader returns the io.Reader associated to this Splitter.
func (ss *sizeSplitterv2) Reader() io.Reader {
	return ss.r
}

Q1: Will using a pool result in abnormal memory consumption when the file size is much smaller than the chunk size?

Copy link

welcome bot commented Jul 30, 2024

Thank you for submitting your first issue to this repository! A maintainer will be here shortly to triage and review.
In the meantime, please double-check that you have provided all the necessary information to make this process easy! Any information that can help save additional round trips is useful! We currently aim to give initial feedback within two business days. If this does not happen, feel free to leave a comment.
Please keep an eye on how this issue will be labeled, as labels give an overview of priorities, assignments and additional actions requested by the maintainers:

  • "Priority" labels will show how urgent this is for the team.
  • "Status" labels will show if this is ready to be worked on, blocked, or in progress.
  • "Need" labels will indicate if additional input or analysis is required.

Finally, remember to use https://discuss.ipfs.io if you just need general support.

@lidel lidel added the need/analysis Needs further analysis before proceeding label Jul 30, 2024
@gammazero
Copy link
Contributor

gammazero commented Aug 1, 2024

Will using a pool result in abnormal memory consumption when the file size is much smaller than the chunk size?

The pool tries to allocate buffers sized to the power of 2 closest to the chunk size. A small file causes a new allocation, and the pool buffer is returned to the pool.

		small := make([]byte, n)
		copy(small, full) 
		pool.Put(full)  

Returning the buffer to the pool does not necessarily free it and may keep it in the pool. A large number of files will cause these allocations for each when reading the last partial chunk. So I not think it has to do with small files, but the number of files.

It may be better to keep the pool-allocated buffer, and avoid the extra allocation and copy:

		small := full[:n]

or use the pool to allocate the smaller buffer:

		small := pool.Get(n)
		copy(small, full)
		pool.Put(full)

Having fewer different buffer sizes might help with GC. Will test to see if this makes any difference.

@gammazero gammazero reopened this Aug 1, 2024
@gammazero
Copy link
Contributor

accidental close - repoened

@Xib1uvXi
Copy link
Author

Xib1uvXi commented Aug 1, 2024

I have reimplemented a Splitter and modified some kubo related code. Under scenarios with a large number of files, memory usage has stabilized and no abnormal growth has been observed.

chunk

func NewAutoSplitter(r io.Reader, size int64, fileSize int64) chunk.Splitter {
	ss := &AutoSplitter{
		r:        r,
		size:     size,
		usePool:  true,
		fileSize: fileSize,
	}

	if fileSize >= size {
		ss.mode = normal
	}

	if fileSize < fileSize {
		ss.mode = small
		ss.buffer = make([]byte, fileSize)
	}

	return ss
}

kubo, core/coreapi/unixfs.go

// Constructs a node from reader's data, and adds it. Doesn't pin.
func (adder *Adder) add(reader io.Reader, fileSize int64) (ipld.Node, error) {
	chnk := chunker.NewAutoSplitterV2(reader, chunkSize, fileSize)
	
	params := ihelper.DagBuilderParams{
		Dagserv:    adder.bufferedDS,
		RawLeaves:  adder.RawLeaves,
		Maxlinks:   ihelper.DefaultLinksPerBlock,
		NoCopy:     adder.NoCopy,
		CidBuilder: adder.CidBuilder,
	}

	db, err := params.New(chnk)
	if err != nil {
		return nil, err
	}
	var nd ipld.Node
	if adder.Trickle {
		nd, err = trickle.Layout(db)
	} else {
		nd, err = balanced.Layout(db)
	}
	if err != nil {
		return nil, err
	}

	return nd, adder.bufferedDS.Commit()
}

@gammazero
Copy link
Contributor

gammazero commented Aug 1, 2024

@Xib1uvXi I would be very interested to see what this looks like for you after running a while: #649

Otherwise, avoiding using the pool in any case may end up being more efficient. I think it may be the case for many files, whether large or small since the extra allocation is caused by a partially filled chunk at the end of the file.

That PR includes a benchmark to compare a chunker that uses go-beffer-pool pool against one that does not. The benchmark simulates your use case of 10000 files with sizes varying between 20K and 60K bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/analysis Needs further analysis before proceeding
Projects
None yet
Development

No branches or pull requests

3 participants