Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove mandatory dependency on pandas #3133

Closed
wants to merge 3 commits into from

Conversation

eginhard
Copy link
Contributor

@eginhard eginhard commented Nov 1, 2023

pandas was only used in a few places to read or write CSV files, which can just as well be done with the standard library. I've changed that accordingly and moved pandas to the optional notebooks requirements.

I also removed the separate numba pin for Python < 3.9 because only Python >= 3.9 is supported now.

@erogol
Copy link
Member

erogol commented Nov 8, 2023

Pandas is significantly faster for reading large CSV files. I'd prefer to keep it.

@eginhard
Copy link
Contributor Author

eginhard commented Nov 8, 2023

Pandas is significantly faster for reading large CSV files. I'd prefer to keep it.

Is it? For a ~2GB CSV file generated with:

import pandas as pd
N = 10000000
df = pd.DataFrame()
df["filename"] = pd._testing.rands_array(100, N)
df["transcript"] = pd._testing.rands_array(100, N)
df["speaker"] = pd._testing.rands_array(10, N)
df["emotion"] = pd._testing.rands_array(10, N)
df.to_csv("random.csv", sep=",")

Loading with:

import pandas as pd
import time
import csv

input_file = "random.csv"

start_time = time.time()
data = csv.reader(open(input_file))
for row in data:
    pass
print("csv.reader took %s seconds" % (time.time() - start_time))

start_time = time.time()
data = csv.DictReader(open(input_file))
for row in data:
    pass
print("csv.DictReader took %s seconds" % (time.time() - start_time))

start_time = time.time()
data = pd.read_csv(input_file)
for row in data:
    pass
print("pd.read_csv took %s seconds" % (time.time() - start_time))

I get:

csv.reader took 9.466115713119507 seconds
csv.DictReader took 15.79174280166626 seconds
pd.read_csv took 23.6483256816864 seconds

In any case, this is done only once before training, which takes much longer, so I don't think it's significant either way.

@akx
Copy link
Contributor

akx commented Nov 10, 2023

Yeah, Pandas is definitely not faster if you're just using it to read or write CSV – especially the .itertuples() etc. that tts ends up doing can be significantly slow, since Pandas dataframes aren't meant to be used like that.

Not pulling in Pandas as a dependency is a win for CI times and saving-the-planet too.

In addition, using csv opens up the possibility of making these calls stream content instead of having to read everything into a DF first.

@erogol
Copy link
Member

erogol commented Nov 13, 2023

This is on my side

csv.reader took 14.11574387550354 seconds
csv.DictReader took 25.624335289001465 seconds
pd.read_csv took 13.326128244400024 seconds

I am against changing anything that already works fine without any practical reason. Any unnoticed problem at loading affects the entire training result, and I am weary of that.

I'd rather struggle with deps than take that risk.

@erogol
Copy link
Member

erogol commented Nov 13, 2023

I guess one solution to the issue is separating deps for inference and training. Given more people are interested in using the models, they would not go into the trouble of dealing with extra deps.

@akx
Copy link
Contributor

akx commented Nov 13, 2023

csv.reader took 12.972580194473267 seconds
csv.DictReader took 17.6508891582489 seconds
pd.read_csv took 18.601688861846924 seconds

on my machine (repeatedly).

I am against changing anything that already works fine without any practical reason. Any unnoticed problem at loading affects the entire training result, and I am weary of that.

How do we know that code works fine? If there's a test for loading those datasets, it should catch any problems, right?

@eginhard
Copy link
Contributor Author

I can add some tests as well if that helps (and might factor it out into read_csv/write_csv functions then that could be used by all formatters, most of them currently do some ad-hoc CSV parsing).

I guess one solution to the issue is separating deps for inference and training. Given more people are interested in using the models, they would not go into the trouble of dealing with extra deps.

That would be very useful and I can work on this. How about also making the other language-specific deps optional like already done for Japanese?

@erogol
Copy link
Member

erogol commented Nov 14, 2023

@eginhard if someone is willing to do that, I am totally in but it's a hell that I can't budge right now:)

about testing, I rather keep pandas. So the problem with pandas is dep conflicts, right? If so how about fixing that? Would that work for you?

@akx
Copy link
Contributor

akx commented Nov 15, 2023

if someone is willing to do that [splitting dependencies], I am totally in but it's a hell that I can't budge right now:)

I could also take a shot at that (been trying to do that before...), but would such a PR get accepted? I mean, I'm still kinda waiting for @erogol to take a look at #3004 too, which is less intrusive than that 😉

So the problem with pandas is dep conflicts, right? If so how about fixing that? Would that work for you?

Generally speaking, the problem is downloading and installing in a whole host of dependencies when they're not needed, contributing to e.g. slower CI times (which, I understand, is a problem in this repo) and more disk space consumed.

(Anecdotally, in a Python 3.11 Docker container, installing pandas (with its dependencies) for the heck of it takes 12 seconds and consumes 204 megabytes of disk space.)

For any extra dependency, there's of course added surface for supply chain attacks. That's likely not an issue for pandas, but should a ne'er-do-well gain publish access to any of the 141 packages that every single coqui-ai/TTS installation requires (Babel-2.13.1 GitPython-3.1.40 Jinja2-3.1.2 MarkupSafe-2.1.3 TTS-0.20.4 Werkzeug-3.0.1 absl-py-2.0.0 accelerate-0.24.1 aiohttp-3.8.6 aiosignal-1.3.1 anyascii-0.3.2 appdirs-1.4.4 async-timeout-4.0.3 attrs-23.1.0 audioread-3.0.1 bangla-0.0.2 blinker-1.7.0 bnnumerizer-0.0.2 bnunicodenormalizer-0.1.6 cachetools-5.3.2 certifi-2023.7.22 cffi-1.16.0 charset-normalizer-3.3.2 clean-fid-0.1.35 click-8.1.7 clip-anytorch-2.5.2 contourpy-1.2.0 coqpit-0.0.17 cycler-0.12.1 cython-0.29.30 dateparser-1.1.8 dctorch-0.1.2 decorator-5.1.1 docker-pycreds-0.4.0 docopt-0.6.2 einops-0.6.1 encodec-0.1.1 filelock-3.13.1 flask-2.3.3 fonttools-4.44.0 frozenlist-1.4.0 fsspec-2023.6.0 ftfy-6.1.1 g2pkk-0.1.2 gitdb-4.0.11 google-auth-2.23.4 google-auth-oauthlib-1.1.0 grpcio-1.59.2 gruut-2.2.3 gruut-ipa-0.13.0 gruut_lang_de-2.0.0 gruut_lang_en-2.0.0 gruut_lang_es-2.0.0 gruut_lang_fr-2.0.2 hangul-romanize-0.1.0 huggingface-hub-0.19.2 idna-3.4 imageio-2.32.0 inflect-5.6.2 itsdangerous-2.1.2 jamo-0.4.1 jieba-0.42.1 joblib-1.3.2 jsonlines-1.2.0 jsonmerge-1.9.2 jsonschema-4.19.2 jsonschema-specifications-2023.11.1 k-diffusion-0.1.1 kiwisolver-1.4.5 kornia-0.7.0 lazy-loader-0.3 librosa-0.10.1 llvmlite-0.40.1 markdown-3.5.1 matplotlib-3.7.3 mpmath-1.3.0 msgpack-1.0.7 multidict-6.0.4 networkx-2.8.8 nltk-3.8.1 num2words-0.5.13 numba-0.57.0 numpy-1.24.3 oauthlib-3.2.2 packaging-23.1 pandas-1.5.3 pillow-10.0.1 platformdirs-4.0.0 pooch-1.8.0 protobuf-4.23.4 psutil-5.9.6 pyasn1-0.5.0 pyasn1-modules-0.3.0 pycparser-2.21 pynndescent-0.5.10 pyparsing-3.1.1 pypinyin-0.49.0 pysbd-0.3.4 python-crfsuite-0.9.9 python-dateutil-2.8.2 pytz-2023.3.post1 pyyaml-6.0.1 referencing-0.31.0 regex-2023.10.3 requests-2.31.0 requests-oauthlib-1.3.1 rpds-py-0.12.0 rsa-4.9 safetensors-0.4.0 scikit-image-0.22.0 scikit-learn-1.3.0 scipy-1.11.3 sentry-sdk-1.35.0 setproctitle-1.3.3 six-1.16.0 smmap-5.0.1 soundfile-0.12.1 soxr-0.3.7 sympy-1.12 tensorboard-2.15.1 tensorboard-data-server-0.7.2 threadpoolctl-3.2.0 tifffile-2023.9.26 tokenizers-0.13.3 torch-2.1.0 torchaudio-2.1.0 torchdiffeq-0.2.3 torchsde-0.2.6 torchvision-0.16.0 tqdm-4.64.1 trainer-0.0.31 trampoline-0.1.2 transformers-4.33.3 typing-extensions-4.8.0 tzlocal-5.2 umap-learn-0.5.4 unidecode-1.3.7 urllib3-2.1.0 wandb-0.16.0 wcwidth-0.2.10 yarl-1.9.2 at the time of writing) there'd be trouble.

about testing, I rather keep pandas.

Why's that? I'm not sure I'm following.

@eginhard
Copy link
Contributor Author

In addition to what @akx said, for us it would be to use TTS as a library without having to pull in so many unnecessary deps. A default version of the package that only includes the inference deps would work for that as well if you really want to keep pandas. I'll probably look into better logging (#1691) first though.

@eginhard eginhard closed this Nov 16, 2023
@akx
Copy link
Contributor

akx commented Nov 17, 2023

@eginhard FWIW, if you're going to work on fixing the logging, ast-grep might help:

ast-grep run --pattern 'print($VAL)' --rewrite 'logger.info($VAL)' -l python -A tts

(I did just that in Stability-AI/generative-models#77)

@eginhard eginhard deleted the remove-pandas branch April 8, 2024 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants