Remove mandatory dependency on pandas #3133

eginhard · 2023-11-01T15:25:41Z

pandas was only used in a few places to read or write CSV files, which can just as well be done with the standard library. I've changed that accordingly and moved pandas to the optional notebooks requirements.

I also removed the separate numba pin for Python < 3.9 because only Python >= 3.9 is supported now.

The repository requires Python>=3.9, so this is obsolete.

erogol · 2023-11-08T10:22:36Z

Pandas is significantly faster for reading large CSV files. I'd prefer to keep it.

eginhard · 2023-11-08T13:21:02Z

Pandas is significantly faster for reading large CSV files. I'd prefer to keep it.

Is it? For a ~2GB CSV file generated with:

import pandas as pd
N = 10000000
df = pd.DataFrame()
df["filename"] = pd._testing.rands_array(100, N)
df["transcript"] = pd._testing.rands_array(100, N)
df["speaker"] = pd._testing.rands_array(10, N)
df["emotion"] = pd._testing.rands_array(10, N)
df.to_csv("random.csv", sep=",")

Loading with:

import pandas as pd
import time
import csv

input_file = "random.csv"

start_time = time.time()
data = csv.reader(open(input_file))
for row in data:
    pass
print("csv.reader took %s seconds" % (time.time() - start_time))

start_time = time.time()
data = csv.DictReader(open(input_file))
for row in data:
    pass
print("csv.DictReader took %s seconds" % (time.time() - start_time))

start_time = time.time()
data = pd.read_csv(input_file)
for row in data:
    pass
print("pd.read_csv took %s seconds" % (time.time() - start_time))

I get:

csv.reader took 9.466115713119507 seconds
csv.DictReader took 15.79174280166626 seconds
pd.read_csv took 23.6483256816864 seconds

In any case, this is done only once before training, which takes much longer, so I don't think it's significant either way.

akx · 2023-11-10T10:31:13Z

Yeah, Pandas is definitely not faster if you're just using it to read or write CSV – especially the .itertuples() etc. that tts ends up doing can be significantly slow, since Pandas dataframes aren't meant to be used like that.

Not pulling in Pandas as a dependency is a win for CI times and saving-the-planet too.

In addition, using csv opens up the possibility of making these calls stream content instead of having to read everything into a DF first.

erogol · 2023-11-13T13:50:28Z

This is on my side

csv.reader took 14.11574387550354 seconds
csv.DictReader took 25.624335289001465 seconds
pd.read_csv took 13.326128244400024 seconds

I am against changing anything that already works fine without any practical reason. Any unnoticed problem at loading affects the entire training result, and I am weary of that.

I'd rather struggle with deps than take that risk.

erogol · 2023-11-13T13:56:04Z

I guess one solution to the issue is separating deps for inference and training. Given more people are interested in using the models, they would not go into the trouble of dealing with extra deps.

akx · 2023-11-13T15:11:05Z

csv.reader took 12.972580194473267 seconds
csv.DictReader took 17.6508891582489 seconds
pd.read_csv took 18.601688861846924 seconds

on my machine (repeatedly).

I am against changing anything that already works fine without any practical reason. Any unnoticed problem at loading affects the entire training result, and I am weary of that.

How do we know that code works fine? If there's a test for loading those datasets, it should catch any problems, right?

eginhard · 2023-11-14T10:21:53Z

I can add some tests as well if that helps (and might factor it out into read_csv/write_csv functions then that could be used by all formatters, most of them currently do some ad-hoc CSV parsing).

I guess one solution to the issue is separating deps for inference and training. Given more people are interested in using the models, they would not go into the trouble of dealing with extra deps.

That would be very useful and I can work on this. How about also making the other language-specific deps optional like already done for Japanese?

erogol · 2023-11-14T15:13:30Z

@eginhard if someone is willing to do that, I am totally in but it's a hell that I can't budge right now:)

about testing, I rather keep pandas. So the problem with pandas is dep conflicts, right? If so how about fixing that? Would that work for you?

akx · 2023-11-15T12:28:31Z

if someone is willing to do that [splitting dependencies], I am totally in but it's a hell that I can't budge right now:)

I could also take a shot at that (been trying to do that before...), but would such a PR get accepted? I mean, I'm still kinda waiting for @erogol to take a look at #3004 too, which is less intrusive than that 😉

So the problem with pandas is dep conflicts, right? If so how about fixing that? Would that work for you?

Generally speaking, the problem is downloading and installing in a whole host of dependencies when they're not needed, contributing to e.g. slower CI times (which, I understand, is a problem in this repo) and more disk space consumed.

(Anecdotally, in a Python 3.11 Docker container, installing pandas (with its dependencies) for the heck of it takes 12 seconds and consumes 204 megabytes of disk space.)

For any extra dependency, there's of course added surface for supply chain attacks. That's likely not an issue for pandas, but should a ne'er-do-well gain publish access to any of the 141 packages that every single coqui-ai/TTS installation requires (Babel-2.13.1 GitPython-3.1.40 Jinja2-3.1.2 MarkupSafe-2.1.3 TTS-0.20.4 Werkzeug-3.0.1 absl-py-2.0.0 accelerate-0.24.1 aiohttp-3.8.6 aiosignal-1.3.1 anyascii-0.3.2 appdirs-1.4.4 async-timeout-4.0.3 attrs-23.1.0 audioread-3.0.1 bangla-0.0.2 blinker-1.7.0 bnnumerizer-0.0.2 bnunicodenormalizer-0.1.6 cachetools-5.3.2 certifi-2023.7.22 cffi-1.16.0 charset-normalizer-3.3.2 clean-fid-0.1.35 click-8.1.7 clip-anytorch-2.5.2 contourpy-1.2.0 coqpit-0.0.17 cycler-0.12.1 cython-0.29.30 dateparser-1.1.8 dctorch-0.1.2 decorator-5.1.1 docker-pycreds-0.4.0 docopt-0.6.2 einops-0.6.1 encodec-0.1.1 filelock-3.13.1 flask-2.3.3 fonttools-4.44.0 frozenlist-1.4.0 fsspec-2023.6.0 ftfy-6.1.1 g2pkk-0.1.2 gitdb-4.0.11 google-auth-2.23.4 google-auth-oauthlib-1.1.0 grpcio-1.59.2 gruut-2.2.3 gruut-ipa-0.13.0 gruut_lang_de-2.0.0 gruut_lang_en-2.0.0 gruut_lang_es-2.0.0 gruut_lang_fr-2.0.2 hangul-romanize-0.1.0 huggingface-hub-0.19.2 idna-3.4 imageio-2.32.0 inflect-5.6.2 itsdangerous-2.1.2 jamo-0.4.1 jieba-0.42.1 joblib-1.3.2 jsonlines-1.2.0 jsonmerge-1.9.2 jsonschema-4.19.2 jsonschema-specifications-2023.11.1 k-diffusion-0.1.1 kiwisolver-1.4.5 kornia-0.7.0 lazy-loader-0.3 librosa-0.10.1 llvmlite-0.40.1 markdown-3.5.1 matplotlib-3.7.3 mpmath-1.3.0 msgpack-1.0.7 multidict-6.0.4 networkx-2.8.8 nltk-3.8.1 num2words-0.5.13 numba-0.57.0 numpy-1.24.3 oauthlib-3.2.2 packaging-23.1 pandas-1.5.3 pillow-10.0.1 platformdirs-4.0.0 pooch-1.8.0 protobuf-4.23.4 psutil-5.9.6 pyasn1-0.5.0 pyasn1-modules-0.3.0 pycparser-2.21 pynndescent-0.5.10 pyparsing-3.1.1 pypinyin-0.49.0 pysbd-0.3.4 python-crfsuite-0.9.9 python-dateutil-2.8.2 pytz-2023.3.post1 pyyaml-6.0.1 referencing-0.31.0 regex-2023.10.3 requests-2.31.0 requests-oauthlib-1.3.1 rpds-py-0.12.0 rsa-4.9 safetensors-0.4.0 scikit-image-0.22.0 scikit-learn-1.3.0 scipy-1.11.3 sentry-sdk-1.35.0 setproctitle-1.3.3 six-1.16.0 smmap-5.0.1 soundfile-0.12.1 soxr-0.3.7 sympy-1.12 tensorboard-2.15.1 tensorboard-data-server-0.7.2 threadpoolctl-3.2.0 tifffile-2023.9.26 tokenizers-0.13.3 torch-2.1.0 torchaudio-2.1.0 torchdiffeq-0.2.3 torchsde-0.2.6 torchvision-0.16.0 tqdm-4.64.1 trainer-0.0.31 trampoline-0.1.2 transformers-4.33.3 typing-extensions-4.8.0 tzlocal-5.2 umap-learn-0.5.4 unidecode-1.3.7 urllib3-2.1.0 wandb-0.16.0 wcwidth-0.2.10 yarl-1.9.2 at the time of writing) there'd be trouble.

about testing, I rather keep pandas.

Why's that? I'm not sure I'm following.

eginhard · 2023-11-16T15:54:10Z

In addition to what @akx said, for us it would be to use TTS as a library without having to pull in so many unnecessary deps. A default version of the package that only includes the inference deps would work for that as well if you really want to keep pandas. I'll probably look into better logging (#1691) first though.

akx · 2023-11-17T12:43:25Z

@eginhard FWIW, if you're going to work on fixing the logging, ast-grep might help:

ast-grep run --pattern 'print($VAL)' --rewrite 'logger.info($VAL)' -l python -A tts

(I did just that in Stability-AI/generative-models#77)

eginhard added 3 commits October 31, 2023 13:34

chore: remove obsolete numba requirement for python<3.9

871b1ce

The repository requires Python>=3.9, so this is obsolete.

refactor: read/write csv files with standard library

5ed88ff

build: move pandas to notebook requirements

94020aa

eginhard closed this Nov 16, 2023

eginhard mentioned this pull request Mar 6, 2024

Remove pandas eginhard/coqui-tts#3

Merged

eginhard deleted the remove-pandas branch April 8, 2024 14:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove mandatory dependency on pandas #3133

Remove mandatory dependency on pandas #3133

eginhard commented Nov 1, 2023

erogol commented Nov 8, 2023

eginhard commented Nov 8, 2023

akx commented Nov 10, 2023 •

edited

Loading

erogol commented Nov 13, 2023 •

edited

Loading

erogol commented Nov 13, 2023

akx commented Nov 13, 2023

eginhard commented Nov 14, 2023

erogol commented Nov 14, 2023 •

edited

Loading

akx commented Nov 15, 2023 •

edited

Loading

eginhard commented Nov 16, 2023

akx commented Nov 17, 2023

Remove mandatory dependency on pandas #3133

Remove mandatory dependency on pandas #3133

Conversation

eginhard commented Nov 1, 2023

erogol commented Nov 8, 2023

eginhard commented Nov 8, 2023

akx commented Nov 10, 2023 • edited Loading

erogol commented Nov 13, 2023 • edited Loading

erogol commented Nov 13, 2023

akx commented Nov 13, 2023

eginhard commented Nov 14, 2023

erogol commented Nov 14, 2023 • edited Loading

akx commented Nov 15, 2023 • edited Loading

eginhard commented Nov 16, 2023

akx commented Nov 17, 2023

akx commented Nov 10, 2023 •

edited

Loading

erogol commented Nov 13, 2023 •

edited

Loading

erogol commented Nov 14, 2023 •

edited

Loading

akx commented Nov 15, 2023 •

edited

Loading