Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta data file_name in the GitHub part of The Pile a bit off #90

Open
thomwolf opened this issue Jul 1, 2021 · 2 comments
Open

Meta data file_name in the GitHub part of The Pile a bit off #90

thomwolf opened this issue Jul 1, 2021 · 2 comments

Comments

@thomwolf
Copy link

thomwolf commented Jul 1, 2021

Hi,

Apologies if this is not the right place to note this but after downloading and exploring the preprocessed GitHub part of The Pile I've noted the metadata file_name are sometime a little off which can make it a bit harder to filter files based on file extension.

For instance here, in the first sample of data_114_time1601108762_default.jsonl downloaded from https://the-eye.eu/public/AI/pile_preliminary_components/, file_name is indicated to be jadx_termux.sh but this appears to be an extract from the changelog of the same repo.

Not sure how important this is for people here but maybe it should be mentioned somewhere?

{
 "text": "## [1.1]\n### Added\n- Added Update() for auto-update\n\n## [1.2]\n### Added\n- extra flag or option `-a` to use __aapt2__ instead of __aapt__.\n- Issue template\n### Changed\n- use getopts for parameters handling\n### Fixed\n- fix update()\n\n## [1.3]\n### Added\n- add aapt2 to bind()\n### Fixed\n- set `LD_LIBRARY_PATH` to avoid libraries access from termux i.e `$PREFIX/lib`\n\n## [1.4]\n### Added\n- patched binaries of aapt2 to skip invalid names while recompiling\n### Fixed\n- fixes #10\n\n## [1.5]\n### Changed\n- stick to alpine v3.10.2 instead of latest one\n\n## [1.6]\n### Added\n- custom path of framework directory\n- new flag `-V` to enable verbose mode for decompiling & recompiling only\n### Changed\n- update apktool to 2.4.1 \n- remove framework app __1.apk__ after each decompiling\n\n## [1.7]\n### Added\n- new option `--no-res` to decompile app except resources.\n- new option `--no-smali` to prevent disassembly of the dex file(s)\n\n## [1.8]\n### Added\n- new option `--no-assets` to prevent decoding of unknown assets files\n- `-z` for zipalign\n- `--frame-path` to specify framework directory\n- `-R` recompile + sign\n\n## [1.9]\n### Added\n- new option `--enable-perm` to enable all permissions automatically in binded or non binded payloads\n\n## [2.0]\n### Added\n- Kali support\n### Changed\n- remove option `-a` & defaults to `aapt2`\n\n## [2.1]\n### Added \n- jadx support\n- new option `--to-java` to decode [dex,apk,zip] to java sources\n- `--deobf` can use along with `--to-java`\n\n## [2.2]\n### Changed\n- now apksigner in termux is from sdk so a key ( PKCS12 ) is added.\n",
 "meta":
   {"repo_name": "Hax4us/Apkmod",
    "stars": "114",
    "repo_language": "Shell",
    "file_name": "jadx_termux.sh",
    "mime_type": "text/plain"}
}
@UniverseFly
Copy link

I have the same issue after inspecting the data downloaded from http://eaidata.bmk.sh/data/github_small.jsonl.zst. It seems the value of the 'file_name' key is identical for every repo.

@osainz59
Copy link

This is a bug caused by https://github.com/EleutherAI/github-downloader/blob/345e7c4cbb9e0dc8a0615fd995a08bf9d73b3fe6/download_repo_text.py#L201C25-L201C49

They append the reference to the same dict every time, so, only the name and the type of the last file is stored in meta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants