Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Papers API #2553

Open
nbroad1881 opened this issue Sep 18, 2024 · 5 comments · May be fixed by #2554
Open

[Feature request] Papers API #2553

nbroad1881 opened this issue Sep 18, 2024 · 5 comments · May be fixed by #2554
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@nbroad1881
Copy link
Contributor

I can do the following to search for papers: curl 'https://huggingface.co/api/papers/search?q=attention'

And I get this:

[{"id":"2409.07146","title":"Gated Slot Attention for Efficient Linear-Time Sequence Modeling","thumbnailUrl":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2409.07146.png","source":"hf"},{"id":"2409.03752","title":"Attention Heads of Large Language Models: A Survey","thumbnailUrl":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2409.03752.png","source":"hf"},{"id":"2409.00391","title":"Density Adaptive Attention-based Speech Network: Enhancing Feature Understanding for Mental Health Disorders","thumbnailUrl":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2409.00391.png","source":"hf"},{"id":"2408.10945","title":"HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments","thumbnailUrl":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2408.10945.png","source":"hf"},{"id":"2408.12588","title":"Real-Time Video Generation with Pyramid Attention Broadcast","thumbnailUrl":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2408.12588.png","source":"hf"},{"id":"2408.11237","title":"Out-of-Distribution Detection with Attention Head Masking for Multimodal Document Classification","thumbnailUrl":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2408.11237.png","source":"hf"},{"id":"2408.00760","title":"Smoothed Energy Guidance: Guiding Diffusion Models with Reduced Energy Curvature of Attention","thumbnailUrl":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2408.00760.png","source":"hf"},{"id":"2407.16291","title":"TAPTRv2: Attention-based Position Update Improves Tracking Any Point","thumbnailUrl":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2407.16291.png","source":"hf"}]

I can do this to get papers for a certain date:

curl https://huggingface.co/api/papers/search?date=2024-09-17

I don't know what other query terms I can look use. It would be nice to have this as part of the hub client.

@Wauplin

@Wauplin
Copy link
Contributor

Wauplin commented Sep 18, 2024

So basically a list_papers method similar to list_models/list_datasets/list_spaces but with different query parameters, correct? What would be your use case for?

We could start with a simple list_papers method with a single query parameter (date?) returning a PaperInfo dataclass and then expand to other parameters. We haven't got much requests for the Paper API yet but maybe it's time to look into it :)

@Wauplin Wauplin added good first issue Good for newcomers enhancement New feature or request labels Sep 18, 2024
@WizKnight
Copy link
Contributor

Hey @Wauplin :), I'm interested to work on this!

@Raghul-M
Copy link

Hey @Wauplin :), I'm interested to work on this! Can I get more context on this issue

@hlky hlky linked a pull request Sep 19, 2024 that will close this issue
@Wauplin
Copy link
Contributor

Wauplin commented Sep 19, 2024

Thank you everyone for proposing your help and even creating a PR @hlky. I was not expecting such enthusiasm on this feature! 🤗 I'll need some time to check which direction we want to take and review the PR. I'll keep you posted :)

@hlky
Copy link
Contributor

hlky commented Sep 19, 2024

Just to note for the search path e.g. https://huggingface.co/api/papers/search?q=attention we can access the full metadata of the paper like https://huggingface.co/api/papers/{paper_id} e.g. https://huggingface.co/api/papers/2408.12588.

This almost matches the Paper dataclass/"paper" from daily_papers endpoint, except it includes submittedOnDailyAt and submittedOnDailyBy which not part of Paper/"paper" in daily_papers endpoint, they are publishedAt and submittedBy in DailyPaper respectively. numComments is also missing from papers/{paper_id}.
Example:

daily_paper:

{
    "paper": {
        "id": "2409.11074",
        "authors": [
            {
                "_id": "66ead57361228b02f8144cdf",
                "name": "Adrian Cosma",
                "hidden": false
            },
            {
                "_id": "66ead57361228b02f8144ce0",
                "name": "Ana-Maria Bucur",
                "hidden": false
            },
            {
                "_id": "66ead57361228b02f8144ce1",
                "name": "Emilian Radoi",
                "hidden": false
            }
        ],
        "publishedAt": "2024-09-17T11:03:46.000Z",
        "title": "RoMath: A Mathematical Reasoning Benchmark in Romanian",
        "summary": "Mathematics has long been conveyed through natural language, primarily for\nhuman understanding. With the rise of mechanized mathematics and proof\nassistants, there is a growing need to understand informal mathematical text,\nyet most existing benchmarks focus solely on English, overlooking other\nlanguages. This paper introduces RoMath, a Romanian mathematical reasoning\nbenchmark suite comprising three datasets: RoMath-Baccalaureate,\nRoMath-Competitions and RoMath-Synthetic, which cover a range of mathematical\ndomains and difficulty levels, aiming to improve non-English language models\nand promote multilingual AI development. By focusing on Romanian, a\nlow-resource language with unique linguistic features, RoMath addresses the\nlimitations of Anglo-centric models and emphasizes the need for dedicated\nresources beyond simple automatic translation. We benchmark several open-weight\nlanguage models, highlighting the importance of creating resources for\nunderrepresented languages. We make the code and dataset available.",
        "upvotes": 1,
        "discussionId": "66ead57461228b02f8144d31"
    },
    "publishedAt": "2024-09-19T17:17:31.279Z",
    "title": "RoMath: A Mathematical Reasoning Benchmark in Romanian",
    "thumbnail": "https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2409.11074.png",
    "numComments": 1,
    "submittedBy": {
        "avatarUrl": "/avatars/1208629f14f010dbc2cd94f3c30f9baf.svg",
        "fullname": "JB D.",
        "name": "IAMJB",
        "type": "user",
        "isPro": false,
        "isHf": true,
        "isMod": false
    }
}

papers/{paper_id}:

{
    "id": "2409.11074",
    "authors": [
        {
            "_id": "66ead57361228b02f8144cdf",
            "name": "Adrian Cosma",
            "hidden": false
        },
        {
            "_id": "66ead57361228b02f8144ce0",
            "name": "Ana-Maria Bucur",
            "hidden": false
        },
        {
            "_id": "66ead57361228b02f8144ce1",
            "name": "Emilian Radoi",
            "hidden": false
        }
    ],
    "publishedAt": "2024-09-17T11:03:46.000Z",
    "submittedOnDailyAt": "2024-09-19T17:17:31.279Z",
    "title": "RoMath: A Mathematical Reasoning Benchmark in Romanian",
    "submittedOnDailyBy": {
        "_id": "62716952bcef985363db8485",
        "avatarUrl": "/avatars/1208629f14f010dbc2cd94f3c30f9baf.svg",
        "isPro": false,
        "fullname": "JB D.",
        "user": "IAMJB",
        "type": "user"
    },
    "summary": "Mathematics has long been conveyed through natural language, primarily for\nhuman understanding. With the rise of mechanized mathematics and proof\nassistants, there is a growing need to understand informal mathematical text,\nyet most existing benchmarks focus solely on English, overlooking other\nlanguages. This paper introduces RoMath, a Romanian mathematical reasoning\nbenchmark suite comprising three datasets: RoMath-Baccalaureate,\nRoMath-Competitions and RoMath-Synthetic, which cover a range of mathematical\ndomains and difficulty levels, aiming to improve non-English language models\nand promote multilingual AI development. By focusing on Romanian, a\nlow-resource language with unique linguistic features, RoMath addresses the\nlimitations of Anglo-centric models and emphasizes the need for dedicated\nresources beyond simple automatic translation. We benchmark several open-weight\nlanguage models, highlighting the importance of creating resources for\nunderrepresented languages. We make the code and dataset available.",
    "upvotes": 1,
    "discussionId": "66ead57461228b02f8144d31"
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants