Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LazyFrame from http endpoint #993

Conversation

ceyhunkerti
Copy link
Contributor

@ceyhunkerti ceyhunkerti commented Sep 24, 2024

  • Fixes the related issue => from_parquet does not handle FSS.HTTP.Entry when lazy: true #990

  • ! Couldn't add test maybe someone more knowledgeable on the repo can point me how to add it the Bypass setup doesn't seem to work with lazy option.

  • Also I observed DF.from_ndjson doesn't work with lazy option although it's documented. I'll check further and create an issue if I can confirm it.

Tried to add a simple test like below but it's hanging

  describe "from_parquet/2 - HTTP" do
    setup do
      [bypass: Bypass.open(), df: Explorer.Datasets.wine()]
    end

    test "reads a parquet file from an HTTP server", %{bypass: bypass, df: df} do
      Bypass.expect(bypass, "GET", "/path/to/file.parquet", fn conn ->
        bytes = Explorer.DataFrame.dump_parquet!(df)
        Plug.Conn.resp(conn, 200, bytes)
      end)

      url = http_endpoint(bypass.port) <> "/path/to/file.parquet"

      assert {:ok, ldf} = Explorer.DataFrame.from_parquet(url, lazy: true)
      df1 = DF.compute(ldf)

      assert DF.to_columns(df1) == DF.to_columns(df)
    end
  end

  defp http_endpoint(port), do: "http://localhost:#{port}"

Eventually failing with something like

     test/explorer/data_frame/lazy_test.exs:223
     match (=) failed
     code:  assert {:ok, ldf} = DF.from_parquet(url, lazy: true)
     left:  {:ok, ldf}
     right: {
              :error,
              %RuntimeError{message: "Polars Error: Generic HTTP error: Request error: Error after 2 retries in 180.2322859s, max_retries:10, retry_timeout:180s, source:error sending request for url (http://localhost:46017/test): 'parquet scan' failed: 'select' input failed to resolve"}
            }
     stacktrace:
       test/explorer/data_frame/lazy_test.exs:232: (test)

@philss
Copy link
Member

philss commented Sep 25, 2024

! Couldn't add test maybe someone more knowledgeable on the repo can point me how to add it the Bypass setup doesn't seem to work with lazy option.

I think the problem may be to a restriction in Polars: it may require HTTPs. If you tested manually and it's working, we can move on without the test :)

@ceyhunkerti
Copy link
Contributor Author

yes, I tested it with the following from the original issue and it's working

frame = Explorer.DataFrame.from_parquet!("https://huggingface.co/datasets/aqubed/kub_tickets_small/resolve/main/data/train-00000-of-00001-47868532d4f55873.parquet", lazy: true)

@philss
Copy link
Member

philss commented Sep 25, 2024

@ceyhunkerti awesome! Thank you!

@josevalim josevalim merged commit ed18e53 into elixir-explorer:main Sep 25, 2024
3 checks passed
@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants