Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Image #155

Open
vikaskookna opened this issue Oct 11, 2024 · 6 comments
Open

Docker Image #155

vikaskookna opened this issue Oct 11, 2024 · 6 comments
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@vikaskookna
Copy link

I created aws lambda docker image, and it fails on this line
from crawl4ai import AsyncWebCrawler

  "errorMessage": "[Errno 30] Read-only file system: '/home/sbx_user1051'",
  "errorType": "OSError",
  "requestId": "",
  "stackTrace": [
    "  File \"/var/lang/lib/python3.12/importlib/__init__.py\", line 90, in import_module\n    return _bootstrap._gcd_import(name[level:], package, level)\n",
    "  File \"<frozen importlib._bootstrap>\", line 1387, in _gcd_import\n",
    "  File \"<frozen importlib._bootstrap>\", line 1360, in _find_and_load\n",
    "  File \"<frozen importlib._bootstrap>\", line 1331, in _find_and_load_unlocked\n",
    "  File \"<frozen importlib._bootstrap>\", line 935, in _load_unlocked\n",
    "  File \"<frozen importlib._bootstrap_external>\", line 995, in exec_module\n",
    "  File \"<frozen importlib._bootstrap>\", line 488, in _call_with_frames_removed\n",
    "  File \"/var/task/lambda_function.py\", line 3, in <module>\n    from crawl4ai import AsyncWebCrawler\n",
    "  File \"/var/lang/lib/python3.12/site-packages/crawl4ai/__init__.py\", line 3, in <module>\n    from .async_webcrawler import AsyncWebCrawler\n",
    "  File \"/var/lang/lib/python3.12/site-packages/crawl4ai/async_webcrawler.py\", line 8, in <module>\n    from .async_database import async_db_manager\n",
    "  File \"/var/lang/lib/python3.12/site-packages/crawl4ai/async_database.py\", line 8, in <module>\n    os.makedirs(DB_PATH, exist_ok=True)\n",
    "  File \"<frozen os>\", line 215, in makedirs\n",
    "  File \"<frozen os>\", line 225, in makedirs\n"
  ]
}
@akamf
Copy link

akamf commented Oct 11, 2024

I had the same issue and bypassed it by set the DP_PATH to '/tmp/' (the only write-able dir i AWS Lambda) befor importing the crawl4ai-package. My solution:

import os
from pathlib import Path

os.makedirs('/tmp/.crawl4ai', exist_ok=True)
DB_PATH = '/tmp/.crawl4ai/crawl4ai.db'
Path.home = lambda: Path("/tmp")

from crawl4ai import AsyncWebCrawler

Hope this works for you as well.

@vikaskookna
Copy link
Author

ok I will try this @akamf
Did you create lambda layer or a docker image, when i tried with layer it exceeded 250 MB limit, how did you mange this?

@vikaskookna
Copy link
Author

After doing what you mentioned I got this error
Error processing https://chatclient.ai: BrowserType.launch: Executable doesn't exist at /home/sbx_user1051/.cache/ms-playwright/chromium-1134/chrome-linux/chrome

@akamf
Copy link

akamf commented Oct 11, 2024

I created a Docker image where I installed Playwright and its dependencies and then chromium with playwright. The Docker image is really big though (because of Playwright I guess), so I'm currently working on optimizing it.

But our latest Dockerfile looks like this:

FROM amazonlinux:2 AS build

RUN curl -sL https://rpm.nodesource.com/setup_16.x | bash - && \
    yum install -y nodejs gcc-c++ make python3-devel \
    libX11 libXcomposite libXcursor libXdamage libXext libXi libXtst cups-libs \
    libXScrnSaver pango at-spi2-atk gtk3 iputils libdrm nss alsa-lib \
    libgbm fontconfig freetype freetype-devel ipa-gothic-fonts

RUN npm install -g playwright && \
    PLAYWRIGHT_BROWSERS_PATH=/ms-playwright-browsers playwright install chromium

FROM public.ecr.aws/lambda/python:3.11

WORKDIR ${LAMBDA_TASK_ROOT}

COPY requirements.txt .
RUN pip3 install --upgrade pip && \
    pip3 install -r requirements.txt --target "${LAMBDA_TASK_ROOT}" --verbose

COPY --from=build /usr/lib /usr/lib
COPY --from=build /usr/local/lib /usr/local/lib
COPY --from=build /usr/bin /usr/bin
COPY --from=build /usr/local/bin /usr/local/bin
COPY --from=build /ms-playwright-browsers /ms-playwright-browsers

ENV PLAYWRIGHT_BROWSERS_PATH=/ms-playwright-browsers

COPY handler.py .

CMD [ "handler.main" ]

I don't know if this is the best solution, but it works for us. Like I said, I'm working on some optimisation for it.

@vikaskookna
Copy link
Author

vikaskookna commented Oct 11, 2024

thanks @akamf I tried this but gave me these errors, i'm using m1 mac and built the image using this command

docker build --platform linux/amd64 -t fetchlinks .

var/task/playwright/driver/node: /lib64/libm.so.6: version GLIBC_2.27' not found (required by /var/task/playwright/driver/node)
/var/task/playwright/driver/node: /lib64/libc.so.6: version GLIBC_2.28' not found (required by /var/task/playwright/driver/node)

@unclecode unclecode self-assigned this Oct 12, 2024
@unclecode
Copy link
Owner

Hi @vikaskookna @akamf

By the next week, I will create the Docker file and also upload the Docker image to a Docker hub. I hope this can also help you.

@unclecode unclecode added enhancement New feature or request question Further information is requested labels Oct 12, 2024
@unclecode unclecode changed the title Fails on aws lambda Docker Image Oct 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants