Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] [ENHANCEMENT] Improve performance with persistent storage #5585

Open
wants to merge 28 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
83703a7
Revert "Revert "improvement: add some SQLite pragma statement setting…
jfcalvo Jul 9, 2024
4b3fa32
chore: Add script to backup the sqlite db
frascuchon Oct 8, 2024
e73708b
refactor: Add logic to backup and restore db on server restart
frascuchon Oct 8, 2024
0562c73
change backup folder
frascuchon Oct 8, 2024
cec7a73
fix: Using argilla home to restore backup files
frascuchon Oct 8, 2024
e254093
create a zero-backup of the existing files
frascuchon Oct 8, 2024
3b7ac8f
chore: Remove extra line in CHANGELOG
frascuchon Oct 8, 2024
853628c
format
frascuchon Oct 8, 2024
4f0e331
fix recursive backup copy
frascuchon Oct 8, 2024
57c635f
backup also the server id file
frascuchon Oct 8, 2024
139e227
chore: Change statement order
frascuchon Oct 8, 2024
3555f15
chore: Change init logic to copy from existing argilla files to backu…
frascuchon Oct 8, 2024
12dca42
Revert "Revert "Revert "improvement: add some SQLite pragma statement…
frascuchon Oct 8, 2024
352e93a
chore: Review start.sh
frascuchon Oct 8, 2024
898afaf
chore: configure lOG
frascuchon Oct 8, 2024
80a7d4e
define backup interval env var
frascuchon Oct 8, 2024
efbd721
chore: Skip init if backup folder exists
frascuchon Oct 9, 2024
8817889
chore: Rename process
frascuchon Oct 9, 2024
777b939
apply some changes
frascuchon Oct 9, 2024
1da94cc
Update argilla-server/docker/argilla-hf-spaces/scripts/argilla_home_b…
frascuchon Oct 9, 2024
183711e
Update argilla-server/docker/argilla-hf-spaces/Procfile
frascuchon Oct 10, 2024
73f9076
add restore python script
frascuchon Oct 10, 2024
4e4ec98
refactor: Improve backup process
frascuchon Oct 10, 2024
5b35c9b
Update argilla-server/docker/argilla-hf-spaces/scripts/argilla_home_b…
frascuchon Oct 10, 2024
0324f51
ci: Update step
frascuchon Oct 10, 2024
cf2fc60
chore: Change logging message
frascuchon Oct 11, 2024
9cf5978
Clean backup folder before
frascuchon Oct 11, 2024
b4a23f9
increase backup id after error
frascuchon Oct 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 7 additions & 6 deletions argilla-server/docker/argilla-hf-spaces/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,6 @@ FROM ${ARGILLA_SERVER_IMAGE}:${ARGILLA_VERSION}

USER root

# Copy Argilla distribution files
COPY scripts/start.sh /home/argilla
COPY Procfile /home/argilla
COPY requirements.txt /packages/requirements.txt

RUN apt-get update && \
apt-get install -y apt-transport-https gnupg wget

Expand All @@ -24,6 +19,13 @@ RUN wget -qO - https://packages.redis.io/gpg | gpg --dearmor -o /usr/share/keyri
RUN apt-get install -y lsb-release
RUN echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | tee /etc/apt/sources.list.d/redis.list


# Copy Argilla distribution files
COPY scripts/* /home/argilla
COPY Procfile /home/argilla
COPY requirements.txt /packages/requirements.txt

# Install dependencies
RUN \
# Create a directory where Argilla will store the data
mkdir /data && \
Expand Down Expand Up @@ -59,7 +61,6 @@ USER argilla
ENV ELASTIC_CONTAINER=true
ENV ES_JAVA_OPTS="-Xms1g -Xmx1g"

ENV ARGILLA_HOME_PATH=/data/argilla
ENV BACKGROUND_NUM_WORKERS=2
ENV REINDEX_DATASETS=1

Expand Down
1 change: 1 addition & 0 deletions argilla-server/docker/argilla-hf-spaces/Procfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ elastic: /usr/share/elasticsearch/bin/elasticsearch
redis: /usr/bin/redis-server
worker: sleep 30; rq worker-pool --num-workers ${BACKGROUND_NUM_WORKERS}
argilla: sleep 30; /bin/bash start_argilla_server.sh
argilla-backup: sleep 30; python argilla_home_backup_cron.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Copyright 2021-present, the Recognai S.L. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import os
import sqlite3
import time
from pathlib import Path
from urllib.parse import urlparse

import httpx

from argilla_server.database import database_url_sync
from argilla_server.settings import settings
from argilla_server.telemetry import get_server_id, SERVER_ID_DAT_FILE

logging.basicConfig(
handlers=[logging.StreamHandler()],
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
level=logging.INFO,
force=True,
)

_LOGGER = logging.getLogger("argilla.backup")


def _run_backup(src: Path, dst_folder: Path, backup_id: int):
backup_folder = Path(dst_folder) / str(backup_id)

# Creating a copy of existing backup
backup_folder.mkdir(exist_ok=True)

backup_file = os.path.join(backup_folder, src.name)

src_conn = sqlite3.connect(src, isolation_level="DEFERRED")
dst_conn = sqlite3.connect(backup_file, isolation_level="DEFERRED")

try:
_LOGGER.info("Creating a db backup in %s", backup_file)
with src_conn, dst_conn:
src_conn.backup(dst_conn)
_LOGGER.info("DB backup created at %s", backup_file)
finally:
src_conn.close()
dst_conn.close()


def db_backup(backup_folder: str, interval: int = 15, num_of_backups: int = 20):
url_db = database_url_sync()
db_path = Path(urlparse(url_db).path)

backup_path = Path(backup_folder).absolute()

if not backup_path.exists():
backup_path.mkdir()

backup_id = 0
while True:
try:
_run_backup(src=db_path, dst_folder=backup_path, backup_id=backup_id)
backup_id = (backup_id + 1) % num_of_backups
except Exception as e:
_LOGGER.exception(f"Error creating backup: {e}")

time.sleep(interval)


def server_id_backup(backup_folder: str):
backup_path = Path(backup_folder).absolute()
if not backup_path.exists():
backup_path.mkdir()

# Force to create the server id file
get_server_id()

server_id_file = os.path.join(settings.home_path, SERVER_ID_DAT_FILE)

_LOGGER.info(f"Copying server id file to {backup_folder}")
os.system(f"cp {server_id_file} {backup_folder}")
_LOGGER.info("Server id file copied!")


def is_argilla_alive():
try:
with httpx.Client() as client:
response = client.get("http://localhost:6900/api/v1/status")
response.raise_for_status()
return True
except Exception as e:
_LOGGER.exception(f"Error checking if argilla is alive: {e}")
return False


if __name__ == "__main__":
argilla_data: str = "/data/argilla"
backup_path = os.environ["ARGILLA_BACKUP_PATH"]
frascuchon marked this conversation as resolved.
Show resolved Hide resolved
backup_interval = int(os.getenv("ARGILLA_BACKUP_INTERVAL") or "15")
num_of_backups = int(os.getenv("ARGILLA_NUM_OF_BACKUPS") or "20")

while not is_argilla_alive():
_LOGGER.info("Waiting for the server to be ready...")
time.sleep(5)

server_id_backup(argilla_data)
db_backup(backup_path, interval=backup_interval, num_of_backups=num_of_backups)
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Copyright 2021-present, the Recognai S.L. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import glob
import logging
import os

logging.basicConfig(
handlers=[logging.StreamHandler()],
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
level=logging.INFO,
force=True,
)

_LOGGER = logging.getLogger("argilla.backup")

if __name__ == "__main__":
backups_path = os.environ["ARGILLA_BACKUPS_PATH"]
frascuchon marked this conversation as resolved.
Show resolved Hide resolved

folders = glob.glob(f"{backups_path}/*")
folders.sort(key=os.path.getmtime, reverse=True)

if len(folders) > 1:
safe_backup = folders[1]
argilla_home = os.getenv("ARGILLA_HOME_PATH")

_LOGGER.info(f"Copying {safe_backup} backup to the argilla home folder at {argilla_home}")
os.system(f"cp -r {safe_backup}/* $ARGILLA_HOME_PATH")
_LOGGER.info("Backup restored!")
else:
_LOGGER.info("No safe backup found to restore. Exiting...")
28 changes: 28 additions & 0 deletions argilla-server/docker/argilla-hf-spaces/scripts/start.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,35 @@ export OAUTH2_HUGGINGFACE_SCOPE=$OAUTH_SCOPES
# See https://huggingface.co/docs/hub/en/spaces-overview#helper-environment-variables for more details
DEFAULT_USERNAME=$(curl -L -s https://huggingface.co/api/users/${SPACES_CREATOR_USER_ID}/overview | jq -r '.user' || echo "${SPACE_AUTHOR_NAME}")
export USERNAME="${USERNAME:-$DEFAULT_USERNAME}"

DEFAULT_PASSWORD=$(pwgen -s 16 1)
export PASSWORD="${PASSWORD:-$DEFAULT_PASSWORD}"

export ARGILLA_BACKUPS_PATH=/data/argilla/backups

if [ ! -d ARGILLA_BACKUPS_PATH ]; then
echo "Initializing backups folder..."
mkdir -p ARGILLA_BACKUPS_PATH

# if exists the db file, copy it to the backup folder and rename it
if [ -f /data/argilla/argilla.db ]; then
echo "Found argilla.db file, moving it to the argilla home path..."
cp /data/argilla/argilla.db $ARGILLA_HOME_PATH || true
fi

# if exists the server id file, copy it to the argilla folder
if [ -f /data/argilla/server_id.dat ]; then
echo "Found server_id.dat file, moving it to argilla home path..."
cp /data/argilla/server_id.dat $ARGILLA_HOME_PATH || true
fi

else
echo "Backup folder already exists..."
fi

# Copy the backup files to the argilla folder
echo "Restoring files from backup folder..."
python restore_argilla_backup.py

echo "Starting processes..."
honcho start
4 changes: 2 additions & 2 deletions argilla-server/src/argilla_server/telemetry/_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@

_LOGGER = logging.getLogger(__name__)

_SERVER_ID_DAT_FILE = "server_id.dat"
SERVER_ID_DAT_FILE = "server_id.dat"


def get_server_id() -> UUID:
Expand All @@ -34,7 +34,7 @@ def get_server_id() -> UUID:

"""

server_id_file = os.path.join(settings.home_path, _SERVER_ID_DAT_FILE)
server_id_file = os.path.join(settings.home_path, SERVER_ID_DAT_FILE)

if os.path.exists(server_id_file):
with open(server_id_file, "r") as f:
Expand Down