Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Low Recall in Filtered-Disk ANN on Sift Dataset-1M with Uniform and Zipf Distributions #550

Open
AnasAito opened this issue May 9, 2024 · 0 comments
Labels
question Further information is requested

Comments

@AnasAito
Copy link

AnasAito commented May 9, 2024

Issue Summary:
I am experimenting with the Filtered-Disk ANN algorithm (the memory version) on the Sift dataset (1M). I followed the steps outlined in the markdown file to generate synthetic labels, build the index, and perform the search. However, I'm encountering an issue with low recall levels when using distributions other than random, such as the one_per_point (uniform distribution) and the Zipf distribution.

Expected Behavior:
I expected the recall levels to be consistent across different label distributions, as suggested in the Filtered-Disk ANN paper. Specifically, I expected the recall to remain high even with distributions other than random.

Observed Behavior:
When using the one_label_per_row and Zipf distributions, the recall levels are significantly lower compared to the random distribution. The only difference is that in the random setting we're dealing with 50% specificity, in the other cases the specificity is low (5%)

Steps to Reproduce:

I used the same parameters as described in the Filtered-Disk ANN paper.
I have attached the Bash script to reproduce the experiment.

#!/bin/bash

function json_time {
  command="$@"
  echo "Executing $command"
  /usr/bin/time --quiet -o /home/anas.aitaomar/logs/time.log -a --format '{"command":"%C", "wallclock": %e, "user": %U, "sys": %S}' $command
  ret=$?
  if [ $ret -ne 0 ]; then
    echo "{\"command\": \"$command\", \"status_code\": $ret}" >> /home/anas.aitaomar/logs/time.log
  fi
}

rm -rf data
mkdir data
rm /home/anas.aitaomar/logs/time.log
touch /home/anas.aitaomar/logs/time.log
chmod 666 /home/anas.aitaomar/logs/time.log

if [ -d "build/home/anas.aitaomar/s" ]; then
	export BASE_PATH="build/home/anas.aitaomar/s"
else
	export BASE_PATH="build/apps"
fi


# var definition
INDEX_TYPE="filtered_vanama"
# INDEX_TYPE="stitched_vanama"

# LABELS_DISTRIBUTION="zipf"
# LABELS_DISTRIBUTION="random"
LABELS_DISTRIBUTION="one_per_point"


if [ "$INDEX_TYPE" = "filtered_vanama" ]; then
    R="96"
    L="90"
else
    R="32"
    L="100"
fi


R_STITCHED="64"
L_FILTER=$L
ALPHA="1.2"
L_SEARCH="10 100 600 650"

LABEL_COUNT="12"
LABEL_TO_SEARCH="1"

NEIGHBORS_COUNT="10"
THREADS_COUNT="48"



json_time $BASE_PATH/utils/fvecs_to_bin float sift/sift_base.fvecs data/sift_base.bin
json_time $BASE_PATH/utils/fvecs_to_bin float sift/sift_query.fvecs data/sift_query.bin

# generate labels
json_time $BASE_PATH/utils/generate_synthetic_labels  --num_labels $LABEL_COUNT --num_points 1000000  --output_file data/labels_base.txt --distribution_type $LABELS_DISTRIBUTION

# labels dist 
echo "----------------------------------------------------"
echo ""
json_time $BASE_PATH/utils/stats_label_data --labels_file data/labels_base.txt --universal_label 0
echo ""
echo "---------------------------------------------------"
# generate gt - label x(12)  
json_time $BASE_PATH/utils/compute_groundtruth_for_filters --data_type float --dist_fn l2 --base_file data/sift_base.bin --query_file data/sift_query.bin --gt_file data/sift_gt.bin --K $NEIGHBORS_COUNT --filter_label $LABEL_TO_SEARCH --label_file data/labels_base.txt 

if [ "$INDEX_TYPE" = "filtered_vanama" ]; then
    # build index 
    json_time $BASE_PATH/build_memory_index  --data_type float --dist_fn l2 --data_path data/sift_base.bin --index_path_prefix data/sift_filtered_index -R $R  --FilteredLbuild $L_FILTER --alpha $ALPHA --label_file data/labels_base.txt --universal_label 0 --num_threads $THREADS_COUNT
else
    # build stitched-index 
    json_time $BASE_PATH/build_stitched_index  --data_type float --data_path data/sift_base.bin --index_path_prefix data/sift_filtered_index -R $R -L $L --stitched_R $R_STITCHED --alpha $ALPHA --label_file data/labels_base.txt --universal_label 0 --num_threads $THREADS_COUNT
fi


# search 
json_time $BASE_PATH/search_memory_index  --data_type float --dist_fn l2 --index_path_prefix data/sift_filtered_index --query_file data/sift_query.bin --gt_file data/sift_gt.bin --filter_label $LABEL_TO_SEARCH -K $NEIGHBORS_COUNT -L $L_SEARCH --result_path data/filtered_search_results --num_threads $THREADS_COUNT

Results

  • one_per_point
Reading truthset file data/sift_gt.bin ...
Metadata: #pts = 10000, #dims = 10... 
L2: Using AVX2 distance computation DistanceL2Float
Resizing took: 0.0387484s
From graph header, expected_file_size: 379395480, _max_observed_degree: 96, _start: 123742, file_frozen_pts: 0
Loading vamana graph data/sift_filtered_index...done. Index has 1000000 nodes and 93848864 out-edges, _start is set to 123742
Identified 13 distinct label(s)
Num frozen points:0 _nd: 1000000 _start: 123742 size(_location_to_tag): 0 size(_tag_to_location):0 Max points: 1000000
Index loaded
Using 48 threads to search
  Ls         QPS     Avg dist cmps  Mean Latency (mus)   99.9 Latency   Recall@10
=================================================================================
  10    51225.34            296.26              869.38       24950.13       46.54
 100    13521.95           1165.10             3472.06       30612.68       49.84
 600     3233.68           4394.70            14768.81       55484.78       49.64
 650     3143.47           4660.87            14896.70       55146.52       49.64
Done searching. Now saving results 
  • Random
Metadata: #pts = 10000, #dims = 10... 
L2: Using AVX2 distance computation DistanceL2Float
Resizing took: 0.0366022s
From graph header, expected_file_size: 380582992, _max_observed_degree: 96, _start: 123742, file_frozen_pts: 0
Loading vamana graph data/sift_filtered_index...done. Index has 1000000 nodes and 94145742 out-edges, _start is set to 123742
Identified 13 distinct label(s)
Num frozen points:0 _nd: 1000000 _start: 123742 size(_location_to_tag): 0 size(_tag_to_location):0 Max points: 1000000
Index loaded
Using 48 threads to search
  Ls         QPS     Avg dist cmps  Mean Latency (mus)   99.9 Latency   Recall@10
=================================================================================
  10    43119.56            599.91             1056.97       29310.13       81.11
 100     8915.00           2456.43             5341.05       30007.46       98.84
 600     1743.98           8778.31            27424.71       92325.24       99.86
 650     1612.75           9283.70            29650.47       90650.07       99.87
Done searching. Now saving results s 

Questions:
I'm trying to understand the cause of this low recall, is it because of the attribute distribution or linked directly to the specificity or maybe the set of parameters used.
Thanks in advance.

@AnasAito AnasAito added the question Further information is requested label May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant