Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefix Space with Llama Tokenizer #40

Open
Yuxing0610 opened this issue Apr 25, 2024 · 0 comments
Open

Prefix Space with Llama Tokenizer #40

Yuxing0610 opened this issue Apr 25, 2024 · 0 comments

Comments

@Yuxing0610
Copy link

Yuxing0610 commented Apr 25, 2024

`

tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-hf", 
token="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")

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" 

with open(f"examples/grammars/geo_query.ebnf", "r") as file:
    grammar_str = file.read()

query = "answer(population_1(cityid('austin', _)))"
query_prefix_1 = ""
query_prefix_2 = "answer"
query_prefix_3 = "answer("

grammar = IncrementalGrammarConstraint(grammar_str, "root", tokenizer)
# Parse from the first token
grammar_processor = GrammarConstrainedLogitsProcessor(grammar, 0)

encoded_1 = tokenizer(query_prefix_1, add_special_tokens=False, return_tensors="pt", padding=True)
encoded_2 = tokenizer(query_prefix_2, add_special_tokens=False, return_tensors="pt", padding=True)
encoded_3 = tokenizer(query_prefix_3, add_special_tokens=False, return_tensors="pt", padding=True)


scores_1 = grammar_processor.process_logits(encoded_1["input_ids"], torch.zeros(1, len(tokenizer)))
print(torch.nonzero(scores_1[0] != -math.inf).squeeze(axis=1))
#tensor([  273,   550, 12011, 29874])
scores_2 = grammar_processor.process_logits(encoded_2["input_ids"], torch.zeros(1, len(tokenizer)))
print(torch.nonzero(scores_2[0] != -math.inf).squeeze(axis=1))
#tensor([2])
grammar_processor.process_logits(encoded_3["input_ids"], torch.zeros(1, len(tokenizer)))

`

would give out the following error for the final process_logits call:

ValueError: All stacks are empty, so the only token accepted is EOS(2) but got 29898

This is because the Llama tokenizer adds a dummy white space at the start of the sequence. Therefore, when encoding answer, the tokenizer would give token_id 1234, instead of 12011

`

print(tokenizer.convert_ids_to_tokens([1234]))
#['▁answer']
print(tokenizer.convert_ids_to_tokens([12011]))
#['answer']

`

When parsing the first empty prefix, the next allowed token won't contain [1234], but only token_ids correspond to tokens without the prefix space, which will make the parsing of the second prefix illegal, thus only accepting EOS token for the next token, so the third prefix would raise an error.

It would be very helpful if we could deal with this prefix whitespace problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant