Prefix Space with Llama Tokenizer #40

Yuxing0610 · 2024-04-25T12:49:31Z

`

tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-hf", 
token="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")

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" 

with open(f"examples/grammars/geo_query.ebnf", "r") as file:
    grammar_str = file.read()

query = "answer(population_1(cityid('austin', _)))"
query_prefix_1 = ""
query_prefix_2 = "answer"
query_prefix_3 = "answer("

grammar = IncrementalGrammarConstraint(grammar_str, "root", tokenizer)
# Parse from the first token
grammar_processor = GrammarConstrainedLogitsProcessor(grammar, 0)

encoded_1 = tokenizer(query_prefix_1, add_special_tokens=False, return_tensors="pt", padding=True)
encoded_2 = tokenizer(query_prefix_2, add_special_tokens=False, return_tensors="pt", padding=True)
encoded_3 = tokenizer(query_prefix_3, add_special_tokens=False, return_tensors="pt", padding=True)


scores_1 = grammar_processor.process_logits(encoded_1["input_ids"], torch.zeros(1, len(tokenizer)))
print(torch.nonzero(scores_1[0] != -math.inf).squeeze(axis=1))
#tensor([  273,   550, 12011, 29874])
scores_2 = grammar_processor.process_logits(encoded_2["input_ids"], torch.zeros(1, len(tokenizer)))
print(torch.nonzero(scores_2[0] != -math.inf).squeeze(axis=1))
#tensor([2])
grammar_processor.process_logits(encoded_3["input_ids"], torch.zeros(1, len(tokenizer)))

`

would give out the following error for the final process_logits call:

ValueError: All stacks are empty, so the only token accepted is EOS(2) but got 29898

This is because the Llama tokenizer adds a dummy white space at the start of the sequence. Therefore, when encoding answer, the tokenizer would give token_id 1234, instead of 12011

`

print(tokenizer.convert_ids_to_tokens([1234]))
#['▁answer']
print(tokenizer.convert_ids_to_tokens([12011]))
#['answer']

`

When parsing the first empty prefix, the next allowed token won't contain [1234], but only token_ids correspond to tokens without the prefix space, which will make the parsing of the second prefix illegal, thus only accepting EOS token for the next token, so the third prefix would raise an error.

It would be very helpful if we could deal with this prefix whitespace problem.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefix Space with Llama Tokenizer #40

Prefix Space with Llama Tokenizer #40

Yuxing0610 commented Apr 25, 2024 •

edited

Loading

Prefix Space with Llama Tokenizer #40

Prefix Space with Llama Tokenizer #40

Comments

Yuxing0610 commented Apr 25, 2024 • edited Loading

Yuxing0610 commented Apr 25, 2024 •

edited

Loading