HuggingFaceH4/starchat-alpha · Incomplete Output even with max_new

So the output of my model ends abruptly and I ideally want it to complete the paragraph/sentences/code which it was it between of.
Although I have provided max_new_tokens = 300 and also in prompt I give to limit by 300 words.

The response is always big and ends abruptly. Any way I can ask for a complete output within desired number of output tokens?

Code:

checkpoint = "HuggingFaceH4/starchat-alpha"
device = "cuda" if torch.cuda.is_available() else "cpu" # "cuda:X" for GPU usage or "cpu" for CPU usage
class StarCoderModel:
  def __init__(self):
    print("Running in " + device)
    self.tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    # make sure `--gpus all` is provided in docker run command if gpu is required
    self.model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map='auto')

  def infer(self, input_text, token_count):
    print(input_text)
    print(token_count)
    inputs = self.tokenizer.encode(input_text, return_tensors="pt").to(device)
    print(len(self.tokenizer.tokenize(input_text)))
    outputs = self.model.generate(inputs,  max_new_tokens=token_count, pad_token_id=self.tokenizer.eos_token_id)
    return self.tokenizer.decode(outputs[0])[len(input_text):]

Sample:

private DataType FuntionName(String someId) {
    // TODO: Replace with implementation that utilizes someId to obtain information
    return DataType.Value;
}


The comment:

- If someId is present in the code, use the getAPI from Client with someId as a parameter to obtain some information.
- If the

HuggingFaceH4
/

starchat-alpha

Incomplete Output even with max_new_tokens