Added a simple repetition detection inside the generate() loop to stop the model
from endlessly producing patterns like “A5A5A5...”.
- Imported `re` for regex pattern matching.
- Stops generation if “A5” repeats 10+ times or the same token appears 10 times.
- Prints a warning and exits safely instead of looping infinitely.
Fixes: #1008
* handle missing scale_inv_name
Fixed an issue where `weight` and `weight_scale_inv` (e.g. `model.layers.39.mlp.experts.92.gate_proj.weight` and `model.layers.39.mlp.experts.92.gate_proj.weight_scale_inv`) were not in the same SafeTensor, causing an assertion error due to scale_inv_name not being in the state_dict.
* sort filename to reduce memory costs
* Add CUDA cache clearing in memory management
Added torch.cuda.empty_cache() to free up unused memory on the GPU,