I was having trouble when trying to run this repo locally with and without the training data, ex:
python scripts/local_infer_ada.py --text "Your text to be detected"
And I noticed a few issues below:
-
Small typo
parser.add_argument('--sampling_mode_name', type=str, default="gemma-9b-instruct") -> parser.add_argument('--sampling_model_name', type=str, default="gemma-9b-instruct")
-
I think there is also an issue with the BSplineTwoSample class in the scripts/nuisance_func.py file. It looks like it expects a tokenized tensor instead of the plain text. I ended up modifying the fit function like this to fix the issue (I only made the "device" changes since I'm running on my cpu, you can ignore those)
class BSplineTwoSample(nn.Module):
def __init__(self, bspline_args, device="cpu"):
super().__init__()
self.bspline = BSpline(**bspline_args)
self.bspline = self.bspline.to(device)
self.device = device
pass
...
def tokenize_text(self, texts: list[str], tokenizer):
out = []
for t in texts:
tok = tokenizer(
t,
return_tensors="pt",
truncation=True,
padding=False # important
).to(self.device)
out.append(tok)
return out
def fit(
self,
human_text_list: list[str],
machine_text_list: list[str],
scoring_tokenizer,
model,
args,
constraint=False
):
print("Learning witness function...")
print("Fetch log-likelihood of human texts...")
human_token_list = self.tokenize_text(human_text_list, scoring_tokenizer)
machine_token_list = self.tokenize_text(machine_text_list, scoring_tokenizer)
z_ij_u = self.get_zij(human_token_list[0:5], model, args)
print("Fetch log-likelihood of LLM texts...")
z_ij_v = self.get_zij(machine_token_list[0:5], model, args)
beta_hat = self.compute_beta_hat(z_ij_u, z_ij_v, constraint)
self.beta_hat = beta_hat
print("beta_hat:", torch.round(beta_hat, decimals=3))
- The example code expects a folder structure that doesn't exist:
parser.add_argument('--train_dataset', type=separated_string, default="./exp_main/data/xsum_gpt2-xl&./exp_main/data/writing_gpt2-xl")
To work around this for testing I just switched it to a path that did exist -
parser.add_argument('--train_dataset', type=separated_string, default="./exp_gpt3to4/data/writing_gpt-4o")
I can open a PR with those fixes if those are along the lines of what you intended. It would be nice to have the complete data in the referenced ./exp_main/data folder if you're able to add that.
Also, for the no training data example, since the args defaults to bspline, you'd also need to pass "identity" instead of "bspline" if I'm understanding it correctly -
python scripts/local_infer_ada.py --text "Your text to be detected" --w_fun "identity"
I was having trouble when trying to run this repo locally with and without the training data, ex:
python scripts/local_infer_ada.py --text "Your text to be detected"And I noticed a few issues below:
Small typo
parser.add_argument('--sampling_mode_name', type=str, default="gemma-9b-instruct") -> parser.add_argument('--sampling_model_name', type=str, default="gemma-9b-instruct")
I think there is also an issue with the BSplineTwoSample class in the
scripts/nuisance_func.pyfile. It looks like it expects a tokenized tensor instead of the plain text. I ended up modifying the fit function like this to fix the issue (I only made the "device" changes since I'm running on my cpu, you can ignore those)To work around this for testing I just switched it to a path that did exist -
parser.add_argument('--train_dataset', type=separated_string, default="./exp_gpt3to4/data/writing_gpt-4o")I can open a PR with those fixes if those are along the lines of what you intended. It would be nice to have the complete data in the referenced
./exp_main/datafolder if you're able to add that.Also, for the no training data example, since the args defaults to bspline, you'd also need to pass "identity" instead of "bspline" if I'm understanding it correctly -