Skip to content

Issues Running local_infer_ada.py Script #2

@PrestonBlackburn

Description

@PrestonBlackburn

I was having trouble when trying to run this repo locally with and without the training data, ex:

python scripts/local_infer_ada.py --text "Your text to be detected"

And I noticed a few issues below:

  1. Small typo
    parser.add_argument('--sampling_mode_name', type=str, default="gemma-9b-instruct") -> parser.add_argument('--sampling_model_name', type=str, default="gemma-9b-instruct")

  2. I think there is also an issue with the BSplineTwoSample class in the scripts/nuisance_func.py file. It looks like it expects a tokenized tensor instead of the plain text. I ended up modifying the fit function like this to fix the issue (I only made the "device" changes since I'm running on my cpu, you can ignore those)

class BSplineTwoSample(nn.Module):
    def __init__(self, bspline_args, device="cpu"):
        super().__init__()
        self.bspline = BSpline(**bspline_args)
        self.bspline = self.bspline.to(device)
        self.device = device
        pass
    
    ...

    def tokenize_text(self, texts: list[str], tokenizer):
        out = []
        for t in texts:
            tok = tokenizer(
                t,
                return_tensors="pt",
                truncation=True,
                padding=False  # important
            ).to(self.device)
            out.append(tok)
        return out

    def fit(
            self, 
            human_text_list: list[str], 
            machine_text_list: list[str], 
            scoring_tokenizer,
            model, 
            args, 
            constraint=False
        ):
        print("Learning witness function...")
        print("Fetch log-likelihood of human texts...")
        human_token_list = self.tokenize_text(human_text_list, scoring_tokenizer)
        machine_token_list = self.tokenize_text(machine_text_list, scoring_tokenizer)
        z_ij_u = self.get_zij(human_token_list[0:5], model, args)
        print("Fetch log-likelihood of LLM texts...")
        z_ij_v = self.get_zij(machine_token_list[0:5], model, args)
        beta_hat = self.compute_beta_hat(z_ij_u, z_ij_v, constraint)
        self.beta_hat = beta_hat
        print("beta_hat:", torch.round(beta_hat, decimals=3))
  1. The example code expects a folder structure that doesn't exist:
parser.add_argument('--train_dataset', type=separated_string, default="./exp_main/data/xsum_gpt2-xl&./exp_main/data/writing_gpt2-xl")

To work around this for testing I just switched it to a path that did exist -
parser.add_argument('--train_dataset', type=separated_string, default="./exp_gpt3to4/data/writing_gpt-4o")


I can open a PR with those fixes if those are along the lines of what you intended. It would be nice to have the complete data in the referenced ./exp_main/data folder if you're able to add that.

Also, for the no training data example, since the args defaults to bspline, you'd also need to pass "identity" instead of "bspline" if I'm understanding it correctly -

python scripts/local_infer_ada.py --text "Your text to be detected" --w_fun "identity"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions