Issues Running `local_infer_ada.py` Script

I was having trouble when trying to run this repo locally with and without the training data, ex: 
```bash
python scripts/local_infer_ada.py --text "Your text to be detected"
```

And I noticed a few issues below:

1. Small typo
    parser.add_argument(**'--sampling_mode_name'**, type=str, default="gemma-9b-instruct") ->     parser.add_argument(**'--sampling_model_name'**, type=str, default="gemma-9b-instruct")

2. I think there is also an issue with the **BSplineTwoSample** class in the `scripts/nuisance_func.py` file. It looks like it expects a tokenized tensor instead of the plain text. I ended up modifying the fit function like this to fix the issue (I only made the "device" changes since I'm running on my cpu, you can ignore those) 

```python
class BSplineTwoSample(nn.Module):
    def __init__(self, bspline_args, device="cpu"):
        super().__init__()
        self.bspline = BSpline(**bspline_args)
        self.bspline = self.bspline.to(device)
        self.device = device
        pass
    
    ...

    def tokenize_text(self, texts: list[str], tokenizer):
        out = []
        for t in texts:
            tok = tokenizer(
                t,
                return_tensors="pt",
                truncation=True,
                padding=False  # important
            ).to(self.device)
            out.append(tok)
        return out

    def fit(
            self, 
            human_text_list: list[str], 
            machine_text_list: list[str], 
            scoring_tokenizer,
            model, 
            args, 
            constraint=False
        ):
        print("Learning witness function...")
        print("Fetch log-likelihood of human texts...")
        human_token_list = self.tokenize_text(human_text_list, scoring_tokenizer)
        machine_token_list = self.tokenize_text(machine_text_list, scoring_tokenizer)
        z_ij_u = self.get_zij(human_token_list[0:5], model, args)
        print("Fetch log-likelihood of LLM texts...")
        z_ij_v = self.get_zij(machine_token_list[0:5], model, args)
        beta_hat = self.compute_beta_hat(z_ij_u, z_ij_v, constraint)
        self.beta_hat = beta_hat
        print("beta_hat:", torch.round(beta_hat, decimals=3))

```


3. The example code expects a folder structure that doesn't exist:
```python
parser.add_argument('--train_dataset', type=separated_string, default="./exp_main/data/xsum_gpt2-xl&./exp_main/data/writing_gpt2-xl")
```

To work around this for testing I just switched it to a path that did exist -   
`parser.add_argument('--train_dataset', type=separated_string, default="./exp_gpt3to4/data/writing_gpt-4o")`


---

I can open a PR with those fixes if those are along the lines of what you intended. It would be nice to have the complete data in the referenced `./exp_main/data` folder if you're able to add that. 

Also, for the no training data example, since the args defaults to bspline, you'd also need to pass "identity" instead of "bspline" if I'm understanding it correctly - 
```bash
python scripts/local_infer_ada.py --text "Your text to be detected" --w_fun "identity"
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues Running `local_infer_ada.py` Script #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issues Running local_infer_ada.py Script #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Issues Running `local_infer_ada.py` Script #2