-
Notifications
You must be signed in to change notification settings - Fork 2
Design Strings
Defining the experimental design can seem daunting at first. Here are provided some guidelines on how to specify the relevant parameters in the analysis, as well as explaining how to quickly generate long design strings.
A "design string" is a string that expresses the value of some categorical variable on each sample, with each variable separated by a comma (spaces are ignored). The order of the categorical variables is the same as the order of the columns in the input expression matrix, from left to right. We will take as example four samples named "A", "B", "C" and "D", included in this order (from left to right) in an input file. Samples "A" and "C" are "control" samples, while samples "B" and "D" are "treated" samples. If we want to express this variable (the status of the samples) in a design string, we would write: "control, treated, control, treated"
Design strings can be very repetitive, and very long if many samples are present in the analysis. Two shorthands are provided to alleviate this:
-
"(...):x": Comma-separated values (...) in the parentheses are repeated sequentially "x" number of times. For example:"(a, b):2"is equal to"a, b, a, b". -
"[...]:x": Comma-separated values (...) in the parentheses are repeated each a "x" number of times. For example:"[a, b]:2"is equal to"a, a, b, b".
Please note that these patterns cannot be nested, but they can be used in combination ("(a, b):3, b, a, [a]:3, b" is valid, for example). The same example from before can, therefore, be expressed with the equal string "(control, treated):2" instead.
The experimental design string, specified under design > experimental_design in the bioTEA analyze options, represent the groups in which the samples fall in. These are the groups of interest for the analysis, and will be the same groups specified in the contrasts parameter, determining between which samples degs have to be detected.
An additional variable that can be included in this same parameter is the sample pairings. Samples are paired if, for instance, they derive from the same patient, or the same tissue. This pairings can be expressed in the variable by adding a number, representing a certain pairing, at the end of each group name. For instance, consider six samples that fall into the "control" and "tumor" groups like such: "control, tumor, control, tumor, control, tumor". The first two samples come from the same patient, as do the next two and so on. This pairing can be expressed in the following way: "control1, tumor1, control2, tumor2, control3, tumor3". The actual numbers used are unimportant, as only the pairings of the samples are regarded.
The shorthand patterns support a way to specify these numbers easily. By adding an asterisk (*) at the end of the value, it will be replaced with a progressively increasing number automatically. For example, the string "control1, tumor1, control2, tumor2, control3, tumor3" can be expressed as "(control*, tumor*):3". "[control*, tumor*]:2" instead expands to "control1, control2, tumor1, tumor2". Note that any numbers that are already present in the string will be skipped by the shorthand patterns. For instance, "tumor4, (control*):2" results in "tumor4, control5, control6".
*Important: As numbers represent pairings, the group names cannot contain any numbers (in any position of the string). Additionally, either all samples must have a pairing annotation, or none may. If the samples are paired, each different pairing must show at least one sample for each group variable.
Running the analysis in batches is often necessary, but introduces so-called batch effects, static differences in the expression measures due to technical variables (such as the operator that ran the analysis, the hybridization and exposure times, etc...). Batch effects have been handled in different ways in the literature. Nygaard et al. offers a review of the subject, and bioTEA follows the recommended guidelines. When running the limma analysis, the data is corrected by including the batches variable to the limma analysis, so that the linear models can take the effect of the batches into account. When running RankProd, the batch effects are taken into account in the algorithm itself. Note that RankProd may only correct batch effects if for each grouping variable there are at least two samples in each batch. If this is not the case, no correction is performed (and you will see a warning in the logs).
From Nygaard et al., we learn that the best way to contrast the batch effects is to make balanced batches, otherwise the effect of the batch and the one of the variable of interest are indistinguishable.
The various batches can be specified with the design > batches variable.
limma can accept a virtually infinite amount of variables to control the tests. Extra variables may be submitted to limma in the design > extra_limma_vars, as a list of design strings, one named entry per additional variable. See the examples in the file generated by biotea initialize (without specifying a metadata file).
A technical replicate is a sample that is identical in all aspects to another (e.g. from the same biological sample), but analyzed in parallel or a second time. Assuming this to be an independent samples (i.e. including it in the analysis as-is) increases the numerosity of the data without enriching the available variability. This is called pseudoreplication, and causes p-values to be lower than they should.
BioTEA used to allow for the inclusion of technical replicates, but it was ultimately scrapped. This doesn't mean that it will never be included again, but for now we suggest to mean the expression values of the technical replicates to get a single average sample, and use that for the analysis. If your experiment has a lot of technical replicates, it might be better to use limma or RankProduct directly in a custom script. biotea prepare can be used regardless of technical replicates.