-
Notifications
You must be signed in to change notification settings - Fork 0
Identification of cluster issue, at least in understanding #1
Description
Hello,
Following up on MacsyFinder Issue #81, and the issue of not fully understanding how MacsyFinder and the CONJScan model evaluates hits and systems.
I have now found another and larger example where I do not completely understand the outcome of MacsyFinder's CONJScan model evaluation.
The proteins I am looking at are can be found here:
IMGPR_plasmid_2502790010_000004_2502790010_2502790446.faa.txt
I ran MacsyFinder with the -vvv flag to get additional debugging information in the .log file.
From this info I notes the following observations:
- MOB system can be identified
Output to.logfor identified MOB gene
INFO : search_systems: L 178 : Check model CONJScan/Plasmids/MOB
DEBUG : search_systems: L 181 : ############################# hits related to MOB ##############################
DEBUG : search_systems: L 184 :
id rep_name pos seq_len gene_name i_eval score profile_cov seq_cov beg_match end_match
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_258 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 258 730 T4SS_t4cp2 7.000e-31 101.700 0.973 0.330 413 653
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_276 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 276 964 T4SS_virb4 6.900e-94 310.300 0.831 0.939 52 956
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_287 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 287 615 T4SS_MOBH 2.400e-42 139.300 0.907 0.293 26 205
DEBUG : search_systems: L 185 : ################################################################################
INFO : search_systems: L 186 : Building clusters
DEBUG : search_systems: L 189 : ################################### CLUSTERS ###################################
DEBUG : search_systems: L 190 :
Cluster:
- model = MOB
- replicon = IMGPR_plasmid_2502790010_000004_2502790010_2502790446
- hits = (IMGPR_plasmid_2502790010_000004_2502790010_2502790446_258, T4SS_t4cp2, 258), (IMGPR_plasmid_2502790010_000004_2502790010_2502790446_276, T4SS_virb4, 276), (IMGPR_plasmid_2502790010_000004_2502790010_2502790446_287, T4SS_MOBH, 287)
DEBUG : search_systems: L 191 : ===================== LONERS =====================
DEBUG : search_systems: L 192 :
DEBUG : search_systems: L 195 : ################################################################################
INFO : search_systems: L 196 : Searching systems
DEBUG : system: L 832 : ##################################################
DEBUG : system: L 833 : mandatory_genes: ['T4SS_MOBB']
DEBUG : system: L 834 : accessory_genes: ['T4SS_virb4', 'T4SS_t4cp1']
DEBUG : system: L 835 : neutral_genes: []
DEBUG : system: L 836 : forbidden_genes: []
DEBUG : system: L 861 : is a system
DEBUG : system: L 864 : ##################################################
DEBUG : search_systems: L 214 : ################################# MultiSystems #################################
DEBUG : search_systems: L 215 :
Cluster:
- model = MOB
- replicon = IMGPR_plasmid_2502790010_000004_2502790010_2502790446
- hits = (IMGPR_plasmid_2502790010_000004_2502790010_2502790446_258, T4SS_t4cp2, 258), (IMGPR_plasmid_2502790010_000004_2502790010_2502790446_276, T4SS_virb4, 276), (IMGPR_plasmid_2502790010_000004_2502790010_2502790446_287, T4SS_MOBH, 287)
- Missing reason for no T4SS_typeG identification
From the identification of the MOB system it seems that all mandatory genes for the T4SS_typeG system are present (MOBH,T4SS_virb4, andT4SS_t4cp2) for the T4SS_typeG system.
Additionally these genes all seem to be located as expected with MOBH and T4SS_t4cp2 being allowed loner status, and T4SS_virb4 among accessory genes.
In addition to the 3 mandatory genes, a minimum of 3 additional genes has to be identified (if I understand the min_genes_required="6" definition correctly). This is reached by the number of accessory genes identified.
Finally the inter_gene_max_space="500" requirement is relatively relaxed but also fulfilled (if I understand this inter_gene_max_space correctly in it being the number of allowed genes separating the genes of the system). In the case of this system a total number of (14 genes are scattered in and among the putative T4SS_TypeG cluster).
output to .log from T4SS_typeG identification
INFO : search_systems: L 178 : Check model CONJScan/Plasmids/T4SS_typeG
DEBUG : search_systems: L 181 : ########################## hits related to T4SS_typeG ##########################
DEBUG : search_systems: L 184 :
id rep_name pos seq_len gene_name i_eval score profile_cov seq_cov beg_match end_match
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_253 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 253 199 T4SS_G_tfc2 1.000e-71 235.100 0.970 0.960 9 199
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_255 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 255 245 T4SS_G_tfc3 8.700e-106 346.800 0.992 0.996 1 244
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_256 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 256 189 T4SS_T_virB1 3.300e-09 31.100 0.615 0.503 41 135
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_257 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 257 182 T4SS_G_tfc5 1.600e-72 237.300 0.994 0.962 8 182
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_258 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 258 730 T4SS_t4cp2 7.000e-31 101.700 0.973 0.330 413 653
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_259 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 259 249 T4SS_G_tfc7 7.400e-114 373.900 1.000 1.000 1 249
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_268 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 268 127 T4SS_G_tfc8 8.100e-45 145.900 0.974 0.890 4 116
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_269 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 269 77 T4SS_G_tfc9 6.900e-35 113.800 0.975 1.000 1 77
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_270 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 270 119 T4SS_G_tfc10 1.100e-53 174.300 0.983 0.992 2 119
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_271 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 271 132 T4SS_G_tfc11 6.600e-60 194.800 0.938 0.962 4 130
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_272 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 272 230 T4SS_G_tfc12 2.300e-109 358.100 0.964 0.965 1 222
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_273 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 273 303 T4SS_G_tfc13 1.700e-124 409.100 0.990 0.990 1 300
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_274 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 274 472 T4SS_G_tfc14 6.700e-208 686.000 1.000 1.000 1 472
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_275 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 275 149 T4SS_G_tfc15 4.300e-61 199.500 0.986 0.966 6 149
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_276 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 276 964 T4SS_virb4 6.900e-94 310.300 0.831 0.939 52 956
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_279 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 279 148 T4SS_G_tfc24 2.800e-60 196.700 0.986 0.926 12 148
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_280 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 280 316 T4SS_G_tfc23 2.100e-150 494.900 0.988 0.984 5 315
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_281 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 281 464 T4SS_G_tfc22 3.500e-204 673.100 0.991 0.972 12 462
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_282 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 282 119 T4SS_G_tfc18 9.900e-40 129.400 0.982 0.958 5 118
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_283 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 283 506 T4SS_G_tfc19 3.400e-241 795.600 0.984 0.988 3 502
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_287 IMGPR_plasmid_2502790010_000004_2502790010_2502790446 287 615 T4SS_MOBH 2.400e-42 139.300 0.907 0.293 26 205
DEBUG : search_systems: L 185 : ################################################################################
INFO : search_systems: L 186 : Building clusters
DEBUG : search_systems: L 189 : ################################### CLUSTERS ###################################
DEBUG : search_systems: L 190 :
DEBUG : search_systems: L 191 : ===================== LONERS =====================
DEBUG : search_systems: L 192 :
DEBUG : search_systems: L 195 : ################################################################################
INFO : search_systems: L 196 : Searching systems
DEBUG : search_systems: L 214 : ################################# MultiSystems #################################
DEBUG : search_systems: L 215 :
Due to the above I don't know if I am misunderstanding how MacsyFinder evaluates the CONJScan rules, or there is some other problem I am not able to identify.
I hope you can clarify.
Cheers,
Magnus
Command:
macsyfinder -o test_MacsyFinder_dCONJ_typeG -vvv --replicon-topology circular --db-type ordered_replicon --models CONJScan/Plasmids --sequence-db IMGPR_plasmid_2502790010_000004_2502790010_2502790446.faa
OS:
- Linux
- Windows
- Mac
- [ ]
MacSyFinder Version:
MacSyFinder 2.1.5
using:
- Python 3.13.7 | packaged by conda-forge | (main, Sep 3 2025, 14:24:46) [Clang 19.1.7 ]
- MacSyLib 1.0.3
- NetworkX 3.5
- Pandas 2.3.3