Skip to content

Identification of cluster issue, at least in understanding #1

@milnus

Description

@milnus

Hello,

Following up on MacsyFinder Issue #81, and the issue of not fully understanding how MacsyFinder and the CONJScan model evaluates hits and systems.

I have now found another and larger example where I do not completely understand the outcome of MacsyFinder's CONJScan model evaluation.
The proteins I am looking at are can be found here:
IMGPR_plasmid_2502790010_000004_2502790010_2502790446.faa.txt

I ran MacsyFinder with the -vvv flag to get additional debugging information in the .log file.
From this info I notes the following observations:

  1. MOB system can be identified
    Output to .log for identified MOB gene
INFO     : search_systems: L 178 : Check model CONJScan/Plasmids/MOB
DEBUG    : search_systems: L 181 : ############################# hits related to MOB ##############################
DEBUG    : search_systems: L 184 : 
id	rep_name	pos	seq_len	gene_name	i_eval	score	profile_cov	seq_cov	beg_match	end_match
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_258	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	258	730	T4SS_t4cp2	7.000e-31	101.700	0.973	0.330	413	653
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_276	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	276	964	T4SS_virb4	6.900e-94	310.300	0.831	0.939	52	956
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_287	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	287	615	T4SS_MOBH	2.400e-42	139.300	0.907	0.293	26	205

DEBUG    : search_systems: L 185 : ################################################################################
INFO     : search_systems: L 186 : Building clusters
DEBUG    : search_systems: L 189 : ################################### CLUSTERS ###################################
DEBUG    : search_systems: L 190 : 
Cluster:
- model = MOB
- replicon = IMGPR_plasmid_2502790010_000004_2502790010_2502790446
- hits = (IMGPR_plasmid_2502790010_000004_2502790010_2502790446_258, T4SS_t4cp2, 258), (IMGPR_plasmid_2502790010_000004_2502790010_2502790446_276, T4SS_virb4, 276), (IMGPR_plasmid_2502790010_000004_2502790010_2502790446_287, T4SS_MOBH, 287)
DEBUG    : search_systems: L 191 : ===================== LONERS =====================
DEBUG    : search_systems: L 192 : 

DEBUG    : search_systems: L 195 : ################################################################################
INFO     : search_systems: L 196 : Searching systems
DEBUG    : system: L 832 : ##################################################
DEBUG    : system: L 833 : mandatory_genes: ['T4SS_MOBB']
DEBUG    : system: L 834 : accessory_genes: ['T4SS_virb4', 'T4SS_t4cp1']
DEBUG    : system: L 835 : neutral_genes: []
DEBUG    : system: L 836 : forbidden_genes: []
DEBUG    : system: L 861 : is a system
DEBUG    : system: L 864 : ##################################################
DEBUG    : search_systems: L 214 : ################################# MultiSystems #################################
DEBUG    : search_systems: L 215 : 
Cluster:
- model = MOB
- replicon = IMGPR_plasmid_2502790010_000004_2502790010_2502790446
- hits = (IMGPR_plasmid_2502790010_000004_2502790010_2502790446_258, T4SS_t4cp2, 258), (IMGPR_plasmid_2502790010_000004_2502790010_2502790446_276, T4SS_virb4, 276), (IMGPR_plasmid_2502790010_000004_2502790010_2502790446_287, T4SS_MOBH, 287)
  1. Missing reason for no T4SS_typeG identification
    From the identification of the MOB system it seems that all mandatory genes for the T4SS_typeG system are present (MOBH, T4SS_virb4, and T4SS_t4cp2) for the T4SS_typeG system.

Additionally these genes all seem to be located as expected with MOBH and T4SS_t4cp2 being allowed loner status, and T4SS_virb4 among accessory genes.

In addition to the 3 mandatory genes, a minimum of 3 additional genes has to be identified (if I understand the min_genes_required="6" definition correctly). This is reached by the number of accessory genes identified.

Finally the inter_gene_max_space="500" requirement is relatively relaxed but also fulfilled (if I understand this inter_gene_max_space correctly in it being the number of allowed genes separating the genes of the system). In the case of this system a total number of (14 genes are scattered in and among the putative T4SS_TypeG cluster).

output to .log from T4SS_typeG identification

INFO     : search_systems: L 178 : Check model CONJScan/Plasmids/T4SS_typeG
DEBUG    : search_systems: L 181 : ########################## hits related to T4SS_typeG ##########################
DEBUG    : search_systems: L 184 : 
id	rep_name	pos	seq_len	gene_name	i_eval	score	profile_cov	seq_cov	beg_match	end_match
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_253	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	253	199	T4SS_G_tfc2	1.000e-71	235.100	0.970	0.960	9	199
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_255	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	255	245	T4SS_G_tfc3	8.700e-106	346.800	0.992	0.996	1	244
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_256	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	256	189	T4SS_T_virB1	3.300e-09	31.100	0.615	0.503	41	135
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_257	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	257	182	T4SS_G_tfc5	1.600e-72	237.300	0.994	0.962	8	182
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_258	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	258	730	T4SS_t4cp2	7.000e-31	101.700	0.973	0.330	413	653
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_259	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	259	249	T4SS_G_tfc7	7.400e-114	373.900	1.000	1.000	1	249
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_268	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	268	127	T4SS_G_tfc8	8.100e-45	145.900	0.974	0.890	4	116
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_269	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	269	77	T4SS_G_tfc9	6.900e-35	113.800	0.975	1.000	1	77
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_270	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	270	119	T4SS_G_tfc10	1.100e-53	174.300	0.983	0.992	2	119
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_271	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	271	132	T4SS_G_tfc11	6.600e-60	194.800	0.938	0.962	4	130
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_272	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	272	230	T4SS_G_tfc12	2.300e-109	358.100	0.964	0.965	1	222
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_273	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	273	303	T4SS_G_tfc13	1.700e-124	409.100	0.990	0.990	1	300
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_274	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	274	472	T4SS_G_tfc14	6.700e-208	686.000	1.000	1.000	1	472
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_275	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	275	149	T4SS_G_tfc15	4.300e-61	199.500	0.986	0.966	6	149
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_276	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	276	964	T4SS_virb4	6.900e-94	310.300	0.831	0.939	52	956
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_279	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	279	148	T4SS_G_tfc24	2.800e-60	196.700	0.986	0.926	12	148
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_280	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	280	316	T4SS_G_tfc23	2.100e-150	494.900	0.988	0.984	5	315
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_281	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	281	464	T4SS_G_tfc22	3.500e-204	673.100	0.991	0.972	12	462
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_282	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	282	119	T4SS_G_tfc18	9.900e-40	129.400	0.982	0.958	5	118
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_283	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	283	506	T4SS_G_tfc19	3.400e-241	795.600	0.984	0.988	3	502
IMGPR_plasmid_2502790010_000004_2502790010_2502790446_287	IMGPR_plasmid_2502790010_000004_2502790010_2502790446	287	615	T4SS_MOBH	2.400e-42	139.300	0.907	0.293	26	205

DEBUG    : search_systems: L 185 : ################################################################################
INFO     : search_systems: L 186 : Building clusters
DEBUG    : search_systems: L 189 : ################################### CLUSTERS ###################################
DEBUG    : search_systems: L 190 : 

DEBUG    : search_systems: L 191 : ===================== LONERS =====================
DEBUG    : search_systems: L 192 : 

DEBUG    : search_systems: L 195 : ################################################################################
INFO     : search_systems: L 196 : Searching systems
DEBUG    : search_systems: L 214 : ################################# MultiSystems #################################
DEBUG    : search_systems: L 215 : 

Due to the above I don't know if I am misunderstanding how MacsyFinder evaluates the CONJScan rules, or there is some other problem I am not able to identify.
I hope you can clarify.

Cheers,
Magnus


Command:

macsyfinder -o test_MacsyFinder_dCONJ_typeG -vvv --replicon-topology circular --db-type ordered_replicon --models CONJScan/Plasmids --sequence-db IMGPR_plasmid_2502790010_000004_2502790010_2502790446.faa

OS:

  • Linux
  • Windows
  • Mac
  • [ ]
    MacSyFinder Version:
MacSyFinder 2.1.5
using:
- Python 3.13.7 | packaged by conda-forge | (main, Sep  3 2025, 14:24:46) [Clang 19.1.7 ]
- MacSyLib 1.0.3
- NetworkX 3.5
- Pandas 2.3.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions