metator package

Submodules

metator.align module

Alignement of reads

Align read files onto the assembly and return a tsv file with 9 columns: ReadID, ContigA, Position_startA, Position_endA, StrandA, ContigB, Position_startB, Position_endB, StrandB. Reads are mapped separately, sorted by names, then interleaved (rather than mapped in paired-end mode) to capture the pairs mapping on two different contigs.

This module contains all these alignment functions:
  • align

  • get_contact_pairs

  • merge_alignement

  • process_bamfile

  • process_bwa_bamfile

metator.align.align(fq_in, index, aligner, bam_out, n_cpu, tmp_dir, iterative=False, fq_in_2=None)[source]

Alignment Aligns reads of fq_in with bowtie2. Parameters of bowtie2 are set as –very-sensitive-local and only the aligned reads are kept in the bam file. Multiple files could be given, but only one file will be written as output.

Parameters:

fq_instr

Path to input fastq file to align.

indexstr

Path to the bowtie2 index genome.

alignerstr

Name of the aligner algorithm to use. Either bowtie2 or bwa.

bam_outstr

Path where the alignment should be written in BAM format.

n_cpuint

The number of CPUs to use for the alignment.

tmp_dir:

Path to directory to store temporary files.

iterative:

Wether to use the iterative mapping procedure (truncating reads and extending iteratively)

fq_in_2str

Path to the reverse fastq file for bwa (map both reads at the same time but do not take into account their poistion as pairs.)

Returns:

int :

0

metator.align.get_contact_pairs(for_in, rev_in, index, assembly, aligner, aligner_mode, min_qual, start, depth_file, enzyme, out_dir, tmp_dir, n_cpu)[source]

General function to do the whole alignment of both fastq.

The Function write at the output directory location given as an argument and return a tsv file of the aligned reads with 9 columns: ReadID, ContigA, Position_startA, Position_endA, StrandA, ContigB, Position_startB, Position_endB, StrandB. The name of the file will be alignment.txt.

Two start stages are possible, from fastq or bam files.

Parameters
  • for_in (str) – Path to input forward fastq or bam file to align. If multiple files are given, list of path separated by a comma.

  • rev_in (str) – Path to input reverse fastq or bam file to align. If multiple files are given, list of path separated by a comma.

  • index (str) – Path to the bowtie2 index of the assembly.

  • assembly (str) – The initial assembly path acting as the alignment file’s reference assembly.

  • aligner (str) – Either ‘bowtie2’ or ‘bwa’ aligner used or to be use to map the reads.

  • aligner_mode (str) – Either ‘normal’, ‘iterative’ or ‘cutsiste’. Mode to align the HiC reads.

  • min_qual (int) – Minimum mapping quality required to keep Hi-C pairs.

  • start (str) – Either fastq or bam. Starting point for the pipeline.

  • depth_file (str or None) – Path to the depth.txt file from jgi_summarize_bam_contig_depths from Metabat2 Software.

  • enzyme (str or None) – String that contains the names of the enzyme separated by a comma.

  • out_dir (str) – Path to directory where to write the output file.

  • tmp_dir (str) – Path where temporary files should be written.

  • n_cpu (int) – The number of CPUs to use for the alignment.

Returns

  • list of str – List of path of the Files with the table containing the alignement data of the pairs: ReadID, ContigA, Position_startA, Position_endA, StrandA, ContigB, Position_startB, Position_endB, StrandB.

  • dict – Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”. Coverage still at 0 and need to be updated later.

  • dict – Dictionnary for hit information on each contigs.

metator.align.merge_alignment(forward_aligned, reverse_aligned, contig_data, out_file)[source]

Merge forward and reverse alignment into one file with only pairs which have both reads are aligned on the genome with 9 columns: ReadID, ContigA, Position_startA, Position_endA, StrandA, ContigB, Position_startB, Position_endB, StrandB.

Parameters
  • forward_alignement (str) – File with the table containing the data of the forward reads kept after the alignment. With five columns: ReadID, Contig, Position_start, Position_end, strand.

  • reverse_alignement (str) – File with the table containing the data of the forward reads kept after the alignment. With five columns: ReadID, Contig, Position_start, Position_end, strand.

  • contig_data (dict) – Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”. Coverage still at 0 and need to be updated later.

  • out_file (str) – Path to write the output pairs file.

Returns

File with the table containing the alignement data of the pairs: ReadID, ContigA, Position_startA, Position_endA, StrandA, ContigB, Position_startB, Position_endB, StrandB

Return type

str

metator.align.process_bamfile(alignment, min_qual, filtered_out)[source]

Filter alignment BAM files

Reads all the reads in the input BAM alignment file. Keep reads in the output if they are aligned with a good quality (greater than min quality threshold given) saving their only some columns: ReadID, Contig, Position_start, Position_end, strand to save memory.

Parameters
  • alignment (str) – Path to the input temporary alignment.

  • min_qual (int) – Minimum mapping quality required to keep a Hi-C pair.

  • filtered_out (str) – Path to the output temporary tsv alignement.

Returns

Number of reads aligned.

Return type

int

metator.align.process_bwa_bamfile(alignment, min_qual, contig_data, out_file)[source]

Filter alignment BAM files

Reads all the reads in the input BAM alignment file. Keep reads in the output if they are aligned with a good quality (greater than min quality threshold given) saving their only some columns: ReadID, Contig, Position_start, Position_end, strand to save memory.

Parameters
  • alignment (str) – Path to the input temporary alignment.

  • min_qual (int) – Minimum mapping quality required to keep a Hi-C pair.

  • contig_data (dict) – Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”. Coverage still at 0 and need to be updated later.

  • out_file (str) – Path to the output pairs file.

Returns

Number of pairs aligned.

Return type

int

metator.commands module

Abstract command classes for metaTOR

This module contains all classes related to metaTOR commands:
  • network

  • partition

  • pipeline

  • validation

  • qualitycheck

  • contactmap

Note

Structure based on Rémy Greinhofer (rgreinho) tutorial on subcommands in docopt : https://github.com/rgreinho/docopt-subcommands-example abignaud, 20201118

raises NotImplementedError

Will be raised if AbstractCommand is called for some reason instead of one of its children.

class metator.commands.AbstractCommand(command_args, global_args)[source]

Bases: object

Abstract base command class

Base class for the commands from which other metaTOR commands derive.

check_output_path(path, force=False)[source]

Throws error if the output file exists. Create required file tree otherwise.

execute()[source]

Execute the commands

class metator.commands.Contactmap(command_args, global_args)[source]

Bases: AbstractCommand

Generates a HiC contact map of a MetaTOR object from the pairs files, the contig data file and a fasta file containing the contigs of interest.

Generates the Hi-C matrix of one MetaTOR object form the pair alignment file of metaTOR. MetaTOR object integrated are contigs, and core, overlapping, recursive, final or personalized bins. For the personalized bins you should add a column in the contig_data_final.txt file with a header called “Other”.

Do not display the matrix, you have to run another pipeline to display it such hicstuff view or cooler show depending on the metrix format chosen. It’s also possible to try to scaffold the bin using instagraal pipeline using the output.

usage:

contactmap –assembly=FILE –contig-data=FILE –enzyme=STR –name=STR [–filter] [–force] [–mat-fmt=graal] [–min-size=5000] [–no-clean-up] [–object=final_bin] [–outdir=DIR] [–pcr-dup] [–tmpdir=DIR] [–threads=1] <pairsfile>…

argument:

pairsfile File(s) containing pairs information.

options:
-a, --assembly=FILE

Path to the fasta file containing the contigs of interest. Could be the whole or the extracted contigs of one bin.

-c, --contig-data=FILE

Path to the contig_data_final.txt file from MetaTOR output.

-e, --enzyme=STR

The list of restriction enzyme used to digest the contigs separated by a comma. Example: HpaII,MluCI.

-n, --name=STR

Name of the metator or its numerical ID for the core bin, overlapping bin or recursive bin objects. Example: “NODE_1”, “MetaTOR_1_0” or 8.

-D, –pcr-dup, Filter out PCR duplicates based on read

positions.

-f, --filter

Filter out spurious 3C events (loops and uncuts) using hicstuff filter. For more informations, see Cournac et al. BMC Genomics, 2012.

-F, --force

Write files even if the output files already exists.

-m, --mat-fmt=STR

The format of the output sparse matrix. Can be “bg2” for 2D Bedgraph format, “cool” for Mirnylab’s cooler software, or “graal” for graal-compatible plain text COO format. [default: graal]

-N, --no-clean-up

If enabled intermediary files will be kept.

-o, --outdir=DIR

Output directory. Default creates a new directory contact_map in the current directory.

-O, --object=STR

Type of metator object used to build a contact map. Either “contig”, “core_bin”, “overlapping_bin”, “recursive_bin”, “final_bin” or “other”. [Default: final_bin]

-s, --min-size=INT

Minimum size threshold to consider contigs. [Default: 5000]

-t, --threads=INT

Number of threads to allocate. [Default: 1]

-T, --tmpdir=DIR

Directory for storing intermediary files and temporary files. Default creates a “tmp” folder in the current directory.

execute()[source]

Execute the commands

class metator.commands.Network(command_args, global_args)[source]

Bases: AbstractCommand

Generation of network command

Generate metaHiC contigs network from fastq reads or bam files.Generates a network file (in edgelist form) from either fastq files or bam files. For both starting input forward and reverse reads or alignment need to be in seprated files with the same read names. In the output network contigs are the network nodes and the edges are the contact counts.

If multiple fastq are given will align them seprately and return a hit file with a number of hit for each sample.

Note: the network is in a strict barebone form so that it can be reused and imported quickly into other applications etc. Verbose information about every single node in the network is written on a ‘contig data’ file.

usage:

network –forward=STR –assembly=FILE [–reverse=STR] [–aligner=bowtie2] [–aligner-mode=normal] [–depth=FILE] [–edge=0] [–enzyme=STR] [–normalization=empirical_hit] [–no-clean-up] [–outdir=DIR] [–min-quality=30] [–self-contacts] [–start=fastq] [–threads=1] [–tmpdir=DIR]

options:
-1, --forward=STR

Fastq file or list of Fastq separated by a comma containing the forward reads to be aligned or their corresponding bam or pairs files.

-2, --reverse=STR

Fastq file or list of Fastq separated by a comma containing the reverse reads to be aligned or their corresponding bam files. Forward and reverse reads need to have the same identifier (read names). If start is set to pair, no argument is necessary.

-a, --assembly=FILE

The initial assembly path acting as the alignment file’s reference genome or the basename of the bowtie2 index.

-b, --aligner=STR

Aligner algorithm to use. Either “bwa” or “bowtie2”. [Default: bowtie2]

-B, --aligner-mode=STR

Mode of alignment from hicstuff. Either normal, iterative or cutsite. [Default: normal]

-d, --depth=FILE

The depth.txt file from the shotgun reads used to made the assembly computed by jgi_summarize_bam_contig_depths from metabat2 pipeline.

-E, --edge=INT

Distance of the edge region in base pair on the contigs where the mapping reads are not considered as inter contigs. [Default: 0]

-e, --enzyme=STR

The list of restriction enzyme used to digest the contigs separated by a comma. Example: HpaII,MluCI.

-n, –normalization=STR If None, do not normalized the count of a

contact by the geometric mean of the coverage of the contigs. Otherwise it’s the type of normalization. 6 values are possible “None”, “abundance”, “length”, “RS”, “empirical_hit”, “theoritical_hit”. [Default: empirical_hit]

-N, --no-clean-up

Do not remove temporary files.

-o, --outdir=DIR

The output directory to write the bam files the network and contig data into. Default: current directory.

-q, --min-quality=INT

Threshold of quality necessary to considered a read properly aligned. [Default: 30]

-s, --self-contacts

If enabled, count alignments between a contig and itself (intracontigs contacts).

-S, --start=STR

Start stage of the pipeline. Either “fastq”, “bam”, or “pair”. [Default: fastq]

-t, --threads=INT

Number of parallel threads allocated for the alignement. [Default: 1]

-T, --tmpdir=DIR

Temporary directory. Default to current directory. [Default: ./tmp]

execute()[source]

Execute the commands

class metator.commands.Pairs(command_args, global_args)[source]

Bases: AbstractCommand

Sort and pairs files for faster assess to the data.

Sort the pairs file using pairtools. Compress them using bgzip. Index them using pairix.

usage:

pairs [–force] [–remove] [–threads=1] <pairsfile>…

Parameters

File (pairsfile) –

options:
-F, --force

Write files even if the output files already exists.

-r, --remove

Remove the input file at the end to keep only the sorted, compressed and indexed pairs file.

-t, --threads=INT

Numbers of thread to allocate. [Default: 1]

execute()[source]

Execute the commands

class metator.commands.Partition(command_args, global_args)[source]

Bases: AbstractCommand

Partition the network using Louvain algorithm

Partition the network file using iteratively the Louvain or Leiden algorithm. Then looks for ‘cores’ bins that are constituted by the contigs which are always in the same cluster at each iterations. Then using the Hamming distance from these cores bins, merege the bins with a small Hamming distance (close to 1), bigger than the percentage (overlapping parameter) given.

It will also update the file to integrate the bins information of the contigs.

Furthermore, both Leiden and Louvain algorithm are available here. However, the benchmark made show that here the Louvain algorithm have better performance and is faster on seawater and gut metagenomic samples.

Note that the Louvain or Leiden software are not, in the strictest sense, necessary. Any program that assigns a node to a bin, does so non deterministically and solely outputs a list in the form: ‘node_id bin_id’ could be plugged instead.

usage:

partition –assembly=FILE –contigs=FILE –network=FILE [–algorithm=louvain] [–cluster-matrix] [–force] [–iterations=100] [–no-clean-up] [–outdir=DIR] [–overlap=80] [–prefix=STR] [–res-param=1.0] [–size=500000] [–threads=1] [–tmpdir=DIR]

options:
-a, --assembly=FILE

The path to the assembly fasta file used to do the alignment.

-A, --algorithm=STR

Either “louvain” or “leiden”, algorithm to use to partition the network. [Default: louvain]

-c, --contigs=FILE

The path to the tsv file containing the data of the contigs (ID, Name, Length, GC content, Hit, Coverage, Restriction Site).

-C, --cluster-matrix

If enabled, save the clustering matrix.

-F, --force

If enabled, would remove directory of overlapping bins in the output directory.

-i, --iterations=INT

Number of iterations of Louvain. [Default: 100]

-n, --network=FILE

Path to the file containing the network information from the meta HiC experiment compute in network function previously.

-N, --no-clean-up

Do not remove temporary files.

-o, --outdir=DIR

Path to the directory to write the output. Default to current directory. [Default: ./]

-O, --overlap=INT

Hamming distance threshold to use to merge bins (percentage). [Default: 80]

-p, --prefix=STR

Prefix to use for fasta files. By default just ‘metator_’, otherwise ‘STR_metator_’.

-r, --res-param=FLOAT

Resolution paramter to use for Leiden algorithm. [Default: 1.0]

-s, --size=INT

Threshold size to keep bins in base pair. [Default: 500000]

-t, --threads=INT

Number of parallel threads allocated for the partition. [Default: 1]

-T, --tmpdir=DIR

Temporary directory. Default to current directory. [Default: ./tmp]

execute()[source]

Execute the commands

class metator.commands.Pipeline(command_args, global_args)[source]

Bases: AbstractCommand

Launch the full metator pipeline

Partition the contigs from teh assembly in bins from the metaHiC reads.

It’s possible to start from the fastq, the bam, the pair or the network files. It’s will also possible to ask or not to run a validation step which will decontaminate the bins when it’s necessary. However as it’s the critical step for memory usage (~40G), it’s possible to skip these step.

usage:

pipeline –assembly=FILE [–forward=STR] [–reverse=STR] [–algorithm=louvain] [–aligner=bowtie2] [–aligner-mode=normal] [–cluster-matrix] [–depth=FILE] [–edge=0] [–enzyme=STR] [–force] [–iterations=100] [–rec-iter=10] [–junctions=NNNNN] [–no-clean-up] [–normalization=empirical_hit] [–outdir=DIR] [–overlap=80] [–prefix=STR] [–rec-overlap=90] [–min-quality=30] [–res-param=1.0] [–size=500000] [–start=fastq] [–scaffold] [–threads=1] [–tmpdir=DIR]

options:
-1, --forward=STR

Fastq file or list of Fastq separated by a comma containing the forward reads to be aligned or their corresponding bam or pairs files.

-2, --reverse=STR

Fastq file or list of Fastq separated by a comma containing the reverse reads to be aligned or their corresponding bam files. Forward and reverse reads need to have the same identifier (read names). If pair start is used no argument is necessary.

-a, --assembly=FILE

The initial assembly path acting as the alignment file’s reference genome or the basename of the bowtie2 index.

-A, --algorithm=STR

Algorithm to use. Either “louvain”, “leiden” or “spinglass”. If spinglass is chosen, the first partition will be done using louvain [Default: louvain]

-b, --aligner=STR

Aligner algorithm to use. Either “bwa” or “bowtie2”. [Default: bowtie2]

-B, --aligner-mode=STR

Mode of alignment from hicstuff. Either normal, iterative or cutsite. [Default: normal]

-C, --cluster-matrix

If enabled, save the clustering matrix.

-d, --depth=FILE

The depth.txt file from the shotgun reads used to made the assembly computed by jgi_summarize_bam_contig_depths from metabat2 pipeline.

-E, --edge=INT

Distance of the edge region in base pair on the contigs where the mapping reads are not considered as inter contigs. [Default: 0]

-e, --enzyme=STR

The list of restriction enzyme used to digest the contigs separated by a comma. Example: HpaII,MluCI.

-F, --force

If enable, would remove directory of overlapping bins in the output directory.

-i, --iterations=INT

Number of iterations of Louvain for the partition step. [Default: 100]

-j, --rec-iter=INT

Number of iterations of Louvain for the recursive step. [Default: 10]

-J, --junctions=STR

Sequences to use as junction between contigs. [Default: NNNNN]

-N, --no-clean-up

Do not remove temporary files.

-m, –normalization=STR If None, do not normalized the count of a

contact by the geometric mean of the coverage of the contigs. Otherwise it’s the type of normalization. 6 values are possible None, abundance, length, RS, empirical_hit, theoritical_hit. [Default: empirical_hit]

-o, --outdir=DIR

The output directory to write the bam files the network and contig data into. Default: current directory.

-O, --overlap=INT

Hamming distance threshold to use to merge bins for the first step (percentage). [Default: 80]

-p, --prefix=STR

Prefix to use for fasta files. By default just ‘metator_’, otherwise ‘STR_metator_’.

-P, --rec-overlap=INT

Hamming distance threshold to use to merge bins for the recursive step (percentage). [Default: 90]

-q, --min-quality=INT

Threshold of quality necessary to considered a read properly aligned. [Default: 30]

-r, --res-param=FLOAT

Resolution paramter to use for Leiden algorithm. [Default: 1.0]

-s, --size=INT

Threshold size to keep bins in base pair. [Default: 500000]

-S, --start=STR

Start stage of the pipeline. Either fastq, bam, or pair. [Default: fastq]

-t, --threads=INT

Number of parallel threads allocated for the alignement. [Default: 1]

-T, --tmpdir=DIR

Temporary directory. Default to current directory. [Default: ./tmp]

-v, --scaffold

If enables, it will sacffold the genomes at the end.

execute()[source]

Execute the commands

class metator.commands.Qc(command_args, global_args)[source]

Bases: AbstractCommand

Generates some quality check on the output of metator.

The quality check is done on the high quality MAGs and only contigs witha size bigger than 100kb. The goal is to assess HiC quality metrics on genomes we kinda know to have a estimation of the quality of the library.

If the plot are set it will return some plot of the event distribution on the high quality contigs.

usage:

qc –assembly=FILE –enzyme=STR [–bin-summary=FILE] [–contig-data=FILE] [–metator-dir=DIR] [–no-clean-up] [–outdir=DIR] [–prefix=STR] [–plot] [–threshold=STR] [–tmpdir=DIR] <pairsfile>…

Parameters

File (pairsfile) –

options:
-a, --assembly=FILE

Path to the fasta file containing the contigs of interest. Could be the whole or the extracted contigs of one bin.

-b, --bin-summary=FILE

Path to the bin_summary.txt file from MetaTOR output.

-c, --contig-data=FILE

Path to the contig_data_final.txt file from MetaTOR output.

-e, --enzyme=STR

The list of restriction enzyme used to digest the contigs separated by a comma. Example: HpaII,MluCI.

-n, --no-clean-up

If enabled intermediary files will be kept.

-O, --metator-dir=DIR

Output directory from metator pipeline. If set contigs data and bin summary will be found automatically.

-o, --outdir=DIR

Directory to save output plots and log.

-p, --prefix=STR

Name of the sample to add on plot and files.

-P, --plot

If enable display some plots.

-t, --threshold=STR

Hicstuff religation and loop thresholds. Two integers seperated by a coma.

-T, --tmpdir=DIR

Temporary directory to save pairs files

execute()[source]

Execute the commands

class metator.commands.Scaffold(command_args, global_args)[source]

Bases: AbstractCommand

Scaffold a bin from metator.

Use the pairs to build a scaffold from the binned contigs. The scaffolding is based on contact between the edges of the contigs. There are normalized based on intra contacts of the contigs except for the small contigs where a low score is given.

usage:

scaffold –bin-name=STR –input-fasta=FILE [–junctions=NNNNN] [–out-fasta=FILE] [–out-frags=FILE] [–threads=1] [–threshold=0.05] [–window-size=5000] <pairsfile>…

Parameters

File (pairsfile) –

options:
-b, --bin-name=STR

Name of the bin to scaffold.

-i, --input-fasta=FILE

Path to the input fasta.

-j, --junctions=STR

Sequences to use as junction between contigs. [Default: NNNNN]

-o, --out-fasta=FILE

Path to write the output fasta. Default in the current directory: ‘”$bin_name”_scaffolded.fa’.

-O, --out-frags=FILE

Path to write the fragments information. Default in the current directory: ‘”$bin_name”_info_frags.txt’.

-t, --threads=INT

Numbers of threads to allocate if pairs need to be sorted. [Default: 1]

-T, --threshold=FLOAT

Threshold score to consider an association between two contigs. [Default: 0.05]

-w, --window-size=INT

Size of the window in base pair to use as edge window to scaffold. [Default: 5000]

execute()[source]

Execute the commands

class metator.commands.Validation(command_args, global_args)[source]

Bases: AbstractCommand

Use CheckM to validate the bins.

Recursively decontaminate the contaminated bins evaluated by checkM. The script returns new decontaminated fasta and summary files of the decontamination.

Only bins with more than 50% completion and 5% contamination are subject to the recursive step. If the recursive step gave worst results than the first (decrease of the completion with no decrease of the contamination), it will keep the original bin.

usage:

validation –assembly=FILE –contigs=FILE –fasta=DIR –network=FILE [–algorithm=louvain] [–cluster-matrix] [–force] [–iterations=10] [–no-clean-up] [–outdir=DIR] [–overlap=90] [–prefix=STR] [–res-param=1.0] [–size=500000] [–threads=1] [–tmpdir=DIR]

options:
-a, --assembly=FILE

The path to the assembly fasta file used to do the alignment.

-A, --algorithm=STR

Algorithm to use. Either “louvain”, “leiden” or “spinglass”. [Default: louvain]

-c, --contigs=FILE

The path to the file containing the data of the contigs from the partition step (13 columns).

-C, --cluster-matrix

If enabled, save the clustering matrix.

-f, --fasta=DIR

Path to the directory containing the input fasta files of the bins to decontaminate.

-F, --force

If enable, would remove directory of recursive bins in the output directory.

-i, --iterations=INT

Number of recursive iterations of Louvain. [Default: 10]

-n, --network=FILE

Path to the file containing the network information from the meta HiC experiment compute in network function previously.

-N, --no-clean-up

Do not remove temporary files.

-o, --outdir=DIR

Path to the directory to write the output. Default to current directory. [Default: ./]

-O, --overlap=INT

Hamming distance threshold to use to merge bins (percentage). [Default: 90]

-p, --prefix=STR

Prefix to use for fasta files. By default just ‘metator_’, otherwise ‘STR_metator_’.

-r, --res-param=FLOAT

Resolution paramter to use for Leiden algorithm. [Default: 1.0]

-s, --size=INT

Threshold size to keep bins in base pair. [Default: 500000]

-t, --threads=INT

Number of parallel threads allocated for the partition. [Default: 1]

-T, --tmpdir=DIR

Temporary directory. Default to current directory. [Default: ./tmp]

execute()[source]

Execute the commands

metator.commands.generate_log_header(log_path, cmd, args)[source]

metator.contact_map module

Generates a HiC contact map of a MetaTOR object.

General utility functions and class for handling one MetaTOR object and for generating its HiC contact map using hicstuff pipeline.

Core class to handle bin object easily:
  • MetatorObject

Core function to build the network are:
  • extract_pairs

  • generate_contact_map

class metator.contact_map.MetatorObject(metator_object, name, assembly, contig_data_file, pairs, min_size=5000)[source]

Bases: object

Class to handle MetaTOR object informations and to extract easily aligned pairs from the object given.

The class is initiate by the MetaTOR object type, its name, the assembly file, the contig data file from MetaTOR and the pairs files.

It’s possible to add a threshold on the contigs size.

get_contigs_data()[source]

Method to get contigs data Dataframe

Returns:

pandas.core.frame.DataFrame

Table with informations from metaTOR binning for each contigs.

set_contigs()[source]

Method to extract the list of contigs of the class from the contig data file and their size.

set_large_contigs()[source]

Method to keep only contigs bigger than the threshold given to remove small bins which will be smaller than the binning value and which will prevent a good scaffolding with Instagraal.

set_metator_object(metator_object, name)[source]

Method to get the metator object and name of the object usable for the algorithm.

Parameters:

metator_objectstr Object to extract contigs to build the matrix.

Either “contig”, “core_bin”, “overlapping_bin”, “recursive_bin”, “final_bin” or “other”.

namestr

Name of the object. Could be the name of a contig, an id of a bin or the name of the bin. Example: “NODE_1” or “MetaTOR_1_0”.

write_fasta(tmp_dir, out_dir)[source]

Method to write the new fasta with only the contigs of interest in the outdir directory.

Parameters:

tmp_dirstr

Path to the temporary directory to write the list of contigs of the object.

out_dirstr

Path to the output directory where to write the new fasta file.

metator.contact_map.extract_pairs(metator_data)[source]

Function to extract the pairs of a MetaTOR object from pairs files.

Parameters:

metator_dataobject from class MetatorObject

Object of the class MetatorObject. It should have the parameters contigs and the output pairs file of the metaTOR object set.

Returns:

int:

Number of pairs from the MetaTOR object.

metator.contact_map.generate_contact_map(assembly, contig_data_file, enzyme, name, pairs, out_dir, tmp_dir, filter_events=False, force=False, mat_fmt='graal', metator_object='final_bin', min_size=5000, pcr_duplicates=False, threads=1)[source]

General function to extract pairs of the MetaTOR object and generate its the contact map.

Parameters:

assemblystr

Path to the fasta file containing the contigs of interest. Could be the whole or the extracted contigs of one bin.

contig_data_filestr

Path to the contig_data_final.txt file form MetaTOR output.

enzymestr

Enzyme used to digest the genome in the HiC experiment. Example: HpaII,MluCI.

namestr

Name of the object. Could be the name of a contig, an id of a bin or the name of the bin. Example: “NODE_1” or “MetaTOR_1_0”.

pairsstr

Path of the “.pairs” file or bgzip indexed pair file. If more than one is given, files should be separated by a comma.

out_dirstr

Path where output files should be written. Current directory by default.

tmp_dirstr

Path where temporary files will be written.

filter_eventsbool

Filter spurious or uninformative 3C events. Requires a restriction enzyme. Default: False.

forcebool

If True, overwrite existing files with the same name as output. Default: False.

mat_fmtstr

Select the output matrix format. Can be either “bg2” for the bedgraph2 format, “cool” for Mirnylab’s cool format, or graal for a plain text COO format compatible with Koszullab’s instagraal software. Default: “graal”.

metator_objectstr

Object to extract contigs to build the matrix. Either “contig”, “core_bin”, “overlapping_bin”, “recursive_bin”, “final_bin” or “other”.

min_sizeint

Minimum contig size required to keep it.

pcr_duplicatesbool

If True, PCR duplicates will be filtered based on genomic positions. Pairs where both reads have exactly the same coordinates are considered duplicates and only one of those will be conserved. Default: False.

threadsint

Numbers of threads to use. Default: 1.

metator.figures module

Figures foor metaTOR output.

General utility functions to plot some figures to describe metaTOR output.

Core functions to plot the firgures are:
  • build_matrix_network

  • figures_bins_distribution

  • figures_bins_size_distribution

  • figure_camembert_quality

  • figures_mags_GC_boxplots

  • figures_mags_HiC_cov_boxplots

  • figures_mags_SG_cov_boxplots

  • generates_frags_network

  • network_heatmap

  • plot_figures

  • reindex_df

metator.figures.build_matrix_network(pairs_files, info_contigs, N, bin_size)[source]

Build a binned matrix from the pairs alignement files and the final binning information in the contig data file.

Parameters:

pairs_fileslist of str

List of path of the alignment “.pairs” file.

info_contigs: dict

Dictionnary with frags id of the contigs.

Nint

Size of the final bin matrix.

bin_sizeint

Size of the bin to plot the heatmap.

Returns:

numpy.array:

Binned dense matrix of the network.

metator.figures.figure_camembert_quality(out_file, prefix, n_religated, n_loops, n_weirds, n_informative_intra, n_informative_inter, n_intra_mags, n_inter_mags, uncut_thr, loop_thr)[source]

Functon to plot nice hicstuff camembert from metator high quality contigs.

metator.figures.figures_bins_distribution(bin_summary, out_file)[source]

Function to plot the distribution of the estimated completion and contamination of the final bins generated by metaTOR. The bins are ordered by decreasing completion and contamination values superior to 105% are set to 105%.

Parameters:

bin_summarypandas.core.frame.DataFrame

Table with the informations about the final bins.

out_filestr

Path where to save the figure.

metator.figures.figures_bins_size_distribution(bin_summary, total_size, threshold, out_file)[source]

Function to plot a camembert of the fraction of size corresponding to each quality of bins. We defined 6 categories:

  • High quality MAGs (HQ MAGs): >= 90% completion ; < 5% contamination

  • Medium quality MAGs (MQ MAGs): >= 70% completion ; < 10% contamination

  • Low quality MAGs (LQ MAGs): >= 50% completion ; < 10% contamination

  • Contaminated bins: >= 50% completion ; >= 10% contamination

  • Bins superior to size threshold but with less 50% completion.

  • Unbinned contigs

Parameters:

bin_summarypandas.core.frame.DataFrame

Table with the informations about the final bins.

total_sizeint

Size of the whole assembly.

thresholdint

Minimun size of the bins considered.

out_filestr

Path where to save the figure.

metator.figures.figures_mags_GC_boxplots(contigs_data, out_file)[source]

Function to plot barplots GC content of the contigs for each bins. The count are weighted by the size of the contigs.

Parameters:

contigs_datapandas.DataFrame

Table with all the data from the contigs, the ubinned contigs should have removed and a column size_weight should have been add to weight with the size.

out_filestr

Path where to write the figure.

metator.figures.figures_mags_HiC_cov_boxplots(contigs_data, out_file)[source]

Function to plot barplots HiC_coverage coverage of the contigs for each bins. The count are weighted by the size of the contigs.

Parameters:

contigs_datapandas.DataFrame

Table with all the data from the contigs, the ubinned contigs should have removed and a column size_weight should have been add to weight with the size.

out_filestr

Path where to write the figure.

metator.figures.figures_mags_SG_cov_boxplots(contigs_data, out_file)[source]

Function to plot barplots assembly coverage of the contigs for each bins. The count are weighted by the size of the contigs.

Parameters:

contigs_datapandas.DataFrame

Table with all the data from the contigs, the ubinned contigs should have removed and a column size_weight should have been add to weight with the size.

out_filestr

Path where to write the figure.

metator.figures.generates_frags_network(contig_data, bin_summary, bin_size)[source]

Generates info fragments for all contigs and start position all bins.

Parameters:

contigs_datapandas.DataFrame

Table with all the data from the contigs.

bin_summarydict

Dictionnary with the informations of the final bins kept by MetaTOR.

bin_sizeint

Size of the bin to plot the heatmap.

Returns:

dict:

Dictionnary with frags id of the contigs.

list of int:

List of the start positions of the mags

int:

Size of the final binned matrix

metator.figures.network_heatmap(matrix, out_file=None, mag_starts=None)[source]

Plot the heatmap of the network, with the contig ordering by their bin attribution.

Parameters:

matrixnumpy.array

Binned dense matrix of the network.

out_filestr

Name of the file to save the plot.

mag_startslist of int

List of bin positions where to draw mags starts as dotted lines.

metator.figures.plot_figures(out_dir, contigs_data, bin_summary, threshold)[source]

Function to generates all figures.

Parameters:

out_dirstr

Path of the output directory of MetaTOR.

contigs_datapandas.DataFrame

Table with all the data from the contigs.

bin_summarydict

Dictionnary with the informations of the final bins kept by MetaTOR.

thresholdint

Minimun size of the bins considered.

metator.figures.reindex_df(dataframe, weight_col)[source]

Expand the dataframe to prepare for resampling result is 1 row per count per sample.

Parameters:

dataframepandas.core.frame.DataFrame

Table to expand.

weight_colstr

Name of the weighted column to use.

Returns:

pandas.core.frame.DataFrame:

Table expanded with new index.

metator.io module

Core tools to build I/O for metaTOR

This mdoule contains all core I/O functions:
  • check_checkm

  • check_fasta_index

  • check_is_fasta

  • check_louvain_cpp

  • check_pairix

  • check_pairtools

  • generate_fasta_index

  • generate_temp_dir

  • get_restriction_site

  • get_pairs_data

  • process_ligation_sites

  • read_bin_summary

  • read_compressed

  • read_contig_data

  • read_results_checkm

  • retrieve_fasta

  • sort_pairs

  • sort_pairs_pairtools

  • write_bin_summary

metator.io.check_checkm()[source]

Function to test if CheckM is in the path.

Returns:

bool:

True if checkM found in the path, False otherwise.

metator.io.check_fasta_index(ref, mode='bowtie2')[source]

Function from hicstuff.io (https://github.com/koszullab/hicstuff/) Checks for the existence of a bowtie2 or bwa index based on the reference file name.

Parameters:

refstr

Path to the reference genome.

modestr

The alignment software used to build the index. bowtie2 or bwa. If any other value is given, the function returns the reference path.

Returns:

str

The bowtie2 or bwa index basename. None if no index was found

metator.io.check_is_fasta(in_file)[source]

Function from hicstuff.io (https://github.com/koszullab/hicstuff/) Checks whether input file is in fasta format.

Parameters:

in_filestr

Path to the input file.

Returns:

bool :

True if the input is in fasta format, False otherwise

metator.io.check_louvain_cpp(louvain_path)[source]

Function to check is the Louvain functions are callable.

Parameters:

louvain_pathstr

Path of the directory where the Louvain functions are.

Returns:

bool:

Boolean value describing either the Louvain functions are callable or not.

metator.io.check_pairix()[source]

Function to test if pairix is in the path.

Returns:

bool:

True if pairix found in the path, False otherwise.

metator.io.check_pairtools()[source]

Function to test if pairtools is in the path.

Returns:

bool:

True if pairtools found in the path, False otherwise.

metator.io.generate_fasta_index(fasta, aligner, outdir)[source]

Generate fasta index.

Parameters:

fastastr

Path to the fasta reference to index.

alignerstr

Aligner to use to build the index.

outdirstr

Path to the directory to write the index.

Returns:

str:

Path to the bowtie2 index build

metator.io.generate_temp_dir(path)[source]

Temporary directory generation

Function from hicstuff.io (https://github.com/koszullab/hicstuff/) Generates a temporary file with a random name at the input path.

Parameters:

pathstr

The path at which the temporary directory will be created.

Returns:

str

The path of the newly created temporary directory.

metator.io.get_pairs_data(pairfile, threads=1, remove=False, force=False)[source]

Extract pairs data from pypairix indexed pairs file. If no pypairix indexed found, sort pairs files using pairtools executable.

Parameters
  • pairfile (str) – Path to the pairfile to extract data.

  • threads (int) – Number of threads to use to sort pairs if necessary. [Default: 1]

  • remove (bool) – If set to true, it will remove the unsorted pair file. [Default: False]

  • force (bool) – If set overwrite existing files. [Default: False]

Returns

Path to the sorted and indexed pair file.

Return type

str

metator.io.get_restriction_site(enzyme)[source]

Function to return a regex which corresponds to all possible restriction sites given a set of enzyme.

Parameters:

enzymestr

String that contains the names of the enzyme separated by a comma.

Returns:

str :

Regex that corresponds to all possible restriction sites given a set of enzyme.

Examples:

>>> get_restriction_site('DpnII')
'GATC'
>>> get_restriction_site('DpnII,HinfI')
'GA.TC|GATC'
metator.io.getrandbits(k) x.  Generates an int with k random bits.
metator.io.micomplete_results_to_dict(micomplete_file)[source]

Read micomplte output file and transfrom it as a dictionnary with the bin name as keys and bin information as values.

Parameters

micomplete_file (str) – Path to the micomplete output file.

Returns

Dictionnary of the output of miComplete as values and the bin id as keys.

Return type

dict

metator.io.read_bin_summary(bin_summary_file)[source]

Read bin summary file from metator pipeline.

Parameters

bin_summary_file (str) – Path to the bin summary file.

Returns

Table from bin summary with bin name as index.

Return type

pandas.DataFrame

metator.io.read_compressed(filename)[source]

Read compressed file

Function from hicstuff.io (https://github.com/koszullab/hicstuff/) Opens the file in read mode with appropriate decompression algorithm.

Parameters:

filenamestr

The path to the input file

Returns:

file-like object

The handle to access the input file’s content

metator.io.read_contig_data(contig_data_file)[source]

Read bin summary file from metator pipeline.

Parameters

contig_data_file (str) – Path to the contig data file.

Returns

Table from contig data with contig name as index.

Return type

pandas.DataFrame

metator.io.read_results_checkm(checkm_file, checkm_taxonomy_file)[source]

Function to transform the output summary file of checkm into a dictionnary.

Parameters:

checkm_filestr

Path to the summary output file of CheckM.

Returns:

dict:

Dictionnary of the output of checkm with binID as keys and with three values lineage, completness and contamination.

metator.io.retrieve_fasta(in_file, aligner, tmpdir)[source]

Function to retrieve fasta from the given reference file. If index is given retrieve it using bowtie2 inspect. Thraw an error if not a fasta or bowtie2 index.

Parameters:

in_filestr

Path to the reference file given.

alignerstr

Name of the aligner used. Either ‘bowtie2’ or ‘bwa’.

tmpdirstr

Path to the temp directory to write the fasta if necessary.

Returns:

str:

Path to the fasta file.

metator.io.save_sparse_matrix(s_mat, path)[source]

Save a sparse matrix

Saves a sparse matrix object into tsv format.

Parameters
  • s_mat (scipy.sparse.coo_matrix) – The sparse matrix to save on disk

  • path (str) – File path where the matrix will be stored

metator.io.sort_pairs(in_file, out_file, tmp_dir=None, threads=1, buffer='2G')[source]

Adapted function from hicstuff.io (https://github.com/koszullab/hicstuff/) Sort a pairs file in batches using UNIX sort.

Parameters:

in_filestr

Path to the unsorted input file

out_filestr

Path to the sorted output file.

keyslist of str

list of columns to use as sort keys. Each column can be one of readID, chr1, pos1, chr2, pos2, frag1, frag2. Key priorities are according to the order in the list.

tmp_dirstr

Path to the directory where temporary files will be created. Defaults to current directory.

threadsint

Number of parallel sorting threads.

bufferstr

Buffer size used for sorting. Consists of a number and a unit.

metator.io.sort_pairs_pairtools(pairfile, threads=1, remove=False, force=False)[source]

Sort pairs files using pairtools executable. Pairix only works with compressed pair files. So we use bgzip to compress them.

Parameters
  • pairfile (str) – Path to the pairfile to sort, compress and index.

  • threads (int) – Number of threads to use. [Default: 1]

  • remove (bool) – If set to true, it will remove the unsorted pair file. [Default: False]

  • force (bool) – If set overwrite existing files. [Default: False]

Returns

Path to the sorted and indexed pair file.

Return type

str

metator.io.write_bin_summary(bin_summary, bin_summary_file)[source]

Function to write the bin summary from dictionnary to table text file.

Parameters:

bin_summarydict

Dictionnary with the output of the miComplete of the bins.

bin_summary_filestr

Path to the output file to write the summary informations of the bins.

metator.log module

Basic logging setup for metaTOR

Basic logging setup with three main handlers (stdout, log file and optionally texting with the right API). By default the log file is disabled, but can be enabled or changed using set_file_handler.

metator.log.set_file_handler(log_path, formatter=<logging.Formatter object>)[source]

Change the file handler for custom log file location

metator.log.setup_text_logging(credentials='text_credentials.json')[source]

Setup text logging

Setup text notifications on errors with a basic API. In order for it to work, a file named ‘text_credentials.json’ must be in the directory where a command is run. The logger performs a GET request to a text notification API.

Parameters

credentials (str or pathlib.Path) – A JSON file with necessary credentials for the GET request

metator.main module

MetaHiC pipeline for generating and manipulating MAGs.

usage:

metator [-hv] <command> [<args>…]

options:
-h, --help

shows the help

-v, --version

shows the version

The subcommands are:
network Generate metaHiC contigs network from fastq reads or bam files

and normalize it.

partition Partition metaHiC network using Louvain or Leiden algorithm. pipeline Use all the others command to give a binned output in one

command line.

validation Validates bins using CheckM and make a recursive partition to

try to decontaminate them.

qc Generates some quality check on the output of metator. contactmap Generates a HiC contact map from one metaTOR object from the

final ouptut of metaTOR.

scaffold Scaffold a metator bin based on pairs files. pairs Sort the pairs file using pairtools. Compress them using bgzip.

Index them using pairix.

metator.main.main()[source]

metator.network module

Generate metaHiC contigs network from fastq reads or bam files.

General utility functions for handling aligment files from align module and generating metaHiC networks.

Core function to build the network are:
  • alignment_to_contacts

  • compute_network

  • create_contig_data

  • normalize_pair

  • precompute_network

  • write_contig_data

  • write_hit_data

metator.network.alignment_to_contacts(alignment_files, contig_data, edge, hit_data, out_dir, output_file_network, output_file_contig_data, tmp_dir, n_cpus, normalization, self_contacts)[source]

Generates a network file (in edgelist form) from an alignment. Contigs are the network nodes and the edges are the contact counts.

The network is in a strict barebone form so that it can be reused and imported quickly into other applications etc. Verbose information about every single node in the network is written on a ‘contig data’ file.

Parameters:

alignment_fileslist of str

List of path to the alignment file(s) used as input.

contig_datadict

Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”. Coverage still at 0 and need to be updated later.

edgeint

Distance of the edge region in base pair on the contigs where the mapping reads are not considered as inter contigs.

hit_datadict:

Dictionnary for hit information on each contigs.

out_dirstr

The output directory to write the network and chunk data into.

output_file_networkstr, optional

The specific file name for the output network file. Default is ‘network.txt’

output_file_contig_datastr, optional

The specific file name for the output chunk data file. Default is ‘idx_contig_length_GC_hit_cov.txt’

tmp_dirstr

Path to th temporary directory. Default in the working directory

normalizationstr

If None, do not normalized the count of a contact by the geometric mean of the coverage of the contigs. Otherwise it’s the type of normalization.

self_contactsbool

Whether to return network with self contact. Default is False.

Returns:

str:

Path to the network file.

str:

Path to the verbose contig data file.

metator.network.compute_network(pre_network_file, network_file, contig_data, tmp_dir, tmp_file, n_cpus, normalization)[source]

Generate the network file from the prenetwork file.

Parameters:

pre_network_filestr

Path of the input file (prenetwork file, output from teh precompute network function)

network_filestr

Path of the output file (network file).

contig_datadict

Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”, “RS”.

tmp_dirstr

Path to the temporary directory.

tmp_filestr

Path of the temporary file to write the sorted precompute network.

n_cpusint

Number of cpus used to sort the prenetwork.

normalizationstr

If None, do not normalized the count of a contact by the geometric mean of the coverage of the contigs. Otherwise it’s the type of normalization.

metator.network.create_contig_data(assembly, nb_alignment=1, depth_file=None, enzyme=None)[source]

Create a dictionnary with data on each Contig.

Parameters:

assemblystr

Path to the assembly fasta file

nb_alignementint

Numbers of alignment files. [Default: 1]

depth_filestr or None

Path to the depth.txt file from jgi_summarize_bam_contig_depths from Metabat2 Software. [Default: None]

enzymestr or None

String that contains the names of the enzyme separated by a comma. [Default: None]

Returns:

dict:

Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”, “RS”. Hit is set to 0 and need to be updated later.

dict:

Dictionnary for hit information on each contigs.

metator.network.normalize_pair(contig_data, pair, n_occ, normalization)[source]

Function to do the normalization of an inter contig contact depending on the mode of normalisation given.

Parameters:

contig_datadict

Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”, “RS”.

pairtuple

Tuple of the id of the contigs contributing to the contact.

n_occint

Contact count between the two contigs.

normalizationstr

Mode of normalization to use.

Returns:

float:

Normalized contact count.

metator.network.precompute_network(alignment_files, contig_data, edge, hit_data, out_file, tmp_dir, self_contacts=False)[source]

Write a file with only the contig id separated by a tabulation and count the contacts by contigs to be able to compute directlty the normalized network.

Parameters:

alignment_fileslist of str

List of path to the alignment file(s).

contig_datadict

Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”. Coverage still at 0 and need to be updated later.

edgeint

Distance of the edge region in base pair on the contigs where the mapping reads are not considered as inter contigs.

hit_datadict

Dictionnary with the count of hits for each aligment file.

out_filestr

Path to the write the output_file which will be necessary to compute the network.

tmp_dirstr

Directory where to store temporary files.

self_contactsbool

If True, the contacts on the same contigs will be kept. Otherwise only displays the inter contigs contacts. [Default False]

Return:

dict:

Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage” “RS”. Coverage still at 0 and need to be updated later.

metator.network.write_contig_data(contig_data, output_path)[source]

Function to write the contig data file at the output path given. The file will contains 7 columns separated by a tabulation: id, name, length, GC_content, hit, coverage, restriction site for each contig with an header.

Parameters:

contig_datadict

Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”, “RS”.

output_pathstr

Path to the output file where the data from the dictionnary will be written

metator.network.write_hit_data(hit_data, output_path)[source]

Function to write the contig data file at the output path given. The file will contains 6 columns separated by a tabulation: id, name, length, GC_content, hit, coverage for each contig.

Parameters:

hit_datadict

Dictionnary of the all the contigs from the assembly, the contigs names are the keys to a list of hits from each alignment files separately.

output_pathstr

Path to the output file where the data from the dictionnary will be written.

metator.partition module

Partition metaHiC network using Louvain or Leiden algorithm.

General utility function for generating bins from the network file using Louvain algorithm. If an overlapping threshold is given, it will use the Hamming distance to group together bins relatively closed (to avaoid to split genomes in different bins).

Core functions to partition the network are:
  • algo_partition

  • build_clustering_matrix

  • defined_overlapping_bins

  • detect_core_bins

  • generate_fasta

  • get_distances_splitmat

  • get_hamming_distance

  • leiden_iterations_java

  • louvain_iterations_cpp

  • partition

  • remove_isolates

  • update_contigs_data

Deprecated Spinglass functions:
  • spinglass_partition

metator.partition.algo_partition(algorithm='louvain', network_file=None, network=None, iterations=10, resolution_parameter=1.0, tmpdir='.', spin=2)[source]

Function to partition the network depednding on the used algorithm.

Parameters:

algorithmstr

Algorithm to use to partition network. [Default: louvain]

network_filestr

Path to the network computed previously. The file is 3 columns table separated by a tabulation with the id of the first contigs the id of the second one and the weights of the edge normalized or not. Mandatory if louvain or leiden algorithm. [Default: None]

networknetworkx.classes.graph.Graph

Network of interaction of a contaminated bins. Mandatory if spinglass algorithm. [Default: None]

iterationsint

Number of iterations of the algorithm of Leiden or Louvain. [Default: 10]

resolution_parameterfloat

Resolution parameter for Leiden clustering. [Default: 1.0]

tmp_dirstr

Path to the temporary directory. [Default: current directory]

spinint

Deprecated. Number of final cluster if spinglass algorithm chosen. [Default: 2]

Returns:

dict:

Dictionnary with the id of the contig as key and the list of the results of each iterations separated by a semicolon as values.

metator.partition.build_clustering_matrix(core_bins_contigs, hamming_distance, N)[source]

Function to return the clustering matrix in sparse format.

For each contigs, the value correspond to the number of iterations where the contigs are clusterized together divided by the number of iterations. A value of 1 means that the contigs are in the same core bin.

Parameters:

core_bins_contigsdict

Dictionnary which has as keys the core bins id and as value the id of the contigs of the core bin.

hamming_distancescipy.sparse.csr.csr_matrix:

Matrix with all the previously computed hamming distance between two core bins.

Nint

Number of contigs in the assembly.

Returns:

scipy.sparse.coo.coo_matrix:

Matrix with all the previously computed hamming distance between two contigs.

metator.partition.defined_overlapping_bins(overlap, hamming_distance, core_bins_contigs)[source]

This function extract the overlapped bins

From the hamming distances between the core bins, the function identifies the overlapping bins and create a dictionnary with the list of the contigs ID for each core bin.

Two core bins are considered overlapping if there have a percentage of identity superior or equal to the threshold given.

Parameters:

overlapfloat

hamming distance threshold use to consider that two bins are overlapping.

hamming_distancescipy.sparse.csr.csr_matrix

Matrix with all the previously computed hamming distance between two core bins.

core_bins_contigsdict

Dictionnary which has as keys the core bins id and as value the id of the contigs of the core bin.

Returns:

dict:

A dictionnary with the id of the overlapping bins as keys and the list of id of their contigs as values.

metator.partition.detect_core_bins(output_partition, iterations)[source]

Detect core bins from the output of the partition algorithm.

The function search for duplicated values in the output of Louvain or Leiden algorithm in order to find contigs which are always in the same bin. The bins find with this method are called the core bins.

Parameters:

output_partitiondict

Dictionnary with the id of the contig as key and the list of the results of each iterations separated by a semicolon as values.

iterationsint

Number of iterations made previously with the partition algorithm.

Returns:

dict:

Dictionnary which has as keys the core bins id and as value the id of the contigs of the core bin.

pandas.core.frame.DataFrame:

Table with the id of the core bin and their values for each iterations.

metator.partition.generate_fasta(assembly, overlapping_bins, contigs_data, size, output_dir, tmpdir, prefix)[source]

Generate the fasta files of each bins from the assembly.

Parameters:

assemblystr

Path to the fasta file of the original assembly.

overlapping_binsdict

A dictionnary with the id of the overlapping bins as keys and the list of id of their contigs as values.

contigs_datapandas.core.frame.DataFrame

Table with all the information on the contigs included their appartenance to the bins.

sizeint

Thrshold size chosen to write the bins.

output_dirstr

Path to the output directory where the fasta of all the bin will be written.

tmpdirstr

Path to the temporary directory to write the temporary contigs list files.

prefixstr

Sample prefix to use.

metator.partition.get_distances_splitmat(bins, core_bins_iterations)[source]

This function takes a segment of the full iterative clustering matrix and computes, for each index (i.e. contig), the hamming distance to each of the other indices.

Parameters:

binspandas.core.frame.DataFrame

Slice of the table with the id of the core bin and their values for each iterations.

core_bins_iterationspandas.core.frame.DataFrame

Table with the id of the core bin and their values for each iterations.

Returns:

scipy.sparse.csr.csr_matrix:

matrix of the distance of the possible pairs from the slice of the table and the table itself.

metator.partition.get_hamming_distance(core_bins_iterations, threads)[source]

Generate matrix of Hamming distances between all pairs of core bins.

Parameters:

core_bins_iterationspandas.core.frame.DataFrame

Table with the id of the core bin as index and their values for each iterations.

threadsint

Number of cores to parallelize computation.

Returns:

scipy.sparse.csr.csr_matrix:

Matrix with all the previously computed hamming distance between two core bins.

metator.partition.leiden_iterations_java(network_file, iterations, resolution_parameter, tmp_dir, leiden_path)[source]

Use the java implementation of Leiden to partition the network.

Parameters:

network_filestr

Path to the network computed previously. The file is 3 columns table separated by a tabulation with the id of the first contigs the id of the second one and the weights of the edge normalized or not.

iterationsint

Number of iterations of the algorithm of Leiden.

resolution_parameterfloat

Resolution parameter for Leiden clustering.

tmp_dirstr

Path to the temporary directory.

leiden_pathstr

Path to the directory with network analysis java implementation.

Returns:

dict:

Dictionnary with the id of the contig as key and the list of the results of each iterations separated by a semicolon as values.

metator.partition.louvain_iterations_cpp(network_file, iterations, tmp_dir, louvain_path)[source]

Use the cpp original Louvain to partition the network.

Parameters:

network_filestr

Path to the network computed previously. The file is 3 columns table separated by a tabulation with the id of the first contigs the id of the second one and the weights of the edge normalized or not.

iterationsint

Number of iterations of the algorithm of Louvain.

tmp_dirstr

Path to the temporary directory.

louvain_pathstr

Path to the directory with louvain functions.

Returns:

dict:

Dictionnary with the id of the contig as key and the list of the results of each iterations separated by a semicolon as values.

metator.partition.partition(algorithm, assembly, cluster_matrix, contig_data_file, iterations, network_file, outdir, fasta_dir, overlapping_parameter, resolution_parameter, size, temp_directory, threads, prefix)[source]

Function to call the others functions to partition the network.

Parameters:

algorithmstr

Algorithm to use to partition the network. Either leiden or louvain.

assemblystr

Path to the assembly file used for the partition.

cluster_matrixbool

If True, build and save the clustering matrix.

contig_data_filestr

Path to the contig data table to update.

iterationsint

Number of iterations to use for the partition.

network_filestr

Path to the network file.

outdirstr

Path to the output directory where to write the output files.

fasta_dirstr

Path to directory where to write the fasta files.

overlapping_parameterint

Hamming distance threshold to use to merge bins (percentage).

resolution_parameterfloat

Resolution parameter to use if Leiden algorithm is chosen. It will be a factor of the cost function used. A resolution parameter of 1 will be equivalent as the modularity function used in Louvain. Higher these parameters, smaller the bins will be in the output.

sizeint

Threshold size in base pair of the output bins.

temp_directorystr

Path to the directory used to write temporary files.

threadsint

Number of threads to use.

prefixstr

Sample prefix to use.

Returns:

scipy.sparse.coo.coo_matrix:

Matrix with all the previously computed hamming distance between two contigs.

str:

Path to the new contig data file with the bin informations in it.

metator.partition.remove_isolates(output_partition, network_file)[source]

Remove isolates, i.e. nodes without any contacts in the network in the partition. This step is necessary as it will slow the further process of the communities. This function is only useful while using Leiden algorithm.

Parameters:

output_partitiondict

Dictionnary with the id of the contig as key and the list of the results of each iterations separated by a semicolon as values.

network_filestr

Path to the network computed previously. The file is 3 columns table separated by a tabulation with the id of the first contigs the id of the second one and the weights of the edge normalized or not.

Returns:

dict:

Dictionnary with the id of the contig as key and the list of the results of each iterations separated by a semicolon as values without isolates.

metator.partition.spinglass_partition(subnetwork, spins=2)[source]

Use spinglass function from cdlib to partition the network.

This function is only used in the validation step as it will take a long time to run on a large network.

Parameters:

subnetworknetworkx.classes.graph.Graph

Network of interaction of a contaminated bins.

spinsint

Number of expected MAGs in the contaminated bins.

Returns: dict:

Dictionnary with the id of the contig as key and the clustering result as values.

metator.partition.update_contigs_data(contig_data_file, core_bins_contigs, overlapping_bins, outdir)[source]

Add bin information in the contigs data file.

This function allow to update the contigs data file which were created previously in the network functions with the columns: contig id, contig name, contig length, GC content, hit, coverage, restriction site. The function will add six columns: core bin id, core bin number of contigs, core bin length, overlapping bin id, overlapping bin number of contigs, overlapping bin length.

Parameters:

contig_data_filestr

Path to the contigs data file.

core_bins_contigsdict

Dictionnary which has as keys the core bins id and as value the id of the contigs of the core bin.

overlapping_binsdict

A dictionnary with the id of the overlapping bins as keys and the list of id of their contigs as values.

outdirstr

Path of the output directory to write the update contigs data file.

Returns:

pandas.core.frame.DataFrame:

Table with all the information on the contigs included their appartenance to the bins.

metator.validation module

Validates bins using miComplete and make a recursive partition to try to decontaminate them.

General functions to validate bins completion using miComplete and make recursive iterations of Louvain or Leiden to try to partition contaminated bins. Only bins with more than 50% completion and 5% contamination are subject to the recursive step. If the recursive step gave worst results than the first (decrease of the completion with no decrease of the contamination), it will keep the original bin.

Functions in this module:
  • correct_final_bin

  • get_bin_coverage

  • give_results_info

  • merge_micomplete

  • micomplete_compare_bins

  • micomplete_quality

  • recursive_clustering

  • recursive_clustering_worker

  • recursive_decontamination

  • update_contigs_data_recursive

  • write_bins_contigs

CheckM deprecated functions:
  • checkm

  • checkm_compare_bins

metator.validation.checkm(fasta_dir, outfile, taxonomy_file, tmpdir, threads)[source]

Function to evaluate fasta bins using CheckM. Write the checkM results summary in the outfile and the taxonomy results in the the taxonomy file.

Parameters:

fasta_dirstr

Path to the input fasta of the bins to evaluate.

outfilestr

Path to the file where the results of checkm will be written.

taxonomy_filestr

path to the file where checkm taxonomy results will be written.

tmpdirstr

Path to the temporary directory where CheckM intermediary files will be written.

threadsint

Numbers of threads to use for CheckM.

metator.validation.checkm_compare_bins(overlapping_checkm_file, overlapping_taxonomy_file, recursive_checkm_file, recursive_taxonomy_file, prefix)[source]

Compare the completness and contamination of the bins from the first step and from the recursive step. If the recursive step decrease the completion of one bin without decreasing its contamination, it will kept the bin before the recursive step. Moreover if the completion goes below 50% with the recursive it will not take the recursive bins.

Parameters:

overlapping_checkm_filestr

Path to the checkm summary from the overlapping step.

overlapping_taxonomy_filestr

path to the overlapping checkm taxonomy results file.

recursive_checkm_filestr

Path to the checkm summary from the recursive step.

recursive_taxonomy_filestr

path to the recursive checkm taxonomy results file.

prefixstr

Sample prefix to use.

Returns:

dict:

Dictionnary with the informations of the final bins kept by MetaTOR.

metator.validation.correct_final_bin(contigs_data, final_fasta_dir, bin_summary)[source]

Function to compute the coverage of each bin.

Parameters:

bin_summarydict

Dictionnary with the informations of the final bins kept by MetaTOR.

contigs_datapandas.core.frame.DataFrame

Dataframe with the contigs informations.

Returns:

pandas.core.frame.DataFrame

Dataframe with the contigs informations.

metator.validation.get_bin_coverage(bin_summary, contigs_data)[source]

Function to compute the coverage of each bin.

Parameters:

bin_summarydict

Dictionnary with the informations of the final bins kept by MetaTOR.

contigs_datapandas.core.frame.DataFrame

Dataframe with the contigs informations.

Returns:

dict:

Dictionnary with the informations of the final bins kept by MetaTOR with the coverage.

metator.validation.give_results_info(bin_summary)[source]

Function to return the general information about the binning results.

Parameters:

bin_summarydict

Dictionnary with the summary results of the kept bins.

metator.validation.merge_micomplete(out_bact105, out_arch131, outfile)[source]

Function to merge bacterial and archaeal output of micomplete into one file.

Parameters: out_bact105 : str

Output of micomplete using 105 bacterial markers.

out_arch131str

Output of micomplete using 131 arcaheal markers.

outfilestr

Final merged output.

metator.validation.micomplete_compare_bins(recursive_micomplete_file, bin_summary, parent_dict, step)[source]

Compare the completness and contamination of the bins from the first step and from the recursive step. If the recursive step decrease the completion of one bin without decreasing its contamination, it will kept the bin before the recursive step. Moreover if the completion goes below 50% with the recursive it will not take the recursive bins. The goal is to keep complete bins even if there are contaminated that the users can manually curate.

Parameters:

recursive_micomplete_filestr

Path to the miComplete summary from the recursive step.

bin_summarydict

Dictionnary containing iinformation about the bins.

parent_dictdict

Dictionnary with recursive bin_id as key and parent bin as values.

stepint

Recursive step.

Returns:

dict:

Dictionnary with the informations of the final bins kept by MetaTOR.

metator.validation.micomplete_quality(fasta_dir, outfile, threads)[source]

Function to evaluate fasta bins using miComplete. Write the bins quality summury in the outfile.

Parameters:

fasta_dirstr

Path to the input fasta of the bins to evaluate.

outfilestr

Path to the file where the results of miComplete will be written.

threadsint

Numbers of threads to use for miComplete.

metator.validation.recursive_clustering(assembly, iterations, overlapping_parameter, resolution_parameter, outdir, recursive_fasta_dir, algorithm, tmpdir, bin_summary, contigs_data, network, cluster_matrix, size, threads, prefix)[source]

Function to run recursive iterations on contaminated bins in order to try to improve the quality of the bins using Louvain or Leiden algorthm.

Parameters:

assemblystr

Path to the fasta file used as assembly.

iterationsint

Number of iterations to use for recursive iterations of Louvain or Leiden.

overlapping_parameterfloat

Hamming distance threshold to consider two bins as the same bin.

resolution parameterfloat

Resolution parameter of Leiden algorithm.

outdirstr

Path to the output directory.

recursive_fasta_dirstr

Path to the directory where to write the decontaminated fasta.

algorithmstr

Algorithm to use, either louvain, leiden or spinglass.

tmpdirstr

Path the temp directory.

bin_summarydict

Dictionnary containing iinformation about the bins.

contigs_datapandas.DataFrame

Table with all the data from the contigs.

networkstr

Metator full network.

cluster_matrixbool

If True, build the clustering matrix and save it.

sizeint

Size threshodl in base pairs of the bins.

threadsint

Number of threads to use.

prefixstr

Sample prefix to use.

Returns:

boolean:

True if at least one new recursive bin has been generated.

pandas.DataFrame

Updated dictionnary which has as keys the values of the iterations from the recursive partition separated by a semicolon and as values the list of the id of the contigs.

scipy.sparse.coo.coo_matrix:

Matrix with all the previously computed hamming distance between two contigs.

dict

Dictionnary with recursive bin_id as key and parent bin as values.

metator.validation.recursive_clustering_worker(bin_id, bin_summary, tmpdir, network, algorithm, iterations, resolution_parameter, contigs_data)[source]

Worker to partition one bin if it’s contaminated.

Parameters:

bin_summarydict

Dictionnary of the output of micomplete as values and the bin id as keys.

tmpdirstr

Path the temp directory.

algorithmstr

Algorithm to use, either louvain, leiden or spinglass.

networknetworkx.classes.graph.Graph

Network of contigs from HiC librairies.

iterationsint

Number of iterations to use for recursive iterations of Louvain or Leiden.

resolution parameterfloat

Resolution parameter of Leiden algorithm.

contigs_data_filestr

Path to the contigs data file from metator partition.

metator.validation.recursive_decontamination(algorithm, assembly, cluster_matrix, contig_data_file, final_fasta_dir, input_fasta_dir, iterations, network_file, outdir, overlapping_parameter, recursive_fasta_dir, resolution_parameter, size, temp_directory, threads, prefix)[source]

Function to validate bins do the recursive decontamination using Louvain or Leiden algorithm

Parameters:

algorithmstr

Algorithm to use to recursively partition the network. Either leiden, louvain or spinglass.

assemblystr

Path to the assembly file used for the partition.

cluster_matrixbool

If True, build the clustering matrix and save it.

contig_data_filestr

Path to the contig data table to update.

final_fasta_dirstr

Path to write the final fasta decontaminated bins.

input_fasta_dirstr

Path to the directory where the fasta bin from the partition are.

iterationsint

Number of iterations to use for the recursive partition.

network_filestr

Path to the network file.

outdirstr

Path to the output directory where to write the output files.

overlapping_parameterint

Hamming distance threshold in percentage to use to consider to bins as one in the recursive partition.

recursive_fasta_dirstr

Path to write the fasta decontaminated bins.

resolution_parameterfloat

Resolution parameter to use if Leiden algorithm is chosen. It will be a factor of the cost function used. A resolution parameter of 1 will be equivalent as the modularity function used in Louvain. Higher these parameters, smaller the bins will be in the output.

sizeint

Threshold size in base pair of the output bins.

temp_directorystr

Path to the directory used to write temporary files.

threadsint

Number of threads to use.

prefixstr

Sample prefix to use.

Returns:

scipy.sparse.coo.coo_matrix:

Matrix with all the previously computed hamming distance between two contigs.

metator.validation.update_contigs_data_recursive(bin_id, contigs_data, recursive_bins, assembly, outdir, tmpdir, size, contamination, parent_dict, prefix)[source]

Update the data of the bin according to the recursive step and generated their fasta.

Parameters:

bin_idstr

Name of the parental bin.

contigs_datapandas.DataFrame

Table with all the data from the contigs.

recursive_binsdict

Dictionnary which has as keys the values of the recursive iterations from Louvain or Leiden separated by a semicolon and as values the list of the id of the contigs.

assemblystr

Path to the fasta file.

outdirstr

Path to the output directory to write the new fasta.

tmpdirstr

Path to the file where to write the temporary contigs list.

sizeint

Size threshold to generate fasta.

contaminationboolean

True if one bin has already been generated, false otherwise.

parent_dictdict

Dictionnary with recursive bin_id as key and parent bin as values.

prefixstr

Sample prefix to use.

Returns:

boolean:

True if one bin has already been generated, false otherwise.

pandas.DataFrame

Updated dictionnary which has as keys the values of the iterations from the recursive partition separated by a semicolon and as values the list of the id of the contigs.

dict

Dictionnary with recursive bin_id as key and parent bin as values.

metator.validation.write_bins_contigs(bin_summary, contigs_data, outfile, prefix)[source]

Function to write a table with the nodes kept in the bins and their bin id. The file is adapted to be added in anvio.

Parameters:

bin_summarydict

Dictionnary with the informations of the final bins kept by MetaTOR.

contigs_datapandas.core.frame.DataFrame

Dataframe with the contigs informations.

outfilestr

Path where to write the output file.

prefixstr

Sample prefix to use.

Returns:

pandas.core.frame.DataFrame

Dataframe with the contigs informations with the final bin information.

metator.version module

Module contents