metator package¶

Submodules¶

metator.align module¶

Alignement of reads

Align read files onto the assembly and return a tsv file with 9 columns: ReadID, ContigA, Position_startA, Position_endA, StrandA, ContigB, Position_startB, Position_endB, StrandB. Reads are mapped separately, sorted by names, then interleaved (rather than mapped in paired-end mode) to capture the pairs mapping on two different contigs.

This module contains all these alignment functions:

align
get_contact_pairs
merge_alignement
process_bamfile
process_bwa_bamfile

metator.align.align(fq_in, index, aligner, bam_out, n_cpu, tmp_dir, iterative=False, fq_in_2=None)[source]¶

Alignment Aligns reads of fq_in with bowtie2. Parameters of bowtie2 are set as –very-sensitive-local and only the aligned reads are kept in the bam file. Multiple files could be given, but only one file will be written as output.

Parameters:¶

fq_instr: Path to input fastq file to align.
indexstr: Path to the bowtie2 index genome.
alignerstr: Name of the aligner algorithm to use. Either bowtie2 or bwa.
bam_outstr: Path where the alignment should be written in BAM format.
n_cpuint: The number of CPUs to use for the alignment.
tmp_dir:: Path to directory to store temporary files.
iterative:: Wether to use the iterative mapping procedure (truncating reads and extending iteratively)
fq_in_2str: Path to the reverse fastq file for bwa (map both reads at the same time but do not take into account their poistion as pairs.)

Returns:¶

int :: 0

metator.align.get_contact_pairs(for_in, rev_in, index, assembly, aligner, aligner_mode, min_qual, start, depth_file, enzyme, out_dir, tmp_dir, n_cpu)[source]¶

General function to do the whole alignment of both fastq.

The Function write at the output directory location given as an argument and return a tsv file of the aligned reads with 9 columns: ReadID, ContigA, Position_startA, Position_endA, StrandA, ContigB, Position_startB, Position_endB, StrandB. The name of the file will be alignment.txt.

Two start stages are possible, from fastq or bam files.

Parameters

for_in (str) – Path to input forward fastq or bam file to align. If multiple files are given, list of path separated by a comma.
rev_in (str) – Path to input reverse fastq or bam file to align. If multiple files are given, list of path separated by a comma.
index (str) – Path to the bowtie2 index of the assembly.
assembly (str) – The initial assembly path acting as the alignment file’s reference assembly.
aligner (str) – Either ‘bowtie2’ or ‘bwa’ aligner used or to be use to map the reads.
aligner_mode (str) – Either ‘normal’, ‘iterative’ or ‘cutsiste’. Mode to align the HiC reads.
min_qual (int) – Minimum mapping quality required to keep Hi-C pairs.
start (str) – Either fastq or bam. Starting point for the pipeline.
depth_file (str or None) – Path to the depth.txt file from jgi_summarize_bam_contig_depths from Metabat2 Software.
enzyme (str or None) – String that contains the names of the enzyme separated by a comma.
out_dir (str) – Path to directory where to write the output file.
tmp_dir (str) – Path where temporary files should be written.
n_cpu (int) – The number of CPUs to use for the alignment.

Returns

list of str – List of path of the Files with the table containing the alignement data of the pairs: ReadID, ContigA, Position_startA, Position_endA, StrandA, ContigB, Position_startB, Position_endB, StrandB.
dict – Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”. Coverage still at 0 and need to be updated later.
dict – Dictionnary for hit information on each contigs.

metator.align.merge_alignment(forward_aligned, reverse_aligned, contig_data, out_file)[source]¶

Merge forward and reverse alignment into one file with only pairs which have both reads are aligned on the genome with 9 columns: ReadID, ContigA, Position_startA, Position_endA, StrandA, ContigB, Position_startB, Position_endB, StrandB.

Parameters

forward_alignement (str) – File with the table containing the data of the forward reads kept after the alignment. With five columns: ReadID, Contig, Position_start, Position_end, strand.
reverse_alignement (str) – File with the table containing the data of the forward reads kept after the alignment. With five columns: ReadID, Contig, Position_start, Position_end, strand.
contig_data (dict) – Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”. Coverage still at 0 and need to be updated later.
out_file (str) – Path to write the output pairs file.

Returns

File with the table containing the alignement data of the pairs: ReadID, ContigA, Position_startA, Position_endA, StrandA, ContigB, Position_startB, Position_endB, StrandB

Return type

str

metator.align.process_bamfile(alignment, min_qual, filtered_out)[source]¶

Filter alignment BAM files

Reads all the reads in the input BAM alignment file. Keep reads in the output if they are aligned with a good quality (greater than min quality threshold given) saving their only some columns: ReadID, Contig, Position_start, Position_end, strand to save memory.

Parameters

alignment (str) – Path to the input temporary alignment.
min_qual (int) – Minimum mapping quality required to keep a Hi-C pair.
filtered_out (str) – Path to the output temporary tsv alignement.

Returns

Number of reads aligned.

Return type

int

metator.align.process_bwa_bamfile(alignment, min_qual, contig_data, out_file)[source]¶

Filter alignment BAM files

Parameters

alignment (str) – Path to the input temporary alignment.
min_qual (int) – Minimum mapping quality required to keep a Hi-C pair.
contig_data (dict) – Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”. Coverage still at 0 and need to be updated later.
out_file (str) – Path to the output pairs file.

Returns

Number of pairs aligned.

Return type

int

metator.commands module¶

Abstract command classes for metaTOR

This module contains all classes related to metaTOR commands:

network
partition
pipeline
validation
qualitycheck
contactmap

Note

Structure based on Rémy Greinhofer (rgreinho) tutorial on subcommands in docopt : https://github.com/rgreinho/docopt-subcommands-example abignaud, 20201118

raises NotImplementedError: Will be raised if AbstractCommand is called for some reason instead of one of its children.

class metator.commands.AbstractCommand(command_args, global_args)[source]¶

Bases: object

Abstract base command class

Base class for the commands from which other metaTOR commands derive.

check_output_path(path, force=False)[source]¶: Throws error if the output file exists. Create required file tree otherwise.

execute()[source]¶: Execute the commands

class metator.commands.Contactmap(command_args, global_args)[source]¶

Bases: AbstractCommand

Generates a HiC contact map of a MetaTOR object from the pairs files, the contig data file and a fasta file containing the contigs of interest.

Generates the Hi-C matrix of one MetaTOR object form the pair alignment file of metaTOR. MetaTOR object integrated are contigs, and core, overlapping, recursive, final or personalized bins. For the personalized bins you should add a column in the contig_data_final.txt file with a header called “Other”.

Do not display the matrix, you have to run another pipeline to display it such hicstuff view or cooler show depending on the metrix format chosen. It’s also possible to try to scaffold the bin using instagraal pipeline using the output.

usage:

contactmap –assembly=FILE –contig-data=FILE –enzyme=STR –name=STR [–filter] [–force] [–mat-fmt=graal] [–min-size=5000] [–no-clean-up] [–object=final_bin] [–outdir=DIR] [–pcr-dup] [–tmpdir=DIR] [–threads=1] <pairsfile>…

argument:

pairsfile File(s) containing pairs information.

options:

-a, --assembly=FILE: Path to the fasta file containing the contigs of interest. Could be the whole or the extracted contigs of one bin.
-c, --contig-data=FILE: Path to the contig_data_final.txt file from MetaTOR output.
-e, --enzyme=STR: The list of restriction enzyme used to digest the contigs separated by a comma. Example: HpaII,MluCI.
-n, --name=STR: Name of the metator or its numerical ID for the core bin, overlapping bin or recursive bin objects. Example: “NODE_1”, “MetaTOR_1_0” or 8.

-D, –pcr-dup, Filter out PCR duplicates based on read: positions.

-f, --filter: Filter out spurious 3C events (loops and uncuts) using hicstuff filter. For more informations, see Cournac et al. BMC Genomics, 2012.
-F, --force: Write files even if the output files already exists.
-m, --mat-fmt=STR: The format of the output sparse matrix. Can be “bg2” for 2D Bedgraph format, “cool” for Mirnylab’s cooler software, or “graal” for graal-compatible plain text COO format. [default: graal]
-N, --no-clean-up: If enabled intermediary files will be kept.
-o, --outdir=DIR: Output directory. Default creates a new directory contact_map in the current directory.
-O, --object=STR: Type of metator object used to build a contact map. Either “contig”, “core_bin”, “overlapping_bin”, “recursive_bin”, “final_bin” or “other”. [Default: final_bin]
-s, --min-size=INT: Minimum size threshold to consider contigs. [Default: 5000]
-t, --threads=INT: Number of threads to allocate. [Default: 1]
-T, --tmpdir=DIR: Directory for storing intermediary files and temporary files. Default creates a “tmp” folder in the current directory.

execute()[source]¶: Execute the commands

class metator.commands.Network(command_args, global_args)[source]¶

Bases: AbstractCommand

Generation of network command

Generate metaHiC contigs network from fastq reads or bam files.Generates a network file (in edgelist form) from either fastq files or bam files. For both starting input forward and reverse reads or alignment need to be in seprated files with the same read names. In the output network contigs are the network nodes and the edges are the contact counts.

If multiple fastq are given will align them seprately and return a hit file with a number of hit for each sample.

Note: the network is in a strict barebone form so that it can be reused and imported quickly into other applications etc. Verbose information about every single node in the network is written on a ‘contig data’ file.

usage:

network –forward=STR –assembly=FILE [–reverse=STR] [–aligner=bowtie2] [–aligner-mode=normal] [–depth=FILE] [–edge=0] [–enzyme=STR] [–normalization=empirical_hit] [–no-clean-up] [–outdir=DIR] [–min-quality=30] [–self-contacts] [–start=fastq] [–threads=1] [–tmpdir=DIR]

options:

-1, --forward=STR: Fastq file or list of Fastq separated by a comma containing the forward reads to be aligned or their corresponding bam or pairs files.
-2, --reverse=STR: Fastq file or list of Fastq separated by a comma containing the reverse reads to be aligned or their corresponding bam files. Forward and reverse reads need to have the same identifier (read names). If start is set to pair, no argument is necessary.
-a, --assembly=FILE: The initial assembly path acting as the alignment file’s reference genome or the basename of the bowtie2 index.
-b, --aligner=STR: Aligner algorithm to use. Either “bwa” or “bowtie2”. [Default: bowtie2]
-B, --aligner-mode=STR: Mode of alignment from hicstuff. Either normal, iterative or cutsite. [Default: normal]
-d, --depth=FILE: The depth.txt file from the shotgun reads used to made the assembly computed by jgi_summarize_bam_contig_depths from metabat2 pipeline.
-E, --edge=INT: Distance of the edge region in base pair on the contigs where the mapping reads are not considered as inter contigs. [Default: 0]
-e, --enzyme=STR: The list of restriction enzyme used to digest the contigs separated by a comma. Example: HpaII,MluCI.

-n, –normalization=STR If None, do not normalized the count of a: contact by the geometric mean of the coverage of the contigs. Otherwise it’s the type of normalization. 6 values are possible “None”, “abundance”, “length”, “RS”, “empirical_hit”, “theoritical_hit”. [Default: empirical_hit]

-N, --no-clean-up: Do not remove temporary files.
-o, --outdir=DIR: The output directory to write the bam files the network and contig data into. Default: current directory.
-q, --min-quality=INT: Threshold of quality necessary to considered a read properly aligned. [Default: 30]
-s, --self-contacts: If enabled, count alignments between a contig and itself (intracontigs contacts).
-S, --start=STR: Start stage of the pipeline. Either “fastq”, “bam”, or “pair”. [Default: fastq]
-t, --threads=INT: Number of parallel threads allocated for the alignement. [Default: 1]
-T, --tmpdir=DIR: Temporary directory. Default to current directory. [Default: ./tmp]

execute()[source]¶: Execute the commands

class metator.commands.Pairs(command_args, global_args)[source]¶

Bases: AbstractCommand

Sort and pairs files for faster assess to the data.

Sort the pairs file using pairtools. Compress them using bgzip. Index them using pairix.

usage:: pairs [–force] [–remove] [–threads=1] <pairsfile>…

Parameters: File (pairsfile) –

options:

-F, --force: Write files even if the output files already exists.
-r, --remove: Remove the input file at the end to keep only the sorted, compressed and indexed pairs file.
-t, --threads=INT: Numbers of thread to allocate. [Default: 1]

execute()[source]¶: Execute the commands

class metator.commands.Partition(command_args, global_args)[source]¶

Bases: AbstractCommand

Partition the network using Louvain algorithm

Partition the network file using iteratively the Louvain or Leiden algorithm. Then looks for ‘cores’ bins that are constituted by the contigs which are always in the same cluster at each iterations. Then using the Hamming distance from these cores bins, merege the bins with a small Hamming distance (close to 1), bigger than the percentage (overlapping parameter) given.

It will also update the file to integrate the bins information of the contigs.

Furthermore, both Leiden and Louvain algorithm are available here. However, the benchmark made show that here the Louvain algorithm have better performance and is faster on seawater and gut metagenomic samples.

Note that the Louvain or Leiden software are not, in the strictest sense, necessary. Any program that assigns a node to a bin, does so non deterministically and solely outputs a list in the form: ‘node_id bin_id’ could be plugged instead.

usage:

partition –assembly=FILE –contigs=FILE –network=FILE [–algorithm=louvain] [–cluster-matrix] [–force] [–iterations=100] [–no-clean-up] [–outdir=DIR] [–overlap=80] [–prefix=STR] [–res-param=1.0] [–size=500000] [–threads=1] [–tmpdir=DIR]

options:

-a, --assembly=FILE: The path to the assembly fasta file used to do the alignment.
-A, --algorithm=STR: Either “louvain” or “leiden”, algorithm to use to partition the network. [Default: louvain]
-c, --contigs=FILE: The path to the tsv file containing the data of the contigs (ID, Name, Length, GC content, Hit, Coverage, Restriction Site).
-C, --cluster-matrix: If enabled, save the clustering matrix.
-F, --force: If enabled, would remove directory of overlapping bins in the output directory.
-i, --iterations=INT: Number of iterations of Louvain. [Default: 100]
-n, --network=FILE: Path to the file containing the network information from the meta HiC experiment compute in network function previously.
-N, --no-clean-up: Do not remove temporary files.
-o, --outdir=DIR: Path to the directory to write the output. Default to current directory. [Default: ./]
-O, --overlap=INT: Hamming distance threshold to use to merge bins (percentage). [Default: 80]
-p, --prefix=STR: Prefix to use for fasta files. By default just ‘metator_’, otherwise ‘STR_metator_’.
-r, --res-param=FLOAT: Resolution paramter to use for Leiden algorithm. [Default: 1.0]
-s, --size=INT: Threshold size to keep bins in base pair. [Default: 500000]
-t, --threads=INT: Number of parallel threads allocated for the partition. [Default: 1]
-T, --tmpdir=DIR: Temporary directory. Default to current directory. [Default: ./tmp]

execute()[source]¶: Execute the commands

class metator.commands.Pipeline(command_args, global_args)[source]¶

Bases: AbstractCommand

Launch the full metator pipeline

Partition the contigs from teh assembly in bins from the metaHiC reads.

It’s possible to start from the fastq, the bam, the pair or the network files. It’s will also possible to ask or not to run a validation step which will decontaminate the bins when it’s necessary. However as it’s the critical step for memory usage (~40G), it’s possible to skip these step.

usage:

pipeline –assembly=FILE [–forward=STR] [–reverse=STR] [–algorithm=louvain] [–aligner=bowtie2] [–aligner-mode=normal] [–cluster-matrix] [–depth=FILE] [–edge=0] [–enzyme=STR] [–force] [–iterations=100] [–rec-iter=10] [–junctions=NNNNN] [–no-clean-up] [–normalization=empirical_hit] [–outdir=DIR] [–overlap=80] [–prefix=STR] [–rec-overlap=90] [–min-quality=30] [–res-param=1.0] [–size=500000] [–start=fastq] [–scaffold] [–threads=1] [–tmpdir=DIR]

options:

-1, --forward=STR: Fastq file or list of Fastq separated by a comma containing the forward reads to be aligned or their corresponding bam or pairs files.
-2, --reverse=STR: Fastq file or list of Fastq separated by a comma containing the reverse reads to be aligned or their corresponding bam files. Forward and reverse reads need to have the same identifier (read names). If pair start is used no argument is necessary.
-a, --assembly=FILE: The initial assembly path acting as the alignment file’s reference genome or the basename of the bowtie2 index.
-A, --algorithm=STR: Algorithm to use. Either “louvain”, “leiden” or “spinglass”. If spinglass is chosen, the first partition will be done using louvain [Default: louvain]
-b, --aligner=STR: Aligner algorithm to use. Either “bwa” or “bowtie2”. [Default: bowtie2]
-B, --aligner-mode=STR: Mode of alignment from hicstuff. Either normal, iterative or cutsite. [Default: normal]
-C, --cluster-matrix: If enabled, save the clustering matrix.
-d, --depth=FILE: The depth.txt file from the shotgun reads used to made the assembly computed by jgi_summarize_bam_contig_depths from metabat2 pipeline.
-E, --edge=INT: Distance of the edge region in base pair on the contigs where the mapping reads are not considered as inter contigs. [Default: 0]
-e, --enzyme=STR: The list of restriction enzyme used to digest the contigs separated by a comma. Example: HpaII,MluCI.
-F, --force: If enable, would remove directory of overlapping bins in the output directory.
-i, --iterations=INT: Number of iterations of Louvain for the partition step. [Default: 100]
-j, --rec-iter=INT: Number of iterations of Louvain for the recursive step. [Default: 10]
-J, --junctions=STR: Sequences to use as junction between contigs. [Default: NNNNN]
-N, --no-clean-up: Do not remove temporary files.

-m, –normalization=STR If None, do not normalized the count of a: contact by the geometric mean of the coverage of the contigs. Otherwise it’s the type of normalization. 6 values are possible None, abundance, length, RS, empirical_hit, theoritical_hit. [Default: empirical_hit]

-o, --outdir=DIR: The output directory to write the bam files the network and contig data into. Default: current directory.
-O, --overlap=INT: Hamming distance threshold to use to merge bins for the first step (percentage). [Default: 80]
-p, --prefix=STR: Prefix to use for fasta files. By default just ‘metator_’, otherwise ‘STR_metator_’.
-P, --rec-overlap=INT: Hamming distance threshold to use to merge bins for the recursive step (percentage). [Default: 90]
-q, --min-quality=INT: Threshold of quality necessary to considered a read properly aligned. [Default: 30]
-r, --res-param=FLOAT: Resolution paramter to use for Leiden algorithm. [Default: 1.0]
-s, --size=INT: Threshold size to keep bins in base pair. [Default: 500000]
-S, --start=STR: Start stage of the pipeline. Either fastq, bam, or pair. [Default: fastq]
-t, --threads=INT: Number of parallel threads allocated for the alignement. [Default: 1]
-T, --tmpdir=DIR: Temporary directory. Default to current directory. [Default: ./tmp]
-v, --scaffold: If enables, it will sacffold the genomes at the end.

execute()[source]¶: Execute the commands

class metator.commands.Qc(command_args, global_args)[source]¶

Bases: AbstractCommand

Generates some quality check on the output of metator.

The quality check is done on the high quality MAGs and only contigs witha size bigger than 100kb. The goal is to assess HiC quality metrics on genomes we kinda know to have a estimation of the quality of the library.

If the plot are set it will return some plot of the event distribution on the high quality contigs.

usage:: qc –assembly=FILE –enzyme=STR [–bin-summary=FILE] [–contig-data=FILE] [–metator-dir=DIR] [–no-clean-up] [–outdir=DIR] [–prefix=STR] [–plot] [–threshold=STR] [–tmpdir=DIR] <pairsfile>…

Parameters: File (pairsfile) –

options:

-a, --assembly=FILE: Path to the fasta file containing the contigs of interest. Could be the whole or the extracted contigs of one bin.
-b, --bin-summary=FILE: Path to the bin_summary.txt file from MetaTOR output.
-c, --contig-data=FILE: Path to the contig_data_final.txt file from MetaTOR output.
-e, --enzyme=STR: The list of restriction enzyme used to digest the contigs separated by a comma. Example: HpaII,MluCI.
-n, --no-clean-up: If enabled intermediary files will be kept.
-O, --metator-dir=DIR: Output directory from metator pipeline. If set contigs data and bin summary will be found automatically.
-o, --outdir=DIR: Directory to save output plots and log.
-p, --prefix=STR: Name of the sample to add on plot and files.
-P, --plot: If enable display some plots.
-t, --threshold=STR: Hicstuff religation and loop thresholds. Two integers seperated by a coma.
-T, --tmpdir=DIR: Temporary directory to save pairs files

execute()[source]¶: Execute the commands

class metator.commands.Scaffold(command_args, global_args)[source]¶

Bases: AbstractCommand

Scaffold a bin from metator.

Use the pairs to build a scaffold from the binned contigs. The scaffolding is based on contact between the edges of the contigs. There are normalized based on intra contacts of the contigs except for the small contigs where a low score is given.

usage:: scaffold –bin-name=STR –input-fasta=FILE [–junctions=NNNNN] [–out-fasta=FILE] [–out-frags=FILE] [–threads=1] [–threshold=0.05] [–window-size=5000] <pairsfile>…

Parameters: File (pairsfile) –

options:

-b, --bin-name=STR: Name of the bin to scaffold.
-i, --input-fasta=FILE: Path to the input fasta.
-j, --junctions=STR: Sequences to use as junction between contigs. [Default: NNNNN]
-o, --out-fasta=FILE: Path to write the output fasta. Default in the current directory: ‘”$bin_name”_scaffolded.fa’.
-O, --out-frags=FILE: Path to write the fragments information. Default in the current directory: ‘”$bin_name”_info_frags.txt’.
-t, --threads=INT: Numbers of threads to allocate if pairs need to be sorted. [Default: 1]
-T, --threshold=FLOAT: Threshold score to consider an association between two contigs. [Default: 0.05]
-w, --window-size=INT: Size of the window in base pair to use as edge window to scaffold. [Default: 5000]

execute()[source]¶: Execute the commands

class metator.commands.Validation(command_args, global_args)[source]¶

Bases: AbstractCommand

Use CheckM to validate the bins.

Recursively decontaminate the contaminated bins evaluated by checkM. The script returns new decontaminated fasta and summary files of the decontamination.

Only bins with more than 50% completion and 5% contamination are subject to the recursive step. If the recursive step gave worst results than the first (decrease of the completion with no decrease of the contamination), it will keep the original bin.

usage:

validation –assembly=FILE –contigs=FILE –fasta=DIR –network=FILE [–algorithm=louvain] [–cluster-matrix] [–force] [–iterations=10] [–no-clean-up] [–outdir=DIR] [–overlap=90] [–prefix=STR] [–res-param=1.0] [–size=500000] [–threads=1] [–tmpdir=DIR]

options:

-a, --assembly=FILE: The path to the assembly fasta file used to do the alignment.
-A, --algorithm=STR: Algorithm to use. Either “louvain”, “leiden” or “spinglass”. [Default: louvain]
-c, --contigs=FILE: The path to the file containing the data of the contigs from the partition step (13 columns).
-C, --cluster-matrix: If enabled, save the clustering matrix.
-f, --fasta=DIR: Path to the directory containing the input fasta files of the bins to decontaminate.
-F, --force: If enable, would remove directory of recursive bins in the output directory.
-i, --iterations=INT: Number of recursive iterations of Louvain. [Default: 10]
-n, --network=FILE: Path to the file containing the network information from the meta HiC experiment compute in network function previously.
-N, --no-clean-up: Do not remove temporary files.
-o, --outdir=DIR: Path to the directory to write the output. Default to current directory. [Default: ./]
-O, --overlap=INT: Hamming distance threshold to use to merge bins (percentage). [Default: 90]
-p, --prefix=STR: Prefix to use for fasta files. By default just ‘metator_’, otherwise ‘STR_metator_’.
-r, --res-param=FLOAT: Resolution paramter to use for Leiden algorithm. [Default: 1.0]
-s, --size=INT: Threshold size to keep bins in base pair. [Default: 500000]
-t, --threads=INT: Number of parallel threads allocated for the partition. [Default: 1]
-T, --tmpdir=DIR: Temporary directory. Default to current directory. [Default: ./tmp]

execute()[source]¶: Execute the commands

metator.commands.generate_log_footer(log_path)[source]¶

metator.commands.generate_log_header(log_path, cmd, args)[source]¶

metator.contact_map module¶

Generates a HiC contact map of a MetaTOR object.

General utility functions and class for handling one MetaTOR object and for generating its HiC contact map using hicstuff pipeline.

Core class to handle bin object easily:

MetatorObject

Core function to build the network are:

extract_pairs
generate_contact_map

class metator.contact_map.MetatorObject(metator_object, name, assembly, contig_data_file, pairs, min_size=5000)[source]¶

Bases: object

Class to handle MetaTOR object informations and to extract easily aligned pairs from the object given.

The class is initiate by the MetaTOR object type, its name, the assembly file, the contig data file from MetaTOR and the pairs files.

It’s possible to add a threshold on the contigs size.

get_contigs_data()[source]¶

Method to get contigs data Dataframe

Returns:¶

pandas.core.frame.DataFrame: Table with informations from metaTOR binning for each contigs.

set_contigs()[source]¶: Method to extract the list of contigs of the class from the contig data file and their size.

set_large_contigs()[source]¶: Method to keep only contigs bigger than the threshold given to remove small bins which will be smaller than the binning value and which will prevent a good scaffolding with Instagraal.

set_metator_object(metator_object, name)[source]¶

Method to get the metator object and name of the object usable for the algorithm.

Parameters:¶

metator_objectstr Object to extract contigs to build the matrix.: Either “contig”, “core_bin”, “overlapping_bin”, “recursive_bin”, “final_bin” or “other”.
namestr: Name of the object. Could be the name of a contig, an id of a bin or the name of the bin. Example: “NODE_1” or “MetaTOR_1_0”.

write_fasta(tmp_dir, out_dir)[source]¶

Method to write the new fasta with only the contigs of interest in the outdir directory.

Parameters:¶

tmp_dirstr: Path to the temporary directory to write the list of contigs of the object.
out_dirstr: Path to the output directory where to write the new fasta file.

metator.contact_map.extract_pairs(metator_data)[source]¶

Function to extract the pairs of a MetaTOR object from pairs files.

Parameters:¶

metator_dataobject from class MetatorObject: Object of the class MetatorObject. It should have the parameters contigs and the output pairs file of the metaTOR object set.

Returns:¶

int:: Number of pairs from the MetaTOR object.

metator.contact_map.generate_contact_map(assembly, contig_data_file, enzyme, name, pairs, out_dir, tmp_dir, filter_events=False, force=False, mat_fmt='graal', metator_object='final_bin', min_size=5000, pcr_duplicates=False, threads=1)[source]¶

General function to extract pairs of the MetaTOR object and generate its the contact map.

Parameters:¶

assemblystr: Path to the fasta file containing the contigs of interest. Could be the whole or the extracted contigs of one bin.
contig_data_filestr: Path to the contig_data_final.txt file form MetaTOR output.
enzymestr: Enzyme used to digest the genome in the HiC experiment. Example: HpaII,MluCI.
namestr: Name of the object. Could be the name of a contig, an id of a bin or the name of the bin. Example: “NODE_1” or “MetaTOR_1_0”.
pairsstr: Path of the “.pairs” file or bgzip indexed pair file. If more than one is given, files should be separated by a comma.
out_dirstr: Path where output files should be written. Current directory by default.
tmp_dirstr: Path where temporary files will be written.
filter_eventsbool: Filter spurious or uninformative 3C events. Requires a restriction enzyme. Default: False.
forcebool: If True, overwrite existing files with the same name as output. Default: False.
mat_fmtstr: Select the output matrix format. Can be either “bg2” for the bedgraph2 format, “cool” for Mirnylab’s cool format, or graal for a plain text COO format compatible with Koszullab’s instagraal software. Default: “graal”.
metator_objectstr: Object to extract contigs to build the matrix. Either “contig”, “core_bin”, “overlapping_bin”, “recursive_bin”, “final_bin” or “other”.
min_sizeint: Minimum contig size required to keep it.
pcr_duplicatesbool: If True, PCR duplicates will be filtered based on genomic positions. Pairs where both reads have exactly the same coordinates are considered duplicates and only one of those will be conserved. Default: False.
threadsint: Numbers of threads to use. Default: 1.

metator.figures module¶

Figures foor metaTOR output.

General utility functions to plot some figures to describe metaTOR output.

Core functions to plot the firgures are:

build_matrix_network
figures_bins_distribution
figures_bins_size_distribution
figure_camembert_quality
figures_mags_GC_boxplots
figures_mags_HiC_cov_boxplots
figures_mags_SG_cov_boxplots
generates_frags_network
network_heatmap
plot_figures
reindex_df

metator.figures.build_matrix_network(pairs_files, info_contigs, N, bin_size)[source]¶

Build a binned matrix from the pairs alignement files and the final binning information in the contig data file.

Parameters:¶

pairs_fileslist of str: List of path of the alignment “.pairs” file.
info_contigs: dict: Dictionnary with frags id of the contigs.
Nint: Size of the final bin matrix.
bin_sizeint: Size of the bin to plot the heatmap.

Returns:¶

numpy.array:: Binned dense matrix of the network.

metator.figures.figure_camembert_quality(out_file, prefix, n_religated, n_loops, n_weirds, n_informative_intra, n_informative_inter, n_intra_mags, n_inter_mags, uncut_thr, loop_thr)[source]¶: Functon to plot nice hicstuff camembert from metator high quality contigs.

metator.figures.figures_bins_distribution(bin_summary, out_file)[source]¶

Function to plot the distribution of the estimated completion and contamination of the final bins generated by metaTOR. The bins are ordered by decreasing completion and contamination values superior to 105% are set to 105%.

Parameters:¶

bin_summarypandas.core.frame.DataFrame: Table with the informations about the final bins.
out_filestr: Path where to save the figure.

metator.figures.figures_bins_size_distribution(bin_summary, total_size, threshold, out_file)[source]¶

Function to plot a camembert of the fraction of size corresponding to each quality of bins. We defined 6 categories:

High quality MAGs (HQ MAGs): >= 90% completion ; < 5% contamination

Medium quality MAGs (MQ MAGs): >= 70% completion ; < 10% contamination

Low quality MAGs (LQ MAGs): >= 50% completion ; < 10% contamination

Contaminated bins: >= 50% completion ; >= 10% contamination

Bins superior to size threshold but with less 50% completion.

Unbinned contigs

Parameters:¶

bin_summarypandas.core.frame.DataFrame: Table with the informations about the final bins.
total_sizeint: Size of the whole assembly.
thresholdint: Minimun size of the bins considered.
out_filestr: Path where to save the figure.

metator.figures.figures_mags_GC_boxplots(contigs_data, out_file)[source]¶

Function to plot barplots GC content of the contigs for each bins. The count are weighted by the size of the contigs.

Parameters:¶

contigs_datapandas.DataFrame: Table with all the data from the contigs, the ubinned contigs should have removed and a column size_weight should have been add to weight with the size.
out_filestr: Path where to write the figure.

metator.figures.figures_mags_HiC_cov_boxplots(contigs_data, out_file)[source]¶

Function to plot barplots HiC_coverage coverage of the contigs for each bins. The count are weighted by the size of the contigs.

Parameters:¶

contigs_datapandas.DataFrame: Table with all the data from the contigs, the ubinned contigs should have removed and a column size_weight should have been add to weight with the size.
out_filestr: Path where to write the figure.

metator.figures.figures_mags_SG_cov_boxplots(contigs_data, out_file)[source]¶

Function to plot barplots assembly coverage of the contigs for each bins. The count are weighted by the size of the contigs.

Parameters:¶

contigs_datapandas.DataFrame: Table with all the data from the contigs, the ubinned contigs should have removed and a column size_weight should have been add to weight with the size.
out_filestr: Path where to write the figure.

metator.figures.generates_frags_network(contig_data, bin_summary, bin_size)[source]¶

Generates info fragments for all contigs and start position all bins.

Parameters:¶

contigs_datapandas.DataFrame: Table with all the data from the contigs.
bin_summarydict: Dictionnary with the informations of the final bins kept by MetaTOR.
bin_sizeint: Size of the bin to plot the heatmap.

Returns:¶

dict:: Dictionnary with frags id of the contigs.
list of int:: List of the start positions of the mags
int:: Size of the final binned matrix

metator.figures.network_heatmap(matrix, out_file=None, mag_starts=None)[source]¶

Plot the heatmap of the network, with the contig ordering by their bin attribution.

Parameters:¶

matrixnumpy.array: Binned dense matrix of the network.
out_filestr: Name of the file to save the plot.
mag_startslist of int: List of bin positions where to draw mags starts as dotted lines.

metator.figures.plot_figures(out_dir, contigs_data, bin_summary, threshold)[source]¶

Function to generates all figures.

Parameters:¶

out_dirstr: Path of the output directory of MetaTOR.
contigs_datapandas.DataFrame: Table with all the data from the contigs.
bin_summarydict: Dictionnary with the informations of the final bins kept by MetaTOR.
thresholdint: Minimun size of the bins considered.

metator.figures.reindex_df(dataframe, weight_col)[source]¶

Expand the dataframe to prepare for resampling result is 1 row per count per sample.

Parameters:¶

dataframepandas.core.frame.DataFrame: Table to expand.
weight_colstr: Name of the weighted column to use.

Returns:¶

pandas.core.frame.DataFrame:: Table expanded with new index.

metator.io module¶

Core tools to build I/O for metaTOR

This mdoule contains all core I/O functions:

check_checkm
check_fasta_index
check_is_fasta
check_louvain_cpp
check_pairix
check_pairtools
generate_fasta_index
generate_temp_dir
get_restriction_site
get_pairs_data
process_ligation_sites
read_bin_summary
read_compressed
read_contig_data
read_results_checkm
retrieve_fasta
sort_pairs
sort_pairs_pairtools
write_bin_summary

metator.io.check_checkm()[source]¶

Function to test if CheckM is in the path.

Returns:¶

bool:: True if checkM found in the path, False otherwise.

metator.io.check_fasta_index(ref, mode='bowtie2')[source]¶

Function from hicstuff.io (https://github.com/koszullab/hicstuff/) Checks for the existence of a bowtie2 or bwa index based on the reference file name.

Parameters:¶

refstr: Path to the reference genome.
modestr: The alignment software used to build the index. bowtie2 or bwa. If any other value is given, the function returns the reference path.

Returns:¶

str: The bowtie2 or bwa index basename. None if no index was found

metator.io.check_is_fasta(in_file)[source]¶

Function from hicstuff.io (https://github.com/koszullab/hicstuff/) Checks whether input file is in fasta format.

Parameters:¶

in_filestr: Path to the input file.

Returns:¶

bool :: True if the input is in fasta format, False otherwise

metator.io.check_louvain_cpp(louvain_path)[source]¶

Function to check is the Louvain functions are callable.

Parameters:¶

louvain_pathstr: Path of the directory where the Louvain functions are.

Returns:¶

bool:: Boolean value describing either the Louvain functions are callable or not.

metator.io.check_pairix()[source]¶

Function to test if pairix is in the path.

Returns:¶

bool:: True if pairix found in the path, False otherwise.

metator.io.check_pairtools()[source]¶

Function to test if pairtools is in the path.

Returns:¶

bool:: True if pairtools found in the path, False otherwise.

metator.io.generate_fasta_index(fasta, aligner, outdir)[source]¶

Generate fasta index.

Parameters:¶

fastastr: Path to the fasta reference to index.
alignerstr: Aligner to use to build the index.
outdirstr: Path to the directory to write the index.

Returns:¶

str:: Path to the bowtie2 index build

metator.io.generate_temp_dir(path)[source]¶

Temporary directory generation

Function from hicstuff.io (https://github.com/koszullab/hicstuff/) Generates a temporary file with a random name at the input path.

Parameters:¶

pathstr: The path at which the temporary directory will be created.

Returns:¶

str: The path of the newly created temporary directory.

metator.io.get_pairs_data(pairfile, threads=1, remove=False, force=False)[source]¶

Extract pairs data from pypairix indexed pairs file. If no pypairix indexed found, sort pairs files using pairtools executable.

Parameters

pairfile (str) – Path to the pairfile to extract data.
threads (int) – Number of threads to use to sort pairs if necessary. [Default: 1]
remove (bool) – If set to true, it will remove the unsorted pair file. [Default: False]
force (bool) – If set overwrite existing files. [Default: False]

Returns

Path to the sorted and indexed pair file.

Return type

str

metator.io.get_restriction_site(enzyme)[source]¶

Function to return a regex which corresponds to all possible restriction sites given a set of enzyme.

Parameters:¶

enzymestr: String that contains the names of the enzyme separated by a comma.

Returns:¶

str :: Regex that corresponds to all possible restriction sites given a set of enzyme.

Examples:¶

>>> get_restriction_site('DpnII')
'GATC'
>>> get_restriction_site('DpnII,HinfI')
'GA.TC|GATC'

metator.io.getrandbits(k) → x. Generates an int with k random bits.¶

metator.io.micomplete_results_to_dict(micomplete_file)[source]¶

Read micomplte output file and transfrom it as a dictionnary with the bin name as keys and bin information as values.

Parameters: micomplete_file (str) – Path to the micomplete output file.
Returns: Dictionnary of the output of miComplete as values and the bin id as keys.
Return type: dict

metator.io.read_bin_summary(bin_summary_file)[source]¶

Read bin summary file from metator pipeline.

Parameters: bin_summary_file (str) – Path to the bin summary file.
Returns: Table from bin summary with bin name as index.
Return type: pandas.DataFrame

metator.io.read_compressed(filename)[source]¶

Read compressed file

Function from hicstuff.io (https://github.com/koszullab/hicstuff/) Opens the file in read mode with appropriate decompression algorithm.

Parameters:¶

filenamestr: The path to the input file

Returns:¶

file-like object: The handle to access the input file’s content

metator.io.read_contig_data(contig_data_file)[source]¶

Read bin summary file from metator pipeline.

Parameters: contig_data_file (str) – Path to the contig data file.
Returns: Table from contig data with contig name as index.
Return type: pandas.DataFrame

metator.io.read_results_checkm(checkm_file, checkm_taxonomy_file)[source]¶

Function to transform the output summary file of checkm into a dictionnary.

Parameters:¶

checkm_filestr: Path to the summary output file of CheckM.

Returns:¶

dict:: Dictionnary of the output of checkm with binID as keys and with three values lineage, completness and contamination.

metator.io.retrieve_fasta(in_file, aligner, tmpdir)[source]¶

Function to retrieve fasta from the given reference file. If index is given retrieve it using bowtie2 inspect. Thraw an error if not a fasta or bowtie2 index.

Parameters:¶

in_filestr: Path to the reference file given.
alignerstr: Name of the aligner used. Either ‘bowtie2’ or ‘bwa’.
tmpdirstr: Path to the temp directory to write the fasta if necessary.

Returns:¶

str:: Path to the fasta file.

metator.io.save_sparse_matrix(s_mat, path)[source]¶

Save a sparse matrix

Saves a sparse matrix object into tsv format.

Parameters

s_mat (scipy.sparse.coo_matrix) – The sparse matrix to save on disk
path (str) – File path where the matrix will be stored

metator.io.sort_pairs(in_file, out_file, tmp_dir=None, threads=1, buffer='2G')[source]¶

Adapted function from hicstuff.io (https://github.com/koszullab/hicstuff/) Sort a pairs file in batches using UNIX sort.

Parameters:¶

in_filestr: Path to the unsorted input file
out_filestr: Path to the sorted output file.
keyslist of str: list of columns to use as sort keys. Each column can be one of readID, chr1, pos1, chr2, pos2, frag1, frag2. Key priorities are according to the order in the list.
tmp_dirstr: Path to the directory where temporary files will be created. Defaults to current directory.
threadsint: Number of parallel sorting threads.
bufferstr: Buffer size used for sorting. Consists of a number and a unit.

metator.io.sort_pairs_pairtools(pairfile, threads=1, remove=False, force=False)[source]¶

Sort pairs files using pairtools executable. Pairix only works with compressed pair files. So we use bgzip to compress them.

Parameters

pairfile (str) – Path to the pairfile to sort, compress and index.
threads (int) – Number of threads to use. [Default: 1]
remove (bool) – If set to true, it will remove the unsorted pair file. [Default: False]
force (bool) – If set overwrite existing files. [Default: False]

Returns

Path to the sorted and indexed pair file.

Return type

str

metator.io.write_bin_summary(bin_summary, bin_summary_file)[source]¶

Function to write the bin summary from dictionnary to table text file.

Parameters:¶

bin_summarydict: Dictionnary with the output of the miComplete of the bins.
bin_summary_filestr: Path to the output file to write the summary informations of the bins.

metator.log module¶

Basic logging setup for metaTOR

Basic logging setup with three main handlers (stdout, log file and optionally texting with the right API). By default the log file is disabled, but can be enabled or changed using set_file_handler.

metator.log.set_file_handler(log_path, formatter=<logging.Formatter object>)[source]¶: Change the file handler for custom log file location

metator.log.setup_text_logging(credentials='text_credentials.json')[source]¶

Setup text logging

Setup text notifications on errors with a basic API. In order for it to work, a file named ‘text_credentials.json’ must be in the directory where a command is run. The logger performs a GET request to a text notification API.

Parameters: credentials (str or pathlib.Path) – A JSON file with necessary credentials for the GET request

metator.main module¶

MetaHiC pipeline for generating and manipulating MAGs.

usage:

metator [-hv] <command> [<args>…]

options:

-h, --help: shows the help
-v, --version: shows the version

The subcommands are:

network Generate metaHiC contigs network from fastq reads or bam files: and normalize it.

partition Partition metaHiC network using Louvain or Leiden algorithm. pipeline Use all the others command to give a binned output in one

command line.

validation Validates bins using CheckM and make a recursive partition to: try to decontaminate them.

qc Generates some quality check on the output of metator. contactmap Generates a HiC contact map from one metaTOR object from the

final ouptut of metaTOR.

scaffold Scaffold a metator bin based on pairs files. pairs Sort the pairs file using pairtools. Compress them using bgzip.

Index them using pairix.

metator.main.main()[source]¶

metator.network module¶

Generate metaHiC contigs network from fastq reads or bam files.

General utility functions for handling aligment files from align module and generating metaHiC networks.

Core function to build the network are:

alignment_to_contacts
compute_network
create_contig_data
normalize_pair
precompute_network
write_contig_data
write_hit_data

metator.network.alignment_to_contacts(alignment_files, contig_data, edge, hit_data, out_dir, output_file_network, output_file_contig_data, tmp_dir, n_cpus, normalization, self_contacts)[source]¶

Generates a network file (in edgelist form) from an alignment. Contigs are the network nodes and the edges are the contact counts.

The network is in a strict barebone form so that it can be reused and imported quickly into other applications etc. Verbose information about every single node in the network is written on a ‘contig data’ file.

Parameters:¶

alignment_fileslist of str: List of path to the alignment file(s) used as input.
contig_datadict: Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”. Coverage still at 0 and need to be updated later.
edgeint: Distance of the edge region in base pair on the contigs where the mapping reads are not considered as inter contigs.
hit_datadict:: Dictionnary for hit information on each contigs.
out_dirstr: The output directory to write the network and chunk data into.
output_file_networkstr, optional: The specific file name for the output network file. Default is ‘network.txt’
output_file_contig_datastr, optional: The specific file name for the output chunk data file. Default is ‘idx_contig_length_GC_hit_cov.txt’
tmp_dirstr: Path to th temporary directory. Default in the working directory
normalizationstr: If None, do not normalized the count of a contact by the geometric mean of the coverage of the contigs. Otherwise it’s the type of normalization.
self_contactsbool: Whether to return network with self contact. Default is False.

Returns:¶

str:: Path to the network file.
str:: Path to the verbose contig data file.

metator.network.compute_network(pre_network_file, network_file, contig_data, tmp_dir, tmp_file, n_cpus, normalization)[source]¶

Generate the network file from the prenetwork file.

Parameters:¶

pre_network_filestr: Path of the input file (prenetwork file, output from teh precompute network function)
network_filestr: Path of the output file (network file).
contig_datadict: Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”, “RS”.
tmp_dirstr: Path to the temporary directory.
tmp_filestr: Path of the temporary file to write the sorted precompute network.
n_cpusint: Number of cpus used to sort the prenetwork.
normalizationstr: If None, do not normalized the count of a contact by the geometric mean of the coverage of the contigs. Otherwise it’s the type of normalization.

metator.network.create_contig_data(assembly, nb_alignment=1, depth_file=None, enzyme=None)[source]¶

Create a dictionnary with data on each Contig.

Parameters:¶

assemblystr: Path to the assembly fasta file
nb_alignementint: Numbers of alignment files. [Default: 1]
depth_filestr or None: Path to the depth.txt file from jgi_summarize_bam_contig_depths from Metabat2 Software. [Default: None]
enzymestr or None: String that contains the names of the enzyme separated by a comma. [Default: None]

Returns:¶

dict:: Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”, “RS”. Hit is set to 0 and need to be updated later.
dict:: Dictionnary for hit information on each contigs.

metator.network.normalize_pair(contig_data, pair, n_occ, normalization)[source]¶

Function to do the normalization of an inter contig contact depending on the mode of normalisation given.

Parameters:¶

contig_datadict: Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”, “RS”.
pairtuple: Tuple of the id of the contigs contributing to the contact.
n_occint: Contact count between the two contigs.
normalizationstr: Mode of normalization to use.

Returns:¶

float:: Normalized contact count.

metator.network.precompute_network(alignment_files, contig_data, edge, hit_data, out_file, tmp_dir, self_contacts=False)[source]¶

Write a file with only the contig id separated by a tabulation and count the contacts by contigs to be able to compute directlty the normalized network.

Parameters:¶

alignment_fileslist of str: List of path to the alignment file(s).
contig_datadict: Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”. Coverage still at 0 and need to be updated later.
edgeint: Distance of the edge region in base pair on the contigs where the mapping reads are not considered as inter contigs.
hit_datadict: Dictionnary with the count of hits for each aligment file.
out_filestr: Path to the write the output_file which will be necessary to compute the network.
tmp_dirstr: Directory where to store temporary files.
self_contactsbool: If True, the contacts on the same contigs will be kept. Otherwise only displays the inter contigs contacts. [Default False]

Return:¶

dict:: Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage” “RS”. Coverage still at 0 and need to be updated later.

metator.network.write_contig_data(contig_data, output_path)[source]¶

Function to write the contig data file at the output path given. The file will contains 7 columns separated by a tabulation: id, name, length, GC_content, hit, coverage, restriction site for each contig with an header.

Parameters:¶

contig_datadict: Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”, “RS”.
output_pathstr: Path to the output file where the data from the dictionnary will be written

metator.network.write_hit_data(hit_data, output_path)[source]¶

Function to write the contig data file at the output path given. The file will contains 6 columns separated by a tabulation: id, name, length, GC_content, hit, coverage for each contig.

Parameters:¶

hit_datadict: Dictionnary of the all the contigs from the assembly, the contigs names are the keys to a list of hits from each alignment files separately.
output_pathstr: Path to the output file where the data from the dictionnary will be written.

metator.partition module¶

Partition metaHiC network using Louvain or Leiden algorithm.

General utility function for generating bins from the network file using Louvain algorithm. If an overlapping threshold is given, it will use the Hamming distance to group together bins relatively closed (to avaoid to split genomes in different bins).

Core functions to partition the network are:

algo_partition
build_clustering_matrix
defined_overlapping_bins
detect_core_bins
generate_fasta
get_distances_splitmat
get_hamming_distance
leiden_iterations_java
louvain_iterations_cpp
partition
remove_isolates
update_contigs_data

Deprecated Spinglass functions:

spinglass_partition

metator.partition.algo_partition(algorithm='louvain', network_file=None, network=None, iterations=10, resolution_parameter=1.0, tmpdir='.', spin=2)[source]¶

Function to partition the network depednding on the used algorithm.

Parameters:¶

algorithmstr: Algorithm to use to partition network. [Default: louvain]
network_filestr: Path to the network computed previously. The file is 3 columns table separated by a tabulation with the id of the first contigs the id of the second one and the weights of the edge normalized or not. Mandatory if louvain or leiden algorithm. [Default: None]
networknetworkx.classes.graph.Graph: Network of interaction of a contaminated bins. Mandatory if spinglass algorithm. [Default: None]
iterationsint: Number of iterations of the algorithm of Leiden or Louvain. [Default: 10]
resolution_parameterfloat: Resolution parameter for Leiden clustering. [Default: 1.0]
tmp_dirstr: Path to the temporary directory. [Default: current directory]
spinint: Deprecated. Number of final cluster if spinglass algorithm chosen. [Default: 2]

Returns:¶

dict:: Dictionnary with the id of the contig as key and the list of the results of each iterations separated by a semicolon as values.

metator.partition.build_clustering_matrix(core_bins_contigs, hamming_distance, N)[source]¶

Function to return the clustering matrix in sparse format.

For each contigs, the value correspond to the number of iterations where the contigs are clusterized together divided by the number of iterations. A value of 1 means that the contigs are in the same core bin.

Parameters:¶

core_bins_contigsdict: Dictionnary which has as keys the core bins id and as value the id of the contigs of the core bin.
hamming_distancescipy.sparse.csr.csr_matrix:: Matrix with all the previously computed hamming distance between two core bins.
Nint: Number of contigs in the assembly.

Returns:¶

scipy.sparse.coo.coo_matrix:: Matrix with all the previously computed hamming distance between two contigs.

metator.partition.defined_overlapping_bins(overlap, hamming_distance, core_bins_contigs)[source]¶

This function extract the overlapped bins

From the hamming distances between the core bins, the function identifies the overlapping bins and create a dictionnary with the list of the contigs ID for each core bin.

Two core bins are considered overlapping if there have a percentage of identity superior or equal to the threshold given.

Parameters:¶

overlapfloat: hamming distance threshold use to consider that two bins are overlapping.
hamming_distancescipy.sparse.csr.csr_matrix: Matrix with all the previously computed hamming distance between two core bins.
core_bins_contigsdict: Dictionnary which has as keys the core bins id and as value the id of the contigs of the core bin.

Returns:¶

dict:: A dictionnary with the id of the overlapping bins as keys and the list of id of their contigs as values.

metator.partition.detect_core_bins(output_partition, iterations)[source]¶

Detect core bins from the output of the partition algorithm.

The function search for duplicated values in the output of Louvain or Leiden algorithm in order to find contigs which are always in the same bin. The bins find with this method are called the core bins.

Parameters:¶

output_partitiondict: Dictionnary with the id of the contig as key and the list of the results of each iterations separated by a semicolon as values.
iterationsint: Number of iterations made previously with the partition algorithm.

Returns:¶

dict:: Dictionnary which has as keys the core bins id and as value the id of the contigs of the core bin.
pandas.core.frame.DataFrame:: Table with the id of the core bin and their values for each iterations.

metator.partition.generate_fasta(assembly, overlapping_bins, contigs_data, size, output_dir, tmpdir, prefix)[source]¶

Generate the fasta files of each bins from the assembly.

Parameters:¶

assemblystr: Path to the fasta file of the original assembly.
overlapping_binsdict: A dictionnary with the id of the overlapping bins as keys and the list of id of their contigs as values.
contigs_datapandas.core.frame.DataFrame: Table with all the information on the contigs included their appartenance to the bins.
sizeint: Thrshold size chosen to write the bins.
output_dirstr: Path to the output directory where the fasta of all the bin will be written.
tmpdirstr: Path to the temporary directory to write the temporary contigs list files.
prefixstr: Sample prefix to use.

metator.partition.get_distances_splitmat(bins, core_bins_iterations)[source]¶

This function takes a segment of the full iterative clustering matrix and computes, for each index (i.e. contig), the hamming distance to each of the other indices.

Parameters:¶

binspandas.core.frame.DataFrame: Slice of the table with the id of the core bin and their values for each iterations.
core_bins_iterationspandas.core.frame.DataFrame: Table with the id of the core bin and their values for each iterations.

Returns:¶

scipy.sparse.csr.csr_matrix:: matrix of the distance of the possible pairs from the slice of the table and the table itself.

metator.partition.get_hamming_distance(core_bins_iterations, threads)[source]¶

Generate matrix of Hamming distances between all pairs of core bins.

Parameters:¶

core_bins_iterationspandas.core.frame.DataFrame: Table with the id of the core bin as index and their values for each iterations.
threadsint: Number of cores to parallelize computation.

Returns:¶

scipy.sparse.csr.csr_matrix:: Matrix with all the previously computed hamming distance between two core bins.

metator.partition.leiden_iterations_java(network_file, iterations, resolution_parameter, tmp_dir, leiden_path)[source]¶

Use the java implementation of Leiden to partition the network.

Parameters:¶

network_filestr: Path to the network computed previously. The file is 3 columns table separated by a tabulation with the id of the first contigs the id of the second one and the weights of the edge normalized or not.
iterationsint: Number of iterations of the algorithm of Leiden.
resolution_parameterfloat: Resolution parameter for Leiden clustering.
tmp_dirstr: Path to the temporary directory.
leiden_pathstr: Path to the directory with network analysis java implementation.

Returns:¶

dict:: Dictionnary with the id of the contig as key and the list of the results of each iterations separated by a semicolon as values.

metator.partition.louvain_iterations_cpp(network_file, iterations, tmp_dir, louvain_path)[source]¶

Use the cpp original Louvain to partition the network.

Parameters:¶

network_filestr: Path to the network computed previously. The file is 3 columns table separated by a tabulation with the id of the first contigs the id of the second one and the weights of the edge normalized or not.
iterationsint: Number of iterations of the algorithm of Louvain.
tmp_dirstr: Path to the temporary directory.
louvain_pathstr: Path to the directory with louvain functions.

Returns:¶

dict:: Dictionnary with the id of the contig as key and the list of the results of each iterations separated by a semicolon as values.

metator.partition.partition(algorithm, assembly, cluster_matrix, contig_data_file, iterations, network_file, outdir, fasta_dir, overlapping_parameter, resolution_parameter, size, temp_directory, threads, prefix)[source]¶

Function to call the others functions to partition the network.

Parameters:¶

algorithmstr: Algorithm to use to partition the network. Either leiden or louvain.
assemblystr: Path to the assembly file used for the partition.
cluster_matrixbool: If True, build and save the clustering matrix.
contig_data_filestr: Path to the contig data table to update.
iterationsint: Number of iterations to use for the partition.
network_filestr: Path to the network file.
outdirstr: Path to the output directory where to write the output files.
fasta_dirstr: Path to directory where to write the fasta files.
overlapping_parameterint: Hamming distance threshold to use to merge bins (percentage).
resolution_parameterfloat: Resolution parameter to use if Leiden algorithm is chosen. It will be a factor of the cost function used. A resolution parameter of 1 will be equivalent as the modularity function used in Louvain. Higher these parameters, smaller the bins will be in the output.
sizeint: Threshold size in base pair of the output bins.
temp_directorystr: Path to the directory used to write temporary files.
threadsint: Number of threads to use.
prefixstr: Sample prefix to use.

Returns:¶

scipy.sparse.coo.coo_matrix:: Matrix with all the previously computed hamming distance between two contigs.
str:: Path to the new contig data file with the bin informations in it.

metator.partition.remove_isolates(output_partition, network_file)[source]¶

Remove isolates, i.e. nodes without any contacts in the network in the partition. This step is necessary as it will slow the further process of the communities. This function is only useful while using Leiden algorithm.

Parameters:¶

output_partitiondict: Dictionnary with the id of the contig as key and the list of the results of each iterations separated by a semicolon as values.
network_filestr: Path to the network computed previously. The file is 3 columns table separated by a tabulation with the id of the first contigs the id of the second one and the weights of the edge normalized or not.

Returns:¶

dict:: Dictionnary with the id of the contig as key and the list of the results of each iterations separated by a semicolon as values without isolates.

metator.partition.spinglass_partition(subnetwork, spins=2)[source]¶

Use spinglass function from cdlib to partition the network.

This function is only used in the validation step as it will take a long time to run on a large network.

Parameters:¶

subnetworknetworkx.classes.graph.Graph: Network of interaction of a contaminated bins.
spinsint: Number of expected MAGs in the contaminated bins.

Returns: dict:

Dictionnary with the id of the contig as key and the clustering result as values.

metator.partition.update_contigs_data(contig_data_file, core_bins_contigs, overlapping_bins, outdir)[source]¶

Add bin information in the contigs data file.

This function allow to update the contigs data file which were created previously in the network functions with the columns: contig id, contig name, contig length, GC content, hit, coverage, restriction site. The function will add six columns: core bin id, core bin number of contigs, core bin length, overlapping bin id, overlapping bin number of contigs, overlapping bin length.

Parameters:¶

contig_data_filestr: Path to the contigs data file.
core_bins_contigsdict: Dictionnary which has as keys the core bins id and as value the id of the contigs of the core bin.
overlapping_binsdict: A dictionnary with the id of the overlapping bins as keys and the list of id of their contigs as values.
outdirstr: Path of the output directory to write the update contigs data file.

Returns:¶

pandas.core.frame.DataFrame:: Table with all the information on the contigs included their appartenance to the bins.

metator.validation module¶

Validates bins using miComplete and make a recursive partition to try to decontaminate them.

General functions to validate bins completion using miComplete and make recursive iterations of Louvain or Leiden to try to partition contaminated bins. Only bins with more than 50% completion and 5% contamination are subject to the recursive step. If the recursive step gave worst results than the first (decrease of the completion with no decrease of the contamination), it will keep the original bin.

Functions in this module:

correct_final_bin
get_bin_coverage
give_results_info
merge_micomplete
micomplete_compare_bins
micomplete_quality
recursive_clustering
recursive_clustering_worker
recursive_decontamination
update_contigs_data_recursive
write_bins_contigs

CheckM deprecated functions:

checkm
checkm_compare_bins

metator.validation.checkm(fasta_dir, outfile, taxonomy_file, tmpdir, threads)[source]¶

Function to evaluate fasta bins using CheckM. Write the checkM results summary in the outfile and the taxonomy results in the the taxonomy file.

Parameters:¶

fasta_dirstr: Path to the input fasta of the bins to evaluate.
outfilestr: Path to the file where the results of checkm will be written.
taxonomy_filestr: path to the file where checkm taxonomy results will be written.
tmpdirstr: Path to the temporary directory where CheckM intermediary files will be written.
threadsint: Numbers of threads to use for CheckM.

metator.validation.checkm_compare_bins(overlapping_checkm_file, overlapping_taxonomy_file, recursive_checkm_file, recursive_taxonomy_file, prefix)[source]¶

Compare the completness and contamination of the bins from the first step and from the recursive step. If the recursive step decrease the completion of one bin without decreasing its contamination, it will kept the bin before the recursive step. Moreover if the completion goes below 50% with the recursive it will not take the recursive bins.

Parameters:¶

overlapping_checkm_filestr: Path to the checkm summary from the overlapping step.
overlapping_taxonomy_filestr: path to the overlapping checkm taxonomy results file.
recursive_checkm_filestr: Path to the checkm summary from the recursive step.
recursive_taxonomy_filestr: path to the recursive checkm taxonomy results file.
prefixstr: Sample prefix to use.

Returns:¶

dict:: Dictionnary with the informations of the final bins kept by MetaTOR.

metator.validation.correct_final_bin(contigs_data, final_fasta_dir, bin_summary)[source]¶

Function to compute the coverage of each bin.

Parameters:¶

bin_summarydict: Dictionnary with the informations of the final bins kept by MetaTOR.
contigs_datapandas.core.frame.DataFrame: Dataframe with the contigs informations.

Returns:¶

pandas.core.frame.DataFrame: Dataframe with the contigs informations.

metator.validation.get_bin_coverage(bin_summary, contigs_data)[source]¶

Function to compute the coverage of each bin.

Parameters:¶

bin_summarydict: Dictionnary with the informations of the final bins kept by MetaTOR.
contigs_datapandas.core.frame.DataFrame: Dataframe with the contigs informations.

Returns:¶

dict:: Dictionnary with the informations of the final bins kept by MetaTOR with the coverage.

metator.validation.give_results_info(bin_summary)[source]¶

Function to return the general information about the binning results.

Parameters:¶

bin_summarydict: Dictionnary with the summary results of the kept bins.

metator.validation.merge_micomplete(out_bact105, out_arch131, outfile)[source]¶

Function to merge bacterial and archaeal output of micomplete into one file.

Parameters: out_bact105 : str

Output of micomplete using 105 bacterial markers.

out_arch131str: Output of micomplete using 131 arcaheal markers.
outfilestr: Final merged output.

metator.validation.micomplete_compare_bins(recursive_micomplete_file, bin_summary, parent_dict, step)[source]¶

Parameters:¶

recursive_micomplete_filestr: Path to the miComplete summary from the recursive step.
bin_summarydict: Dictionnary containing iinformation about the bins.
parent_dictdict: Dictionnary with recursive bin_id as key and parent bin as values.
stepint: Recursive step.

Returns:¶

dict:: Dictionnary with the informations of the final bins kept by MetaTOR.

metator.validation.micomplete_quality(fasta_dir, outfile, threads)[source]¶

Function to evaluate fasta bins using miComplete. Write the bins quality summury in the outfile.

Parameters:¶

fasta_dirstr: Path to the input fasta of the bins to evaluate.
outfilestr: Path to the file where the results of miComplete will be written.
threadsint: Numbers of threads to use for miComplete.

metator.validation.recursive_clustering(assembly, iterations, overlapping_parameter, resolution_parameter, outdir, recursive_fasta_dir, algorithm, tmpdir, bin_summary, contigs_data, network, cluster_matrix, size, threads, prefix)[source]¶

Function to run recursive iterations on contaminated bins in order to try to improve the quality of the bins using Louvain or Leiden algorthm.

Parameters:¶

assemblystr: Path to the fasta file used as assembly.
iterationsint: Number of iterations to use for recursive iterations of Louvain or Leiden.
overlapping_parameterfloat: Hamming distance threshold to consider two bins as the same bin.
resolution parameterfloat: Resolution parameter of Leiden algorithm.
outdirstr: Path to the output directory.
recursive_fasta_dirstr: Path to the directory where to write the decontaminated fasta.
algorithmstr: Algorithm to use, either louvain, leiden or spinglass.
tmpdirstr: Path the temp directory.
bin_summarydict: Dictionnary containing iinformation about the bins.
contigs_datapandas.DataFrame: Table with all the data from the contigs.
networkstr: Metator full network.
cluster_matrixbool: If True, build the clustering matrix and save it.
sizeint: Size threshodl in base pairs of the bins.
threadsint: Number of threads to use.
prefixstr: Sample prefix to use.

Returns:¶

boolean:: True if at least one new recursive bin has been generated.
pandas.DataFrame: Updated dictionnary which has as keys the values of the iterations from the recursive partition separated by a semicolon and as values the list of the id of the contigs.
scipy.sparse.coo.coo_matrix:: Matrix with all the previously computed hamming distance between two contigs.
dict: Dictionnary with recursive bin_id as key and parent bin as values.

metator.validation.recursive_clustering_worker(bin_id, bin_summary, tmpdir, network, algorithm, iterations, resolution_parameter, contigs_data)[source]¶

Worker to partition one bin if it’s contaminated.

Parameters:¶

bin_summarydict: Dictionnary of the output of micomplete as values and the bin id as keys.
tmpdirstr: Path the temp directory.
algorithmstr: Algorithm to use, either louvain, leiden or spinglass.
networknetworkx.classes.graph.Graph: Network of contigs from HiC librairies.
iterationsint: Number of iterations to use for recursive iterations of Louvain or Leiden.
resolution parameterfloat: Resolution parameter of Leiden algorithm.
contigs_data_filestr: Path to the contigs data file from metator partition.

metator.validation.recursive_decontamination(algorithm, assembly, cluster_matrix, contig_data_file, final_fasta_dir, input_fasta_dir, iterations, network_file, outdir, overlapping_parameter, recursive_fasta_dir, resolution_parameter, size, temp_directory, threads, prefix)[source]¶

Function to validate bins do the recursive decontamination using Louvain or Leiden algorithm

Parameters:¶

algorithmstr: Algorithm to use to recursively partition the network. Either leiden, louvain or spinglass.
assemblystr: Path to the assembly file used for the partition.
cluster_matrixbool: If True, build the clustering matrix and save it.
contig_data_filestr: Path to the contig data table to update.
final_fasta_dirstr: Path to write the final fasta decontaminated bins.
input_fasta_dirstr: Path to the directory where the fasta bin from the partition are.
iterationsint: Number of iterations to use for the recursive partition.
network_filestr: Path to the network file.
outdirstr: Path to the output directory where to write the output files.
overlapping_parameterint: Hamming distance threshold in percentage to use to consider to bins as one in the recursive partition.
recursive_fasta_dirstr: Path to write the fasta decontaminated bins.
resolution_parameterfloat: Resolution parameter to use if Leiden algorithm is chosen. It will be a factor of the cost function used. A resolution parameter of 1 will be equivalent as the modularity function used in Louvain. Higher these parameters, smaller the bins will be in the output.
sizeint: Threshold size in base pair of the output bins.
temp_directorystr: Path to the directory used to write temporary files.
threadsint: Number of threads to use.
prefixstr: Sample prefix to use.

Returns:¶

scipy.sparse.coo.coo_matrix:: Matrix with all the previously computed hamming distance between two contigs.

metator.validation.update_contigs_data_recursive(bin_id, contigs_data, recursive_bins, assembly, outdir, tmpdir, size, contamination, parent_dict, prefix)[source]¶

Update the data of the bin according to the recursive step and generated their fasta.

Parameters:¶

bin_idstr: Name of the parental bin.
contigs_datapandas.DataFrame: Table with all the data from the contigs.
recursive_binsdict: Dictionnary which has as keys the values of the recursive iterations from Louvain or Leiden separated by a semicolon and as values the list of the id of the contigs.
assemblystr: Path to the fasta file.
outdirstr: Path to the output directory to write the new fasta.
tmpdirstr: Path to the file where to write the temporary contigs list.
sizeint: Size threshold to generate fasta.
contaminationboolean: True if one bin has already been generated, false otherwise.
parent_dictdict: Dictionnary with recursive bin_id as key and parent bin as values.
prefixstr: Sample prefix to use.

Returns:¶

boolean:: True if one bin has already been generated, false otherwise.
pandas.DataFrame: Updated dictionnary which has as keys the values of the iterations from the recursive partition separated by a semicolon and as values the list of the id of the contigs.
dict: Dictionnary with recursive bin_id as key and parent bin as values.

metator.validation.write_bins_contigs(bin_summary, contigs_data, outfile, prefix)[source]¶

Function to write a table with the nodes kept in the bins and their bin id. The file is adapted to be added in anvio.

Parameters:¶

bin_summarydict: Dictionnary with the informations of the final bins kept by MetaTOR.
contigs_datapandas.core.frame.DataFrame: Dataframe with the contigs informations.
outfilestr: Path where to write the output file.
prefixstr: Sample prefix to use.

Returns:¶

pandas.core.frame.DataFrame: Dataframe with the contigs informations with the final bin information.