metator package¶
Submodules¶
metator.align module¶
Alignement of reads
Align read files onto the assembly and return a tsv file with 9 columns: ReadID, ContigA, Position_startA, Position_endA, StrandA, ContigB, Position_startB, Position_endB, StrandB. Reads are mapped separately, sorted by names, then interleaved (rather than mapped in paired-end mode) to capture the pairs mapping on two different contigs.
- This module contains all these alignment functions:
align
get_contact_pairs
merge_alignement
process_bamfile
process_bwa_bamfile
- metator.align.align(fq_in, index, aligner, bam_out, n_cpu, tmp_dir, iterative=False, fq_in_2=None)[source]¶
Alignment Aligns reads of fq_in with bowtie2. Parameters of bowtie2 are set as –very-sensitive-local and only the aligned reads are kept in the bam file. Multiple files could be given, but only one file will be written as output.
Parameters:¶
- fq_instr
Path to input fastq file to align.
- indexstr
Path to the bowtie2 index genome.
- alignerstr
Name of the aligner algorithm to use. Either bowtie2 or bwa.
- bam_outstr
Path where the alignment should be written in BAM format.
- n_cpuint
The number of CPUs to use for the alignment.
- tmp_dir:
Path to directory to store temporary files.
- iterative:
Wether to use the iterative mapping procedure (truncating reads and extending iteratively)
- fq_in_2str
Path to the reverse fastq file for bwa (map both reads at the same time but do not take into account their poistion as pairs.)
Returns:¶
- int :
0
- metator.align.get_contact_pairs(for_in, rev_in, index, assembly, aligner, aligner_mode, min_qual, start, depth_file, enzyme, out_dir, tmp_dir, n_cpu)[source]¶
General function to do the whole alignment of both fastq.
The Function write at the output directory location given as an argument and return a tsv file of the aligned reads with 9 columns: ReadID, ContigA, Position_startA, Position_endA, StrandA, ContigB, Position_startB, Position_endB, StrandB. The name of the file will be alignment.txt.
Two start stages are possible, from fastq or bam files.
- Parameters
for_in (str) – Path to input forward fastq or bam file to align. If multiple files are given, list of path separated by a comma.
rev_in (str) – Path to input reverse fastq or bam file to align. If multiple files are given, list of path separated by a comma.
index (str) – Path to the bowtie2 index of the assembly.
assembly (str) – The initial assembly path acting as the alignment file’s reference assembly.
aligner (str) – Either ‘bowtie2’ or ‘bwa’ aligner used or to be use to map the reads.
aligner_mode (str) – Either ‘normal’, ‘iterative’ or ‘cutsiste’. Mode to align the HiC reads.
min_qual (int) – Minimum mapping quality required to keep Hi-C pairs.
start (str) – Either fastq or bam. Starting point for the pipeline.
depth_file (str or None) – Path to the depth.txt file from jgi_summarize_bam_contig_depths from Metabat2 Software.
enzyme (str or None) – String that contains the names of the enzyme separated by a comma.
out_dir (str) – Path to directory where to write the output file.
tmp_dir (str) – Path where temporary files should be written.
n_cpu (int) – The number of CPUs to use for the alignment.
- Returns
list of str – List of path of the Files with the table containing the alignement data of the pairs: ReadID, ContigA, Position_startA, Position_endA, StrandA, ContigB, Position_startB, Position_endB, StrandB.
dict – Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”. Coverage still at 0 and need to be updated later.
dict – Dictionnary for hit information on each contigs.
- metator.align.merge_alignment(forward_aligned, reverse_aligned, contig_data, out_file)[source]¶
Merge forward and reverse alignment into one file with only pairs which have both reads are aligned on the genome with 9 columns: ReadID, ContigA, Position_startA, Position_endA, StrandA, ContigB, Position_startB, Position_endB, StrandB.
- Parameters
forward_alignement (str) – File with the table containing the data of the forward reads kept after the alignment. With five columns: ReadID, Contig, Position_start, Position_end, strand.
reverse_alignement (str) – File with the table containing the data of the forward reads kept after the alignment. With five columns: ReadID, Contig, Position_start, Position_end, strand.
contig_data (dict) – Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”. Coverage still at 0 and need to be updated later.
out_file (str) – Path to write the output pairs file.
- Returns
File with the table containing the alignement data of the pairs: ReadID, ContigA, Position_startA, Position_endA, StrandA, ContigB, Position_startB, Position_endB, StrandB
- Return type
str
- metator.align.process_bamfile(alignment, min_qual, filtered_out)[source]¶
Filter alignment BAM files
Reads all the reads in the input BAM alignment file. Keep reads in the output if they are aligned with a good quality (greater than min quality threshold given) saving their only some columns: ReadID, Contig, Position_start, Position_end, strand to save memory.
- Parameters
alignment (str) – Path to the input temporary alignment.
min_qual (int) – Minimum mapping quality required to keep a Hi-C pair.
filtered_out (str) – Path to the output temporary tsv alignement.
- Returns
Number of reads aligned.
- Return type
int
- metator.align.process_bwa_bamfile(alignment, min_qual, contig_data, out_file)[source]¶
Filter alignment BAM files
Reads all the reads in the input BAM alignment file. Keep reads in the output if they are aligned with a good quality (greater than min quality threshold given) saving their only some columns: ReadID, Contig, Position_start, Position_end, strand to save memory.
- Parameters
alignment (str) – Path to the input temporary alignment.
min_qual (int) – Minimum mapping quality required to keep a Hi-C pair.
contig_data (dict) – Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”. Coverage still at 0 and need to be updated later.
out_file (str) – Path to the output pairs file.
- Returns
Number of pairs aligned.
- Return type
int
metator.commands module¶
Abstract command classes for metaTOR
- This module contains all classes related to metaTOR commands:
network
partition
pipeline
validation
qualitycheck
contactmap
Note
Structure based on Rémy Greinhofer (rgreinho) tutorial on subcommands in docopt : https://github.com/rgreinho/docopt-subcommands-example abignaud, 20201118
- raises NotImplementedError
Will be raised if AbstractCommand is called for some reason instead of one of its children.
- class metator.commands.AbstractCommand(command_args, global_args)[source]¶
Bases:
object
Abstract base command class
Base class for the commands from which other metaTOR commands derive.
- class metator.commands.Contactmap(command_args, global_args)[source]¶
Bases:
AbstractCommand
Generates a HiC contact map of a MetaTOR object from the pairs files, the contig data file and a fasta file containing the contigs of interest.
Generates the Hi-C matrix of one MetaTOR object form the pair alignment file of metaTOR. MetaTOR object integrated are contigs, and core, overlapping, recursive, final or personalized bins. For the personalized bins you should add a column in the contig_data_final.txt file with a header called “Other”.
Do not display the matrix, you have to run another pipeline to display it such hicstuff view or cooler show depending on the metrix format chosen. It’s also possible to try to scaffold the bin using instagraal pipeline using the output.
- usage:
contactmap –assembly=FILE –contig-data=FILE –enzyme=STR –name=STR [–filter] [–force] [–mat-fmt=graal] [–min-size=5000] [–no-clean-up] [–object=final_bin] [–outdir=DIR] [–pcr-dup] [–tmpdir=DIR] [–threads=1] <pairsfile>…
- argument:
pairsfile File(s) containing pairs information.
- options:
- -a, --assembly=FILE
Path to the fasta file containing the contigs of interest. Could be the whole or the extracted contigs of one bin.
- -c, --contig-data=FILE
Path to the contig_data_final.txt file from MetaTOR output.
- -e, --enzyme=STR
The list of restriction enzyme used to digest the contigs separated by a comma. Example: HpaII,MluCI.
- -n, --name=STR
Name of the metator or its numerical ID for the core bin, overlapping bin or recursive bin objects. Example: “NODE_1”, “MetaTOR_1_0” or 8.
- -D, –pcr-dup, Filter out PCR duplicates based on read
positions.
- -f, --filter
Filter out spurious 3C events (loops and uncuts) using hicstuff filter. For more informations, see Cournac et al. BMC Genomics, 2012.
- -F, --force
Write files even if the output files already exists.
- -m, --mat-fmt=STR
The format of the output sparse matrix. Can be “bg2” for 2D Bedgraph format, “cool” for Mirnylab’s cooler software, or “graal” for graal-compatible plain text COO format. [default: graal]
- -N, --no-clean-up
If enabled intermediary files will be kept.
- -o, --outdir=DIR
Output directory. Default creates a new directory contact_map in the current directory.
- -O, --object=STR
Type of metator object used to build a contact map. Either “contig”, “core_bin”, “overlapping_bin”, “recursive_bin”, “final_bin” or “other”. [Default: final_bin]
- -s, --min-size=INT
Minimum size threshold to consider contigs. [Default: 5000]
- -t, --threads=INT
Number of threads to allocate. [Default: 1]
- -T, --tmpdir=DIR
Directory for storing intermediary files and temporary files. Default creates a “tmp” folder in the current directory.
- class metator.commands.Network(command_args, global_args)[source]¶
Bases:
AbstractCommand
Generation of network command
Generate metaHiC contigs network from fastq reads or bam files.Generates a network file (in edgelist form) from either fastq files or bam files. For both starting input forward and reverse reads or alignment need to be in seprated files with the same read names. In the output network contigs are the network nodes and the edges are the contact counts.
If multiple fastq are given will align them seprately and return a hit file with a number of hit for each sample.
Note: the network is in a strict barebone form so that it can be reused and imported quickly into other applications etc. Verbose information about every single node in the network is written on a ‘contig data’ file.
- usage:
network –forward=STR –assembly=FILE [–reverse=STR] [–aligner=bowtie2] [–aligner-mode=normal] [–depth=FILE] [–edge=0] [–enzyme=STR] [–normalization=empirical_hit] [–no-clean-up] [–outdir=DIR] [–min-quality=30] [–self-contacts] [–start=fastq] [–threads=1] [–tmpdir=DIR]
- options:
- -1, --forward=STR
Fastq file or list of Fastq separated by a comma containing the forward reads to be aligned or their corresponding bam or pairs files.
- -2, --reverse=STR
Fastq file or list of Fastq separated by a comma containing the reverse reads to be aligned or their corresponding bam files. Forward and reverse reads need to have the same identifier (read names). If start is set to pair, no argument is necessary.
- -a, --assembly=FILE
The initial assembly path acting as the alignment file’s reference genome or the basename of the bowtie2 index.
- -b, --aligner=STR
Aligner algorithm to use. Either “bwa” or “bowtie2”. [Default: bowtie2]
- -B, --aligner-mode=STR
Mode of alignment from hicstuff. Either normal, iterative or cutsite. [Default: normal]
- -d, --depth=FILE
The depth.txt file from the shotgun reads used to made the assembly computed by jgi_summarize_bam_contig_depths from metabat2 pipeline.
- -E, --edge=INT
Distance of the edge region in base pair on the contigs where the mapping reads are not considered as inter contigs. [Default: 0]
- -e, --enzyme=STR
The list of restriction enzyme used to digest the contigs separated by a comma. Example: HpaII,MluCI.
- -n, –normalization=STR If None, do not normalized the count of a
contact by the geometric mean of the coverage of the contigs. Otherwise it’s the type of normalization. 6 values are possible “None”, “abundance”, “length”, “RS”, “empirical_hit”, “theoritical_hit”. [Default: empirical_hit]
- -N, --no-clean-up
Do not remove temporary files.
- -o, --outdir=DIR
The output directory to write the bam files the network and contig data into. Default: current directory.
- -q, --min-quality=INT
Threshold of quality necessary to considered a read properly aligned. [Default: 30]
- -s, --self-contacts
If enabled, count alignments between a contig and itself (intracontigs contacts).
- -S, --start=STR
Start stage of the pipeline. Either “fastq”, “bam”, or “pair”. [Default: fastq]
- -t, --threads=INT
Number of parallel threads allocated for the alignement. [Default: 1]
- -T, --tmpdir=DIR
Temporary directory. Default to current directory. [Default: ./tmp]
- class metator.commands.Pairs(command_args, global_args)[source]¶
Bases:
AbstractCommand
Sort and pairs files for faster assess to the data.
Sort the pairs file using pairtools. Compress them using bgzip. Index them using pairix.
- usage:
pairs [–force] [–remove] [–threads=1] <pairsfile>…
- Parameters
File (pairsfile) –
- options:
- -F, --force
Write files even if the output files already exists.
- -r, --remove
Remove the input file at the end to keep only the sorted, compressed and indexed pairs file.
- -t, --threads=INT
Numbers of thread to allocate. [Default: 1]
- class metator.commands.Partition(command_args, global_args)[source]¶
Bases:
AbstractCommand
Partition the network using Louvain algorithm
Partition the network file using iteratively the Louvain or Leiden algorithm. Then looks for ‘cores’ bins that are constituted by the contigs which are always in the same cluster at each iterations. Then using the Hamming distance from these cores bins, merege the bins with a small Hamming distance (close to 1), bigger than the percentage (overlapping parameter) given.
It will also update the file to integrate the bins information of the contigs.
Furthermore, both Leiden and Louvain algorithm are available here. However, the benchmark made show that here the Louvain algorithm have better performance and is faster on seawater and gut metagenomic samples.
Note that the Louvain or Leiden software are not, in the strictest sense, necessary. Any program that assigns a node to a bin, does so non deterministically and solely outputs a list in the form: ‘node_id bin_id’ could be plugged instead.
- usage:
partition –assembly=FILE –contigs=FILE –network=FILE [–algorithm=louvain] [–cluster-matrix] [–force] [–iterations=100] [–no-clean-up] [–outdir=DIR] [–overlap=80] [–prefix=STR] [–res-param=1.0] [–size=500000] [–threads=1] [–tmpdir=DIR]
- options:
- -a, --assembly=FILE
The path to the assembly fasta file used to do the alignment.
- -A, --algorithm=STR
Either “louvain” or “leiden”, algorithm to use to partition the network. [Default: louvain]
- -c, --contigs=FILE
The path to the tsv file containing the data of the contigs (ID, Name, Length, GC content, Hit, Coverage, Restriction Site).
- -C, --cluster-matrix
If enabled, save the clustering matrix.
- -F, --force
If enabled, would remove directory of overlapping bins in the output directory.
- -i, --iterations=INT
Number of iterations of Louvain. [Default: 100]
- -n, --network=FILE
Path to the file containing the network information from the meta HiC experiment compute in network function previously.
- -N, --no-clean-up
Do not remove temporary files.
- -o, --outdir=DIR
Path to the directory to write the output. Default to current directory. [Default: ./]
- -O, --overlap=INT
Hamming distance threshold to use to merge bins (percentage). [Default: 80]
- -p, --prefix=STR
Prefix to use for fasta files. By default just ‘metator_’, otherwise ‘STR_metator_’.
- -r, --res-param=FLOAT
Resolution paramter to use for Leiden algorithm. [Default: 1.0]
- -s, --size=INT
Threshold size to keep bins in base pair. [Default: 500000]
- -t, --threads=INT
Number of parallel threads allocated for the partition. [Default: 1]
- -T, --tmpdir=DIR
Temporary directory. Default to current directory. [Default: ./tmp]
- class metator.commands.Pipeline(command_args, global_args)[source]¶
Bases:
AbstractCommand
Launch the full metator pipeline
Partition the contigs from teh assembly in bins from the metaHiC reads.
It’s possible to start from the fastq, the bam, the pair or the network files. It’s will also possible to ask or not to run a validation step which will decontaminate the bins when it’s necessary. However as it’s the critical step for memory usage (~40G), it’s possible to skip these step.
- usage:
pipeline –assembly=FILE [–forward=STR] [–reverse=STR] [–algorithm=louvain] [–aligner=bowtie2] [–aligner-mode=normal] [–cluster-matrix] [–depth=FILE] [–edge=0] [–enzyme=STR] [–force] [–iterations=100] [–rec-iter=10] [–junctions=NNNNN] [–no-clean-up] [–normalization=empirical_hit] [–outdir=DIR] [–overlap=80] [–prefix=STR] [–rec-overlap=90] [–min-quality=30] [–res-param=1.0] [–size=500000] [–start=fastq] [–scaffold] [–threads=1] [–tmpdir=DIR]
- options:
- -1, --forward=STR
Fastq file or list of Fastq separated by a comma containing the forward reads to be aligned or their corresponding bam or pairs files.
- -2, --reverse=STR
Fastq file or list of Fastq separated by a comma containing the reverse reads to be aligned or their corresponding bam files. Forward and reverse reads need to have the same identifier (read names). If pair start is used no argument is necessary.
- -a, --assembly=FILE
The initial assembly path acting as the alignment file’s reference genome or the basename of the bowtie2 index.
- -A, --algorithm=STR
Algorithm to use. Either “louvain”, “leiden” or “spinglass”. If spinglass is chosen, the first partition will be done using louvain [Default: louvain]
- -b, --aligner=STR
Aligner algorithm to use. Either “bwa” or “bowtie2”. [Default: bowtie2]
- -B, --aligner-mode=STR
Mode of alignment from hicstuff. Either normal, iterative or cutsite. [Default: normal]
- -C, --cluster-matrix
If enabled, save the clustering matrix.
- -d, --depth=FILE
The depth.txt file from the shotgun reads used to made the assembly computed by jgi_summarize_bam_contig_depths from metabat2 pipeline.
- -E, --edge=INT
Distance of the edge region in base pair on the contigs where the mapping reads are not considered as inter contigs. [Default: 0]
- -e, --enzyme=STR
The list of restriction enzyme used to digest the contigs separated by a comma. Example: HpaII,MluCI.
- -F, --force
If enable, would remove directory of overlapping bins in the output directory.
- -i, --iterations=INT
Number of iterations of Louvain for the partition step. [Default: 100]
- -j, --rec-iter=INT
Number of iterations of Louvain for the recursive step. [Default: 10]
- -J, --junctions=STR
Sequences to use as junction between contigs. [Default: NNNNN]
- -N, --no-clean-up
Do not remove temporary files.
- -m, –normalization=STR If None, do not normalized the count of a
contact by the geometric mean of the coverage of the contigs. Otherwise it’s the type of normalization. 6 values are possible None, abundance, length, RS, empirical_hit, theoritical_hit. [Default: empirical_hit]
- -o, --outdir=DIR
The output directory to write the bam files the network and contig data into. Default: current directory.
- -O, --overlap=INT
Hamming distance threshold to use to merge bins for the first step (percentage). [Default: 80]
- -p, --prefix=STR
Prefix to use for fasta files. By default just ‘metator_’, otherwise ‘STR_metator_’.
- -P, --rec-overlap=INT
Hamming distance threshold to use to merge bins for the recursive step (percentage). [Default: 90]
- -q, --min-quality=INT
Threshold of quality necessary to considered a read properly aligned. [Default: 30]
- -r, --res-param=FLOAT
Resolution paramter to use for Leiden algorithm. [Default: 1.0]
- -s, --size=INT
Threshold size to keep bins in base pair. [Default: 500000]
- -S, --start=STR
Start stage of the pipeline. Either fastq, bam, or pair. [Default: fastq]
- -t, --threads=INT
Number of parallel threads allocated for the alignement. [Default: 1]
- -T, --tmpdir=DIR
Temporary directory. Default to current directory. [Default: ./tmp]
- -v, --scaffold
If enables, it will sacffold the genomes at the end.
- class metator.commands.Qc(command_args, global_args)[source]¶
Bases:
AbstractCommand
Generates some quality check on the output of metator.
The quality check is done on the high quality MAGs and only contigs witha size bigger than 100kb. The goal is to assess HiC quality metrics on genomes we kinda know to have a estimation of the quality of the library.
If the plot are set it will return some plot of the event distribution on the high quality contigs.
- usage:
qc –assembly=FILE –enzyme=STR [–bin-summary=FILE] [–contig-data=FILE] [–metator-dir=DIR] [–no-clean-up] [–outdir=DIR] [–prefix=STR] [–plot] [–threshold=STR] [–tmpdir=DIR] <pairsfile>…
- Parameters
File (pairsfile) –
- options:
- -a, --assembly=FILE
Path to the fasta file containing the contigs of interest. Could be the whole or the extracted contigs of one bin.
- -b, --bin-summary=FILE
Path to the bin_summary.txt file from MetaTOR output.
- -c, --contig-data=FILE
Path to the contig_data_final.txt file from MetaTOR output.
- -e, --enzyme=STR
The list of restriction enzyme used to digest the contigs separated by a comma. Example: HpaII,MluCI.
- -n, --no-clean-up
If enabled intermediary files will be kept.
- -O, --metator-dir=DIR
Output directory from metator pipeline. If set contigs data and bin summary will be found automatically.
- -o, --outdir=DIR
Directory to save output plots and log.
- -p, --prefix=STR
Name of the sample to add on plot and files.
- -P, --plot
If enable display some plots.
- -t, --threshold=STR
Hicstuff religation and loop thresholds. Two integers seperated by a coma.
- -T, --tmpdir=DIR
Temporary directory to save pairs files
- class metator.commands.Scaffold(command_args, global_args)[source]¶
Bases:
AbstractCommand
Scaffold a bin from metator.
Use the pairs to build a scaffold from the binned contigs. The scaffolding is based on contact between the edges of the contigs. There are normalized based on intra contacts of the contigs except for the small contigs where a low score is given.
- usage:
scaffold –bin-name=STR –input-fasta=FILE [–junctions=NNNNN] [–out-fasta=FILE] [–out-frags=FILE] [–threads=1] [–threshold=0.05] [–window-size=5000] <pairsfile>…
- Parameters
File (pairsfile) –
- options:
- -b, --bin-name=STR
Name of the bin to scaffold.
- -i, --input-fasta=FILE
Path to the input fasta.
- -j, --junctions=STR
Sequences to use as junction between contigs. [Default: NNNNN]
- -o, --out-fasta=FILE
Path to write the output fasta. Default in the current directory: ‘”$bin_name”_scaffolded.fa’.
- -O, --out-frags=FILE
Path to write the fragments information. Default in the current directory: ‘”$bin_name”_info_frags.txt’.
- -t, --threads=INT
Numbers of threads to allocate if pairs need to be sorted. [Default: 1]
- -T, --threshold=FLOAT
Threshold score to consider an association between two contigs. [Default: 0.05]
- -w, --window-size=INT
Size of the window in base pair to use as edge window to scaffold. [Default: 5000]
- class metator.commands.Validation(command_args, global_args)[source]¶
Bases:
AbstractCommand
Use CheckM to validate the bins.
Recursively decontaminate the contaminated bins evaluated by checkM. The script returns new decontaminated fasta and summary files of the decontamination.
Only bins with more than 50% completion and 5% contamination are subject to the recursive step. If the recursive step gave worst results than the first (decrease of the completion with no decrease of the contamination), it will keep the original bin.
- usage:
validation –assembly=FILE –contigs=FILE –fasta=DIR –network=FILE [–algorithm=louvain] [–cluster-matrix] [–force] [–iterations=10] [–no-clean-up] [–outdir=DIR] [–overlap=90] [–prefix=STR] [–res-param=1.0] [–size=500000] [–threads=1] [–tmpdir=DIR]
- options:
- -a, --assembly=FILE
The path to the assembly fasta file used to do the alignment.
- -A, --algorithm=STR
Algorithm to use. Either “louvain”, “leiden” or “spinglass”. [Default: louvain]
- -c, --contigs=FILE
The path to the file containing the data of the contigs from the partition step (13 columns).
- -C, --cluster-matrix
If enabled, save the clustering matrix.
- -f, --fasta=DIR
Path to the directory containing the input fasta files of the bins to decontaminate.
- -F, --force
If enable, would remove directory of recursive bins in the output directory.
- -i, --iterations=INT
Number of recursive iterations of Louvain. [Default: 10]
- -n, --network=FILE
Path to the file containing the network information from the meta HiC experiment compute in network function previously.
- -N, --no-clean-up
Do not remove temporary files.
- -o, --outdir=DIR
Path to the directory to write the output. Default to current directory. [Default: ./]
- -O, --overlap=INT
Hamming distance threshold to use to merge bins (percentage). [Default: 90]
- -p, --prefix=STR
Prefix to use for fasta files. By default just ‘metator_’, otherwise ‘STR_metator_’.
- -r, --res-param=FLOAT
Resolution paramter to use for Leiden algorithm. [Default: 1.0]
- -s, --size=INT
Threshold size to keep bins in base pair. [Default: 500000]
- -t, --threads=INT
Number of parallel threads allocated for the partition. [Default: 1]
- -T, --tmpdir=DIR
Temporary directory. Default to current directory. [Default: ./tmp]
metator.contact_map module¶
Generates a HiC contact map of a MetaTOR object.
General utility functions and class for handling one MetaTOR object and for generating its HiC contact map using hicstuff pipeline.
- Core class to handle bin object easily:
MetatorObject
- Core function to build the network are:
extract_pairs
generate_contact_map
- class metator.contact_map.MetatorObject(metator_object, name, assembly, contig_data_file, pairs, min_size=5000)[source]¶
Bases:
object
Class to handle MetaTOR object informations and to extract easily aligned pairs from the object given.
The class is initiate by the MetaTOR object type, its name, the assembly file, the contig data file from MetaTOR and the pairs files.
It’s possible to add a threshold on the contigs size.
- get_contigs_data()[source]¶
Method to get contigs data Dataframe
Returns:¶
- pandas.core.frame.DataFrame
Table with informations from metaTOR binning for each contigs.
- set_contigs()[source]¶
Method to extract the list of contigs of the class from the contig data file and their size.
- set_large_contigs()[source]¶
Method to keep only contigs bigger than the threshold given to remove small bins which will be smaller than the binning value and which will prevent a good scaffolding with Instagraal.
- set_metator_object(metator_object, name)[source]¶
Method to get the metator object and name of the object usable for the algorithm.
Parameters:¶
- metator_objectstr Object to extract contigs to build the matrix.
Either “contig”, “core_bin”, “overlapping_bin”, “recursive_bin”, “final_bin” or “other”.
- namestr
Name of the object. Could be the name of a contig, an id of a bin or the name of the bin. Example: “NODE_1” or “MetaTOR_1_0”.
- write_fasta(tmp_dir, out_dir)[source]¶
Method to write the new fasta with only the contigs of interest in the outdir directory.
Parameters:¶
- tmp_dirstr
Path to the temporary directory to write the list of contigs of the object.
- out_dirstr
Path to the output directory where to write the new fasta file.
- metator.contact_map.extract_pairs(metator_data)[source]¶
Function to extract the pairs of a MetaTOR object from pairs files.
Parameters:¶
- metator_dataobject from class MetatorObject
Object of the class MetatorObject. It should have the parameters contigs and the output pairs file of the metaTOR object set.
Returns:¶
- int:
Number of pairs from the MetaTOR object.
- metator.contact_map.generate_contact_map(assembly, contig_data_file, enzyme, name, pairs, out_dir, tmp_dir, filter_events=False, force=False, mat_fmt='graal', metator_object='final_bin', min_size=5000, pcr_duplicates=False, threads=1)[source]¶
General function to extract pairs of the MetaTOR object and generate its the contact map.
Parameters:¶
- assemblystr
Path to the fasta file containing the contigs of interest. Could be the whole or the extracted contigs of one bin.
- contig_data_filestr
Path to the contig_data_final.txt file form MetaTOR output.
- enzymestr
Enzyme used to digest the genome in the HiC experiment. Example: HpaII,MluCI.
- namestr
Name of the object. Could be the name of a contig, an id of a bin or the name of the bin. Example: “NODE_1” or “MetaTOR_1_0”.
- pairsstr
Path of the “.pairs” file or bgzip indexed pair file. If more than one is given, files should be separated by a comma.
- out_dirstr
Path where output files should be written. Current directory by default.
- tmp_dirstr
Path where temporary files will be written.
- filter_eventsbool
Filter spurious or uninformative 3C events. Requires a restriction enzyme. Default: False.
- forcebool
If True, overwrite existing files with the same name as output. Default: False.
- mat_fmtstr
Select the output matrix format. Can be either “bg2” for the bedgraph2 format, “cool” for Mirnylab’s cool format, or graal for a plain text COO format compatible with Koszullab’s instagraal software. Default: “graal”.
- metator_objectstr
Object to extract contigs to build the matrix. Either “contig”, “core_bin”, “overlapping_bin”, “recursive_bin”, “final_bin” or “other”.
- min_sizeint
Minimum contig size required to keep it.
- pcr_duplicatesbool
If True, PCR duplicates will be filtered based on genomic positions. Pairs where both reads have exactly the same coordinates are considered duplicates and only one of those will be conserved. Default: False.
- threadsint
Numbers of threads to use. Default: 1.
metator.figures module¶
Figures foor metaTOR output.
General utility functions to plot some figures to describe metaTOR output.
- Core functions to plot the firgures are:
build_matrix_network
figures_bins_distribution
figures_bins_size_distribution
figure_camembert_quality
figures_mags_GC_boxplots
figures_mags_HiC_cov_boxplots
figures_mags_SG_cov_boxplots
generates_frags_network
network_heatmap
plot_figures
reindex_df
- metator.figures.build_matrix_network(pairs_files, info_contigs, N, bin_size)[source]¶
Build a binned matrix from the pairs alignement files and the final binning information in the contig data file.
Parameters:¶
- pairs_fileslist of str
List of path of the alignment “.pairs” file.
- info_contigs: dict
Dictionnary with frags id of the contigs.
- Nint
Size of the final bin matrix.
- bin_sizeint
Size of the bin to plot the heatmap.
Returns:¶
- numpy.array:
Binned dense matrix of the network.
- metator.figures.figure_camembert_quality(out_file, prefix, n_religated, n_loops, n_weirds, n_informative_intra, n_informative_inter, n_intra_mags, n_inter_mags, uncut_thr, loop_thr)[source]¶
Functon to plot nice hicstuff camembert from metator high quality contigs.
- metator.figures.figures_bins_distribution(bin_summary, out_file)[source]¶
Function to plot the distribution of the estimated completion and contamination of the final bins generated by metaTOR. The bins are ordered by decreasing completion and contamination values superior to 105% are set to 105%.
Parameters:¶
- bin_summarypandas.core.frame.DataFrame
Table with the informations about the final bins.
- out_filestr
Path where to save the figure.
- metator.figures.figures_bins_size_distribution(bin_summary, total_size, threshold, out_file)[source]¶
Function to plot a camembert of the fraction of size corresponding to each quality of bins. We defined 6 categories:
High quality MAGs (HQ MAGs): >= 90% completion ; < 5% contamination
Medium quality MAGs (MQ MAGs): >= 70% completion ; < 10% contamination
Low quality MAGs (LQ MAGs): >= 50% completion ; < 10% contamination
Contaminated bins: >= 50% completion ; >= 10% contamination
Bins superior to size threshold but with less 50% completion.
Unbinned contigs
Parameters:¶
- bin_summarypandas.core.frame.DataFrame
Table with the informations about the final bins.
- total_sizeint
Size of the whole assembly.
- thresholdint
Minimun size of the bins considered.
- out_filestr
Path where to save the figure.
- metator.figures.figures_mags_GC_boxplots(contigs_data, out_file)[source]¶
Function to plot barplots GC content of the contigs for each bins. The count are weighted by the size of the contigs.
Parameters:¶
- contigs_datapandas.DataFrame
Table with all the data from the contigs, the ubinned contigs should have removed and a column size_weight should have been add to weight with the size.
- out_filestr
Path where to write the figure.
- metator.figures.figures_mags_HiC_cov_boxplots(contigs_data, out_file)[source]¶
Function to plot barplots HiC_coverage coverage of the contigs for each bins. The count are weighted by the size of the contigs.
Parameters:¶
- contigs_datapandas.DataFrame
Table with all the data from the contigs, the ubinned contigs should have removed and a column size_weight should have been add to weight with the size.
- out_filestr
Path where to write the figure.
- metator.figures.figures_mags_SG_cov_boxplots(contigs_data, out_file)[source]¶
Function to plot barplots assembly coverage of the contigs for each bins. The count are weighted by the size of the contigs.
Parameters:¶
- contigs_datapandas.DataFrame
Table with all the data from the contigs, the ubinned contigs should have removed and a column size_weight should have been add to weight with the size.
- out_filestr
Path where to write the figure.
- metator.figures.generates_frags_network(contig_data, bin_summary, bin_size)[source]¶
Generates info fragments for all contigs and start position all bins.
Parameters:¶
- contigs_datapandas.DataFrame
Table with all the data from the contigs.
- bin_summarydict
Dictionnary with the informations of the final bins kept by MetaTOR.
- bin_sizeint
Size of the bin to plot the heatmap.
Returns:¶
- dict:
Dictionnary with frags id of the contigs.
- list of int:
List of the start positions of the mags
- int:
Size of the final binned matrix
- metator.figures.network_heatmap(matrix, out_file=None, mag_starts=None)[source]¶
Plot the heatmap of the network, with the contig ordering by their bin attribution.
Parameters:¶
- matrixnumpy.array
Binned dense matrix of the network.
- out_filestr
Name of the file to save the plot.
- mag_startslist of int
List of bin positions where to draw mags starts as dotted lines.
- metator.figures.plot_figures(out_dir, contigs_data, bin_summary, threshold)[source]¶
Function to generates all figures.
Parameters:¶
- out_dirstr
Path of the output directory of MetaTOR.
- contigs_datapandas.DataFrame
Table with all the data from the contigs.
- bin_summarydict
Dictionnary with the informations of the final bins kept by MetaTOR.
- thresholdint
Minimun size of the bins considered.
- metator.figures.reindex_df(dataframe, weight_col)[source]¶
Expand the dataframe to prepare for resampling result is 1 row per count per sample.
Parameters:¶
- dataframepandas.core.frame.DataFrame
Table to expand.
- weight_colstr
Name of the weighted column to use.
Returns:¶
- pandas.core.frame.DataFrame:
Table expanded with new index.
metator.io module¶
Core tools to build I/O for metaTOR
- This mdoule contains all core I/O functions:
check_checkm
check_fasta_index
check_is_fasta
check_louvain_cpp
check_pairix
check_pairtools
generate_fasta_index
generate_temp_dir
get_restriction_site
get_pairs_data
process_ligation_sites
read_bin_summary
read_compressed
read_contig_data
read_results_checkm
retrieve_fasta
sort_pairs
sort_pairs_pairtools
write_bin_summary
- metator.io.check_checkm()[source]¶
Function to test if CheckM is in the path.
Returns:¶
- bool:
True if checkM found in the path, False otherwise.
- metator.io.check_fasta_index(ref, mode='bowtie2')[source]¶
Function from hicstuff.io (https://github.com/koszullab/hicstuff/) Checks for the existence of a bowtie2 or bwa index based on the reference file name.
Parameters:¶
- refstr
Path to the reference genome.
- modestr
The alignment software used to build the index. bowtie2 or bwa. If any other value is given, the function returns the reference path.
Returns:¶
- str
The bowtie2 or bwa index basename. None if no index was found
- metator.io.check_is_fasta(in_file)[source]¶
Function from hicstuff.io (https://github.com/koszullab/hicstuff/) Checks whether input file is in fasta format.
Parameters:¶
- in_filestr
Path to the input file.
Returns:¶
- bool :
True if the input is in fasta format, False otherwise
- metator.io.check_louvain_cpp(louvain_path)[source]¶
Function to check is the Louvain functions are callable.
Parameters:¶
- louvain_pathstr
Path of the directory where the Louvain functions are.
Returns:¶
- bool:
Boolean value describing either the Louvain functions are callable or not.
- metator.io.check_pairix()[source]¶
Function to test if pairix is in the path.
Returns:¶
- bool:
True if pairix found in the path, False otherwise.
- metator.io.check_pairtools()[source]¶
Function to test if pairtools is in the path.
Returns:¶
- bool:
True if pairtools found in the path, False otherwise.
- metator.io.generate_fasta_index(fasta, aligner, outdir)[source]¶
Generate fasta index.
Parameters:¶
- fastastr
Path to the fasta reference to index.
- alignerstr
Aligner to use to build the index.
- outdirstr
Path to the directory to write the index.
Returns:¶
- str:
Path to the bowtie2 index build
- metator.io.generate_temp_dir(path)[source]¶
Temporary directory generation
Function from hicstuff.io (https://github.com/koszullab/hicstuff/) Generates a temporary file with a random name at the input path.
Parameters:¶
- pathstr
The path at which the temporary directory will be created.
Returns:¶
- str
The path of the newly created temporary directory.
- metator.io.get_pairs_data(pairfile, threads=1, remove=False, force=False)[source]¶
Extract pairs data from pypairix indexed pairs file. If no pypairix indexed found, sort pairs files using pairtools executable.
- Parameters
pairfile (str) – Path to the pairfile to extract data.
threads (int) – Number of threads to use to sort pairs if necessary. [Default: 1]
remove (bool) – If set to true, it will remove the unsorted pair file. [Default: False]
force (bool) – If set overwrite existing files. [Default: False]
- Returns
Path to the sorted and indexed pair file.
- Return type
str
- metator.io.get_restriction_site(enzyme)[source]¶
Function to return a regex which corresponds to all possible restriction sites given a set of enzyme.
Parameters:¶
- enzymestr
String that contains the names of the enzyme separated by a comma.
Returns:¶
- str :
Regex that corresponds to all possible restriction sites given a set of enzyme.
Examples:¶
>>> get_restriction_site('DpnII') 'GATC' >>> get_restriction_site('DpnII,HinfI') 'GA.TC|GATC'
- metator.io.getrandbits(k) x. Generates an int with k random bits. ¶
- metator.io.micomplete_results_to_dict(micomplete_file)[source]¶
Read micomplte output file and transfrom it as a dictionnary with the bin name as keys and bin information as values.
- Parameters
micomplete_file (str) – Path to the micomplete output file.
- Returns
Dictionnary of the output of miComplete as values and the bin id as keys.
- Return type
dict
- metator.io.read_bin_summary(bin_summary_file)[source]¶
Read bin summary file from metator pipeline.
- Parameters
bin_summary_file (str) – Path to the bin summary file.
- Returns
Table from bin summary with bin name as index.
- Return type
pandas.DataFrame
- metator.io.read_compressed(filename)[source]¶
Read compressed file
Function from hicstuff.io (https://github.com/koszullab/hicstuff/) Opens the file in read mode with appropriate decompression algorithm.
Parameters:¶
- filenamestr
The path to the input file
Returns:¶
- file-like object
The handle to access the input file’s content
- metator.io.read_contig_data(contig_data_file)[source]¶
Read bin summary file from metator pipeline.
- Parameters
contig_data_file (str) – Path to the contig data file.
- Returns
Table from contig data with contig name as index.
- Return type
pandas.DataFrame
- metator.io.read_results_checkm(checkm_file, checkm_taxonomy_file)[source]¶
Function to transform the output summary file of checkm into a dictionnary.
Parameters:¶
- checkm_filestr
Path to the summary output file of CheckM.
Returns:¶
- dict:
Dictionnary of the output of checkm with binID as keys and with three values lineage, completness and contamination.
- metator.io.retrieve_fasta(in_file, aligner, tmpdir)[source]¶
Function to retrieve fasta from the given reference file. If index is given retrieve it using bowtie2 inspect. Thraw an error if not a fasta or bowtie2 index.
Parameters:¶
- in_filestr
Path to the reference file given.
- alignerstr
Name of the aligner used. Either ‘bowtie2’ or ‘bwa’.
- tmpdirstr
Path to the temp directory to write the fasta if necessary.
Returns:¶
- str:
Path to the fasta file.
- metator.io.save_sparse_matrix(s_mat, path)[source]¶
Save a sparse matrix
Saves a sparse matrix object into tsv format.
- Parameters
s_mat (scipy.sparse.coo_matrix) – The sparse matrix to save on disk
path (str) – File path where the matrix will be stored
- metator.io.sort_pairs(in_file, out_file, tmp_dir=None, threads=1, buffer='2G')[source]¶
Adapted function from hicstuff.io (https://github.com/koszullab/hicstuff/) Sort a pairs file in batches using UNIX sort.
Parameters:¶
- in_filestr
Path to the unsorted input file
- out_filestr
Path to the sorted output file.
- keyslist of str
list of columns to use as sort keys. Each column can be one of readID, chr1, pos1, chr2, pos2, frag1, frag2. Key priorities are according to the order in the list.
- tmp_dirstr
Path to the directory where temporary files will be created. Defaults to current directory.
- threadsint
Number of parallel sorting threads.
- bufferstr
Buffer size used for sorting. Consists of a number and a unit.
- metator.io.sort_pairs_pairtools(pairfile, threads=1, remove=False, force=False)[source]¶
Sort pairs files using pairtools executable. Pairix only works with compressed pair files. So we use bgzip to compress them.
- Parameters
pairfile (str) – Path to the pairfile to sort, compress and index.
threads (int) – Number of threads to use. [Default: 1]
remove (bool) – If set to true, it will remove the unsorted pair file. [Default: False]
force (bool) – If set overwrite existing files. [Default: False]
- Returns
Path to the sorted and indexed pair file.
- Return type
str
- metator.io.write_bin_summary(bin_summary, bin_summary_file)[source]¶
Function to write the bin summary from dictionnary to table text file.
Parameters:¶
- bin_summarydict
Dictionnary with the output of the miComplete of the bins.
- bin_summary_filestr
Path to the output file to write the summary informations of the bins.
metator.log module¶
Basic logging setup for metaTOR
Basic logging setup with three main handlers (stdout, log file and optionally texting with the right API). By default the log file is disabled, but can be enabled or changed using set_file_handler.
- metator.log.set_file_handler(log_path, formatter=<logging.Formatter object>)[source]¶
Change the file handler for custom log file location
- metator.log.setup_text_logging(credentials='text_credentials.json')[source]¶
Setup text logging
Setup text notifications on errors with a basic API. In order for it to work, a file named ‘text_credentials.json’ must be in the directory where a command is run. The logger performs a GET request to a text notification API.
- Parameters
credentials (str or pathlib.Path) – A JSON file with necessary credentials for the GET request
metator.main module¶
MetaHiC pipeline for generating and manipulating MAGs.
- usage:
metator [-hv] <command> [<args>…]
- options:
- -h, --help
shows the help
- -v, --version
shows the version
- The subcommands are:
- network Generate metaHiC contigs network from fastq reads or bam files
and normalize it.
partition Partition metaHiC network using Louvain or Leiden algorithm. pipeline Use all the others command to give a binned output in one
command line.
- validation Validates bins using CheckM and make a recursive partition to
try to decontaminate them.
qc Generates some quality check on the output of metator. contactmap Generates a HiC contact map from one metaTOR object from the
final ouptut of metaTOR.
scaffold Scaffold a metator bin based on pairs files. pairs Sort the pairs file using pairtools. Compress them using bgzip.
Index them using pairix.
metator.network module¶
Generate metaHiC contigs network from fastq reads or bam files.
General utility functions for handling aligment files from align module and generating metaHiC networks.
- Core function to build the network are:
alignment_to_contacts
compute_network
create_contig_data
normalize_pair
precompute_network
write_contig_data
write_hit_data
- metator.network.alignment_to_contacts(alignment_files, contig_data, edge, hit_data, out_dir, output_file_network, output_file_contig_data, tmp_dir, n_cpus, normalization, self_contacts)[source]¶
Generates a network file (in edgelist form) from an alignment. Contigs are the network nodes and the edges are the contact counts.
The network is in a strict barebone form so that it can be reused and imported quickly into other applications etc. Verbose information about every single node in the network is written on a ‘contig data’ file.
Parameters:¶
- alignment_fileslist of str
List of path to the alignment file(s) used as input.
- contig_datadict
Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”. Coverage still at 0 and need to be updated later.
- edgeint
Distance of the edge region in base pair on the contigs where the mapping reads are not considered as inter contigs.
- hit_datadict:
Dictionnary for hit information on each contigs.
- out_dirstr
The output directory to write the network and chunk data into.
- output_file_networkstr, optional
The specific file name for the output network file. Default is ‘network.txt’
- output_file_contig_datastr, optional
The specific file name for the output chunk data file. Default is ‘idx_contig_length_GC_hit_cov.txt’
- tmp_dirstr
Path to th temporary directory. Default in the working directory
- normalizationstr
If None, do not normalized the count of a contact by the geometric mean of the coverage of the contigs. Otherwise it’s the type of normalization.
- self_contactsbool
Whether to return network with self contact. Default is False.
Returns:¶
- str:
Path to the network file.
- str:
Path to the verbose contig data file.
- metator.network.compute_network(pre_network_file, network_file, contig_data, tmp_dir, tmp_file, n_cpus, normalization)[source]¶
Generate the network file from the prenetwork file.
Parameters:¶
- pre_network_filestr
Path of the input file (prenetwork file, output from teh precompute network function)
- network_filestr
Path of the output file (network file).
- contig_datadict
Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”, “RS”.
- tmp_dirstr
Path to the temporary directory.
- tmp_filestr
Path of the temporary file to write the sorted precompute network.
- n_cpusint
Number of cpus used to sort the prenetwork.
- normalizationstr
If None, do not normalized the count of a contact by the geometric mean of the coverage of the contigs. Otherwise it’s the type of normalization.
- metator.network.create_contig_data(assembly, nb_alignment=1, depth_file=None, enzyme=None)[source]¶
Create a dictionnary with data on each Contig.
Parameters:¶
- assemblystr
Path to the assembly fasta file
- nb_alignementint
Numbers of alignment files. [Default: 1]
- depth_filestr or None
Path to the depth.txt file from jgi_summarize_bam_contig_depths from Metabat2 Software. [Default: None]
- enzymestr or None
String that contains the names of the enzyme separated by a comma. [Default: None]
Returns:¶
- dict:
Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”, “RS”. Hit is set to 0 and need to be updated later.
- dict:
Dictionnary for hit information on each contigs.
- metator.network.normalize_pair(contig_data, pair, n_occ, normalization)[source]¶
Function to do the normalization of an inter contig contact depending on the mode of normalisation given.
Parameters:¶
- contig_datadict
Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”, “RS”.
- pairtuple
Tuple of the id of the contigs contributing to the contact.
- n_occint
Contact count between the two contigs.
- normalizationstr
Mode of normalization to use.
Returns:¶
- float:
Normalized contact count.
- metator.network.precompute_network(alignment_files, contig_data, edge, hit_data, out_file, tmp_dir, self_contacts=False)[source]¶
Write a file with only the contig id separated by a tabulation and count the contacts by contigs to be able to compute directlty the normalized network.
Parameters:¶
- alignment_fileslist of str
List of path to the alignment file(s).
- contig_datadict
Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”. Coverage still at 0 and need to be updated later.
- edgeint
Distance of the edge region in base pair on the contigs where the mapping reads are not considered as inter contigs.
- hit_datadict
Dictionnary with the count of hits for each aligment file.
- out_filestr
Path to the write the output_file which will be necessary to compute the network.
- tmp_dirstr
Directory where to store temporary files.
- self_contactsbool
If True, the contacts on the same contigs will be kept. Otherwise only displays the inter contigs contacts. [Default False]
Return:¶
- dict:
Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage” “RS”. Coverage still at 0 and need to be updated later.
- metator.network.write_contig_data(contig_data, output_path)[source]¶
Function to write the contig data file at the output path given. The file will contains 7 columns separated by a tabulation: id, name, length, GC_content, hit, coverage, restriction site for each contig with an header.
Parameters:¶
- contig_datadict
Dictionnary of the all the contigs from the assembly, the contigs names are the keys to the data of the contig available with the following keys: “id”, “length”, “GC”, “hit”, “coverage”, “RS”.
- output_pathstr
Path to the output file where the data from the dictionnary will be written
- metator.network.write_hit_data(hit_data, output_path)[source]¶
Function to write the contig data file at the output path given. The file will contains 6 columns separated by a tabulation: id, name, length, GC_content, hit, coverage for each contig.
Parameters:¶
- hit_datadict
Dictionnary of the all the contigs from the assembly, the contigs names are the keys to a list of hits from each alignment files separately.
- output_pathstr
Path to the output file where the data from the dictionnary will be written.
metator.partition module¶
Partition metaHiC network using Louvain or Leiden algorithm.
General utility function for generating bins from the network file using Louvain algorithm. If an overlapping threshold is given, it will use the Hamming distance to group together bins relatively closed (to avaoid to split genomes in different bins).
- Core functions to partition the network are:
algo_partition
build_clustering_matrix
defined_overlapping_bins
detect_core_bins
generate_fasta
get_distances_splitmat
get_hamming_distance
leiden_iterations_java
louvain_iterations_cpp
partition
remove_isolates
update_contigs_data
- Deprecated Spinglass functions:
spinglass_partition
- metator.partition.algo_partition(algorithm='louvain', network_file=None, network=None, iterations=10, resolution_parameter=1.0, tmpdir='.', spin=2)[source]¶
Function to partition the network depednding on the used algorithm.
Parameters:¶
- algorithmstr
Algorithm to use to partition network. [Default: louvain]
- network_filestr
Path to the network computed previously. The file is 3 columns table separated by a tabulation with the id of the first contigs the id of the second one and the weights of the edge normalized or not. Mandatory if louvain or leiden algorithm. [Default: None]
- networknetworkx.classes.graph.Graph
Network of interaction of a contaminated bins. Mandatory if spinglass algorithm. [Default: None]
- iterationsint
Number of iterations of the algorithm of Leiden or Louvain. [Default: 10]
- resolution_parameterfloat
Resolution parameter for Leiden clustering. [Default: 1.0]
- tmp_dirstr
Path to the temporary directory. [Default: current directory]
- spinint
Deprecated. Number of final cluster if spinglass algorithm chosen. [Default: 2]
Returns:¶
- dict:
Dictionnary with the id of the contig as key and the list of the results of each iterations separated by a semicolon as values.
- metator.partition.build_clustering_matrix(core_bins_contigs, hamming_distance, N)[source]¶
Function to return the clustering matrix in sparse format.
For each contigs, the value correspond to the number of iterations where the contigs are clusterized together divided by the number of iterations. A value of 1 means that the contigs are in the same core bin.
Parameters:¶
- core_bins_contigsdict
Dictionnary which has as keys the core bins id and as value the id of the contigs of the core bin.
- hamming_distancescipy.sparse.csr.csr_matrix:
Matrix with all the previously computed hamming distance between two core bins.
- Nint
Number of contigs in the assembly.
Returns:¶
- scipy.sparse.coo.coo_matrix:
Matrix with all the previously computed hamming distance between two contigs.
- metator.partition.defined_overlapping_bins(overlap, hamming_distance, core_bins_contigs)[source]¶
This function extract the overlapped bins
From the hamming distances between the core bins, the function identifies the overlapping bins and create a dictionnary with the list of the contigs ID for each core bin.
Two core bins are considered overlapping if there have a percentage of identity superior or equal to the threshold given.
Parameters:¶
- overlapfloat
hamming distance threshold use to consider that two bins are overlapping.
- hamming_distancescipy.sparse.csr.csr_matrix
Matrix with all the previously computed hamming distance between two core bins.
- core_bins_contigsdict
Dictionnary which has as keys the core bins id and as value the id of the contigs of the core bin.
Returns:¶
- dict:
A dictionnary with the id of the overlapping bins as keys and the list of id of their contigs as values.
- metator.partition.detect_core_bins(output_partition, iterations)[source]¶
Detect core bins from the output of the partition algorithm.
The function search for duplicated values in the output of Louvain or Leiden algorithm in order to find contigs which are always in the same bin. The bins find with this method are called the core bins.
Parameters:¶
- output_partitiondict
Dictionnary with the id of the contig as key and the list of the results of each iterations separated by a semicolon as values.
- iterationsint
Number of iterations made previously with the partition algorithm.
Returns:¶
- dict:
Dictionnary which has as keys the core bins id and as value the id of the contigs of the core bin.
- pandas.core.frame.DataFrame:
Table with the id of the core bin and their values for each iterations.
- metator.partition.generate_fasta(assembly, overlapping_bins, contigs_data, size, output_dir, tmpdir, prefix)[source]¶
Generate the fasta files of each bins from the assembly.
Parameters:¶
- assemblystr
Path to the fasta file of the original assembly.
- overlapping_binsdict
A dictionnary with the id of the overlapping bins as keys and the list of id of their contigs as values.
- contigs_datapandas.core.frame.DataFrame
Table with all the information on the contigs included their appartenance to the bins.
- sizeint
Thrshold size chosen to write the bins.
- output_dirstr
Path to the output directory where the fasta of all the bin will be written.
- tmpdirstr
Path to the temporary directory to write the temporary contigs list files.
- prefixstr
Sample prefix to use.
- metator.partition.get_distances_splitmat(bins, core_bins_iterations)[source]¶
This function takes a segment of the full iterative clustering matrix and computes, for each index (i.e. contig), the hamming distance to each of the other indices.
Parameters:¶
- binspandas.core.frame.DataFrame
Slice of the table with the id of the core bin and their values for each iterations.
- core_bins_iterationspandas.core.frame.DataFrame
Table with the id of the core bin and their values for each iterations.
Returns:¶
- scipy.sparse.csr.csr_matrix:
matrix of the distance of the possible pairs from the slice of the table and the table itself.
- metator.partition.get_hamming_distance(core_bins_iterations, threads)[source]¶
Generate matrix of Hamming distances between all pairs of core bins.
Parameters:¶
- core_bins_iterationspandas.core.frame.DataFrame
Table with the id of the core bin as index and their values for each iterations.
- threadsint
Number of cores to parallelize computation.
Returns:¶
- scipy.sparse.csr.csr_matrix:
Matrix with all the previously computed hamming distance between two core bins.
- metator.partition.leiden_iterations_java(network_file, iterations, resolution_parameter, tmp_dir, leiden_path)[source]¶
Use the java implementation of Leiden to partition the network.
Parameters:¶
- network_filestr
Path to the network computed previously. The file is 3 columns table separated by a tabulation with the id of the first contigs the id of the second one and the weights of the edge normalized or not.
- iterationsint
Number of iterations of the algorithm of Leiden.
- resolution_parameterfloat
Resolution parameter for Leiden clustering.
- tmp_dirstr
Path to the temporary directory.
- leiden_pathstr
Path to the directory with network analysis java implementation.
Returns:¶
- dict:
Dictionnary with the id of the contig as key and the list of the results of each iterations separated by a semicolon as values.
- metator.partition.louvain_iterations_cpp(network_file, iterations, tmp_dir, louvain_path)[source]¶
Use the cpp original Louvain to partition the network.
Parameters:¶
- network_filestr
Path to the network computed previously. The file is 3 columns table separated by a tabulation with the id of the first contigs the id of the second one and the weights of the edge normalized or not.
- iterationsint
Number of iterations of the algorithm of Louvain.
- tmp_dirstr
Path to the temporary directory.
- louvain_pathstr
Path to the directory with louvain functions.
Returns:¶
- dict:
Dictionnary with the id of the contig as key and the list of the results of each iterations separated by a semicolon as values.
- metator.partition.partition(algorithm, assembly, cluster_matrix, contig_data_file, iterations, network_file, outdir, fasta_dir, overlapping_parameter, resolution_parameter, size, temp_directory, threads, prefix)[source]¶
Function to call the others functions to partition the network.
Parameters:¶
- algorithmstr
Algorithm to use to partition the network. Either leiden or louvain.
- assemblystr
Path to the assembly file used for the partition.
- cluster_matrixbool
If True, build and save the clustering matrix.
- contig_data_filestr
Path to the contig data table to update.
- iterationsint
Number of iterations to use for the partition.
- network_filestr
Path to the network file.
- outdirstr
Path to the output directory where to write the output files.
- fasta_dirstr
Path to directory where to write the fasta files.
- overlapping_parameterint
Hamming distance threshold to use to merge bins (percentage).
- resolution_parameterfloat
Resolution parameter to use if Leiden algorithm is chosen. It will be a factor of the cost function used. A resolution parameter of 1 will be equivalent as the modularity function used in Louvain. Higher these parameters, smaller the bins will be in the output.
- sizeint
Threshold size in base pair of the output bins.
- temp_directorystr
Path to the directory used to write temporary files.
- threadsint
Number of threads to use.
- prefixstr
Sample prefix to use.
Returns:¶
- scipy.sparse.coo.coo_matrix:
Matrix with all the previously computed hamming distance between two contigs.
- str:
Path to the new contig data file with the bin informations in it.
- metator.partition.remove_isolates(output_partition, network_file)[source]¶
Remove isolates, i.e. nodes without any contacts in the network in the partition. This step is necessary as it will slow the further process of the communities. This function is only useful while using Leiden algorithm.
Parameters:¶
- output_partitiondict
Dictionnary with the id of the contig as key and the list of the results of each iterations separated by a semicolon as values.
- network_filestr
Path to the network computed previously. The file is 3 columns table separated by a tabulation with the id of the first contigs the id of the second one and the weights of the edge normalized or not.
Returns:¶
- dict:
Dictionnary with the id of the contig as key and the list of the results of each iterations separated by a semicolon as values without isolates.
- metator.partition.spinglass_partition(subnetwork, spins=2)[source]¶
Use spinglass function from cdlib to partition the network.
This function is only used in the validation step as it will take a long time to run on a large network.
Parameters:¶
- subnetworknetworkx.classes.graph.Graph
Network of interaction of a contaminated bins.
- spinsint
Number of expected MAGs in the contaminated bins.
Returns: dict:
Dictionnary with the id of the contig as key and the clustering result as values.
- metator.partition.update_contigs_data(contig_data_file, core_bins_contigs, overlapping_bins, outdir)[source]¶
Add bin information in the contigs data file.
This function allow to update the contigs data file which were created previously in the network functions with the columns: contig id, contig name, contig length, GC content, hit, coverage, restriction site. The function will add six columns: core bin id, core bin number of contigs, core bin length, overlapping bin id, overlapping bin number of contigs, overlapping bin length.
Parameters:¶
- contig_data_filestr
Path to the contigs data file.
- core_bins_contigsdict
Dictionnary which has as keys the core bins id and as value the id of the contigs of the core bin.
- overlapping_binsdict
A dictionnary with the id of the overlapping bins as keys and the list of id of their contigs as values.
- outdirstr
Path of the output directory to write the update contigs data file.
Returns:¶
- pandas.core.frame.DataFrame:
Table with all the information on the contigs included their appartenance to the bins.
metator.validation module¶
Validates bins using miComplete and make a recursive partition to try to decontaminate them.
General functions to validate bins completion using miComplete and make recursive iterations of Louvain or Leiden to try to partition contaminated bins. Only bins with more than 50% completion and 5% contamination are subject to the recursive step. If the recursive step gave worst results than the first (decrease of the completion with no decrease of the contamination), it will keep the original bin.
- Functions in this module:
correct_final_bin
get_bin_coverage
give_results_info
merge_micomplete
micomplete_compare_bins
micomplete_quality
recursive_clustering
recursive_clustering_worker
recursive_decontamination
update_contigs_data_recursive
write_bins_contigs
- CheckM deprecated functions:
checkm
checkm_compare_bins
- metator.validation.checkm(fasta_dir, outfile, taxonomy_file, tmpdir, threads)[source]¶
Function to evaluate fasta bins using CheckM. Write the checkM results summary in the outfile and the taxonomy results in the the taxonomy file.
Parameters:¶
- fasta_dirstr
Path to the input fasta of the bins to evaluate.
- outfilestr
Path to the file where the results of checkm will be written.
- taxonomy_filestr
path to the file where checkm taxonomy results will be written.
- tmpdirstr
Path to the temporary directory where CheckM intermediary files will be written.
- threadsint
Numbers of threads to use for CheckM.
- metator.validation.checkm_compare_bins(overlapping_checkm_file, overlapping_taxonomy_file, recursive_checkm_file, recursive_taxonomy_file, prefix)[source]¶
Compare the completness and contamination of the bins from the first step and from the recursive step. If the recursive step decrease the completion of one bin without decreasing its contamination, it will kept the bin before the recursive step. Moreover if the completion goes below 50% with the recursive it will not take the recursive bins.
Parameters:¶
- overlapping_checkm_filestr
Path to the checkm summary from the overlapping step.
- overlapping_taxonomy_filestr
path to the overlapping checkm taxonomy results file.
- recursive_checkm_filestr
Path to the checkm summary from the recursive step.
- recursive_taxonomy_filestr
path to the recursive checkm taxonomy results file.
- prefixstr
Sample prefix to use.
Returns:¶
- dict:
Dictionnary with the informations of the final bins kept by MetaTOR.
- metator.validation.correct_final_bin(contigs_data, final_fasta_dir, bin_summary)[source]¶
Function to compute the coverage of each bin.
Parameters:¶
- bin_summarydict
Dictionnary with the informations of the final bins kept by MetaTOR.
- contigs_datapandas.core.frame.DataFrame
Dataframe with the contigs informations.
Returns:¶
- pandas.core.frame.DataFrame
Dataframe with the contigs informations.
- metator.validation.get_bin_coverage(bin_summary, contigs_data)[source]¶
Function to compute the coverage of each bin.
Parameters:¶
- bin_summarydict
Dictionnary with the informations of the final bins kept by MetaTOR.
- contigs_datapandas.core.frame.DataFrame
Dataframe with the contigs informations.
Returns:¶
- dict:
Dictionnary with the informations of the final bins kept by MetaTOR with the coverage.
- metator.validation.give_results_info(bin_summary)[source]¶
Function to return the general information about the binning results.
Parameters:¶
- bin_summarydict
Dictionnary with the summary results of the kept bins.
- metator.validation.merge_micomplete(out_bact105, out_arch131, outfile)[source]¶
Function to merge bacterial and archaeal output of micomplete into one file.
Parameters: out_bact105 : str
Output of micomplete using 105 bacterial markers.
- out_arch131str
Output of micomplete using 131 arcaheal markers.
- outfilestr
Final merged output.
- metator.validation.micomplete_compare_bins(recursive_micomplete_file, bin_summary, parent_dict, step)[source]¶
Compare the completness and contamination of the bins from the first step and from the recursive step. If the recursive step decrease the completion of one bin without decreasing its contamination, it will kept the bin before the recursive step. Moreover if the completion goes below 50% with the recursive it will not take the recursive bins. The goal is to keep complete bins even if there are contaminated that the users can manually curate.
Parameters:¶
- recursive_micomplete_filestr
Path to the miComplete summary from the recursive step.
- bin_summarydict
Dictionnary containing iinformation about the bins.
- parent_dictdict
Dictionnary with recursive bin_id as key and parent bin as values.
- stepint
Recursive step.
Returns:¶
- dict:
Dictionnary with the informations of the final bins kept by MetaTOR.
- metator.validation.micomplete_quality(fasta_dir, outfile, threads)[source]¶
Function to evaluate fasta bins using miComplete. Write the bins quality summury in the outfile.
Parameters:¶
- fasta_dirstr
Path to the input fasta of the bins to evaluate.
- outfilestr
Path to the file where the results of miComplete will be written.
- threadsint
Numbers of threads to use for miComplete.
- metator.validation.recursive_clustering(assembly, iterations, overlapping_parameter, resolution_parameter, outdir, recursive_fasta_dir, algorithm, tmpdir, bin_summary, contigs_data, network, cluster_matrix, size, threads, prefix)[source]¶
Function to run recursive iterations on contaminated bins in order to try to improve the quality of the bins using Louvain or Leiden algorthm.
Parameters:¶
- assemblystr
Path to the fasta file used as assembly.
- iterationsint
Number of iterations to use for recursive iterations of Louvain or Leiden.
- overlapping_parameterfloat
Hamming distance threshold to consider two bins as the same bin.
- resolution parameterfloat
Resolution parameter of Leiden algorithm.
- outdirstr
Path to the output directory.
- recursive_fasta_dirstr
Path to the directory where to write the decontaminated fasta.
- algorithmstr
Algorithm to use, either louvain, leiden or spinglass.
- tmpdirstr
Path the temp directory.
- bin_summarydict
Dictionnary containing iinformation about the bins.
- contigs_datapandas.DataFrame
Table with all the data from the contigs.
- networkstr
Metator full network.
- cluster_matrixbool
If True, build the clustering matrix and save it.
- sizeint
Size threshodl in base pairs of the bins.
- threadsint
Number of threads to use.
- prefixstr
Sample prefix to use.
Returns:¶
- boolean:
True if at least one new recursive bin has been generated.
- pandas.DataFrame
Updated dictionnary which has as keys the values of the iterations from the recursive partition separated by a semicolon and as values the list of the id of the contigs.
- scipy.sparse.coo.coo_matrix:
Matrix with all the previously computed hamming distance between two contigs.
- dict
Dictionnary with recursive bin_id as key and parent bin as values.
- metator.validation.recursive_clustering_worker(bin_id, bin_summary, tmpdir, network, algorithm, iterations, resolution_parameter, contigs_data)[source]¶
Worker to partition one bin if it’s contaminated.
Parameters:¶
- bin_summarydict
Dictionnary of the output of micomplete as values and the bin id as keys.
- tmpdirstr
Path the temp directory.
- algorithmstr
Algorithm to use, either louvain, leiden or spinglass.
- networknetworkx.classes.graph.Graph
Network of contigs from HiC librairies.
- iterationsint
Number of iterations to use for recursive iterations of Louvain or Leiden.
- resolution parameterfloat
Resolution parameter of Leiden algorithm.
- contigs_data_filestr
Path to the contigs data file from metator partition.
- metator.validation.recursive_decontamination(algorithm, assembly, cluster_matrix, contig_data_file, final_fasta_dir, input_fasta_dir, iterations, network_file, outdir, overlapping_parameter, recursive_fasta_dir, resolution_parameter, size, temp_directory, threads, prefix)[source]¶
Function to validate bins do the recursive decontamination using Louvain or Leiden algorithm
Parameters:¶
- algorithmstr
Algorithm to use to recursively partition the network. Either leiden, louvain or spinglass.
- assemblystr
Path to the assembly file used for the partition.
- cluster_matrixbool
If True, build the clustering matrix and save it.
- contig_data_filestr
Path to the contig data table to update.
- final_fasta_dirstr
Path to write the final fasta decontaminated bins.
- input_fasta_dirstr
Path to the directory where the fasta bin from the partition are.
- iterationsint
Number of iterations to use for the recursive partition.
- network_filestr
Path to the network file.
- outdirstr
Path to the output directory where to write the output files.
- overlapping_parameterint
Hamming distance threshold in percentage to use to consider to bins as one in the recursive partition.
- recursive_fasta_dirstr
Path to write the fasta decontaminated bins.
- resolution_parameterfloat
Resolution parameter to use if Leiden algorithm is chosen. It will be a factor of the cost function used. A resolution parameter of 1 will be equivalent as the modularity function used in Louvain. Higher these parameters, smaller the bins will be in the output.
- sizeint
Threshold size in base pair of the output bins.
- temp_directorystr
Path to the directory used to write temporary files.
- threadsint
Number of threads to use.
- prefixstr
Sample prefix to use.
Returns:¶
- scipy.sparse.coo.coo_matrix:
Matrix with all the previously computed hamming distance between two contigs.
- metator.validation.update_contigs_data_recursive(bin_id, contigs_data, recursive_bins, assembly, outdir, tmpdir, size, contamination, parent_dict, prefix)[source]¶
Update the data of the bin according to the recursive step and generated their fasta.
Parameters:¶
- bin_idstr
Name of the parental bin.
- contigs_datapandas.DataFrame
Table with all the data from the contigs.
- recursive_binsdict
Dictionnary which has as keys the values of the recursive iterations from Louvain or Leiden separated by a semicolon and as values the list of the id of the contigs.
- assemblystr
Path to the fasta file.
- outdirstr
Path to the output directory to write the new fasta.
- tmpdirstr
Path to the file where to write the temporary contigs list.
- sizeint
Size threshold to generate fasta.
- contaminationboolean
True if one bin has already been generated, false otherwise.
- parent_dictdict
Dictionnary with recursive bin_id as key and parent bin as values.
- prefixstr
Sample prefix to use.
Returns:¶
- boolean:
True if one bin has already been generated, false otherwise.
- pandas.DataFrame
Updated dictionnary which has as keys the values of the iterations from the recursive partition separated by a semicolon and as values the list of the id of the contigs.
- dict
Dictionnary with recursive bin_id as key and parent bin as values.
- metator.validation.write_bins_contigs(bin_summary, contigs_data, outfile, prefix)[source]¶
Function to write a table with the nodes kept in the bins and their bin id. The file is adapted to be added in anvio.
Parameters:¶
- bin_summarydict
Dictionnary with the informations of the final bins kept by MetaTOR.
- contigs_datapandas.core.frame.DataFrame
Dataframe with the contigs informations.
- outfilestr
Path where to write the output file.
- prefixstr
Sample prefix to use.
Returns:¶
- pandas.core.frame.DataFrame
Dataframe with the contigs informations with the final bin information.