Usage
To access the features provided by the fastx-barber
suite, use the fbarber
keyword.
Running fbarber -h
provides helpful details on how to run the commands.
Match
usage: fbarber match [-h] [--pattern PATTERN] [--version] [--unmatched-output UNMATCHED_OUTPUT]
[--compress-level COMPRESS_LEVEL] [--log-file LOG_FILE] [--chunk-size CHUNK_SIZE]
[--threads THREADS] [--temp-dir TEMP_DIR]
in.fastx[.gz] out.fastx[.gz]
The match
command allows to subselect reads from a fastx file based on a regular expression (--pattern
, see regular expressions for more details). Only reads matching the regex are written to the output file. It is possible to export reads that do not match the pattern through the --unmatched-output
option. This script can be parallelized; for more details see Parallelization.
Trim
Trimming is a common operation which is generally used either to remove non-genomic (e.g., prefix, linker) read portions, or to remove terminal low-quality bases, before alignment to a reference genome. The trim
command provides access to different tools with this aim.
Trim by length
usage: fbarber trim length [-h] [-l LENGTH] [-s {3,5}] [--version] [--compress-level COMPRESS_LEVEL]
[--log-file LOG_FILE] [--chunk-size CHUNK_SIZE] [--threads THREADS]
[--temp-dir TEMP_DIR]
in.fastx[.gz] out.fastx[.gz]
The trim length
option allows to trim a given number ( -l
) of bases from either side (-s
) of the reads (5’: left; 3’: right). This script can be parallelized; for more details see Parallelization.
Trim by quality
usage: fbarber trim quality [-h] [-q QSCORE] [-s {3,5}] [--version] [--phred-offset PHRED_OFFSET]
[--compress-level COMPRESS_LEVEL] [--log-file LOG_FILE]
[--chunk-size CHUNK_SIZE] [--threads THREADS] [--temp-dir TEMP_DIR]
in.fastq[.gz] out.fastq[.gz]
The trim quality
command allows to remove all consecutive bases with a QSCORE (-q
) below a certain threshold, from either (-s
) side of the reads (5’: left; 3’: right). For more details on the QSCORE, see QSCORE. This script can be parallelized; for more details see Parallelization.
Trim by regular expression
usage: fbarber trim regex [-h] [--pattern PATTERN] [--version] [--unmatched-output UNMATCHED_OUTPUT]
[--compress-level COMPRESS_LEVEL] [--log-file LOG_FILE]
[--chunk-size CHUNK_SIZE] [--threads THREADS] [--temp-dir TEMP_DIR]
in.fastx[.gz] out.fastx[.gz]
The trim regex
command tries to match a regex (--pattern
, see regular expressions for more details) to the reads, and then removes the portion that matched. If a read does not match the pattern, it is not written in the output; the --unmatched-output
command can be used to export unmatched reads to a separate file instead. This script can be parallelized; for more details see Parallelization.
Flags
The flag
command can be used to access tools to extract portions of the reads (flags) and store them in the read headers, filter them, match them to a regex, calculate statistics, or split reads besed on their value. These operations can be either performed simultaneously at time of flag extraction (see extract flags below) or on a file with previously extracted flags (see after flag extraction).
Extract flags
usage: fbarber flag extract [-h] [--pattern PATTERN] [--version] [--unmatched-output UNMATCHED_OUTPUT]
[--flag-delim FLAG_DELIM]
[--selected-flags SELECTED_FLAGS [SELECTED_FLAGS ...]]
[--flagstats FLAGSTATS [FLAGSTATS ...]] [--split-by SPLIT_BY]
[--filter-qual-flags FILTER_QUAL_FLAGS [FILTER_QUAL_FLAGS ...]]
[--filter-qual-output FILTER_QUAL_OUTPUT] [--phred-offset PHRED_OFFSET]
[--no-qual-flags] [--comment-space COMMENT_SPACE]
[--compress-level COMPRESS_LEVEL] [--log-file LOG_FILE]
[--chunk-size CHUNK_SIZE] [--threads THREADS] [--temp-dir TEMP_DIR]
in.fastx[.gz] out.fastx[.gz]
The flag extract
command matches a regular expression (--pattern
, see regular expressions for more details) to a read. Flags can be specified in the pattern as groups using regular expression syntax, e.g., ^(?<umi>.{8})(?<bc>AGTCTAGA){s<2}
specifies a flag called “umi” consisting of 8 consecutive characters from the left terminal of the reads, and a second flag called “bc” with value “ACTCTAGA” allowing for up to 1 substitution (or mismatch).
By default, the part of the reads matching the pattern is trimmed, and all flags specified in the pattern are extracted (i.e., saved in the header). It is possible to extract only a subset of the flags by using the --selected-flags
option. Moreover, use the --unmatched-output
option to write to a separate file any read not matching the pattern.
When extracting flags, it is possible to simultaneously perform a number of operations that can also be performed after flag extraction:
- Use the
--flagstats
option to calculate the frequency of flag values. See calculate flag value frequency for more details. - Use the
--filter-qual-flags
to filter reads by quality. To output reads that do not pass the specified filter(s), use the--filter-qual-output
option. See filter by flag quality for more details. - Split reads to different files based on the value of a flag by using the
--split-by
option. See split by flag value for more details.
This script can be parallelized; for more details see Parallelization.
Flag extraction example
Flags are appended to the initial part of each read header, identified after removing any header comments. The --comment-space
value (defaulting to a white space) is used to identify and remove header comments. The --flag-delim
character (defaulting to a tilde ~
) is used to separate flags and key/value flag pairs. For example, applying the default values and the above pattern to the read below,
>Read_1 header_comment1 header_comment2
ACTGGACTAGTCTAGAGTATCGATCAGTCAGTCGATCG
would generate the following result:
>Read_1~~umi~ACTGGACT~~bc~AGTATAGA header_comment1 header_comment2
GTATCGATCAGTCAGTCGATCG
Using a simple alphanumeric pattern
When used together with --simple-pattern
, the --pattern
option accepts a simple alphanumeric pattern - a string composed of flag names and flag lengths. This pattern is always applied to the start (right-end, 5’) of a sequence. This is especially useful to extract flags of known length, independently of their expected sequence. For example, a record starting with a UMI of 8 nt, a barcode (BC
) of 8 nt, and a cutsite (CS
) of 4 nt could be treated with the following pattern: UMI8BC8CS4
. The rest of the execution proceeds in the same manner as with a normal regular expression. This can be particularly convenient as it provides a modest boost to performances.
Extracting quality flags (default)
When running flag extract
on a fastq file, the portion of quality string corresponging to each flag is also stored in the header. The quality string is saved as a separate flag by appending a “q” prefix at the beginning of the flag name.
To avoid extracting quality flags from fastq files, please use the --no-qual-flags
option.
The previous example applied on a fastq file,
>Read_1 header_comment1 header_comment2
ACTGGACTAGTCTAGAGTATCGATCAGTCAGTCGATCG
+
AAA/AAAAAEAAAA//AAAAAAAAAAAAAAAAAAAAAA
would generate the following result:
>Read_1~~umi~ACTGGACT~~qumi~AAA/AAAA~~bc~AGTATAGA~qbc~AEAAAA// header_comment1 header_comment2
GTATCGATCAGTCAGTCGATCG
+
AAAAAAAAAAAAAAAAAAAAAA
IMPORTANT: as this approach is prone to flag name conflicts, it will change with v0.2.0
. Follow issue #38 for more updates. In the meantime, please refrain from using flags that start with the lettwr “q”.
After flag extraction
As aforementioned, a number of actions can be performed either at the time of flag extraction (simultaneously), or on files with previously extracted flags. When running these commands after flag extraction, it is crucial to use the appropriate --flag-delim
(default “~”) and --comment-space
(default “ “) to properly read the flags.
Filter by flag quality
usage: fbarber flag filter [-h] [--version] [--flag-delim FLAG_DELIM] [--comment-space COMMENT_SPACE]
[--filter-qual-flags FILTER_QUAL_FLAGS [FILTER_QUAL_FLAGS ...]]
[--filter-qual-output FILTER_QUAL_OUTPUT] [--phred-offset PHRED_OFFSET]
[--compress-level COMPRESS_LEVEL] [--log-file LOG_FILE]
[--chunk-size CHUNK_SIZE] [--threads THREADS] [--temp-dir TEMP_DIR]
in.fastx[.gz] out.fastx[.gz]
The flag filter
command applies a set of filter(s) to one or more previously extracted flags. For each flag to be filtered, it is possible to specify a minimum QSCORE threshold and a maximum allowed fraction (percentage of bases) with QSCORE below the threshold. Specifically, the filters can be set as space-separated strings in the format flag_name,min_QSCORE,max_fraction
.
Any read with at least a flag with more bases below the QSCORE threshold than allowed is discarded. To export reads that do not pass a filter, use the --filter-qual-output
option.
For more details on the QSCORE, see QSCORE. This script can be parallelized; for more details see Parallelization.
Match flags with regular expressions
usage: fbarber flag regex [-h] [--pattern PATTERN [PATTERN ...]] [--version]
[--unmatched-output UNMATCHED_OUTPUT] [--flag-delim FLAG_DELIM]
[--comment-space COMMENT_SPACE] [--compress-level COMPRESS_LEVEL]
[--log-file LOG_FILE] [--chunk-size CHUNK_SIZE] [--threads THREADS]
[--temp-dir TEMP_DIR]
in.fastx[.gz] out.fastx[.gz]
The flag regex
command tries to match one or more flags to regular expressions. Any read with at least a non-matching flag, or were a specified flag is not present is not written to the output. To export these reads to a separate file, use the --unmatched-output
option.
Regular expressions can be specified as space-separated strings in the format "flag_name,regex"
. We recommend wrapping each string in quotes.
This script can be parallelized; for more details see Parallelization.
Split by flag value
usage: fbarber flag split [-h] [--version] [--flag-delim FLAG_DELIM] [--comment-space COMMENT_SPACE]
[--split-by SPLIT_BY] [--compress-level COMPRESS_LEVEL] [--log-file LOG_FILE]
[--chunk-size CHUNK_SIZE] [--threads THREADS] [--temp-dir TEMP_DIR]
in.fastx[.gz] out.fastx[.gz]
The flag split
command allows to split reads to separate files based on the value of a specific flag (--split-by
). This script can be parallelized; for more details see Parallelization.
Calculate flag value frequency
usage: fbarber flag stats [-h] [--version] [--flag-delim FLAG_DELIM] [--comment-space COMMENT_SPACE]
[--flagstats FLAGSTATS [FLAGSTATS ...]] [--compress-level COMPRESS_LEVEL]
[--log-file LOG_FILE] [--chunk-size CHUNK_SIZE] [--threads THREADS]
[--temp-dir TEMP_DIR]
in.fastx[.gz]
The flag stats
command allows to calculate the frequency of the value of one or more flags (--flagstats
). This script can be parallelized; for more details see Parallelization.
Find sequence
usage: fbarber find_seq [-h] [--version] [--output out.bed[.gz]] [--prefix prefix] [--case-insensitive]
[--global-name] [--compress-level COMPRESS_LEVEL] [--log-file LOG_FILE]
in.fastx[.gz] needle
The find_seq
command allows to locate a substring (needle
) in the records of a fastx file, and produce a bed file with the extracted locations. The --case-insensitive
option can be used to make the search case-insensitive. The generated BED file is a BED4 file where the chromosome name corresponds to the FASTX record header value, and the location name is formed by the --prefix
value and the location ID. Location IDs are assigned incrementally per record searched. To obtain location IDs incrementing over the whole FASTX file use the --global-name
option.
General
Output
For all fbarber
commands, the format (fasta/q) of the input must match the output. The barber automatically detects from the output extension if the output should be compressed (expects a .gz
suffix) using the specified compression level (--compress-level
, defaults to 6).
Regular expressions
fbarber
uses the regex
python package to compile, match, and generally manage regular expression. Thus, the barber supports fuzzy matching, where a number of allowed deletions/insertions/substitutions can be specified (NOTE: fuzzy matching might slow execution as it takes longer times to compute). Fore more details on the fuzzy matching syntax, please check the regex
package documentation.
QSCORE
fbarber
uses the latest standard QSCORE definition of QSCORE = -10 log10(Pe)
, were Pe
is the error probability of a base. The QSCORE is read from the quality string of a FASTQ file using a certain PHRED offset (--phref-offest
). The default PHRED offset is 33, following the latest Illumina standards (chr(Q+33)
). As the barber uses the biopython
package for quality calculation, we direct the user to their documentation, which provides a nice historical overview of the topic.
Logging
By default, script log is written to the terminal (stdout
). To save the output to a file we recommend using the --log-file
option.
Parallelization
Parallelization is achieved (using joblib
) by splitting the input file in chunks, which are then concurrently processed on separate threads. Finally, the output of each chunk is merged into the final output by retaining the initial order.
It is possible to specify the number of reads per chunk, and the number of concurrent threadsm using the --chunk-size
and --threads
options, respectively. Input file chunks and single chunk output files are stored in a temporary directory, which can be changed using the --temp-dir
option.
As the I/O operations represent the bottleneck in most operations, especially on solid-state drives and particularly when running on one read at a time, this approach can speed execution up when the chunks are large enough to be spread over multiple threads. Subprocesses are instantiated at execution start, and overhead time is proportional to the number of threads.