labtools.adtools.seqlib

Module Contents

Functions

read_fasta(filename)

Generator for reading entries in a fasta file.

read_fastq(filename[, subset])

Generator for reading entries in a fastq file.

read_fastq_big(filename[, subset, progress])

Generator for fastq file without opening into memory.

get_numreads(filename)

Returns number of reads in a fastq or fastq.gz file.

get_numreads_old(filename)

Returns number of reads in a fastq file.

write_bc_dict(bc_dict, name)

Writes bc_dict to a csv.

read_bc_dict(filename)

Reads bc_dict from a csv.

labtools.adtools.seqlib.read_fasta(filename)[source]

Generator for reading entries in a fasta file.

Yields 2 lines of a fasta file at a time (name, seq).

Parameters:

filename (str) – Path to fasta or fasta.gz file.

Yields:

(name, seq) ((str, str)) – Name of sequence, biological sequence.

Examples

>>> for line in read_fasta("example.fasta"):
...     name = line[0]
...     seq = line[1]
...     print(name, seq)
Geraldine
ACGTGCTGAGGCTGCGCTAGCAT
Gustavo
CTGATGCTAGATGCTGATA
labtools.adtools.seqlib.read_fastq(filename, subset=None)[source]

Generator for reading entries in a fastq file.

Yields 4 lines of a fastq file at a time (name, seq, +, error).

Parameters:
  • filename (str) – Path to fastq or fastq.gz file.

  • subset (int, optional) – Number of reads to randomly sample from the fastq file.

Yields:

(name, seq, qual) ((str, str, str)) – tuple of str containing name, seq and quality for entry.

Examples

>>> for line in read_fastq("example.fasta"):
...     name = line[0]
...     seq = line[1]
...     qual = line[2]
...     print(name, seq)
Geraldine
ACGTGCTGAGGCTGCGCTAGCAT
Gustavo
CTGATGCTAGATGCTGATA
labtools.adtools.seqlib.read_fastq_big(filename, subset=None, progress=True, **kwargs)[source]

Generator for fastq file without opening into memory.

Yields 4 lines of a fastq file at a time (name, seq, +, error). Useful in situations where the fastq file is large and opening into RAM would crash computer. Supports subsetting with sklearn.sample_without_replacement().

Parameters:
  • filename (str) – Path to fastq or fastq.gz file.

  • subset (int) – Number of reads to randomly subsample from file.

Yields:

(name, seq, qual) ((str, str, str)) – tuple of str containing name, seq and quality for entry.

Examples

>>> for line in read_fastq_big("example.fasta"):
...     name = line[0]
...     seq = line[1]
...     qual = line[2]
...     print(name, seq)
Geraldine
ACGTGCTGAGGCTGCGCTAGCAT
Gustavo
CTGATGCTAGATGCTGATA
labtools.adtools.seqlib.get_numreads(filename)[source]

Returns number of reads in a fastq or fastq.gz file.

Parameters:

filename (str) – Path to fastq or fastq.gz file.

Returns:

numreads – Number of reads in the fastq file.

Return type:

int

Examples

>>> get_numreads("example.fastq")
124
labtools.adtools.seqlib.get_numreads_old(filename)[source]

Returns number of reads in a fastq file.

Parameters:

filename (str) – Path to fastq file.

Returns:

numreads – Number of reads in the fastq file.

Return type:

int

Examples

>>> get_numreads("example.fastq")
124
labtools.adtools.seqlib.write_bc_dict(bc_dict, name)[source]

Writes bc_dict to a csv.

Parameters:
  • bc_dict (dict) – Dictionary output from counter.create_map().

  • name (str) – Filename for output csv. Ex “Library1_dictionary”

labtools.adtools.seqlib.read_bc_dict(filename)[source]

Reads bc_dict from a csv.

Parameters:

filename (str) – Path to csv containing a single dictionary.

Returns:

bc_dict – Dictionary.

Return type:

dict