Example usage

The poorly named labtools python package is a collection of diverse modules that might be of use to Staller Lab members and collaborators.

tools to analyze raw sequencing data
simple tools to shuffle/mutate sequences for library design
(hopefully) tools to run in-house neural net

# add the comment 
    # @hidden
# anywhere in any cell you want to hide
# works with custom sphinx layout

# Following code is used to style and truncate the pandas dataframe outputs for easier example viewing

def format_index(s):
    return f"font-size: 8pt; max-width: 250px; text-overflow: ellipsis; overflow: hidden"
def format(s):
    return ["font-size: 10pt;"] *len(s)
def display_df(df, **kwargs):
    disp = df.head(10).style.applymap_index(format_index)
    disp = disp.applymap_index(format_index, axis = 0)
    disp = disp.apply(format)
    return disp

# @hidden

Sequencing Analysis Tools

Sort Processing Example

What you need

Also, see documentation for clearer idea of what inputs should look like

data_files : list of str
- path to fastq file for each sample in the sort in order of bins
bin_counts : list of int
- cells per bin in order of data files
bin_values : list of int
- mean or median fluorescence of the bin in order of data files

If you want to look for only perfect matches to certain sequences (see note)

design_file : str
- path to a csv with one column headered as “ArrayDNA” which contains your 120 bp AD sequences as DNA

NOTE: If you want to count untransformed plasmid, please add what that tile would look like to your design file. The length does not matter as long as the sequence covers the end of the read. If you used phasing, include additional seqs for each primer possibility. For example, for pMVS219 backbone phased with 4 R primers, you should add the following:

“GGTTAATTAAGGCGCGCCACTTCTAAATAAGCGATAG”
“GGTTAATTAAGGCGCGCCACTTCTAAATAAGCGATA”
“GGTTAATTAAGGCGCGCCACTTCTAAATAAGCGAT”
“GGTTAATTAAGGCGCGCCACTTCTAAATAAGCGA”

NOTE: the read cutoff is 10 reads total summed across all bins This means that if a tile is not found at least 10 times combined across any of the fastq files, it will not be analysed.

If your reads differ from default (see example)

ad_preceder : str
- the sequence directly preceding your AD sequence in your reads (anchor sequence)

If you have barcodes, you need the sequence directly preceding them and anteceding them (anchor sequences), otherwise ignore these. See the example with barcodes below.

bc_preceder : str
bc_anteceder : str

What you get out

a dataframe with your AD tiles as indices, normalized scores for tile at each bin, and the tile Activity value

Use the sequence directly preceding the AD/tile as an anchor sequence. Additional characters between AD/tile and preceding anchor seqeunce will not work

My preceding anchor sequence is in blue while my AD/tile sequence is in green in this example read.

The anchor sequence preceding the barcode is purple while the anchor sequence anteceding is red. Barcode length is 11 by default. (Not necessary if you don’t have barcodes)

Example read showing the default values. No need to change anything if your sequence matches the defaults.

TCCCTGCGGGCTCTACTTCATCGGCTAGCGGTTCTT…CTGCTAAATGATAAATAGATGAGGGCCCGTCAACATAGAAGGAGAGAAACATCTAAAAAAGCGATA

Imports

from labtools.adtools import sort

Initialize a sort

my_sort = sort.Sort(["../exampledata/bin1.fastq", "../exampledata/bin2.fastq", "../exampledata/bin3.fastq", 
                "../exampledata/bin4.fastq"], bin_counts = [100000,100000,100000,100000], 
               bin_values = [61,141,251,1462], design_file = "../exampledata/unique_seqs.csv")

Get activity values

activities, total_reads, reads_per_bin = my_sort.process()

Opening file bin1.fastq ...

Reading lines... 0 reads completed

Total parsed reads = 2,000

Opening file bin2.fastq ...

Reading lines... 0 reads completed

Total parsed reads = 2,000

Opening file bin3.fastq ...

Reading lines... 0 reads completed

Total parsed reads = 2,000

Opening file bin4.fastq ...

Reading lines... 0 reads completed

Total parsed reads = 2,000

# activity + normalized abundance in each bin
display_df(activities)

	0	1	2	3	Activity
GAAGATCCAACTTGGTTTGATTCTGGTTCTCAATTTATCTTAAATTCTCAACAATTGGTTGAAGCTTTGTCTTTGTGTGATGATGCTGCTGGTTCTCAAGATAGAGAAGAGAATACTAAT	0.251461	0.288201	0.460339	0.000000	171.520411
GAAGCTTTGTCTTTGTGTGATGATTTGTTGGGTGATCAAGATAGAGAAGAGAATGATAATGATGGTGATTTGAAAGATAAACAACCATGTATTGCTGATTATGCTCATTTGGGTCCAGAA	0.097356	0.519277	0.343084	0.040284	224.165635
GATTTGGCTGAAGATGATGAAGTTATGTGTATGGAAGATGAAGTTCAATCTATTCAACCAAATCATGAAAGACCAGATGATGGTCCAGAATTGGAACATGGTTTGGAGAATGGTGCTAAA	0.352021	0.502035	0.145944	0.000000	128.892233
GGTCAAAGGAAGAGAAGGAAGATTACTCCAACTTTGGTTAATGATGAACCAGTTAGATGGCATAAGACTGGTAGAACTAAACCAGTTATGTTGTCTGGTGTTCAAAGAGGTTGTAAGAAA	0.378641	0.367199	0.254160	0.000000	138.666285
TCTGAATTGACTTCTACTTTGGGTATTTCTCATAGATTGCCACAATCTTTGACTCCATGTGTTAAGACTGGTTCTTTGCAATCTGGTGGTTTGGTTCAATCTGTTCCATTTGAAGAATTG	0.330550	0.278750	0.390700	0.000000	157.533007
GAAATGGCTGATGATAAAGAAGAACAAGAGAAAGATAGAGATAATGAGAATCAAGGTGAAGAAGATCCAACTTGGGCTGATTCTGGTGATCAATTTATTGCTAATTCTCAACAATTGGTT	0.361290	0.420448	0.218262	0.000000	136.105579
TCTGAACCACATGTCTTGATTGAAGAATTTATTAGACCAGTTACTGAAGATGTTGGTATTAATTATACTCATCCACAGAATTTGCCAGGTGCTAATAAAGATGGTGTTTCTGTCTTCTTT	0.315398	0.367040	0.317562	0.000000	150.700019
CAATTTATCTTGAATGCTCAACAATTGGTTGAAGCTTTGTCTTTGTGTGATGATTTGTTGGGTTCTCAAGATAGAGAAGAGAATACTAATTCTGGTTCTTTGAAAGATAAACAACCATGT	0.393467	0.254385	0.352148	0.000000	148.258990
ATTATTTGGCATTTGTTGGCTAAATCTGGTTTGTCTGGTTTGTCTTCTCATCCATTTATTGATGAATTTATTCCAACTGTTAATCAAGATGATGGTATTTGTTATACTCATCCTAAGAAT	0.566048	0.243975	0.189977	0.000000	116.613702
AAGAAGAGGAAGAATAAGAATCAAGGTAAGAAGAAACCAACTTGGTTTAAATCTGGTGCTCAATTTATCTTGAATGCTCAACAATTGGTTAAAGCTGCTTCTGCTTGTAAGAAATTGTTG	0.253394	0.491474	0.255133	0.000000	148.793098

# the total reads for each sequence
total_reads

GAAGATCCAACTTGGTTTGATTCTGGTTCTCAATTTATCTTAAATTCTCAACAATTGGTTGAAGCTTTGTCTTTGTGTGATGATGCTGCTGGTTCTCAAGATAGAGAAGAGAATACTAAT    44.0
GAAGCTTTGTCTTTGTGTGATGATTTGTTGGGTGATCAAGATAGAGAAGAGAATGATAATGATGGTGATTTGAAAGATAAACAACCATGTATTGCTGATTATGCTCATTTGGGTCCAGAA    21.0
GATTTGGCTGAAGATGATGAAGTTATGTGTATGGAAGATGAAGTTCAATCTATTCAACCAAATCATGAAAGACCAGATGATGGTCCAGAATTGGAACATGGTTTGGAGAATGGTGCTAAA    49.0
GGTCAAAGGAAGAGAAGGAAGATTACTCCAACTTTGGTTAATGATGAACCAGTTAGATGGCATAAGACTGGTAGAACTAAACCAGTTATGTTGTCTGGTGTTCAAAGAGGTTGTAAGAAA    16.0
TCTGAATTGACTTCTACTTTGGGTATTTCTCATAGATTGCCACAATCTTTGACTCCATGTGTTAAGACTGGTTCTTTGCAATCTGGTGGTTTGGTTCAATCTGTTCCATTTGAAGAATTG    70.0
                                                                                                                            ... 
GACCCAACTGAATGGTTTGATTCTGGTGCTCAATTTATCTTGAATGCTCAACAATTGGTTGAAGCTCAATGTTTGGATGATAATTTGACTAGAGAATTGGAATCTAATGATGGTGCTTTG    23.0
TCTACTGATTCTACTCCAATGTTTGATTATGATAATTTGGAAGATAATTCTAAAGATTGGACTTCTTTGTTTGATAATGATATTCCAGTTACTACTGATGATGTTTCTTTGGCTGATAAA    24.0
TCTACTGGTCAAGTCTTGTTTGATATTGATGACTTTAGATGGTTGTTGGATCCAGATGATGAACAATTGGGTAAAGAAGCTATCTTGTCTGATCAATTTGGTAAACCAACTCCAGAGAAT    12.0
GAAGATCCAACTTCTGATTCTGCTATTCAACAATTGTGGAATCAAGGATTCTTGTTTGTTGAATCTTTGTCTTTGTGTGATGATTTGTTGGGTTCTCAAGATAGAGAAGAGAATACTAAT    11.0
GAAATTGATCAAATTTCTGATCCAGATAAATTGCCAGTTAATTTGGAACCATTTAGATTGGATCAATTGGAATTTACTGGTGATGATACTTCTGGTGCTGGTTTGAAATTTCAATGGGAT    13.0
Length: 200, dtype: float64

# the reads of each tile per bin
display_df(reads_per_bin)

	0	1	2	3
GAAGATCCAACTTGGTTTGATTCTGGTTCTCAATTTATCTTAAATTCTCAACAATTGGTTGAAGCTTTGTCTTTGTGTGATGATGCTGCTGGTTCTCAAGATAGAGAAGAGAATACTAAT	11.000000	13.000000	20.000000	0.000000
GAAGCTTTGTCTTTGTGTGATGATTTGTTGGGTGATCAAGATAGAGAAGAGAATGATAATGATGGTGATTTGAAAGATAAACAACCATGTATTGCTGATTATGCTCATTTGGGTCCAGAA	2.000000	11.000000	7.000000	1.000000
GATTTGGCTGAAGATGATGAAGTTATGTGTATGGAAGATGAAGTTCAATCTATTCAACCAAATCATGAAAGACCAGATGATGGTCCAGAATTGGAACATGGTTTGGAGAATGGTGCTAAA	17.000000	25.000000	7.000000	0.000000
GGTCAAAGGAAGAGAAGGAAGATTACTCCAACTTTGGTTAATGATGAACCAGTTAGATGGCATAAGACTGGTAGAACTAAACCAGTTATGTTGTCTGGTGTTCAAAGAGGTTGTAAGAAA	6.000000	6.000000	4.000000	0.000000
TCTGAATTGACTTCTACTTTGGGTATTTCTCATAGATTGCCACAATCTTTGACTCCATGTGTTAAGACTGGTTCTTTGCAATCTGGTGGTTTGGTTCAATCTGTTCCATTTGAAGAATTG	23.000000	20.000000	27.000000	0.000000
GAAATGGCTGATGATAAAGAAGAACAAGAGAAAGATAGAGATAATGAGAATCAAGGTGAAGAAGATCCAACTTGGGCTGATTCTGGTGATCAATTTATTGCTAATTCTCAACAATTGGTT	5.000000	6.000000	3.000000	0.000000
TCTGAACCACATGTCTTGATTGAAGAATTTATTAGACCAGTTACTGAAGATGTTGGTATTAATTATACTCATCCACAGAATTTGCCAGGTGCTAATAAAGATGGTGTTTCTGTCTTCTTT	5.000000	6.000000	5.000000	0.000000
CAATTTATCTTGAATGCTCAACAATTGGTTGAAGCTTTGTCTTTGTGTGATGATTTGTTGGGTTCTCAAGATAGAGAAGAGAATACTAATTCTGGTTCTTTGAAAGATAAACAACCATGT	9.000000	6.000000	8.000000	0.000000
ATTATTTGGCATTTGTTGGCTAAATCTGGTTTGTCTGGTTTGTCTTCTCATCCATTTATTGATGAATTTATTCCAACTGTTAATCAAGATGATGGTATTTGTTATACTCATCCTAAGAAT	9.000000	4.000000	3.000000	0.000000
AAGAAGAGGAAGAATAAGAATCAAGGTAAGAAGAAACCAACTTGGTTTAAATCTGGTGCTCAATTTATCTTGAATGCTCAACAATTGGTTAAAGCTGCTTCTGCTTGTAAGAAATTGTTG	5.000000	10.000000	5.000000	0.000000

Locate any tiles, not just perfect matches

If you want to find sequences even if they don’t match your designed sequences, simply do not include the design file. A short (less than 120 bp) tile will probably appear with a significant number of reads. This is probably your untransformed background (cells transformed with a plasmid that did not get a tile). Most of the non-perfect matching tiles will be sequencing errors. You might find a way to map these back to their true tile. A few of the non-perfect matching tiles (likely those with a significant number of reads) will be sequencing library PCR amplification errors. An even fewer number of the non-perfect matching tiles could be mutants that arose within the actual cell, or free tiles. These will probably have a significant number of reads. Someone may put in the effort to distinguish these categories in the future.

my_sort = sort.Sort(["../exampledata/bin1.fastq", "../exampledata/bin2.fastq", "../exampledata/bin3.fastq", 
                "../exampledata/bin4.fastq"], bin_counts = [100000,100000,100000,100000], 
               bin_values = [61,141,251,1462])
activities_no_design, numreads_total, _ = my_sort.process()

Opening file bin1.fastq ...

Reading lines... 0 reads completed

Total parsed reads = 2,000

Opening file bin2.fastq ...

Reading lines... 0 reads completed

Total parsed reads = 2,000

Opening file bin3.fastq ...

Reading lines... 0 reads completed

Total parsed reads = 2,000

Opening file bin4.fastq ...

Reading lines... 0 reads completed

Total parsed reads = 2,000

display_df(activities_no_design)

	0	1	2	3	Activity
GAAGATCCAACTTGGTTTGATTCTGGTTCTCAATTTATCTTAAATTCTCAACAATTGGTTGAAGCTTTGTCTTTGTGTGATGATGCTGCTGGTTCTCAAGATAGAGAAGAGAATACTAAT	0.249091	0.283698	0.467211	0.000000	172.465991
GAAGCTTTGTCTTTGTGTGATGATTTGTTGGGTGATCAAGATAGAGAAGAGAATGATAATGATGGTGATTTGAAAGATAAACAACCATGTATTGCTGATTATGCTCATTTGGGTCCAGAA	0.096596	0.512002	0.348777	0.042625	227.945173
GATTTGGCTGAAGATGATGAAGTTATGTGTATGGAAGATGAAGTTCAATCTATTCAACCAAATCATGAAAGACCAGATGATGGTCCAGAATTGGAACATGGTTTGGAGAATGGTGCTAAA	0.351864	0.498670	0.149466	0.000000	129.292147
GGTTAATTAAGGCGCGCCACTTCTAAATAAGCGA	0.210811	0.528218	0.260971	0.000000	152.841957
GGTCAAAGGAAGAGAAGGAAGATTACTCCAACTTTGGTTAATGATGAACCAGTTAGATGGCATAAGACTGGTAGAACTAAACCAGTTATGTTGTCTGGTGTTCAAAGAGGTTGTAAGAAA	0.377151	0.363465	0.259384	0.000000	139.360114
TCTGAATTGACTTCTACTTTGGGTATTTCTCATAGATTGCCACAATCTTTGACTCCATGTGTTAAGACTGGTTCTTTGCAATCTGGTGGTTTGGTTCAATCTGTTCCATTTGAAGAATTG	0.327972	0.274844	0.397184	0.000000	158.452427
GAAATGGCTGATGATAAAGAAGAACAAGAGAAAGATAGAGATAATGAGAATCAAGGTGAAGAAGATCCAACTTGGGCTGATTCTGGTGATCAATTTATTGCTAATTCTCAACAATTGGTT	0.360305	0.416677	0.223018	0.000000	136.707585
TCTGAACCACATGTCTTGATTGAAGAATTTATTAGACCAGTTACTGAAGATGTTGGTATTAATTATACTCATCCACAGAATTTGCCAGGTGCTAATAAAGATGGTGTTTCTGTCTTCTTT	0.313669	0.362744	0.323586	0.000000	151.500974
CAATTTATCTTGAATGCTCAACAATTGGTTGAAGCTTTGTCTTTGTGTGATGATTTGTTGGGTTCTCAAGATAGAGAAGAGAATACTAATTCTGGTTCTTTGAAAGATAAACAACCATGT	0.390706	0.251019	0.358275	0.000000	149.153731
ATTATTTGGCATTTGTTGGCTAAATCTGGTTTGTCTGGTTTGTCTTCTCATCCATTTATTGATGAATTTATTCCAACTGTTAATCAAGATGATGGTATTTGTTATACTCATCCTAAGAAT	0.564274	0.241688	0.194038	0.000000	117.202291

numreads_total

GAAGATCCAACTTGGTTTGATTCTGGTTCTCAATTTATCTTAAATTCTCAACAATTGGTTGAAGCTTTGTCTTTGTGTGATGATGCTGCTGGTTCTCAAGATAGAGAAGAGAATACTAAT    44.0
GAAGCTTTGTCTTTGTGTGATGATTTGTTGGGTGATCAAGATAGAGAAGAGAATGATAATGATGGTGATTTGAAAGATAAACAACCATGTATTGCTGATTATGCTCATTTGGGTCCAGAA    21.0
GATTTGGCTGAAGATGATGAAGTTATGTGTATGGAAGATGAAGTTCAATCTATTCAACCAAATCATGAAAGACCAGATGATGGTCCAGAATTGGAACATGGTTTGGAGAATGGTGCTAAA    49.0
GGTTAATTAAGGCGCGCCACTTCTAAATAAGCGA                                                                                          24.0
GGTCAAAGGAAGAGAAGGAAGATTACTCCAACTTTGGTTAATGATGAACCAGTTAGATGGCATAAGACTGGTAGAACTAAACCAGTTATGTTGTCTGGTGTTCAAAGAGGTTGTAAGAAA    16.0
                                                                                                                            ... 
TCTACTGATTCTACTCCAATGTTTGATTATGATAATTTGGAAGATAATTCTAAAGATTGGACTTCTTTGTTTGATAATGATATTCCAGTTACTACTGATGATGTTTCTTTGGCTGATAAA    24.0
TCTACTGGTCAAGTCTTGTTTGATATTGATGACTTTAGATGGTTGTTGGATCCAGATGATGAACAATTGGGTAAAGAAGCTATCTTGTCTGATCAATTTGGTAAACCAACTCCAGAGAAT    12.0
AATACTCCAACTCCACCATCTTTGGTTGATGGTGTTGCTGGTGATGAAGAAGCATTTGATGAGATGTTTGATCCATTCTTTGAAGAATTGGATTCTATTCCAGAAGCTGCTTTGTGATAA    14.0
GAAGATCCAACTTCTGATTCTGCTATTCAACAATTGTGGAATCAAGGATTCTTGTTTGTTGAATCTTTGTCTTTGTGTGATGATTTGTTGGGTTCTCAAGATAGAGAAGAGAATACTAAT    11.0
GAAATTGATCAAATTTCTGATCCAGATAAATTGCCAGTTAATTTGGAACCATTTAGATTGGATCAATTGGAATTTACTGGTGATGATACTTCTGGTGCTGGTTTGAAATTTCAATGGGAT    13.0
Length: 220, dtype: float64

Get data for reads that include barcodes AND tiles

Support for barcoded only data may or may not be added in the future.

activities_barcoded, total_reads, reads_per_bin = my_sort.process(barcoded = True)

Opening file bin1.fastq ...

Reading lines... 0 reads completed

Total parsed reads = 2,000

Opening file bin2.fastq ...

Reading lines... 0 reads completed

Total parsed reads = 2,000

Opening file bin3.fastq ...

Reading lines... 0 reads completed

Total parsed reads = 2,000

Opening file bin4.fastq ...

Reading lines... 0 reads completed

Total parsed reads = 2,000

Note that you get back less tiles using this method. The reason for this is because the read must have a locatable barcode AND tile, which is less likely than having one or the other. The primary reason for doing this analysis is assessing per-transformant variation. The assumption is that unique tile-barcode pairs come from unique original transformants.

display_df(activities_barcoded)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/IPython/core/formatters.py:344, in BaseFormatter.__call__(self, obj)
   method = get_real_method(obj, self.print_method)
   if method is not None:
--> 344         return method()
   return None
else:

File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/io/formats/style.py:385, in Styler._repr_html_(self)
"""
Hooks into Jupyter notebook rich display system, which calls _repr_html_ by
default if an object is returned at the end of a cell.
"""
if get_option("styler.render.repr") == "html":
--> 385     return self.to_html()
return None

File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/io/formats/style.py:1377, in Styler.to_html(self, buf, table_uuid, table_attributes, sparse_index, sparse_columns, bold_headers, caption, max_rows, max_columns, encoding, doctype_html, exclude_styles, **kwargs)
   obj.set_caption(caption)
# Build HTML string..
-> 1377 html = obj._render_html(
   sparse_index=sparse_index,
   sparse_columns=sparse_columns,
   max_rows=max_rows,
   max_cols=max_columns,
   exclude_styles=exclude_styles,
   encoding=encoding or get_option("styler.render.encoding"),
   doctype_html=doctype_html,
   **kwargs,
)
return save_to_buffer(
   html, buf=buf, encoding=(encoding if buf is not None else None)
)

File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/io/formats/style_render.py:206, in StylerRenderer._render_html(self, sparse_index, sparse_columns, max_rows, max_cols, **kwargs)
def _render_html(
   self,
   sparse_index: bool,
   (...)
   **kwargs,
) -> str:
   """
   Renders the ``Styler`` including all applied styles to HTML.
   Generates a dict with necessary kwargs passed to jinja2 template.
   """
--> 206     d = self._render(sparse_index, sparse_columns, max_rows, max_cols, "&nbsp;")
   d.update(kwargs)
   return self.template_html.render(
       **d,
       html_table_tpl=self.template_html_table,
       html_style_tpl=self.template_html_style,
   )

File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/io/formats/style_render.py:163, in StylerRenderer._render(self, sparse_index, sparse_columns, max_rows, max_cols, blank)
def _render(
   self,
   sparse_index: bool,
   (...)
   blank: str = "",
):
   """
   Computes and applies styles and then generates the general render dicts.

   Also extends the `ctx` and `ctx_index` attributes with those of concatenated
   stylers for use within `_translate_latex`
   """
--> 163     self._compute()
   dxs = []
   ctx_len = len(self.index)

File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/io/formats/style_render.py:258, in StylerRenderer._compute(self)
r = self
for func, args, kwargs in self._todo:
--> 258     r = func(self)(*args, **kwargs)
return r

File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/io/formats/style.py:1870, in Styler._apply_index(self, func, axis, level, method, **kwargs)
obj = self.index if axis == 0 else self.columns
levels_ = refactor_levels(level, obj)
-> 1870 data = DataFrame(obj.to_list()).loc[:, levels_]
if method == "apply":
   result = data.apply(func, axis=0, **kwargs)

File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexing.py:1067, in _LocationIndexer.__getitem__(self, key)
   if self._is_scalar_access(key):
       return self.obj._get_value(*key, takeable=self._takeable)
-> 1067     return self._getitem_tuple(key)
else:
   # we by definition only have the 0th axis
   axis = self.axis or 0

File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexing.py:1256, in _LocIndexer._getitem_tuple(self, tup)
if self._multi_take_opportunity(tup):
   return self._multi_take(tup)
-> 1256 return self._getitem_tuple_same_dim(tup)

File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexing.py:924, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
if com.is_null_slice(key):
   continue
--> 924 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
# We should never have retval.ndim < self.ndim, as that should
#  be handled by the _getitem_lowerdim call above.
assert retval.ndim == self.ndim

File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexing.py:1301, in _LocIndexer._getitem_axis(self, key, axis)
   if hasattr(key, "ndim") and key.ndim > 1:
       raise ValueError("Cannot index with multidimensional key")
-> 1301     return self._getitem_iterable(key, axis=axis)
# nested tuple slicing
if is_nested_tuple(key, labels):

File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexing.py:1239, in _LocIndexer._getitem_iterable(self, key, axis)
self._validate_key(key, axis)
# A collection of keys
-> 1239 keyarr, indexer = self._get_listlike_indexer(key, axis)
return self.obj._reindex_with_indexers(
   {axis: [keyarr, indexer]}, copy=True, allow_dups=True
)

File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexing.py:1432, in _LocIndexer._get_listlike_indexer(self, key, axis)
ax = self.obj._get_axis(axis)
axis_name = self.obj._get_axis_name(axis)
-> 1432 keyarr, indexer = ax._get_indexer_strict(key, axis_name)
return keyarr, indexer

File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexes/base.py:6070, in Index._get_indexer_strict(self, key, axis_name)
else:
   keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6070 self._raise_if_missing(keyarr, indexer, axis_name)
keyarr = self.take(indexer)
if isinstance(key, Index):
   # GH 42790 - Preserve name from an Index

File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexes/base.py:6130, in Index._raise_if_missing(self, key, indexer, axis_name)
   if use_interval_msg:
       key = list(key)
-> 6130     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
raise KeyError(f"{not_found} not in index")

KeyError: "None of [Int64Index([0], dtype='int64')] are in the [columns]"

<pandas.io.formats.style.Styler at 0x7f29a0d8dee0>

Support for custom anchor sequences

If your tiles or barcodes have a custom anchor sequence (AKA the non-variable portion of the read that is used to locate the variable portion of the read), you can specify that in the kwargs of your Sort(). This passes the arguments to the pull_AD() function used on each read to locate the sequence of interest (AKA AD or tile).

Use the sequence directly preceding the AD/tile as an anchor sequence. Additional characters between AD/tile and preceding anchor seqeunce will not work

My preceding anchor sequence is in blue while my AD/tile sequence is in green in this example read.

The anchor sequence preceding the barcode is purple while the anchor sequence anteceding is red. Barcode length is 11 by default. (Not necessary if you don’t have barcodes)

Example read

TCCCTGCGGGCTCTACTTCATCGGCTAGCGGTTCTT…CTGCTAAATGATAAATAGATGAGGGCCCGTCAACATAGAAGGAGAGAAACATCTAAAAAAGCGATA

Specify alternate values in a dictionary and pass that dictionary to Sort.process()

# these are the default values which would work for my example read above
# no input is required if they work for you
kwargs = {"ad_preceder":"GCTAGC", "bc_preceder":"GGGCCCG", "bc_anteceder":"GGAGAGAA", "bclength":11, "ad_length":120}

activities_barcoded, _, _ = my_sort.process(barcoded = True, **kwargs)

Opening file bin1.fastq ...

Reading lines... 0 reads completed

Total parsed reads = 2,000

Opening file bin2.fastq ...

Reading lines... 0 reads completed

Total parsed reads = 2,000

Opening file bin3.fastq ...

Reading lines... 0 reads completed

Total parsed reads = 2,000

Opening file bin4.fastq ...

Reading lines... 0 reads completed

Total parsed reads = 2,000

import pandas as pd

pd.set_option('display.max_colwidth', 1)

Example output for pull_AD

This might be useful for someone who wants to use pull_AD to analyze reads for a purpose outside of calculating activities.

from labtools.adtools.finder import pull_AD

read = "TCCCTGCGGGCTCTACTTCATCGGCTAGCGGTTCTTCTAAATTGAGATGTGATAATAATGCTGCTGCTCATGTTAAATTGGATTCATTTCCAGCTGGTGTTAGATTTGATACATCTGATGAAGAATTGTTGGAACATTTGGCTGCTAAATGATAAATAGATGAGGGCCCGTCAACATAGAAGGAGAGAAACATCTAAAAAAGCGATA"
pull_AD(read, kwargs)

('GGTTCTTCTAAATTGAGATGTGATAATAATGCTGCTGCTCATGTTAAATTGGATTCATTTCCAGCTGGTGTTAGATTTGATACATCTGATGAAGAATTGTTGGAACATTTGGCTGCTAAA',
 'TCAACATAGAA')

Counting tiles in a fastq file

This essentially applies the pull_AD function shown above to every read in the fastq file. You can also pass the kwargs dict to it to specify custom anchor sequences.

from labtools.adtools.counter import seq_counter
seq_counter("../exampledata/mini.fastq")

Opening file mini.fastq ...

Reading lines... 0 reads completed

Total parsed reads = 4

GGTTCTTCTAAATTGAGATGTGATAATAATGCTGCTGCTCATGTTAAATTGGATTCATTTCCAGCTGGTGTTAGATTTGATACATCTGATGAAGAATTGTTGGAACATTTGGCTGCTAAA    1
GAAGAATTGTTTTTACATTTGTCTGCTAAGATTGGTAGATCTTCTAGGAAACCACATCCATTCTTGGATGAATTTATTCATACTTTGGTTGAAGAAGATGGTATTTGTAGAACTCATCCA    3
dtype: int64

Use barcoded = True to count tile and barcode pairs
Use the design_file flag to look for specific tiles

counts = seq_counter("../exampledata/bin1.fastq", barcoded = True, design_to_use="../exampledata/unique_seqs.csv")
counts

Opening file bin1.fastq ...

Reading lines... 0 reads completed

Total parsed reads = 2,000

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[20], line 1
----> 1 counts = seq_counter("../exampledata/bin1.fastq", barcoded = True, design_to_use="../exampledata/unique_seqs.csv")
      2 counts

File ~/labtools/src/labtools/adtools/counter.py:96, in seq_counter(fastq, design_to_use, barcoded, only_bcs, **kwargs)
     94 design = pd.read_csv(design_to_use)
     95 if barcoded:
---> 96     counts = counts.where(counts.index.droplevel(1).isin(design.ArrayDNA)).dropna()
     97 else:
     98     counts = counts.where(counts.index.isin(design.ArrayDNA)).dropna()

File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexes/base.py:2180, in Index.droplevel(self, level)
   2177 if not isinstance(level, (tuple, list)):
   2178     level = [level]
-> 2180 levnums = sorted(self._get_level_number(lev) for lev in level)[::-1]
   2182 return self._drop_level_numbers(levnums)

File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexes/base.py:2180, in <genexpr>(.0)
   2177 if not isinstance(level, (tuple, list)):
   2178     level = [level]
-> 2180 levnums = sorted(self._get_level_number(lev) for lev in level)[::-1]
   2182 return self._drop_level_numbers(levnums)

File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexes/base.py:2055, in Index._get_level_number(self, level)
   2054 def _get_level_number(self, level) -> int:
-> 2055     self._validate_index_level(level)
   2056     return 0

File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexes/base.py:2046, in Index._validate_index_level(self, level)
   2041         raise IndexError(
   2042             "Too many levels: Index has only 1 level, "
   2043             f"{level} is not a valid level number"
   2044         )
   2045     elif level > 0:
-> 2046         raise IndexError(
   2047             f"Too many levels: Index has only 1 level, not {level + 1}"
   2048         )
   2049 elif level != self.name:
   2050     raise KeyError(
   2051         f"Requested level ({level}) does not match index name ({self.name})"
   2052     )

IndexError: Too many levels: Index has only 1 level, not 2

Sequence Design Tools

Imports

from labtools import shuffle

Shuffle a sequence

Create shuffles of the alphabet:

shuffles_list, names_list = shuffle.windowed_shuffle("ABCDEFGHIJKLMNOPQRSTUVWYXZ")
shuffles_list

['adcbeFGHIJKLMNOPQRSTUVWYXZ',
 'AcbedfGHIJKLMNOPQRSTUVWYXZ',
 'ABgcefdHIJKLMNOPQRSTUVWYXZ',
 'ABCdhfegIJKLMNOPQRSTUVWYXZ',
 'ABCDfigheJKLMNOPQRSTUVWYXZ',
 'ABCDEfhjigKLMNOPQRSTUVWYXZ',
 'ABCDEFgkhjiLMNOPQRSTUVWYXZ',
 'ABCDEFGihljkMNOPQRSTUVWYXZ',
 'ABCDEFGHjmkilNOPQRSTUVWYXZ',
 'ABCDEFGHIknjlmOPQRSTUVWYXZ',
 'ABCDEFGHIJlnomkPQRSTUVWYXZ',
 'ABCDEFGHIJKlmnpoQRSTUVWYXZ',
 'ABCDEFGHIJKLqnmpoRSTUVWYXZ',
 'ABCDEFGHIJKLMopnqrSTUVWYXZ',
 'ABCDEFGHIJKLMNqorpsTUVWYXZ',
 'ABCDEFGHIJKLMNOtqsrpUVWYXZ',
 'ABCDEFGHIJKLMNOPtsruqVWYXZ',
 'ABCDEFGHIJKLMNOPQuvtsrWYXZ',
 'ABCDEFGHIJKLMNOPQRwuvstYXZ',
 'ABCDEFGHIJKLMNOPQRSvtwyuXZ',
 'ABCDEFGHIJKLMNOPQRSTxvyuwZ',
 'ABCDEFGHIJKLMNOPQRSTUzvywx']

Example usage

Sequencing Analysis Tools

Sort Processing Example

What you need

What you get out

Use the sequence directly preceding the AD/tile as an anchor sequence. Additional characters between AD/tile and preceding anchor seqeunce will not work

Example read showing the default values. No need to change anything if your sequence matches the defaults.

Imports

Initialize a sort

Get activity values

Locate any tiles, not just perfect matches

Get data for reads that include barcodes AND tiles

Support for custom anchor sequences

Use the sequence directly preceding the AD/tile as an anchor sequence. Additional characters between AD/tile and preceding anchor seqeunce will not work

Example read

Specify alternate values in a dictionary and pass that dictionary to Sort.process()

Example output for pull_AD

Counting tiles in a fastq file

Sequence Design Tools

Imports

Shuffle a sequence

Predictors (coming soon?)