MySQL SUBSTRING using queried length and position values vs programming language substring after retrieval

https://stackoverflow.com/questions/16178878

11-04-2022
|

Question

I am accessing a chado structured mysql database. I search by the gene product, for this example the product is "bifunctional GDP-fucose synthetase: GDP-4-dehydro-6-deoxy-D-mannose epimerase and GDP-4-dehydro-6-L-deoxygalactose reductase".

I can then use JOIN statements to find on what assembly this gene is located and what its coordinates are. The below SQL statement is valid and will return the assembly's sequence (not just the gene's sequence), and the start and stop positions for the gene of interest on the assembly.

SELECT f.uniquename AS protein_accession, product.value AS protein_name, srcfeature.residues AS residue_sequence, srcassembly.name AS source_type, location.fmin AS location_min, location.fmax AS location_max, location.strand
FROM feature f
JOIN cvterm polypeptide ON f.type_id=polypeptide.cvterm_id
JOIN featureprop product ON f.feature_id=product.feature_id
JOIN cvterm productprop ON product.type_id=productprop.cvterm_id
JOIN featureloc location ON f.feature_id=location.feature_id
JOIN feature srcfeature ON location.srcfeature_id=srcfeature.feature_id
JOIN cvterm srcassembly ON srcfeature.type_id=srcassembly.cvterm_id
WHERE polypeptide.name = 'polypeptide'
AND productprop.name = 'gene_product_name'
AND product.value LIKE '%bifunctional GDP-fucose synthetase: GDP-4-dehydro-6-deoxy-D-mannose epimerase and GDP-4-dehydro-6-L-deoxygalactose reductase%';

The assembly sequence is very very long and I definitely don't need all of it. Is it better to extract the part that I need using a MySQL's SUBSTRING method to save retrieving the whole sequence, or to use a programming language's substring method after retrieval? The below query is my attempt at the SUBSTRING method using values obtained during the query for position and length. It does not work, my guess is that it needs multiple SELECT statements to work. The SQL is getting really ugly, and I'm not even sure a working end result would be better.

What are your thoughts, is it better to do this with SQL SUBSTRING, or just use a programming language and a substring method to display what I want even though I have retrieved the whole thing?

SELECT f.uniquename AS protein_accession, product.value AS protein_name, SUBSTRING(srcfeature.residues AS residue_sequence, location_min, location_max - location_min), srcassembly.name AS source_type, location.fmin AS location_min, location.fmax AS location_max, location.strand
FROM feature f
JOIN cvterm polypeptide ON f.type_id=polypeptide.cvterm_id
JOIN featureprop product ON f.feature_id=product.feature_id
JOIN cvterm productprop ON product.type_id=productprop.cvterm_id
JOIN featureloc location ON f.feature_id=location.feature_id
JOIN feature srcfeature ON location.srcfeature_id=srcfeature.feature_id
JOIN cvterm srcassembly ON srcfeature.type_id=srcassembly.cvterm_id
WHERE polypeptide.name = 'polypeptide'
AND productprop.name = 'gene_product_name'
AND product.value LIKE '%bifunctional GDP-fucose synthetase: GDP-4-dehydro-6-deoxy-D-mannose epimerase and GDP-4-dehydro-6-L-deoxygalactose reductase%';

EDIT Here's an example result for a different gene (shorter name). I have omitted the portion on the queried sequence as that part is thousands of characters long. I would have to use the values of location_min and location_max shown here to SUBSTRING properly.

+-------------------+---------------------------------------------------+-------------+--------------+--------------+--------+
| protein_accession | protein_name                                      | source_type | location_min | location_max | strand |
+-------------------+---------------------------------------------------+-------------+--------------+--------------+--------+
| ECDH10B_0026      | bifunctional riboflavin kinase and FAD synthetase | assembly    |        21406 |        22348 |      1 |
+-------------------+---------------------------------------------------+-------------+--------------+--------------+--------+

Solution

Your as was in the wrong place. It needs to go after the closing paren for the substring():

SELECT f.uniquename AS protein_accession, product.value AS protein_name,
       SUBSTRING(srcfeature.residues, location_min, location_max - location_min)  AS residue_sequence,
       srcassembly.name AS source_type, location.fmin AS location_min, location.fmax AS location_max, location.strand
FROM feature f
JOIN cvterm polypeptide ON f.type_id=polypeptide.cvterm_id
JOIN featureprop product ON f.feature_id=product.feature_id
JOIN cvterm productprop ON product.type_id=productprop.cvterm_id
JOIN featureloc location ON f.feature_id=location.feature_id
JOIN feature srcfeature ON location.srcfeature_id=srcfeature.feature_id
JOIN cvterm srcassembly ON srcfeature.type_id=srcassembly.cvterm_id
WHERE polypeptide.name = 'polypeptide'
AND productprop.name = 'gene_product_name'
AND product.value LIKE '%bifunctional GDP-fucose synthetase: GDP-4-dehydro-6-deoxy-D-mannose epimerase and GDP-4-dehydro-6-L-deoxygalactose reductase%';

As for your other question, I think it makes much more sense to extract the data you want in the query, rather than passing back unnecessary data to the application. This saves on communication overhead. Plus, the database has the opportunity to run in parallel, if it is using multiple threads/processors.

OTHER TIPS

If something like this will work for you:

SELECT f.uniquename AS protein_accession, 
       product.value AS protein_name, 
       SUBSTRING(
                   srcfeature.residues, 
                   patindex('%SOMPATTERN%',srcfeature.residues), 
                   LEN(srcfeature.residues) - patindex('%SOMPATTERN%',srcfeature.residues)
                ) AS residue_sequence, 
      srcassembly.name AS source_type,

then try that in the SQL. If not, use the application programming language.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow