Escaping placeholders, take 2

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Escaping placeholders, take 2

Tim Bunce
Wherein I attempt to summarize the requirements, the background, the options,
the risks, and the counter-proposal ...

** Background

Drivers that support placeholders parse the SQL statement to find them.
The parsing is really just simple tokenization that only knows how to skip
over quoted strings, quoted identifiers, and comments for the SQL dialect
used by the database. The remaining text is checked for placeholders.

A question mark is the international SQL standard for placeholders.
Per ISO/IEC 9075-2:2003 section 4.24 Dynamic SQL concepts:

    In SQL-statements that are executed dynamically, the parameters are
    called dynamic parameters (<dynamic parameter specification>s) and
    are represented in SQL language by a <question mark> (?).

Some drivers also support placeholders expressed as a colon ':' followed
by a number and/or a name. Some drivers support other forms like a dollar
'$' followed by a number. Different types of placeholders cannot be
mixed within a statement.

The number of parameters passed to execute (etc.) has to match the number
of placeholders found in the statement or an error is reported.

An individual SQL statement typically comes primarily from one of these sources:
1. Written literally in the source code, perhaps with some $interpolation
2. Constructed dynamically, in whole or part, using a module like SQL::Abstract
3. From an external source, like an SQL script file or 'catalog'.


** Historical Issues and Work-arounds

Sometimes the driver will identify placeholders in the SQL statement
that weren't intended by the developer.  So some drivers have added
ad-hoc ways to limit the parsing of placeholders.  For example:

  pg_direct - query passed directly without parsing for placeholders at all
  pg_placeholder_dollaronly - question marks are not treated as placeholders
  pg_placeholder_nocolons - colons are not treated as placeholders
  ora_placeholders - disable processing of all placeholders
  odbc_ignore_named_placeholders - disable processing of :foo placeholders

In most cases these have been added as a workaround for statements that
contain colons followed by words that the driver was treating as placeholders.
(For example :old and :new when defining Oracle triggers.)

The exception is pg_placeholder_dollaronly which was added in 2008 to
allow use of postgres geometric operators like "?||" ("are parallel")
See http://www.postgresql.org/docs/9.4/static/functions-geometry.html

So supporting multiple styles of placeholders has both caused problems
and allowed drivers to "dodge the bullet" when one style has become a
problem in a particular situation. But that's only a limited solution.
It doesn't promote the development and use of generic higher-level reusable
modules. (You could frame the progress of software development in terms
of enabling developers to work at ever-higher levels of abstraction.)


** Limitation of Existing Workarounds

A key problem with these attribute workarounds is that they affect the
entire statement.  That's a good approach when loading statements from
an external source, and often ok when writing individual DBI calls directly
where an alternative placeholder style or quote()'ing can be used.

In modern application development, however, the DBI is just another
foundation layer, low down in the stack. Above it are modules like
SQL::Abstract, and ORMs like DBIx::Class. Companies have large and
growing investments in these stacks, plus the modules they've built over
them to provide abstraction and encapsulation of business logic.

For these large applications the attribute workarounds are rarely helpful.

Working 'high up' in an application stack that's dynamically constructing
SQL with placeholders it should be possible to express any desired behavior.


** Requirements

Some mechanism is needed to allow an individual character sequence that
normally represents a placeholder, to appear in the SQL without it being
treated as a placeholder by the driver. To "escape" it's usual meaning.

If escaping question mark placeholders was already defined in 2008 then
pg_placeholder_dollaronly might not have been added to DBD::Pg.

The issue of escaping placeholders is on the table now because
PostgreSQL supports a JSON type with a rich set of features and
operators, some of which use a question mark.  For example, the
expression json_type_field ? 'bar' is true if the string 'bar' exists as
a key/element exist within the JSON value of json_type_field. For more
information see http://www.postgresql.org/docs/9.4/static/functions-json.html
Use of JSON and these operators is likely to be much more common than
use of the geometric operators mentioned earlier.

So, while the needs of Postgres users is what prompted this topic now,
the requirement is more general. It would be good to agree on an approach
that could be applied to most drivers if the need arises.


** Options

There seem to be several options worth considering....

1. Backslash

A backslash could be placed immediately before a question mark to
'escape' the usual interpretation. For example:

    SELECT * FROM table WHERE json_field \? ?

    $sql_abstract->where({ json_field => { '\?' => $bar } })

I've not been able to find any SQL dialect where the backslash is even
valid (outside of quoted strings and quoted identifiers) let alone
valid before a placeholder.

The only significant downside is the need to consider single vs double
quoting when expressing the backslash:

    $sql_abstract->where({ json_field => { "\\?" => $bar } })

but that seems very minor as the use of backslashes in single and double
quotes is natural and familar for Perl developers.


2. Doubling

A few people suggested that a placeholder could be escaped by repeating it.
This is reminicent of the way that quotes can be embedded within an SQL
string by doubling them, like 'don''t'.  For example:

    SELECT * FROM table WHERE json_field ?? ?

    $sql_abstract->where({ json_field => { '??' => $bar } })

I presume there are no database dialects where two adjacent placeholders would
be valid, so I think it's safe.

Doubling of a delimiter as a way to embed the delimiter within the
delimited text has established precedent. However I can't think of a
precedent for doubling of something that isn't a delimiter in order to
escape its meaning.

So, to my mind, this doesn't work well because...
- The relationship to doubling of delimiters is tenuous at best.
- The construct has little mnemonic value, e.g. relative to backslash.
- Doesn't apply easily to colon placeholders, especially for SQL
  dialects that already have a meaning for double colons.


3. Vendor escape clauses

The X/Open SQL CAE specification defined an escape clause as: "a syntactic
mechanism for vendor-specific SQL extensions to be implemented in the
framework of standardized SQL". The original verbose form has been
deprecated in favor of a short form using braces that was popularized by
ODBC and JDBC. For example "{ts 'value'}". See http://tinyurl.com/m9r5rrh

Effectively we'd have to agree on some kind of syntax within braces,
perhaps something like:

    SELECT * FROM table WHERE json_field {verbatim ?} 'bar'

    $sql_abstract->where({ json_field => { '{verbatim ?}' => $bar } })

There's no easy way to be sure we'd avoid clashing with future SQL standards.
If we go down this road we might open the door to deeper support of
escape clause syntax as a mechanism for portability, as it was
originally intended. But that's a potentially complex can of worms that
no one has mentioned recently (except me, just now).


** Option Summary

All three options appear safe enough to enable by default, i.e. I can't think
of any cases where the behaviour of existing working code would change because
the proposed construct wouldn't be valid to start with. No code will break!

Overall I'm not very keen on vendor escape clauses, and even less so about
doubling, so my preference is still to use a backslash.  It's a simple and
mnemonic use of the familar concept of using a backslash to 'escape' from
the default meaning of a character.


** DBI Specification Change

Here's a proposed DBI specification change:

@@ -2422,2 +2422,7 @@ =head2 Placeholders and Bind Values

+Some drivers allow you to prevent the recognition of a placeholder by placing a
+single backslash character (C<\>) immediately before it. The driver will remove
+the backslash character and ignore the placeholder, passing it unchanged to the
+backend. If the driver supports this then L</get_info>(9000) will return true.

@@ -4971,2 +4976,9 @@ =head3 C<get_info>

+Values from 9000 to 9999 for get_info are officially reserved for use by Perl DBI.
+Values in that range which have been assigned a meaning are defined here:
+
+C<9000>: true if a backslash character (C<\>) before placeholder-like text
+(e.g. C<?>, C<:foo>) will prevent it being treated as a placeholder by the driver.
+The backslash will be removed before the text is passed to the backend.

(I'm undecided about making the get_info(9000) return value a bit-map
e.g., 0x01= can escape question mark placeholders, 0x02= can escape
colon placeholders, 0x04= can escape driver-specific placeholder forms.
That's probably too much detail for a feature that's unlikely to be used
in practice. See below.)


** Implementation

The DBI has a preparse() method that's intended for drivers to use for
identifying placeholders. Very few (if any?) drivers actually use that
though. So implementing support for escaping placeholders would need to
be done by each driver.

For most driver's there no need or urgency in implementing it at all!
Being able to escape placeholders is only of use if the placeholder
syntax has some meaning in the SQL dialect of the database.

For question mark placeholders the only database we're aware of that
assigns a meaning to a question mark is PostgreSQL.

For colon placeholders there are many databases that assign a meaning to
colons as a prefix to name and/or numbers. Individual driver authors
would, as always, be free to implement this placeholder escaping feature
if and when they wish.


** Counter-proposal

On Sun, Dec 21, 2014 at 03:17:39PM -0000, Greg Sabino Mullane (the tenth man) wrote:
>
> I'm going to play devil's advocate a bit here. There are some problems with
> this approach. First, it will require that the user know if the underlying
> DBD supports backslashes.

Of course. This is no different from the developer requiring to know that
the underlying DBD supports any other feature that the developer relies on.

For most cases the familiar dependency management mechanisms work fine:
    use DBD::Foo '3.456';
There's also the get_info call, but I doubt that'll be needed in practice.


> Which likely means that SQL::Abstract and/or DBIx::Class will need to
> know as well. (Unless they expose the DBD directly to the user, which
> ruins the point a bit).

Not true.


> So we will have both escaped and unescaped versions of SQL
> floating around, subject to the whims of whether or not your particular
> DBD supports it (and in which version).

There's no such thing as "escaped and unescaped versions of SQL".
That's a false premise.  This is a really important point: the syntax
being proposed *is invalid SQL*.

Exactly the same approach has been used to safely extend other langauges.
For example the '(?' sequence was invalid in early Perl 5 regexps which
allowed Larry to adopt it to enable new features, like (?:foo)


> Which leads to my counter-proposal: have SQL::Abstract accept
> dollar-number placeholders.

s/accept/generate/.

I thought this might be workable and had looked at the SQL::Abstract
source code a few times but didn't think it an attractive option.
(Partly because of the complexity and performance cost, and partly
because it just seems like a poor approach to the issue.  Relying on the
fact that some drivers support non-standard placeholder styles seems to
be an inelegant fudge to me.  It would be pushing complexity and cost
into code that simply shouldn't need to care.)

Then ribasushi, a principle maintainer of SQL::Abstract and DBIx::Class,
pointed out the fatal flaw: numbered placeholders aren't directly composable.

It's easy to pass around a fragment of SQL containing question mark
placeholders along with a list of values to bind to those placeholders.
These fragments can be directly composed into a larger SQL statement
with the corresponding bind values simply pushed onto a list.

The as_query() method in DBIx::Class is a good example use-case. It returns

    \[ $sql_with_placeholders, @binds ]

and that form of 'SQL fragment' is "pervasive in the public API". E.g. see
https://metacpan.org/pod/distribution/DBIx-Class/lib/DBIx/Class/Manual/Cookbook.pod#Subqueries

For example:

    $resultset1->search_rs({
        foo => { -in     => $resultset2->as_query },
        bar => { -not_in => $resultset3->as_query },
        ...
    })->all;

The $resultset1/2/3 variables could represent different complex
multi-table joins with many conditions and many bind variables.
This is a very powerful feature of DBIx::Class, and it works well.

Trying to do the same with numbered placeholders (colon or dollar) would
require parsing and rewriting of the SQL.  Clearly non-trivial and slow.
For this reason he wouldn't accept such a patch to SQL::Abstract, and I'd
agree with that position.


Question marks are the international standard for placeholders.
The DBI should have specified a way to escape them from the start.
I'm attempting to fix that now.


To demonstrate the relative complexity, here's a proof-of-concept patch
for DBD::Pg to implement support for escaping question mark placeholders:

  https://github.com/timbunce/dbdpg/commit/54358c7a7efeeaf2666c5e28c301e47624fb9615

You can see that it's just 8 lines of simple code. (Naturally I'll add tests
and docs etc. to this before sending a pull request.)


Tim

p.s. When replying, please edit your replies to only quote the relevant
parts of this (very long) email.


_______________________________________________
List: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/dbix-class
IRC: irc.perl.org#dbix-class
SVN: http://dev.catalyst.perl.org/repos/bast/DBIx-Class/
Searchable Archive: http://www.grokbase.com/group/dbix-class@...
Reply | Threaded
Open this post in threaded view
|

Re: Escaping placeholders, take 2

Darren Duncan
Thank you for this post Tim, it seemed to lay out the issues well and make a lot
of things clear.

I'm now inclined to support your backslash-escape proposal at the DBI driver
level in principle.

(I also agree that the doubling method is nasty.)

Or alternately I suggest variation on that proposal that brings in shades of the
vendor escape clauses.

I suggest a variant where instead of a single backslash character indicating an
escape, we have some multi-character thing to indicate it, ideally involving
delimiters so it is more clear on how far the effect of the new escape feature
is supposed to go.

For example, using \{?} rather than \? or \{:foo} rather than \:foo.

One benefit of that is if you have some SQL that contains ? or : numerous times,
you only need to surround the whole fragment with \{} and not individually
escape each occurrence, making for neater code.

As to dealing with say literal { or } in the SQL as I could see reasonably
happening, the quoting mechanism could be made more generic like Perl's "q"/etc,
so for example it would take the form \X...X where whatever character appears
after the \ is what is matched, and it could be a multiplicity if necessary
within reason, eg \{{{...}}} would just work.  For that matter, heredocs or the
"quoted printable" feature you can see in email messages or such, eg you have
\"foo"...foo or some such.

I'm speaking in principle here, I'm not proposing a specific feature set, but
both "q" as well as Perl 6's related quoting mechanism is useful for guidance,
and I think something involving delimiters is best.

But if some of that sounds unduly complicated, I have a better idea.

I propose that the DBI include an API for users to tell the driver what possible
escape delimiters they are using.  For example, doable at least at a statement
level (and optionally on a connection/etc level for defaulting if that makes sense).

The API could involve an 'attr' given when preparing a SQL statement or other
appropriate places.

   placeholder_escape_delimiters => [ '\{','}' ]
   placeholder_escape_delimiters => [ '\{','}','\[',']' ]
   placeholder_escape_delimiters => [ '\{{{','}}}' ]
   placeholder_escape_delimiters => [ '{{{','}}}' ]

In this way, the backslash is no longer special, or necessary, though I
anticipate it would often still be used for the mnemonics.

Rather, when the driver is parsing the SQL for placeholders, if it encounters
any left delimiter strings, if will leave the following SQL unaltered until it
encounters the corresponding right delimiter string, and then it looks for
placeholders again.

(As to numbered placeholders, which are effectively a special case of named
placeholders, not being directly composable in SQL::Abstract, I see that as
being a problem itself.  It would be a great help to developers in principle if
the native way for working with parameters was named rather than positionally.
However, that is really a separate matter to deal with and I think it is a good
idea for Tim's proposal in some form to happen regardless of dealing with this
separate matter.)

-- Darren Duncan


_______________________________________________
List: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/dbix-class
IRC: irc.perl.org#dbix-class
SVN: http://dev.catalyst.perl.org/repos/bast/DBIx-Class/
Searchable Archive: http://www.grokbase.com/group/dbix-class@...