-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AST-based SQLite driver #164
base: develop
Are you sure you want to change the base?
Conversation
@adamziel Looking deeper into this, I think that translating specific AST subnodes is pretty straightforward, but what's an open question is the driver/translator architecture, and I'm kind of going back and forth on multiple options. Splitting the driver & translator Splitting top-level constructs To add a bit more context, even though the top-level statement rule can be re-used in a compound statement for stored procedures, stored functions, triggers, and events, it's probably good to distinguish that case from the top-level use case. E.g., an Splitting further? Translating vs. interpreting class WP_SQLite_Create_Table_Translator {
public function translate_create_table( WP_Parser_Node $ast ) {
// CREATE TABLE ... LIKE ...
// CREATE TABLE ... (LIKE ...)
$has_like = $ast->get_descendant_token( WP_MySQL_Lexer::LIKE_SYMBOL );
if ($has_like) {
// Here we will need to run a query to get the table structure,
// and then create the table:
return $this->create_table_like( $ast );
}
// CREATE TABLE ... ( elements ) options
// CREATE TABLE ... AS ...
$elements_ast = $ast->get_child( 'tableElementList' );
$has_as = $ast->get_descendant_token( WP_MySQL_Lexer::AS_SYMBOL );
if ( $elements_ast && $has_as ) {
throw new \Exception('"CREATE TABLE" with both column/constraint definition and "AS" clause is not supported');
}
// SQLite supporst both ( elements ) and "AS", so we can just rewrite
// the query and run it. Rewriting can be done either partly here,
// or fully in the translator.
$query = $this->translator->translate( $ast );
return $this->connection->query( $query );
// Or, we rewrite the top-level `CREATE TABLE` here and nested nodes in the translator:
$query = new WP_SQLite_Expression(
array_merge(
array( WP_SQLite_Token_Factory::raw( 'CREATE TABLE' ) ),
array(
$this->translate(
$ast->get_child( 'tableName' )->get_child()
),
),
array( WP_SQLite_Token_Factory::raw( '(' ) ),
$this->translator->translate( $ast->get_child( 'tableElementList' ) ),
array( WP_SQLite_Token_Factory::raw( ')' ) )
)
);
return $this->connection->query( $query ); I'm not yet sure what's the optimal solution, but it feels like at least a bit of a concern separation would be useful. It would also be nice if we could solidify some AST node translation by unit testing the translated fragment as well (not only by the end-to-end tests that are run against SQLite), as such test would give immediate feedback and great diffs. |
fac342d
to
b8d04ea
Compare
Is there any downside to starting with a single giant class and splitting once that no longer serves the purpose? Separation of concerns is a useful principle but it cuts both ways. Take Enterprise FizzBuzz. Great separation of concerns, extremely unreadable codebase. Too many concerns were defined and separated. Therefore, I propose starting with just a single concern: Running MySQL queries in SQLite. Why? Because we don't really need modular code here. I don't think we'll ever need to use, say, SELECT statement translation as a component in any other context – therefore is it really useful to separate it? I only propose that approach as a starting point. Perhaps it will make things unnecessarily complex and we'll have to look for other ways to slice it. I'm just not too excited about preemptively decoupling things that may not need decoupling at all. The HTML Tag processor, for example, is just a single class even though it tackles a complex problem. It's a huge advantage – most methods are private so we can adjust them without breaking BC, you never have to analyze a dependency graph or wonder where should a specific piece of logic go, you always have full access to the entire state, and you can extend it easily. I've tried splitting that logic a few times in the current iteration of the plugin and the outcome was always messy. Making a distinction between translating and interpreting is tempting, but it complicates the design. Perhaps things are different now that we have AST – if yes, then we should hit the wall really soon with a single class approach. |
@adamziel Thanks for your insights! I did not consider that separation of concerns also creates new public APIs as a side effect. That's a very important point. Your proposal makes a lot of sense to me. Somehow, it seemed natural to me that the HTML Tag Processor, or the MySQL lexer, for instance, is handled by a single class, but I wasn't sure about the driver. I guess for parser-like logic, a single huge class is quite common, but in the case of the driver, I started to see too many different things (DB connection, query translation, query execution, etc.), and got myself caught overthinking it.
The only downside is that the giant class can will have multiple responsibilities, and it can get a bit difficult to read. For instance, some state variables will make sense for selects, but not for inserts, etc. But we can mitigate that with good naming, docs, and making the state as simple as possible.
Wow, what a codebase! 😂 And a nice demonstration of what over-architecting can cause.
💯
Makes sense!
This! 💯 Looking at the architecture, I'd like to simplify also the current draft with The For I will quickly try out these simplifications to see if they could make sense. |
Additionally, move the current SQLite driver prototype to a temporary class WP_SQLite_Driver_Prototype to be gradually moved to the new driver.
b8d04ea
to
63460dd
Compare
63460dd
to
a02f751
Compare
a02f751
to
b8b4500
Compare
$this->assertSame( null, $root->get_descendant_token( 123 ) ); | ||
|
||
// Test multiple descendant methods. | ||
// @TODO: Consider breadth-first search vs depth-first search. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamziel I initially kept the logic from the first prototype (which was traversing descendants breadth-first), and added this note to look into it later, but now it seems to me it may be more useful and natural to do a dept-first search (which would return tokens in the order from the original query).
I'm still not 100% sure which one we need more — usually, I just want to know "is there this node or that token somewhere in this subtree", and in that case, both searches work. But sometimes it may be useful to check for a next token or more, for instance, and then get_descendant_tokens()
would be more natural with a depth-first traversal. If really needed, we could also provide both. What do you think?
@adamziel Another thought regarding Sometimes, it could be useful in the driver, if we want to translate some constructs in a way "if I'm in the context of That said, I think there's a problem — adding parent references to |
@JanJakes what would be a specific code example where a parent reference would help? I get "do this, do that," but what's "this" and "that"? |
Also the HTML and the XML API solve the same problem with the get_ breadcrumbs a method that gives you a stack of parent elements names. Maybe the same would use here for AST notes? |
@adamziel For now, I've run into the
Without the case 'fieldDefinition':
$has_primary_key = ...;
$has_autoincrement = ...;
if ( $has_primary_key ) {
$children = $ast->get_children();
$data_type_node = array_shift( $children );
$data_type = $this->translate( $data_type_node );
if ( 'INTEGER' === $data_type ) {
$data_type = $has_autoincrement ? 'INTEGER' : 'INT';
}
return $data_type . $this->translate_sequence( $children );
}
return $this->translate_sequence( $ast->get_children() ); Here, the However, in the case of |
Two ideas I got to avoid a parent reference:
|
Are these actually different? You need to traverse the tree node by node in either case. Or would the latter mean doing one pass to build a flat index of, say, modifiers and options, and then doing the translation in the second pass? Because I don't think that index could be flat - we'd still need a recursive data structure to account for subqueries. |
@adamziel I think they are a bit different:
Looking at createTable:
TEMPORARY_SYMBOL? TABLE_SYMBOL ifNotExists? tableName (
LIKE_SYMBOL tableRef
| OPEN_PAR_SYMBOL LIKE_SYMBOL tableRef CLOSE_PAR_SYMBOL
| (OPEN_PAR_SYMBOL tableElementList CLOSE_PAR_SYMBOL)? createTableOptions? partitionClause? duplicateAsQueryExpression?
)
; we would be able to get a list of all column and constraint definitions, organize them somehow, and then generate the new Here's maybe a good example for thinking about this: A
As a consequence, option 1 fixes only the particular cases that SQLite supports differently, leaving the rest mostly to their similarities. If there is a specific keyword or modifier we need to ignore, it needs to be done explicitly. With option 2, we would be picking and allow-listing the things we support, possibly ignoring any other ones. In many cases, I think option 2 may be a bit more verbose, unless there are many exceptions and edge cases that would have to be handled in option 1. Ultimately, we can do a little bit of both — sometimes we can be more descriptive for a specific node, we can decompose it and fully reconstruct in an option 2 fashion, while still leveraging option 1 as well by translating some subnodes recursively. The decision may be needed on a case-by-case, but it is a question of what would be a useful guideline to deciding this. And here's a particular example: In MySQL, a PRIMARY KEY with AUTOINCREMENT can be defined in both of these ways: While in SQLite, b) is a syntax error. It means that we have to rewrite it to something like a) because AUTOINCREMENT in SQLite can be only in the column definition and only after a PRIMARY KEY constraint. And here we're touching on the questions — handling this particular case vs reading and normalizing the full statement. Another interesting example:
Going back to the two options, with option 1, we need to catch and handle this case, while with option 2, it would probably just work due to the full statement reading and normalization, at the likely cost of more verbosity. Ultimately, it may also work that we only normalize the PRIMARY KEY & AUTOINCREMENT definitions, leaving the rest as-is. At the moment, I don't know yet if there will be more of such cases in this particular statement. |
04d7c17
to
665d8bc
Compare
🎄 UpdateI'm finally here with an update, I'm sorry for being overdue. It's a longer read, probably should've been split into multiple. I'll start with a summary:
The full list of failures as of today can be found here: https://gist.github.com/JanJakes/23e36446b24c88bee7ca81415d9f6059 Now, I'd like to share some more details about the progress, and the rabbit holes I encountered along the way. Simple statements -> Basic DDL -> Information schemaA couple of weeks ago, I started to implement the new SQLite driver to translate MySQL statements to SQLite statements, using the new MySQL parser. I implemented the basics of a few simple statements, such as This turned out to bring some complexities, and after further discussions, and realizing that a pure MySQL -> SQLite DDL conversion is lossy, we decided to implement a subset of MySQL's Implementing
|
Monumental work here @JanJakes, and a tremendous update. Mad props.
I would love that! It feels like a good spot to park, do a round or a few rounds of review, and merge. Let's just rewire the CI tests to the old driver to avoid failing checks on trunk. |
$this->pdo->query( 'PRAGMA encoding="UTF-8";' ); | ||
|
||
$valid_journal_modes = array( 'DELETE', 'TRUNCATE', 'PERSIST', 'MEMORY', 'WAL', 'OFF' ); | ||
if ( defined( 'SQLITE_JOURNAL_MODE' ) && in_array( SQLITE_JOURNAL_MODE, $valid_journal_modes, true ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How easy/hard would if be to avoid referencing any global constants and variables in this driver? Ideally we could use it outside of WordPress as a drop-in PDO replacement and it's not easy to see all the configuration options when we rely on those constants. Can we source this information from a constructor argument? If we need a bit of dedicated code to connect it with WordPress, we could have a separate function that glues all the WordPress-specific bits with new WP_SQLite_Driver()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just marking that this is copy-pasted from the legacy translator. It will be definitely great to refactor this to an explicit set of config options 👍
a8a1a93
to
335388b
Compare
This PR is part of the Advanced MySQL support project and builds on the previous work in Exhaustive MySQL parser.
This is a WIP, currently stacked on top of #157.