Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AST-based SQLite driver #164

Draft
wants to merge 37 commits into
base: develop
Choose a base branch
from
Draft

AST-based SQLite driver #164

wants to merge 37 commits into from

Conversation

JanJakes
Copy link
Collaborator

This PR is part of the Advanced MySQL support project and builds on the previous work in Exhaustive MySQL parser.

This is a WIP, currently stacked on top of #157.

@JanJakes JanJakes mentioned this pull request Nov 14, 2024
100 tasks
@JanJakes
Copy link
Collaborator Author

@adamziel Looking deeper into this, I think that translating specific AST subnodes is pretty straightforward, but what's an open question is the driver/translator architecture, and I'm kind of going back and forth on multiple options.

Splitting the driver & translator
I'm not sure whether having all in one large class is a good idea, and I'm exploring ways to split the code a bit. That said, there is the large switch/case statement in the prototype that doesn't need to know much about the AST structure — it just handles some nodes specifically, and passes many other ones down to the translator again, recursively. This has some advantages, such as we don't need to worry much about where a specific grammar rule can appear.

Splitting top-level constructs
That said, some kind of distinction makes a lot of sense to me, at least at the highest level. There is a big difference whether we're running a SELECT statement, or an UPDATE statement, or CREATE TABLE, etc. These top-level statements have pretty different semantics, they can return different things, and some special constructs make sense only in some cases (e.g., SQL_CALC_FOUND_ROWS works only with SELECT statements). Therefore, I'm pretty confident that on the top level, it makes sense to split these use cases a bit into separate classes.

To add a bit more context, even though the top-level statement rule can be re-used in a compound statement for stored procedures, stored functions, triggers, and events, it's probably good to distinguish that case from the top-level use case. E.g., an ALTER TABLE ... statement on the top-level may result in directly executing multiple SQLite queries, while when the same is passed as a procedure body, we may need to say it's unsupported.

Splitting further?
However, going a little bit deeper, any kind of distinction can be problematic. There are surely some grammar rules that can appear only in the context of others — tableElementList can never appear in the SELECT clause, for instance — but would it make any sense to implement such rules in CREATE/UPDATE translator class? Or, should there be one huge switch to handle all the grammar rules, with maybe the exception of the top-level ones, as suggested above?

Translating vs. interpreting
Another way of looking at organizing the code is the difference between translating MySQL syntax into SQLite syntax vs. emulating a MySQL statement with multiple different SQLite statements or PHP calls. We can't completely separate these two, but I think it would be nice to be clear about where we are just translating, and where some logic is actually executed. Again, it seems to me that the top-level vs. deeper level could help here — each top-level statement could be handled by a "handler" class that would interpret the query in full, and these handlers could use a generic "translator" within them, whose task would only be to translate some MySQL constructs into SQLite syntax.

class WP_SQLite_Create_Table_Translator {
	public function translate_create_table( WP_Parser_Node $ast ) {
		// CREATE TABLE ... LIKE ...
		// CREATE TABLE ... (LIKE ...)
		$has_like = $ast->get_descendant_token( WP_MySQL_Lexer::LIKE_SYMBOL );
		if ($has_like) {
			// Here we will need to run a query to get the table structure,
			// and then create the table:
			return $this->create_table_like( $ast );
		}

		// CREATE TABLE ... ( elements ) options
		// CREATE TABLE ... AS ...
		$elements_ast = $ast->get_child( 'tableElementList' );
		$has_as = $ast->get_descendant_token( WP_MySQL_Lexer::AS_SYMBOL );
		if ( $elements_ast && $has_as ) {
			throw new \Exception('"CREATE TABLE" with both column/constraint definition and "AS" clause is not supported');
		}
		
		// SQLite supporst both ( elements ) and "AS", so we can just rewrite
		// the query and run it. Rewriting can be done either partly here,
		// or fully in the translator.
		$query = $this->translator->translate( $ast );
		return $this->connection->query( $query );

		// Or, we rewrite the top-level `CREATE TABLE` here and nested nodes in the translator:
		$query = new WP_SQLite_Expression(
			array_merge(
				array( WP_SQLite_Token_Factory::raw( 'CREATE TABLE' ) ),
				array(
					$this->translate(
						$ast->get_child( 'tableName' )->get_child()
					),
				),
				array( WP_SQLite_Token_Factory::raw( '(' ) ),
				$this->translator->translate( $ast->get_child( 'tableElementList' ) ),
				array( WP_SQLite_Token_Factory::raw( ')' ) )
			)
		);
		return $this->connection->query( $query );

I'm not yet sure what's the optimal solution, but it feels like at least a bit of a concern separation would be useful. It would also be nice if we could solidify some AST node translation by unit testing the translated fragment as well (not only by the end-to-end tests that are run against SQLite), as such test would give immediate feedback and great diffs.

@adamziel
Copy link
Collaborator

adamziel commented Nov 15, 2024

Is there any downside to starting with a single giant class and splitting once that no longer serves the purpose?

Separation of concerns is a useful principle but it cuts both ways. Take Enterprise FizzBuzz. Great separation of concerns, extremely unreadable codebase. Too many concerns were defined and separated.

Therefore, I propose starting with just a single concern: Running MySQL queries in SQLite. Why? Because we don't really need modular code here. I don't think we'll ever need to use, say, SELECT statement translation as a component in any other context – therefore is it really useful to separate it?

I only propose that approach as a starting point. Perhaps it will make things unnecessarily complex and we'll have to look for other ways to slice it. I'm just not too excited about preemptively decoupling things that may not need decoupling at all.

The HTML Tag processor, for example, is just a single class even though it tackles a complex problem. It's a huge advantage – most methods are private so we can adjust them without breaking BC, you never have to analyze a dependency graph or wonder where should a specific piece of logic go, you always have full access to the entire state, and you can extend it easily.

I've tried splitting that logic a few times in the current iteration of the plugin and the outcome was always messy. Making a distinction between translating and interpreting is tempting, but it complicates the design. Perhaps things are different now that we have AST – if yes, then we should hit the wall really soon with a single class approach.

@JanJakes
Copy link
Collaborator Author

@adamziel Thanks for your insights! I did not consider that separation of concerns also creates new public APIs as a side effect. That's a very important point. Your proposal makes a lot of sense to me. Somehow, it seemed natural to me that the HTML Tag Processor, or the MySQL lexer, for instance, is handled by a single class, but I wasn't sure about the driver. I guess for parser-like logic, a single huge class is quite common, but in the case of the driver, I started to see too many different things (DB connection, query translation, query execution, etc.), and got myself caught overthinking it.

Is there any downside to starting with a single giant class and splitting once that no longer serves the purpose?

The only downside is that the giant class can will have multiple responsibilities, and it can get a bit difficult to read. For instance, some state variables will make sense for selects, but not for inserts, etc. But we can mitigate that with good naming, docs, and making the state as simple as possible.

Separation of concerns is a useful principle but it cuts both ways. Take Enterprise FizzBuzz. Great separation of concerns, extremely unreadable codebase. Too many concerns were defined and separated.

Wow, what a codebase! 😂 And a nice demonstration of what over-architecting can cause.

Therefore, I propose starting with just a single concern: Running MySQL queries in SQLite.

💯

I only propose that approach as a starting point. Perhaps it will make things unnecessarily complex and we'll have to look for other ways to slice it. I'm just not too excited about preemptively decoupling things that may not need decoupling at all.

Makes sense!

It's a huge advantage – most methods are private so we can adjust them without breaking BC

This! 💯


Looking at the architecture, I'd like to simplify also the current draft with WP_SQLite_Expression and WP_SQLite_Query_Builder. I think maybe we could make it work without them.

The WP_SQLite_Query_Builder takes the expression and stringifies it, joining all strings with a space and wrapping sub-expressions with parentheses. It covers many use cases out of the box, but it's not really a definition of the SQLite SQL dialect or anything like that, and it's not very flexible — for instance, a list of column definitions is delimited by a comma, not a space, and maybe the current implementation would put spaces even into places where they aren't expected. Apart from that, it performs some operations on some tokens (quoting identifiers, for instance), which could be done by the driver itself.

For WP_SQLite_Expression, it may be possible to replace it within the driver with return types of the recursive translate calls — either we return a token, or we could return an array (~ the expression). In any case, if we'll need to keep this one for some reason, I'd consider renaming it to something like WP_SQLite_Node to avoid confusion with SQL Language Expressions or any hints that it maps to some SQLite construct.

I will quickly try out these simplifications to see if they could make sense.

Base automatically changed from recursive-descent-parser to develop November 20, 2024 07:39
$this->assertSame( null, $root->get_descendant_token( 123 ) );

// Test multiple descendant methods.
// @TODO: Consider breadth-first search vs depth-first search.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamziel I initially kept the logic from the first prototype (which was traversing descendants breadth-first), and added this note to look into it later, but now it seems to me it may be more useful and natural to do a dept-first search (which would return tokens in the order from the original query).

I'm still not 100% sure which one we need more — usually, I just want to know "is there this node or that token somewhere in this subtree", and in that case, both searches work. But sometimes it may be useful to check for a next token or more, for instance, and then get_descendant_tokens() would be more natural with a depth-first traversal. If really needed, we could also provide both. What do you think?

@JanJakes
Copy link
Collaborator Author

@adamziel Another thought regarding WP_Parser_Node: We now provide child and descendant methods. Should we also provide parent methods? It can be useful, but I think it has a major flaw. I'll explain:

Sometimes, it could be useful in the driver, if we want to translate some constructs in a way "if I'm in the context of createTable, do this, otherwise do that". This is a simple way to express some specific nuances, but it's not the only way — we could also handle it from the top within the createTable node instead, constructing the statement a bit more top-down (but it can be more verbose at times).

That said, I think there's a problem — adding parent references to WP_Parser_Node would probably cause memory leaks. PHP GC uses reference counting, which means that it only frees the memory occupied by an object when there are no more references to it, and that circular parent-child references would never allow for garbage collection. Weak refs were only introduced in PHP 7.4, so I guess currently, it doesn't make sense to do this. It also complicates the tree a bit and makes subtrees not "independent". There is the alternative of saving some context (node stack) in the driver during the translation, for instance.

@adamziel
Copy link
Collaborator

adamziel commented Nov 23, 2024

@JanJakes what would be a specific code example where a parent reference would help? I get "do this, do that," but what's "this" and "that"?

@adamziel
Copy link
Collaborator

Also the HTML and the XML API solve the same problem with the get_ breadcrumbs a method that gives you a stack of parent elements names. Maybe the same would use here for AST notes?

@JanJakes
Copy link
Collaborator Author

@JanJakes what would be a specific code example where a parent reference would help? I get "do this, do that," but what's "this" and "that"?

@adamziel For now, I've run into the INTEGER PRIMARY KEY "without" vs "with" autoincrement — in the former case, I'd like to translate it as INT PRIMARY KEY while in the latter one as INTEGER PRIMARY KEY due to the "bug" that's preserved in SQLite for back compatibility. In the AST, it would look something like this:

createStatement
   |- createTable
     |- tableElementList
       |- tableElement          <-- Here I can reliably determine PRIMARY KEY
       |  |- columnDefinition
       |    |- fieldDefinition  <-- Here I can reliably determine AUTOINCREMENT
       |     |- dataType        <-- Here I want to know about PRIMARY KEY and AUTOINCREMENT
       |     |- columnAttribute <-- Here PRIMARY KEY and AUTOINCREMENT can be specified
       |     ...
       |- tableElement
         |- tableConstraintDef  <-- But PRIMARY KEY can also be specified here (AUTOINCREMENT not)

Without the tableConstraintDef option for primary keys, it would be simple-enough to handle this case in fieldDefinition in a top-down manner. Something like:

case 'fieldDefinition':
	$has_primary_key   = ...;
	$has_autoincrement = ...;
	if ( $has_primary_key ) {
		$children       = $ast->get_children();
		$data_type_node = array_shift( $children );
		$data_type      = $this->translate( $data_type_node );
		if ( 'INTEGER' === $data_type ) {
			$data_type = $has_autoincrement ? 'INTEGER' : 'INT';
		}
		return $data_type . $this->translate_sequence( $children );
	}
	return $this->translate_sequence( $ast->get_children() );

Here, the $has_primary_key can peek below for descendant token, but it has no way to verify the tableConstraintDef, that would be in a different tableElement. I would either need a reference to the parent tableElement to use it to search for a primary key, or I would need to save it somehow ($this->set_context('primary_key', ...)) and then read it below.


However, in the case of CREATE TABLE, I'm also wondering whether it may make more sense to fully parse the create table statement (into columns, constraints, etc.), and only then construct a query from that, like we do in the current translator. It's granular changes on specific AST nodes vs reading the full MySQL statement and generating a new SQLite statement from that. I don't know yet which of these options will be ultimately more convenient and reliable.

@adamziel
Copy link
Collaborator

Two ideas I got to avoid a parent reference:

  • Could the driver or parser class keep an array like $child => $parent and expose a method such as get_parent_of?
  • The query execution process could keep track of the stack of nodes it entered

@adamziel
Copy link
Collaborator

adamziel commented Nov 23, 2024

It's granular changes on specific AST nodes vs reading the full MySQL statement and generating a new SQLite statement from that. I don't know yet which of these options will be ultimately more convenient and reliable.

Are these actually different? You need to traverse the tree node by node in either case. Or would the latter mean doing one pass to build a flat index of, say, modifiers and options, and then doing the translation in the second pass? Because I don't think that index could be flat - we'd still need a recursive data structure to account for subqueries.

@JanJakes
Copy link
Collaborator Author

It's granular changes on specific AST nodes vs reading the full MySQL statement and generating a new SQLite statement from that. I don't know yet which of these options will be ultimately more convenient and reliable.

Are these actually different? You need to traverse the tree node by node in either case. Or would the latter mean doing one pass to build a flat index of, say, modifiers and options, and then doing the translation in the second pass? Because I don't think that index could be flat - we'd still need a recursive data structure to account for subqueries.

@adamziel I think they are a bit different:

  1. The first approach with only the necessary granular changes often relies on the default case that just recursively translates all children and concatenates them with a space. At the moment, this often works thanks to the similarity of the two SQL dialects, so even tokens are reused (SELECT, parentheses, etc.). In a way, we're just taking the MySQL query and doing some adjustments, where needed. Over time, we'll probably cover the translation more and more on each node, and will rely a bit less on the shape of the original MySQL query.
  2. The second approach would be oriented more towards reading data and then constructing a whole new query, rather than doing only the necessary changes. The index of the elements could be flat, at least in this case. More generally, even with this approach, the recursion is still possible — if I know that I have a list of some definitions, I can still get them with $definition = $this->translate( ... ), in case they can contain expressions, subqueries, etc.

Looking at CREATE TABLE in particular, and the grammar branch that contains tableElementList (the last below),

createTable:
    TEMPORARY_SYMBOL? TABLE_SYMBOL ifNotExists? tableName (
        LIKE_SYMBOL tableRef
        | OPEN_PAR_SYMBOL LIKE_SYMBOL tableRef CLOSE_PAR_SYMBOL
        | (OPEN_PAR_SYMBOL tableElementList CLOSE_PAR_SYMBOL)? createTableOptions? partitionClause? duplicateAsQueryExpression?
    )
;

we would be able to get a list of all column and constraint definitions, organize them somehow, and then generate the new CREATE TABLE query. The question is whether we should do that, or if we should just go with option 1.


Here's maybe a good example for thinking about this:

A PRIMARY KEY constraint can be defined either within a particular column definition, or in a different table element, separately from the column definition. Now:

  1. With option 1, we'd keep the definition where it was in MySQL, not normalizing it in any way, and relying on the fact that both MySQL and SQLite support defining PRIMARY KEY in both places. When we would run into a specific constraint that can be defined in both places in MySQL, but only in a single one in SQLite, we'd just handle that specific case.
  2. With option 2, we would first create a list of all columns and constraints, and then generate a query from that. Inline constraints would be normalized to separate constraint definitions (or the other way round), etc.

As a consequence, option 1 fixes only the particular cases that SQLite supports differently, leaving the rest mostly to their similarities. If there is a specific keyword or modifier we need to ignore, it needs to be done explicitly.

With option 2, we would be picking and allow-listing the things we support, possibly ignoring any other ones. In many cases, I think option 2 may be a bit more verbose, unless there are many exceptions and edge cases that would have to be handled in option 1.

Ultimately, we can do a little bit of both — sometimes we can be more descriptive for a specific node, we can decompose it and fully reconstruct in an option 2 fashion, while still leveraging option 1 as well by translating some subnodes recursively. The decision may be needed on a case-by-case, but it is a question of what would be a useful guideline to deciding this.


And here's a particular example:

In MySQL, a PRIMARY KEY with AUTOINCREMENT can be defined in both of these ways:
a) CREATE TABLE t (id INT PRIMARY KEY AUTO_INCREMENT);
b) CREATE TABLE t (id INT AUTO_INCREMENT, PRIMARY KEY (id) );

While in SQLite, b) is a syntax error. It means that we have to rewrite it to something like a) because AUTOINCREMENT in SQLite can be only in the column definition and only after a PRIMARY KEY constraint. And here we're touching on the questions — handling this particular case vs reading and normalizing the full statement.


Another interesting example:

  • In MySQL, we can write both PRIMARY KEY AUTO_INCREMENT and AUTO_INCREMENT PRIMARY KEY.
  • In SQLite, it always needs to be PRIMARY KEY AUTOINCREMENT, the other one is a syntax error.

Going back to the two options, with option 1, we need to catch and handle this case, while with option 2, it would probably just work due to the full statement reading and normalization, at the likely cost of more verbosity.

Ultimately, it may also work that we only normalize the PRIMARY KEY & AUTOINCREMENT definitions, leaving the rest as-is. At the moment, I don't know yet if there will be more of such cases in this particular statement.

@JanJakes
Copy link
Collaborator Author

🎄 Update

I'm finally here with an update, I'm sorry for being overdue. It's a longer read, probably should've been split into multiple.

I'll start with a summary:

  1. 61% (71/116) of the original tests are passing. Before starting the INFORMATION_SCHEMA rewrite, it was 33% (38/116).
  2. A significant portion of some of the hardest statements is done — CREATE TABLE and ALTER TABLE operations.
  3. The basic forms of SHOW and DESCRIBE using the INFORMATION_SCHEMA are implemented.
  4. However, I think there's also still quite a bit of work left to get to 100%, especially to get everything correct and stable.
  5. The remaining failing tests are:
    • String matching and regexp support (LIKE, RLIKE, etc.) — 8 tests.
    • Date/time related columns — 5 tests.
    • SQL_CALC_FOUND_ROWS — 4 tests.
    • LEFT function — 4 tests.
    • SHOW TABLES with LIKE and FROM — 4 tests.
    • ON CONFLICT ... — 3 tests.
    • SELECT FROM DUAL — 2 tests.
    • Other things like CREATE TEMPORARY TABLE, HAVING COUNT variations, CURRENT_TIMESTAMP syntax, ON UDPATE, CAST as binary, etc.
  6. For more details, check out the WP_SQLite_Information_Schema_Builder.

The full list of failures as of today can be found here: https://gist.github.com/JanJakes/23e36446b24c88bee7ca81415d9f6059


Now, I'd like to share some more details about the progress, and the rabbit holes I encountered along the way.

Simple statements -> Basic DDL -> Information schema

A couple of weeks ago, I started to implement the new SQLite driver to translate MySQL statements to SQLite statements, using the new MySQL parser. I implemented the basics of a few simple statements, such as SELECT, INSERT, UPDATE, REPLACE, and DELETE, and then I switched my focus to DDL statements, such as CREATE TABLE and ALTER TABLE.

This turned out to bring some complexities, and after further discussions, and realizing that a pure MySQL -> SQLite DDL conversion is lossy, we decided to implement a subset of MySQL's INFORMATION_SCHEMA tables.

Implementing INFORMATION_SCHEMA tables

First, I decided to focus on the schema tables that are necessary to describe tables, columns, indexes, and constraints. I started by implementing the CREATE TABLE statement. This is essentially an AST to information schema query conversion. In its basics, it's a simple logic, but reaching high coverage and correctness turned out to be pretty extensive.

Here are some interesting issues I encountered:

  • Some columns depend on each other. Sometimes, a charset can determine a collation (we need a full map of default collations for each charsets), sometimes a collation can determine a charset (here, it's just its prefix). Column lengths and numeric precisions depend on data types, charset (we need a map of charset to maximum bytes per character), etc.
  • Some tables depend on each other. Adding or dropping a constraint may affect the nullability and some other attributes of a column as well, and this data needs to be synchronized. Maybe in the future, we could reduce the number of actual information schema tables, and use views to present the data (MySQL has a similar approach from version 8).
  • Some operations have non-trivial effects on the schema. Dropping a column will remove it from existing indexes, meaning that multi-column index records may need to be renumbered, and other column attributes may change.
  • Some information schema values require a lot of normalization. There are so many was to write varchars, national varchars, and text fields (char varying, character varying, varchar, national varchar, nvarchar, nchar varchar, national char varying, nchar varying, long char varying, long character varying, long varchar, ...), and some keywords can change their meaning (long converts varchar to text, appending CHARSET BINARY seems to change varchar to varbinary, etc.).
  • For some values, we may need to implement an AST -> MySQL string conversion. More on that below.

I've either addressed/TODO-ed all the issues I've encountered, but a more extensive test case for some of them is still needed. More details can be seen in WP_SQLite_Information_Schema_Builder which implements all of the AST -> information schema logic.

MySQL AST -> string conversion

As I was working on the information schema builder, I realized I not only need to read the MySQL AST, but in some cases, it's also necessary to convert some nodes back to strings. A simple example of this is reading a table name from a node that has multiple nesting levels below (identifier -> pureIdentifier -> IDENTIFIER). A more complex example is a column definition with a DEFAULT(expr), in which case I need to save the serialized expression in the INFORMATION_SCHEMA.

I think there are multiple approaches we could consider:

  1. Maybe just walking the tree and joining all tokens with spaces would work.
  2. We could store the original whitespaces in the AST, and then join the values.
  3. We could store offsets in the original SQL input for each node, and then just cut out what we need.
  4. We could implement an AST -> SQL string printer, maybe a bit more involved than option 1.

Additionally, I often need to get a bit more than the original string. For identifiers, the value is not enough — I need to unquote it first (replace quoted sequences by characters, trim quotes), and it is a question where such logic belongs.

ALTER TABLE

The next topic after storing the CREATE TABLE information in the INFORMATION_SCHEMA was implementing support for the most common table altering operations. For that, I refactored my initial draft to reusable methods, and realized that ALTER TABLE statements brings another set of challenges.

SQLite has a limited support for ALTER TABLE statements, so I started by trying to map those statements that SQLite can handle to their SQLite counterparts, while falling back to a table-copying approach for the other ones. This turned out to be a rabbit hole:

  1. ALTER TABLE statements can contain multiple operations, and they often do. Therefore, I needed to try to first decide whether I needed to copy the table, or whether I will be able to just map a single MySQL statement to multiple SQLite ALTER TABLE statements.
  2. Determining which MySQL ALTER TABLE statements are directly supported by SQLite is not straightforward. It may seem that an operation, such as ADD COLUMN is simply supported, but the reality is that it might not be in all cases (e.g., with FIRST/AFTER ...).
  3. Implementing the "direct" plus "fallback to full copy" approach means doing many things twice.

For these reasons and after some experimenting, I decided to first focus on implementing the full table-copy ALTER TABLE algorithm, leaving the direct translation of some operations as an optimization for later. The approach is pretty straightforward:

  1. Reflect all ALTER TABLE operations in the INFORMATION_SCHEMA.
  2. Run the SQLite ALTER TABLE algorithm with full table copy as per the SQLite docs.

This turned out to work flawlessly and to be very helpful in the initial implementation phase. Implementing a new ALTER TABLE operations is as simple as reflecting its changes in the information schema at the moment. This is great for broadening the statement support before focusing on optimizing some of the statements.

Finally, the ALTER TABLE algorithm recommended in the SQLite docs requires only a one table copy, so in the new driver, we don't do two copies like we did in the old one.

Connecting the pieces

What turned out to be challenging in this first phase of implementing the INFORMATION_SCHEMA support was the fact that until I got to some level of completeness, I wasn't able to see any progress in terms of passing tests, and it wasn't easy to estimate where exactly we were regarding the compatibility with the existing SQLite translator.

This is because:

  1. The actual database operations can't be implemented before the data is stored in INFORMATION_SCHEMA.
  2. Saving only some data in the INFORMATION_SCHEMA means the tests will fail because the data is incomplete.
  3. Implementing the INFORMATION_SCHEMA builder and the CREATE/ALTER TABLE operations is not enough either. Most of the tests verify the results using SHOW and DESCRIBE statements.

Therefore, while working on the INFORMATION_SCHEMA, I haven't seen any progress on the original SQL translation test suite (but I did write a new set of tests along the way). But the reward comes at the end — quickly prototyping SHOW and DESCRIBE statements with the INFORMATION_SCHEMA at hand turned out to be easy and straightforward, and suddenly, many tests started to pass, while for some others, this revealed incorrect assertions, as the old driver doesn't always have 100% correct data at hand.

Next steps

There is still a good amount of work to get to 100% of the test suite for the current SQLite translator, but I think in the current state, there is a good foundation for many of the hard parts, and building on that will hopefully be easier. Above that, there are some MySQL features that can likely be ported from the old translator with not so much of a rewrite.

Additionally, the existing test suite may not fully cover everything that the current translator implements, so I think it would be good to look at multiple measures:

  1. The percentage of passing tests in the existing test suite.
  2. Checking the existing translator method by method and ensuring we implement all the operations.
  3. Running WordPress with the new driver.
  4. Examining the new driver using Playground Tester.

That said, maybe the path to get this merged, e.g., with a feature flag, is to focus on point 1, and do the rest in other PRs.

@adamziel
Copy link
Collaborator

Monumental work here @JanJakes, and a tremendous update. Mad props.

That said, maybe the path to get this merged, e.g., with a feature flag, is to focus on point 1, and do the rest in other PRs.

I would love that! It feels like a good spot to park, do a round or a few rounds of review, and merge. Let's just rewire the CI tests to the old driver to avoid failing checks on trunk.

$this->pdo->query( 'PRAGMA encoding="UTF-8";' );

$valid_journal_modes = array( 'DELETE', 'TRUNCATE', 'PERSIST', 'MEMORY', 'WAL', 'OFF' );
if ( defined( 'SQLITE_JOURNAL_MODE' ) && in_array( SQLITE_JOURNAL_MODE, $valid_journal_modes, true ) ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How easy/hard would if be to avoid referencing any global constants and variables in this driver? Ideally we could use it outside of WordPress as a drop-in PDO replacement and it's not easy to see all the configuration options when we rely on those constants. Can we source this information from a constructor argument? If we need a bit of dedicated code to connect it with WordPress, we could have a separate function that glues all the WordPress-specific bits with new WP_SQLite_Driver()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just marking that this is copy-pasted from the legacy translator. It will be definitely great to refactor this to an explicit set of config options 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants