Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Liberation] Topological sorter, entities remapping and add missing imports #2030

Draft
wants to merge 70 commits into
base: trunk
Choose a base branch
from

Conversation

zaerl
Copy link
Collaborator

@zaerl zaerl commented Nov 26, 2024

You can find a discussion here with a more detailed explanation: #2090

Motivation for the change, related issues

Topological Sorting of WXR entities before starting the import to ensure parent posts are imported before child posts. Processing WXR in a topological order may require an index with all offsets and lengths of items in the WXR.

Implementation details

Entities preloading/loading/mapping

The topological sort happens during a STAGE_TOPOLOGICAL_SORT phase runt before the STAGE_FRONTLOAD_ASSETS.

This PR removes the WP_Entity_Importer::mapping and WP_Entity_Importer::exists arrays and all memory preload. It is slow and memory-consuming, with thousands of entries and not support sessions. It adds a new table with import IDs and mapped IDs. During the first phase, it is prefilled. During the WP_Entity_Importer import it maps the imported IDs by using the wxr_importer_* filters and actions.

New WP-CLI script

This PR also introduces the new CLI script and moves the logger there. moved to #2104

New unit tests

  1. Added all the 130 WordPress core unit tests. Updated the WXRs to latest version.
  2. Added a PlaygroundTestCase base class that cleanup the WordPress database after a test, as _delete_all_data does
    Moved to [Data Liberation] Add support for terms meta and new unit tests #2105

Missing imports and hierarchies

Added support for:

  1. Posts hierarchy
  2. Term hierarchy
  3. Term meta
  4. Comment meta

New PHPUnit filter

This PR adds a PHPUNIT_FILTER constant to packages/playground/data-liberation/tests/import/blueprint-import.json. If the value is not falsy it will be passed to PHPUnit when calling npx nx run playground-data-liberation:test:wp-phpunit. So, for example, you can set "PHPUNIT_FILTER": "WPRewriteUrlsTests". It will be the same as running phpunit --filter WPRewriteUrlsTests.
Moved to #2105

Testing Instructions (or ideally a Blueprint)

Unit tests

npx nx run playground-data-liberation:test:wp-phpunit

# Or only tests that do not need a WordPress environment
cd packages/playground/data-liberation
./vendor/bin/phpunit

Many of the tests can be run only in a real WordPress environment, that's why wp-phpunit exists.

Data test

  1. Spin Playground
  2. Import whatever .xml you find in packages/playground/data-liberation/tests/wxr/*
  3. In the above discussion, you can use a plugin to create an XML with thousands of entries if you want.

About to add and pass all the tests found here https://github.com/WordPress/wordpress-importer/tree/master/phpunit/tests.

New script

wp data-liberation import test.xml

or:

cd packages/playground/data-liberation/bin/import
bash import-wxr.sh a-folder-with-xmls

It accepts a file, a URL, or a folder. Run wp help data-liberation import to see all options. moved to #2104

@zaerl zaerl force-pushed the add/topological-sort branch from c51d9c4 to 7778714 Compare November 26, 2024 13:52
@zaerl zaerl self-assigned this Nov 26, 2024
@zaerl zaerl force-pushed the add/topological-sort branch 2 times, most recently from 85c850b to 78f1cb3 Compare November 29, 2024 11:19
@zaerl zaerl force-pushed the add/topological-sort branch from 78f1cb3 to 5af5722 Compare November 29, 2024 13:02
* Quicksort performs badly on already sorted arrays, O(n^2) is the worst case.
* Let's consider using a different sorting algorithm.
*/
uksort( $elements, $sort_callback );
Copy link
Collaborator

@adamziel adamziel Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires fitting all $elements into memory. This may be fine for v1, since every $element is relatively small, but we'll hit the limits of this approach sooner than later. Is it possible to perform topological sorting with at a reasonable speed without holding everything in memory? If not, how much RAM would this need to process one of these huge VIP 1TB exports?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can save the data in a custom DB table (Jetpack have a custom wp_jetpack_sync_queue table) instead of a simple array in memory. I didn't disturb the database in this first phase, but we can. Reproducing this in a table and using the custom sort here is straightforward. At the end of collecting the byte offsets, it's a matter of ALTER TABLE wxr_sorting ORDER BY custom_sorting_like_the_one_above, and we are done, even with our new reentrancy model.

When it finds a row in this table, the streamer can jump to the "correct" position (the one it should load before the current one) and proceed to the next one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely! If it's straightforward, let's include it in v1 – it seems like we can ship something that's on the right track with relatively little effort. It would also enforce shaping the API to stream the sorted list instead of loading it into memory, which would have implications for the reentrancy cursor. It may seem excessive for v1, but thinking about it more it seems like something that could have a ripple through the entire system if we don't account for streaming early on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may seem excessive for v1, but thinking about it more it seems like something that could have a ripple through the entire system if we don't account for streaming early on.

Agree. It's reasonable, let's proceed this way. 👍

@@ -93,7 +93,7 @@ public function __construct( $options = array() ) {
$this->mapping['term_id'] = array();
$this->requires_remapping = $empty_types;
$this->exists = $empty_types;
$this->logger = new Logger();
$this->logger = isset( $options['logger'] ) ? $options['logger'] : new WP_Logger();
Copy link
Collaborator

@adamziel adamziel Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure we need a logger beyond _doing_it_wrong(). All the progress information are now exposed as numbers and the API consumer may implements its own logging using any technique.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, thanks.

Comment on lines 230 to 236
if ( true === $this->topological_sort_next_entity( $count ) ) {
return true;
}

// We indexed all the entities. Now sort them topologically.
$this->topological_sorter->sort_topologically();
$this->topological_sorter = null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It reads confusing to me:

  1. Sort every entity we encounter and short circuit while we have entities
  2. Once we run out of entities to sort, sort them

What would be another way to call these operations?

/**
* The base name of the table.
*/
const TABLE_NAME = 'data_liberation_index';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called index, but it's not used in the index_next_entities() stage, only in the topological sort stage. How about data_liberation_topological_index or something to that effect?

* Run in the 'plugins_loaded' action.
*/
public static function load() {
if ( self::DB_VERSION !== (int) get_site_option( self::OPTION_NAME ) ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about the distinction between load() and activate() – could we either have one or document them a bit more to explain the usage?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The activation/deactivation is when a user clicks "activate" and "deactivate" link. The load action is usually used if you need to change schema:

The best practices are here:

  1. Installation, activation steps: https://developer.wordpress.org/plugins/plugin-basics/activation-deactivation-hooks/
  2. Custom tables: https://learn.wordpress.org/lesson/custom-database-tables/
  3. Schema change: https://developer.wordpress.org/plugins/creating-tables-with-plugins/

Names can be misleading. Let me add more docs.

$table_name = self::get_table_name();

// Create the table if it doesn't exist.
// @TODO: remove this custom SQLite declaration after first phase of unit tests is done.
Copy link
Collaborator

@adamziel adamziel Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the CREATE TABLE statement doesn't work with the legacy SQLite integration we're using now? cc @JanJakes this could be useful for testing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamziel @zaerl The MySQL query actually seems to work fine with the current translator.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JanJakes it is, thanks! Removed the special SQL.

/**
* Run by register_deactivation_hook.
*/
public static function deactivate() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be any advantage to running this when the plugin is uninstalled and not just deactivated?

Copy link
Collaborator Author

@zaerl zaerl Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read this here https://learn.wordpress.org/lesson/custom-database-tables/:

If the users of your plugin do not need the data in this table if they deactivate the plugin, you could trigger this on the plugin deactivation hook.

register_deactivation_hook( __FILE__, 'wp_learn_delete_table' );
However, if the data in that table is important, and your users might want to keep it, even if the plugin is deactivated, you could delete the table using one of the two uninstall methods available to plugins.

Our data is not "important" in the strict sense of the term. It can be recreated again; it's just bulky. So, removing it when the user deactivates the plugin is okay. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good 👍

* @param int $byte_offset The byte offset of the category.
* @param array $data The category data.
*/
public function map_category( $byte_offset, $data ) {
Copy link
Collaborator

@adamziel adamziel Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map makes me think of map/reduce and a mapper callback. How about insert_category or add_category or index_category? Ditto for the other map_ method

'element_id' => (string) $data['term_id'],
'parent_id' => $category_parent,
'byte_offset' => $byte_offset,
// Items with a parent has at least a sort order of 2.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's document this more. What's the logic behind the sort_order of 1, 2 etc? Should it be more for deeply nested categories or not?

*
* @var int
*/
protected $orphan_post_counter = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to restore this (or any other) value after stopping and resuming the import?

*
* @return int|bool The byte offset of the category, or false if the category is not found.
*/
public function get_category_byte_offset( $session_id, $slug ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many methods can we make private? Every public method makes a public API and gets covered by the WordPress indefinite BC policy upon merge to WordPress core.


return $wpdb->get_var(
$wpdb->prepare(
'SELECT byte_offset FROM %i WHERE element_id = %s AND element_type = %d AND session_id = %d LIMIT 1',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to fetch these in chunks of 100 or 1000 to avoid hitting the database every time? Also, noting the docstring says "and remove it from the list" while the method doesn't perform the removal.


// MySQL version - update sort_order using a subquery
return $wpdb->query(
$wpdb->prepare(
Copy link
Collaborator

@adamziel adamziel Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How well does this perform on large datasets? Running a single, table-wide UPDATE that would potentially affect a few hundred million rows is making me anxious, although that's just a gut feeling.

@zaerl zaerl force-pushed the add/topological-sort branch from ecc7336 to e3ba973 Compare January 8, 2025 22:48
* We create a custom table that contains the IDs and the new IDs created in the
* target system sorted in the parent-child order.
*
* This class extends the WP_WXR_Entity_Reader class and overrides the
Copy link
Collaborator

@adamziel adamziel Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat idea! I initially thought about ingraining the topological sorting knowledge in the streaming importer to apply it to all possible data sources and worried about disabling it for, say, Markdown imports.

Thinking about that more, WXR may be the only source we'll see in the near future that yields data in a child-first order. I kind of like keeping and maturing this logic in the entity reader class. Even if we need to move it to the importer one day, the API will mostly stay the same. Cool!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, the logic is split between this reader the Importer. Hm. Couldn't we contain it to just one of these places?

* read_next_entity function to emit the entities in the correct order.
*
* List of entities Sort order
* entity 1 entity 1 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the number 3 stand for? Is it "max depth of descendants"?

* The base name of the table used to store the IDs, the new IDs and the
* sort order.
*/
const TABLE_NAME = 'data_liberation_map';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sorting is WXR-specific, let's give this table a WXR-specific name that also mentions sorting. Maybe data_liberation_wxr_import_topologically_sorted_entities? It's long, sure, but it's clear.

Comment on lines +58 to +60
* The current session ID.
*/
protected $current_session = null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Let's clarify the word "session" as it can have multiple meanings in a PHP app

Suggested change
* The current session ID.
*/
protected $current_session = null;
* The current import session ID.
*/
protected $import_session_id = null;

// Initialize WP_WXR_Entity_Reader.
$reader = parent::create( $upstream, $cursor, $options );

if ( array_key_exists( 'post_id', $options ) ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the names consistent – if it's a session ID, let's call the option session_id

if ( ! empty( $next_cursor ) ) {
$next_cursor = json_decode( $next_cursor, true );

/*if ( ! empty( $next_cursor ) ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be restored or removed? I remember you've mentioned an infinite loop – was it here?

* Run during the register_activation_hook or similar actions. It creates
* the table if it doesn't exist.
*/
public static function create_or_update_db() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Let's clarify this is about a table

Suggested change
public static function create_or_update_db() {
public static function create_or_update_db_table() {

* Run by register_deactivation_hook or similar. It drops the table and
* deletes the option.
*/
public static function delete_db() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Let's clarify this is about a table

* @param array $data The data to map.
* @param mixed $cursor_id The stream cursor ID.
*/
public function add_next_entity( $entity = null ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this logic be moved to the stream importer? I'm confused seeing a Reader that's also a Writer. Also, the method name add_next_entity suggests I can append an entity that's not in the original WXR file, which is also confusing.

return false;
}

$entity = $entity ?? $this->current();
Copy link
Collaborator

@adamziel adamziel Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but for posterity:

I'm having second thoughts about the "every entity reader is an iterator" idea.

There's get_entity() and current(). The latter has a name that suggests it's the same as the former, but it doesn't actually do the same thing. The PHP iterator design assumes a newly created iterator is already initialized and current() comes before next(), which is in direct opposition to the Reader that expects next_entity() to be called before get_entity().

I'm thinking we could either remove the iterator-ness completely, or create a single new Entity_Reader_Iterator( $reader ) class to have a clearly separated single point of entry.


// Default sort order is 1.
$sort_order = 1;
$cursor_id = $this->get_reentrancy_cursor();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just $cursor?

$parent_id_type = null;
$check_existing = true;

// Map the parent ID if the entity has one.
Copy link
Collaborator

@adamziel adamziel Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we mapping IDs while sorting? I think this already came up in one of the previous versions of the sorter. Mapping seems like a separate problem we'll need to handle for every entity source, not just WXR.

Copy link
Collaborator

@adamziel adamziel Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be confused here. It seems like "mapping parent ID" refers to assigning the parent value to $new_entity and not the "reading a post with one ID but inserting it with another parent ID." Also, entity refers to the thing we import, $new_entity refers to a row in the database table we use for sorting.

As I read, I find myself slowing down frequently to understand the ideas behind the words. I'd like to ask you to revisit the terminology used in this PR, especially in comments and variable and function names. Let's make the words more precise than they need to be. My goal is to make all the concepts unambiguous so that a junior WordPress dev could read it without getting confused about any part of the codebase. It will make the next review much easier for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs review
Development

Successfully merging this pull request may close these issues.

3 participants