[Data Liberation] Topological sorter, entities remapping and add missing imports #2030

zaerl · 2024-11-26T13:50:45Z

You can find a discussion here with a more detailed explanation: #2090

Motivation for the change, related issues

Topological Sorting of WXR entities before starting the import to ensure parent posts are imported before child posts. Processing WXR in a topological order may require an index with all offsets and lengths of items in the WXR.

Implementation details

Entities preloading/loading/mapping

The topological sort happens during a STAGE_TOPOLOGICAL_SORT phase runt before the STAGE_FRONTLOAD_ASSETS.

This PR removes the WP_Entity_Importer::mapping and WP_Entity_Importer::exists arrays and all memory preload. It is slow and memory-consuming, with thousands of entries and not support sessions. It adds a new table with import IDs and mapped IDs. During the first phase, it is prefilled. During the WP_Entity_Importer import it maps the imported IDs by using the wxr_importer_* filters and actions.

New WP-CLI script

~~This PR also introduces the new CLI script and moves the logger there.~~ moved to #2104

New unit tests

~~Added all the 130 WordPress core unit tests. Updated the WXRs to latest version.~~
~~Added a PlaygroundTestCase base class that cleanup the WordPress database after a test, as _delete_all_data does~~
Moved to [Data Liberation] Add support for terms meta and new unit tests #2105

Missing imports and hierarchies

Added support for:

Posts hierarchy
Term hierarchy
~~Term meta~~
Comment meta

New PHPUnit filter

This PR adds a PHPUNIT_FILTER constant to packages/playground/data-liberation/tests/import/blueprint-import.json. If the value is not falsy it will be passed to PHPUnit when calling npx nx run playground-data-liberation:test:wp-phpunit. So, for example, you can set "PHPUNIT_FILTER": "WPRewriteUrlsTests". It will be the same as running phpunit --filter WPRewriteUrlsTests.
Moved to #2105

Testing Instructions (or ideally a Blueprint)

Unit tests

npx nx run playground-data-liberation:test:wp-phpunit

# Or only tests that do not need a WordPress environment
cd packages/playground/data-liberation
./vendor/bin/phpunit

Many of the tests can be run only in a real WordPress environment, that's why wp-phpunit exists.

Data test

Spin Playground
Import whatever .xml you find in packages/playground/data-liberation/tests/wxr/*
In the above discussion, you can use a plugin to create an XML with thousands of entries if you want.

About to add and pass all the tests found here https://github.com/WordPress/wordpress-importer/tree/master/phpunit/tests.

New script

wp data-liberation import test.xml

~~or:~~

cd packages/playground/data-liberation/bin/import
bash import-wxr.sh a-folder-with-xmls

~~It accepts a file, a URL, or a folder. Run wp help data-liberation import to see all options.~~ moved to #2104

packages/playground/data-liberation/src/import/WP_Stream_Importer.php

packages/playground/data-liberation/src/import/WP_Topological_Sorter.php

adamziel · 2024-11-29T13:38:40Z

packages/playground/data-liberation/src/import/WP_Topological_Sorter.php

+		 * Quicksort performs badly on already sorted arrays, O(n^2) is the worst case.
+		 * Let's consider using a different sorting algorithm.
+		 */
+		uksort( $elements, $sort_callback );


This requires fitting all $elements into memory. This may be fine for v1, since every $element is relatively small, but we'll hit the limits of this approach sooner than later. Is it possible to perform topological sorting with at a reasonable speed without holding everything in memory? If not, how much RAM would this need to process one of these huge VIP 1TB exports?

We can save the data in a custom DB table (Jetpack have a custom wp_jetpack_sync_queue table) instead of a simple array in memory. I didn't disturb the database in this first phase, but we can. Reproducing this in a table and using the custom sort here is straightforward. At the end of collecting the byte offsets, it's a matter of ALTER TABLE wxr_sorting ORDER BY custom_sorting_like_the_one_above, and we are done, even with our new reentrancy model.

When it finds a row in this table, the streamer can jump to the "correct" position (the one it should load before the current one) and proceed to the next one.

Lovely! If it's straightforward, let's include it in v1 – it seems like we can ship something that's on the right track with relatively little effort. It would also enforce shaping the API to stream the sorted list instead of loading it into memory, which would have implications for the reentrancy cursor. It may seem excessive for v1, but thinking about it more it seems like something that could have a ripple through the entire system if we don't account for streaming early on.

It may seem excessive for v1, but thinking about it more it seems like something that could have a ripple through the entire system if we don't account for streaming early on.

Agree. It's reasonable, let's proceed this way. 👍

adamziel · 2024-12-09T23:31:00Z

packages/playground/data-liberation/src/import/WP_Entity_Importer.php

@@ -93,7 +93,7 @@ public function __construct( $options = array() ) {
 		$this->mapping['term_id']   = array();
 		$this->requires_remapping   = $empty_types;
 		$this->exists               = $empty_types;
-		$this->logger               = new Logger();
+		$this->logger               = isset( $options['logger'] ) ? $options['logger'] : new WP_Logger();


I'm actually not sure we need a logger beyond _doing_it_wrong(). All the progress information are now exposed as numbers and the API consumer may implements its own logging using any technique.

Make sense, thanks.

packages/playground/data-liberation/src/import/WP_Stream_Importer.php

adamziel · 2024-12-09T23:37:23Z

packages/playground/data-liberation/src/import/WP_Stream_Importer.php

+				if ( true === $this->topological_sort_next_entity( $count ) ) {
+					return true;
+				}
+
+				// We indexed all the entities. Now sort them topologically.
+				$this->topological_sorter->sort_topologically();
+				$this->topological_sorter = null;


It reads confusing to me:

Sort every entity we encounter and short circuit while we have entities

Once we run out of entities to sort, sort them

What would be another way to call these operations?

packages/playground/data-liberation/src/import/WP_Stream_Importer.php

packages/playground/data-liberation/src/import/WP_Topological_Sorter.php

adamziel · 2024-12-10T00:17:03Z

packages/playground/data-liberation/src/import/WP_Topological_Sorter.php

+	/**
+	 * The base name of the table.
+	 */
+	const TABLE_NAME = 'data_liberation_index';


This is called index, but it's not used in the index_next_entities() stage, only in the topological sort stage. How about data_liberation_topological_index or something to that effect?

adamziel · 2024-12-10T00:19:17Z

packages/playground/data-liberation/src/import/WP_Topological_Sorter.php

+	 * Run in the 'plugins_loaded' action.
+	 */
+	public static function load() {
+		if ( self::DB_VERSION !== (int) get_site_option( self::OPTION_NAME ) ) {


I'm confused about the distinction between load() and activate() – could we either have one or document them a bit more to explain the usage?

The activation/deactivation is when a user clicks "activate" and "deactivate" link. The load action is usually used if you need to change schema:

The best practices are here:

Installation, activation steps: https://developer.wordpress.org/plugins/plugin-basics/activation-deactivation-hooks/

Custom tables: https://learn.wordpress.org/lesson/custom-database-tables/

Schema change: https://developer.wordpress.org/plugins/creating-tables-with-plugins/

Names can be misleading. Let me add more docs.

adamziel · 2024-12-10T00:20:22Z

packages/playground/data-liberation/src/import/WP_Topological_Sorter.php

+		$table_name = self::get_table_name();
+
+		// Create the table if it doesn't exist.
+		// @TODO: remove this custom SQLite declaration after first phase of unit tests is done.


I understand the CREATE TABLE statement doesn't work with the legacy SQLite integration we're using now? cc @JanJakes this could be useful for testing

@adamziel @zaerl The MySQL query actually seems to work fine with the current translator.

@JanJakes it is, thanks! Removed the special SQL.

adamziel · 2024-12-10T00:22:03Z

packages/playground/data-liberation/src/import/WP_Topological_Sorter.php

+	/**
+	 * Run by register_deactivation_hook.
+	 */
+	public static function deactivate() {


Would there be any advantage to running this when the plugin is uninstalled and not just deactivated?

I've read this here https://learn.wordpress.org/lesson/custom-database-tables/:

If the users of your plugin do not need the data in this table if they deactivate the plugin, you could trigger this on the plugin deactivation hook.

register_deactivation_hook( __FILE__, 'wp_learn_delete_table' );
However, if the data in that table is important, and your users might want to keep it, even if the plugin is deactivated, you could delete the table using one of the two uninstall methods available to plugins.

Our data is not "important" in the strict sense of the term. It can be recreated again; it's just bulky. So, removing it when the user deactivates the plugin is okay. What do you think?

Sounds good 👍

adamziel · 2024-12-10T00:22:39Z

packages/playground/data-liberation/src/import/WP_Topological_Sorter.php

+	 * @param int $byte_offset The byte offset of the category.
+	 * @param array $data The category data.
+	 */
+	public function map_category( $byte_offset, $data ) {


map makes me think of map/reduce and a mapper callback. How about insert_category or add_category or index_category? Ditto for the other map_ method

adamziel · 2024-12-10T00:23:46Z

packages/playground/data-liberation/src/import/WP_Topological_Sorter.php

+				'element_id'   => (string) $data['term_id'],
+				'parent_id'    => $category_parent,
+				'byte_offset'  => $byte_offset,
+				// Items with a parent has at least a sort order of 2.


Let's document this more. What's the logic behind the sort_order of 1, 2 etc? Should it be more for deeply nested categories or not?

adamziel · 2024-12-10T00:24:54Z

packages/playground/data-liberation/src/import/WP_Topological_Sorter.php

+	 *
+	 * @var int
+	 */
+	protected $orphan_post_counter = 0;


Do we need to restore this (or any other) value after stopping and resuming the import?

adamziel · 2024-12-10T00:26:56Z

packages/playground/data-liberation/src/import/WP_Topological_Sorter.php

+	 *
+	 * @return int|bool The byte offset of the category, or false if the category is not found.
+	 */
+	public function get_category_byte_offset( $session_id, $slug ) {


How many methods can we make private? Every public method makes a public API and gets covered by the WordPress indefinite BC policy upon merge to WordPress core.

adamziel · 2024-12-10T00:29:03Z

packages/playground/data-liberation/src/import/WP_Topological_Sorter.php

+
+		return $wpdb->get_var(
+			$wpdb->prepare(
+				'SELECT byte_offset FROM %i WHERE element_id = %s AND element_type = %d AND session_id = %d LIMIT 1',


Would it make sense to fetch these in chunks of 100 or 1000 to avoid hitting the database every time? Also, noting the docstring says "and remove it from the list" while the method doesn't perform the removal.

adamziel · 2024-12-10T00:30:20Z

packages/playground/data-liberation/src/import/WP_Topological_Sorter.php

+
+		// MySQL version - update sort_order using a subquery
+		return $wpdb->query(
+			$wpdb->prepare(


How well does this perform on large datasets? Running a single, table-wide UPDATE that would potentially affect a few hundred million rows is making me anxious, although that's just a gut feeling.

adamziel · 2025-01-09T14:48:02Z

packages/playground/data-liberation/src/entity-readers/WP_WXR_Sorted_Entity_Reader.php

+ * We create a custom table that contains the IDs and the new IDs created in the
+ * target system sorted in the parent-child order.
+ *
+ * This class extends the WP_WXR_Entity_Reader class and overrides the


Neat idea! I initially thought about ingraining the topological sorting knowledge in the streaming importer to apply it to all possible data sources and worried about disabling it for, say, Markdown imports.

Thinking about that more, WXR may be the only source we'll see in the near future that yields data in a child-first order. I kind of like keeping and maturing this logic in the entity reader class. Even if we need to move it to the importer one day, the API will mostly stay the same. Cool!

Ah, I see, the logic is split between this reader the Importer. Hm. Couldn't we contain it to just one of these places?

adamziel · 2025-01-09T14:48:59Z

packages/playground/data-liberation/src/entity-readers/WP_WXR_Sorted_Entity_Reader.php

+ * read_next_entity function to emit the entities in the correct order.
+ *
+ * List of entities      Sort order
+ * entity 1              entity 1          3


What does the number 3 stand for? Is it "max depth of descendants"?

adamziel · 2025-01-09T14:50:31Z

packages/playground/data-liberation/src/entity-readers/WP_WXR_Sorted_Entity_Reader.php

+	 * The base name of the table used to store the IDs, the new IDs and the
+	 * sort order.
+	 */
+	const TABLE_NAME = 'data_liberation_map';


If sorting is WXR-specific, let's give this table a WXR-specific name that also mentions sorting. Maybe data_liberation_wxr_import_topologically_sorted_entities? It's long, sure, but it's clear.

adamziel · 2025-01-09T14:53:03Z

packages/playground/data-liberation/src/entity-readers/WP_WXR_Sorted_Entity_Reader.php

+	 * The current session ID.
+	 */
+	protected $current_session = null;


Nit: Let's clarify the word "session" as it can have multiple meanings in a PHP app

Suggested change

* The current session ID.

*/

protected $current_session = null;

* The current import session ID.

*/

protected $import_session_id = null;

adamziel · 2025-01-09T15:05:24Z

packages/playground/data-liberation/src/entity-readers/WP_WXR_Sorted_Entity_Reader.php

+		// Initialize WP_WXR_Entity_Reader.
+		$reader = parent::create( $upstream, $cursor, $options );
+
+		if ( array_key_exists( 'post_id', $options ) ) {


Let's keep the names consistent – if it's a session ID, let's call the option session_id

adamziel · 2025-01-09T15:06:59Z

packages/playground/data-liberation/src/entity-readers/WP_WXR_Sorted_Entity_Reader.php

+		if ( ! empty( $next_cursor ) ) {
+			$next_cursor = json_decode( $next_cursor, true );
+
+			/*if ( ! empty( $next_cursor ) ) {


Should this be restored or removed? I remember you've mentioned an infinite loop – was it here?

adamziel · 2025-01-09T15:11:11Z

packages/playground/data-liberation/src/entity-readers/WP_WXR_Sorted_Entity_Reader.php

+	 * Run during the register_activation_hook or similar actions. It creates
+	 * the table if it doesn't exist.
+	 */
+	public static function create_or_update_db() {


Nit: Let's clarify this is about a table

Suggested change

public static function create_or_update_db() {

public static function create_or_update_db_table() {

adamziel · 2025-01-09T15:11:30Z

packages/playground/data-liberation/src/entity-readers/WP_WXR_Sorted_Entity_Reader.php

+	 * Run by register_deactivation_hook or similar. It drops the table and
+	 * deletes the option.
+	 */
+	public static function delete_db() {


Nit: Let's clarify this is about a table

adamziel · 2025-01-09T15:15:33Z

packages/playground/data-liberation/src/entity-readers/WP_WXR_Sorted_Entity_Reader.php

+	 * @param array  $data The data to map.
+	 * @param mixed  $cursor_id The stream cursor ID.
+	 */
+	public function add_next_entity( $entity = null ) {


Could this logic be moved to the stream importer? I'm confused seeing a Reader that's also a Writer. Also, the method name add_next_entity suggests I can append an entity that's not in the original WXR file, which is also confusing.

adamziel · 2025-01-09T15:21:52Z

packages/playground/data-liberation/src/entity-readers/WP_WXR_Sorted_Entity_Reader.php

+			return false;
+		}
+
+		$entity      = $entity ?? $this->current();


Unrelated to this PR, but for posterity:

I'm having second thoughts about the "every entity reader is an iterator" idea.

There's get_entity() and current(). The latter has a name that suggests it's the same as the former, but it doesn't actually do the same thing. The PHP iterator design assumes a newly created iterator is already initialized and current() comes before next(), which is in direct opposition to the Reader that expects next_entity() to be called before get_entity().

I'm thinking we could either remove the iterator-ness completely, or create a single new Entity_Reader_Iterator( $reader ) class to have a clearly separated single point of entry.

adamziel · 2025-01-09T15:24:53Z

packages/playground/data-liberation/src/entity-readers/WP_WXR_Sorted_Entity_Reader.php

+
+		// Default sort order is 1.
+		$sort_order = 1;
+		$cursor_id  = $this->get_reentrancy_cursor();


Or just $cursor?

adamziel · 2025-01-09T15:28:29Z

packages/playground/data-liberation/src/entity-readers/WP_WXR_Sorted_Entity_Reader.php

+		$parent_id_type = null;
+		$check_existing = true;
+
+		// Map the parent ID if the entity has one.


Why are we mapping IDs while sorting? I think this already came up in one of the previous versions of the sorter. Mapping seems like a separate problem we'll need to handle for every entity source, not just WXR.

I may be confused here. It seems like "mapping parent ID" refers to assigning the parent value to $new_entity and not the "reading a post with one ID but inserting it with another parent ID." Also, entity refers to the thing we import, $new_entity refers to a row in the database table we use for sorting.

As I read, I find myself slowing down frequently to understand the ideas behind the words. I'd like to ask you to revisit the terminology used in this PR, especially in comments and variable and function names. Let's make the words more precise than they need to be. My goal is to make all the concepts unambiguous so that a junior WordPress dev could read it without getting confused about any part of the codebase. It will make the next review much easier for me.

zaerl force-pushed the add/topological-sort branch from c51d9c4 to 7778714 Compare November 26, 2024 13:52

zaerl self-assigned this Nov 26, 2024

zaerl added [Aspect] Data Liberation [Type] Enhancement New feature or request labels Nov 26, 2024

adamziel reviewed Nov 26, 2024

View reviewed changes

packages/playground/data-liberation/src/import/WP_Stream_Importer.php Outdated Show resolved Hide resolved

adamziel reviewed Nov 26, 2024

View reviewed changes

packages/playground/data-liberation/src/import/WP_Stream_Importer.php Outdated Show resolved Hide resolved

adamziel reviewed Nov 26, 2024

View reviewed changes

packages/playground/data-liberation/src/import/WP_Topological_Sorter.php Outdated Show resolved Hide resolved

zaerl force-pushed the add/topological-sort branch 2 times, most recently from 85c850b to 78f1cb3 Compare November 29, 2024 11:19

adamziel mentioned this pull request Nov 29, 2024

Tracking Issue: Next-gen PHP Importers for Data Liberation #1894

Open

85 tasks

zaerl force-pushed the add/topological-sort branch from 78f1cb3 to 5af5722 Compare November 29, 2024 13:02

adamziel reviewed Nov 29, 2024

View reviewed changes

zaerl mentioned this pull request Dec 2, 2024

Importing a WXR with Site Editor templates should work #1996

Closed

zaerl force-pushed the add/topological-sort branch from 495be8b to e95618e Compare December 4, 2024 14:05

adamziel reviewed Dec 9, 2024

View reviewed changes

packages/playground/data-liberation/src/import/WP_Stream_Importer.php Outdated Show resolved Hide resolved

adamziel reviewed Dec 9, 2024

View reviewed changes

adamziel reviewed Dec 10, 2024

View reviewed changes

packages/playground/data-liberation/src/import/WP_Stream_Importer.php Outdated Show resolved Hide resolved

adamziel reviewed Dec 10, 2024

View reviewed changes

packages/playground/data-liberation/src/import/WP_Topological_Sorter.php Outdated Show resolved Hide resolved

adamziel reviewed Dec 10, 2024

View reviewed changes

zaerl mentioned this pull request Dec 10, 2024

[Data Liberation] Wrong comment meta key values #2067

Open

zaerl added 17 commits January 8, 2025 23:26

Add new unit tests

34e2752

Fix: remove debug code

f6601eb

Add a set_session method

f58bb44

Add support for sessions

7615432

Fix: serialized term meta

1aba667

Fix: missing brace

98565ec

Remove "count" parameter

787c224

Add new sorter

b11fe9b

Add unit test

9d19eb9

Removed all changes of #2105 and #2104

0b68a60

Removed import scrit

19db782

Fix: remove terms meta from import session

28fe35d

Fix: restore functions.php file

7e2c1cf

Add fseek() support

8ed77ed

Fix: typo

2bf73dc

Fix: set cursor_id to null

5ae2e14

Fix: rename class to follow new standard

e3ba973

zaerl force-pushed the add/topological-sort branch from ecc7336 to e3ba973 Compare January 8, 2025 22:48

adamziel reviewed Jan 9, 2025

View reviewed changes

	public static function create_or_update_db() {
	public static function create_or_update_db_table() {

[Data Liberation] Topological sorter, entities remapping and add missing imports #2030

Are you sure you want to change the base?

[Data Liberation] Topological sorter, entities remapping and add missing imports #2030

Conversation

zaerl commented Nov 26, 2024 • edited Loading

Motivation for the change, related issues

Implementation details

Entities preloading/loading/mapping

New WP-CLI script

New unit tests

Missing imports and hierarchies

New PHPUnit filter

Testing Instructions (or ideally a Blueprint)

Unit tests

Data test

New script

adamziel Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaerl Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

adamziel Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

adamziel Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

zaerl commented Nov 26, 2024 •

edited

Loading

adamziel Nov 29, 2024 •

edited

Loading

adamziel Dec 9, 2024 •

edited

Loading

adamziel Dec 10, 2024 •

edited

Loading

zaerl Dec 10, 2024 •

edited

Loading

adamziel Dec 10, 2024 •

edited

Loading

adamziel Dec 10, 2024 •

edited

Loading

adamziel Jan 9, 2025 •

edited

Loading

adamziel Jan 9, 2025 •

edited

Loading

adamziel Jan 9, 2025 •

edited

Loading

adamziel Jan 9, 2025 •

edited

Loading