Skip to content

Latest commit

 

History

History
executable file
·
817 lines (633 loc) · 35.6 KB

quickstart.rst

File metadata and controls

executable file
·
817 lines (633 loc) · 35.6 KB

TA1 quick-start guide

This aims to be a tutorial, or a quick-start guide, for newcomers to the D3M project who are interested in writing TA1 primitives. It is not meant to be a comprehensive guide to everything about D3M, or even just TA1. The goal here is for the reader to be able to write a new, simple, but working primitive by the end of this tutorial. To achieve this goal, this tutorial is divided into several sections:

Important links

First, here is a list of some important links that should help you with reference and instructional material beyond this quick start guide. Be aware also that the d3m core package source code has extensive docstrings that :ref:`you may find helpful <api-reference>`.

Overview of primitives and pipelines

Let's start with basic definitions in order for us to understand a little bit better what happens when we run a pipeline later in the tutorial.

A pipeline is basically a series of steps that are executed in order to solve a particular problem (such as prediction based on historical data). A step of a pipeline is usually a primitive (a step can be something else, however, like a sub-pipeline, but for the purposes of this tutorial, assume that each step is a primitive): something that individually could, for example, transform data into another format, or fit a model for prediction. There are many types of primitives (see the primitives index repo for the full list of available primitives). In a pipeline, the steps must be arranged in a way such that each step must be able to read the data in the format produced by the preceding step.

For this tutorial, let's try to use the example pipeline that comes with a primitive called d3m.primitives.classification.logistic_regression.SKlearn to predict baseball hall-of-fame players, based on their stats (see the 185_baseball dataset).

Let's take a look at the example pipeline. Many example pipelines can be found in primitives index repo where they demonstrate how to use particular primitives. At the time of this writing, an example pipeline can be found here, but this repository's directory names and files periodically change, so it is prudent to see how to navigate to this file too.

The index is organized as: - v2020.1.9 (version of the core package of the index, changes periodically) - JPL (the organization that develops/maintains the primitive) - d3m.primitives.classification.logistic_regression.SKlearn (the python path of the actual primitive) - 2019.11.13 (the version of this primitive, changes periodically) - pipelines - 862df0a2-2f87-450d-a6bd-24e9269a8ba6.json (actual pipeline description filename, changes periodically)

Early on in this JSON document, you will see a list called steps. This is the actual list of primitive steps that run one after another in a pipeline. Each step has the information about the primitive, as well as arguments, outputs, and hyper-parameters, if any. This specific pipeline has 5 steps (the d3m.primitives prefix is omitted in the following list):

  • data_transformation.dataset_to_dataframe.Common
  • data_transformation.column_parser.Common
  • data_cleaning.imputer.SKlearn
  • classification.logistic_regression.SKlearn
  • data_transformation.construct_predictions.Common

Now let's take a look at the first primitive step in that pipeline. We can find the source code of this primitive in the common-primitives repo (common_primitives/dataset_to_dataframe.py). Take a look particularly at the produce method. This is essentially what the primitive does. Try to do this for the other primitive steps in the pipeline as well - take a cursory look at what each one essentially does (note that for the actual classifier primitive, you should look at the fit method as well to see how the model is trained). Primitives whose python path suffix is *.Common is in the common primitives repository, and those that have a *.SKlearn suffix is in the sklearn-wrap repository (checkout the dist branch, to which primitives are being generated).

If you're having a hard time looking for the correct source file, you can try taking the primitive id from the primitive step description in the pipeline, and grep for it. For example, if you were looking for the source code of the first primitive step in this pipeline, first look at the primitive info in that step and get its id:

"primitive": {
  "id": "4b42ce1e-9b98-4a25-b68e-fad13311eb65",
  "version": "0.3.0",
  "python_path": "d3m.primitives.data_transformation.dataset_to_dataframe.Common",
  "name": "Extract a DataFrame from a Dataset"
},

Then, run this:

git clone https://gitlab.com/datadrivendiscovery/common-primitives.git
cd common-primitives
grep -r 4b42ce1e-9b98-4a25-b68e-fad13311eb65 . | grep -F .py

However, this series of commands assumes that you know exactly which specific repository is the primitive's source code located in (the git clone command). Since this is probably not the case for an arbitrarily given primitive, there is a method on how to find out the repository URL of any primitive, and it requires using a d3m Docker image, which is described in the next section.

Setting up a local d3m environment

In order to run a pipeline, you must have a Python environment where the d3m core package is installed, as well as the packages of the primitives installed as well. While it is possible to setup a Python virtual environment and install the packages them through pip, in this tutorial, we're going to use the d3m Docker images instead (in many cases, even beyond this tutorial, this will save you a lot of time and effort trying to find the any missing primitive packages, manually installing them, and troubleshooting installation errors). So, make sure Docker is installed in your system.

You can find the list of D3M docker images here. The one we're going to use in this tutorial is the stable primitives image (feel free to use whatever the latest one instead though - just modify the stable part accordingly):

docker pull registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9

Once you have downloaded the image, we can finally run the d3m package (and hence run a pipeline). Before running a pipeline though, let's first try to get a list of what primitives are installed in the image's Python environment:

docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3.6 -m d3m primitive search

You should get a big list of primitives. All of the known primitives to D3M should be there.

You can also run the docker container in interactive mode (to run commands as if you have logged into the container machine provides) by using the -it option:

docker run --rm -it registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9

The previous section mentions a method of determining where the source code of an arbitrarily given primitive can be found. We can do this using the d3m python package within a d3m docker container. First get the python_path of the primitive step (see the JSON snippet above of the primitive's info from the pipeline). Then, run this command:

docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3.6 -m d3m index describe d3m.primitives.data_transformation.dataset_to_dataframe.Common

Near the top of the huge JSON string describing the primitive, you'll see "source", and inside it, "uris". To help read the JSON, you can use the jq utility:

docker run --rm -it registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9
python -m d3m primitive describe d3m.primitives.data_transformation.dataset_to_dataframe.Common | jq .source.uris

This should give the URI of the git repo where the source code of that primitive can be found. Also, You can also substitute the primitive id for the python_path in that command, but the command usually returns a result faster if you provide the python_path. Note also that you can only do this for primitives that have been submitted for a particular image (primitives that are contained in the primitives index repo).

It can be obscure at first how to use the d3m python package, but you can always access the help string for each d3m command at every level of the command chain by using the -h flag. This is useful especially for the getting a list of all the possible arguments for the runtime module.

docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3.6 -m d3m -h
docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3.6 -m d3m primitive -h
docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3.6 -m d3m runtime -h
docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3.6 -m d3m runtime fit-score -h

One last point before we try running a pipeline. The docker container must be able to access the dataset location and the pipeline location from the host filesystem. We can do this by bind-mounting a host directory that contains both the datasets repo and the primitives index repo to a container directory. Git clone these repos, and also make another empty directory called pipeline-outputs. Now, if your directory structure looks like this:

/home/foo/d3m
├── datasets
├── pipeline-outputs
└── primitives

Then you'll want to bind-mount /home/foo/d3m to a directory in the container, say /mnt/d3m. You can specify this mapping in the docker command itself:

docker run \
    --rm \
    -v /home/foo/d3m:/mnt/d3m \
    registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
    ls /mnt/d3m

If you're reading this tutorial from a text editor, it might be a good idea at this point to find and replace /home/foo/d3m with the actual path in your system where the datasets, pipeline-outputs, and primitives directories are all located. This will make it easier for you to just copy and paste the commands from here on out, instead of changing the faux path every time.

Running an example pipeline

At this point, let's try running a pipeline. Again, we're going to run the example pipeline that comes with d3m.primitives.classification.logistic_regression.SKlearn. There are two ways to run a pipeline: by specifying all the necessary paths of the dataset, or by specifying and using a pipeline run file. Let's make sure first though that the dataset is available, as described in the next subsection.

Preparing the dataset

Towards the end of the previous section, you were asked to git clone the datasets repo to your machine. Most likely, you might have accomplished that like this:

git clone https://datasets.datadrivendiscovery.org/d3m/datasets.git

But unless you had git LFS installed, the entire contents of the repo might not have been really installed.

The repo is organized such that all files larger than 100 KB is stored in git LFS. Thus, if you cloned without git LFS installed, you most likely have to do a one-time extra step before you can use a dataset, as some files of that dataset that are over 100 KB will not have the actual data in them (although they will still exist as files in the cloned repo). This is true even for the dataset that we will use in this exercise, 185_baseball. To verify this, open this file in a text editor:

datasets/training_datasets/seed_datasets_archive/185_baseball/185_baseball_dataset/tables/learningData.csv

Then, see if it contains text similar to this:

version https://git-lfs.github.com/spec/v1
oid sha256:931943cc4a675ee3f46be945becb47f53e4297ec3e470c4e3e1f1db66ad3b8d6
size 131187

If it does, then this dataset has not yet been fully downloaded from git LFS (but if it looks like a normal CSV file, then you can skip the rest of this subsection and move on). To download this dataset, simply run this command inside the datasets directory:

git lfs pull -I training_datasets/seed_datasets_archive/185_baseball/

Inspect the file again, and you should see that it looks like a normal CSV file now.

In general, if you don't know which specific dataset does a certain example pipeline in the primitives repo uses, inspect the pipeline run output file of that primitive (whose file path is similar to that of the pipeline JSON file, as described in the :ref:`overview-of-primitives-and-pipelines` section, but instead of going to pipelines, go to pipeline_runs). The pipeline run is initially gzipped in the primitives repo, so decompress it first. Then open up the actual .yml file, look at datasets, and under it should be id. If you do that for the example pipeline run of the SKlearn logistic regression primitive that we're looking at for this exercise, you'll find that the dataset id is 185_baseball_dataset. The name of the main dataset directory is this string, without the _dataset part.

Now, let's actually run the pipeline using the two ways mentioned earlier.

Specifying all the necessary paths of a dataset

You can use this if there is no existing pipeline run yet for a pipeline, or if you want to manually specify the dataset path (set the paths for --problem, --input, --test-input, --score-input, --pipeline to your target dataset location).

Remember to change the bind mount paths as appropriate for your system (specified by -v).

docker run \
    --rm \
    -v /home/foo/d3m:/mnt/d3m \
    registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
    python -m d3m \
        runtime \
        fit-score \
            --problem /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/185_baseball_problem/problemDoc.json \
            --input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/TRAIN/dataset_TRAIN/datasetDoc.json \
            --test-input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/TEST/dataset_TEST/datasetDoc.json \
            --score-input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/SCORE/dataset_TEST/datasetDoc.json \
            --pipeline /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipelines/862df0a2-2f87-450d-a6bd-24e9269a8ba6.json \
            --output /mnt/d3m/pipeline-outputs/predictions.csv \
            --output-run /mnt/d3m/pipeline-outputs/run.yml

The score is displayed after the pipeline run. The output predictions will be stored on the path specified by --output, and information about the pipeline run is stored in the path specified by --output-run.

Again, you can use the -h flag on fit-score to access the help string and read about the different arguments, as described earlier.

If you get a python error that complains about missing columns, or something that looks like this:

ValueError: Mismatch between column name in data 'version https://git-lfs.github.com/spec/v1' and column name in metadata 'd3mIndex'.

Chances are that the 185_baseball dataset has not yet been downloaded through git LFS. See the :ref:`previous subsection <preparing-dataset>` for details on how to verify and do this.

Using a pipeline run file

Instead of specifying all the specific dataset paths, you can also use an existing pipeline run to essentially "re-run" a previous run of the pipeline:

docker run \
    --rm \
    -v /home/foo/d3m:/mnt/d3m \
    registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
    python -m d3m \
        --pipelines-path /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipelines \
        runtime \
            --datasets /mnt/d3m/datasets \
        fit-score \
            --input-run /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipeline_runs/pipeline_run.yml.gz \
            --output /mnt/d3m/pipeline-outputs/predictions.csv \
            --output-run /mnt/d3m/pipeline-outputs/run.yml

In this case, --input-run is the pipeline run file that this pipeline will re-run, and ---output-run is the new pipeline run file that will be generated.

Note that if you choose fit-score for the d3m runtime option, the pipeline actually runs in two phases: fit, and produce. You can verify this by searching for phase in the pipeline run file.

Lastly, if you want to run multiple commands in the docker container, simply chain your commands with && and wrap them double quotes (") for bash -c. As an example:

docker run \
    --rm \
    -v /home/foo/d3m:/mnt/d3m \
    registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
    /bin/bash -c \
        "python -m d3m \
            --pipelines-path /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipelines \
            runtime \
                --datasets /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball \
            fit-score \
                --input-run /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipeline_runs/pipeline_run.yml \
                --output /mnt/d3m/pipeline-outputs/predictions.csv \
                --output-run /mnt/d3m/pipeline-outputs/run.yml && \
        head /mnt/d3m/pipeline-outputs/predictions.csv"

Writing a new primitive

Let's now try to write a very simple new primitive - one that simply passes whatever input data it receives from the previous step to the next step in the pipeline. Let's call this primitive "Passthrough".

We will use this skeleton primitive repo as a starting point for this exercise. A d3m primitive repo does not have to follow the exact same directory structure as this, but this is a good structure to start with, at least. git clone the repo into docs-quickstart at the same place where the other repos that we have used earlier are located (datasets, pipeline-outputs, primitives).

Alternatively, you can also use the test primitives as a model/starting point. test_primitives/null.py is essentially the same primitive that we are trying to write.

Primitive source code

In the docs-quickstart directory, open quickstart_primitives/sample_primitive1/input_to_output.py. The first important thing to change here is the primitive metadata, which are the first objects defined under the InputToOutputPrimitive class. Modify the following fields (unless otherwise noted, the values you put in must be strings):

  • id: The primitive's UUID v4 number/identifier. To generate one, you can run simply run this simple inline Python command:

    python -c "import uuid; print(uuid.uuid4())"
  • version: You can use semantic versioning for this or another style of versioning. Write "0.1.0" for this exercise. You should bump the version of the primitive at least every time public interfaces of the primitive change (e.g. hyper-parameters).

  • name: The primitive's name. Write "Passthrough primitive" for this exercise.

  • description: A short description of the primitive. Write "A primitive which directly outputs the input." for this exercise.

  • python_path: This follows this format:

    d3m.primitives.<primitive family>.<primitive name>.<kind>
    

    Primitive families can be found in the d3m metadata page (wait a few seconds for the page to load completely), and primitive names can be found in the d3m core package source code. The last segment can be used to attribute the primitive to the author and/or describe in which way it is different from other primitives with same primitive family and primitive name, e.g., a different implementation with different trade-offs.

    For this exercise, write "d3m.primitives.operator.input_to_output.Quickstart". Note that input_to_output is not currently registered as a standard primitive name and using it will produce a warning. For primitives you intent on publishing make a merge request to the d3m core package to add any primitive names you need.

  • primitive_family: This must be the same as used for python_path, as enumeration value. You can use a string or Python enumeration value. Add this import statement (if not there already):

    from d3m.metadata import base as metadata_base

    Then write metadata_base.PrimitiveFamily.OPERATOR (as a value, not a string, so do not put quotation marks) as the value of this field.

  • algorithm_types: Algorithm type(s) that the primitive implements. This can be multiple values in an array. Values can be chosen from the d3m metadata page as well. Write [metadata_base.PrimitiveAlgorithmType.IDENTITY_FUNCTION] here for this exercise (as a list that contains one element, not a string).

  • source: General info about the author of this primitive. name is usually the name of the person or the team that wrote this primitive. contact is a mailto URI to the email address of whoever one should contact about this primitive. uris are usually the git clone URL of the repo, and you can also add the URL of the source file of this primitive.

    Write these for the exercise:

    "name": "My Name",
    "contact": "mailto:myname@example.com",
    "uris": ["https://gitlab.com/datadrivendiscovery/docs-quickstart.git"],
  • keywords: Key words for what this primitive is or does. Write ["passthrough"].

  • installation: Information about how to install this primitive. Add these import statements first:

    import os.path
    from d3m import utils

    Then replace the installation entry with this:

    "installation": [{
        "type": metadata_base.PrimitiveInstallationType.PIP,
        "package_uri": "git+https://gitlab.com/datadrivendiscovery/docs-quickstart@{git_commit}#egg=quickstart_primitives".format(
            git_commit=utils.current_git_commit(os.path.dirname(__file__))
        ),
    }],

    In general, for your own actual primitives, you might only need to substitute the git repo URL here as well as the python egg name.

Next, let's take a look at the produce method. You can see that it simply makes a new dataframe out of the input data, and returns it as the output. To see for ourselves though that our primitive (and thus this produce method) gets called during the pipeline run, let's add a log statement here. The produce method should now look something like this:

def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> base.CallResult[Outputs]:
    self.logger.warning('Hi, InputToOutputPrimitive.produce was called!')
    return base.CallResult(value=inputs)

Note that this is simply an example primitive that is intentionally simple for the purposes of this tutorial. It does not necessarily model a well-written primitive, by any means. For guidelines on how to write a good primitive, take a look at the :ref:`primitive-good-citizen`.

setup.py

Next, we fill in the necessary information in setup.py so that pip can correctly install our primitive in our local d3m environment. Open setup.py (in the project root), and modify the following fields:

  • name: Same as the egg name you used in package_uri

  • version: Same as the primitive metadata's version

  • description: Same as the primitive metadata's description, or a description of all primitives if there are multiple primitives in the package you are making

  • author: Same as the primitive metadata's suorce.name

  • url: Same as main URL in the primitive metadata's source.uris

  • packages: This is an array of the python packages that this primitive repo contains. You can use the find_packages helper:

    packages=find_packages(exclude=['pipelines']),
  • keywords: A list of keywords. Important standard keyword is d3m_primitive which makes all primitives discoverable on PyPi

  • install_requires: This is an array of the python package dependencies of the primitives contained in this repo. Our primitive needs nothing except the d3m core package (and the common-primitives package too for testing, but this is not a package dependency), so write this as the value of this field: ['d3m']

  • entry_points: This is how the d3m runtime maps your primitives' d3m python paths to the your repo's local python paths. For this exercise, it should look like this:

    entry_points={
        'd3m.primitives': [
            'operator.input_to_output.Quickstart = quickstart_primitives.sample_primitive1:InputToOutputPrimitive',
        ],
    }

That's it for this file. Briefly review it for any possible syntax errors.

Primitive unit tests

Let's now make a python test for this primitive, which in this case will just assert whether the input dataframe to the primitive equals the output dataframe. Make a new file called test_input_to_output.py inside quickstart_primitives/sample_primitive1 (the same directory as input_to_output.py), and write this as its contents:

import unittest
import os

from d3m import container
from common_primitives import dataset_to_dataframe
from input_to_output import InputToOutputPrimitive


class InputToOutputTestCase(unittest.TestCase):
    def test_output_equals_input(self):
        dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', '..', 'tests-data', 'datasets', 'timeseries_dataset_1', 'datasetDoc.json'))

        dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path))

        dataframe_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams()
        dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults())
        dataframe = dataframe_primitive.produce(inputs=dataset).value

        i2o_hyperparams_class = InputToOutputPrimitive.metadata.get_hyperparams()
        i2o_primitive = InputToOutputPrimitive(hyperparams=dataframe_hyperparams_class.defaults())
        output = i2o_primitive.produce(inputs=dataframe).value

        self.assertTrue(output.equals(dataframe))


if __name__ == '__main__':
    unittest.main()

For the dataset that this test uses, add as git submodule the d3m tests-data repository at the root of the docs-quickstart repository. Then let's install this new primitive to the Docker image's d3m environment, and run this test using the command below:

docker run \
    --rm \
    -v /home/foo/d3m:/mnt/d3m \
    registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
    /bin/bash -c \
        "pip3 install -e /mnt/d3m/docs-quickstart && \
        cd /mnt/d3m/docs-quickstart/quickstart_primitives/sample_primitive1 && \
        python test_input_to_output.py"

You should see a log statement like this, as well as the python unittest pass message:

Hi, InputToOutputPrimitive.produce was called!
.
----------------------------------------------------------------------
Ran 1 test in 0.011s

Using this primitive in a pipeline

Having seen the primitive test pass, we can now confidently include this primitive in a pipeline. Let's take the same pipeline that we ran :ref:`before <running-example-pipeline>` (the sklearn logistic regression's example pipeline), and add a step using this primitive.

In the root directory of your repository, create these directories: pipelines/operator.input_to_output.Quickstart. Then, from the d3m primitives repo, copy the JSON pipeline description file from primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipelines into the directory we just created. Open this file, and replace the id (generate another UUID v4 number using the inline python command earlier, different from the primitive id), as well as the created timestamp using this inline python command (add Z at the end of the generated timestamp):

python -c "import time; import datetime; \
print(datetime.datetime.fromtimestamp(time.time()).isoformat())"

You can rename the json file too using the new pipeline id.

Next, change the output step number (shown below, "steps.4.produce") to be one more than the current number (at the time of this writing, it is 4, so in this case, change it to 5):

"outputs": [
  {
    "data": "steps.5.produce",
    "name": "output predictions"
  }
],

Then, find the step that contains the d3m.primitives.classification.logistic_regression.SKlearn primitive (search for this string in the file), and right above it, add the following JSON object. Remember to change primitive.id to the primitive's id that you generated in the earlier :ref:`primitive-source-code` subsection.

{
  "type": "PRIMITIVE",
  "primitive": {
    "id": "30d5f2fa-4394-4e46-9857-2029ec9ed0e0",
    "version": "0.1.0",
    "python_path": "d3m.primitives.operator.input_to_output.Quickstart",
    "name": "Passthrough primitive"
  },
  "arguments": {
    "inputs": {
      "type": "CONTAINER",
      "data": "steps.2.produce"
    }
  },
  "outputs": [
    {
      "id": "produce"
    }
  ]
},

Make sure that the step number ("steps.N.produce") in arguments.inputs.data is correct (one greater than the previous step and one less than the next step). Do this as well for the succeeding steps, with the following caveats:

  • For d3m.primitives.classification.logistic_regression.SKlearn, increment the step number both for arguments.inputs.data and arguments.outputs.data (at the time of this writing, the number should be changed to 3).
  • For d3m.primitives.data_transformation.construct_predictions.Common, increment the step number for arguments.inputs.data (at the time of this writing, the number should be changed to 4), but do not change the one for arguments.reference.data (the value should stay as "steps.0.produce")

Generally, you can also programmatically generate a pipeline, as described in the :ref:`pipeline-description-example`.

Now we can finally run this pipeline that uses our new primitive. In the command below, modify the pipeline JSON filename in the -p argument to match the filename of your pipeline file (if you changed it to the new pipeline id that you generated).

docker run \
    --rm \
    -v /home/foo/d3m:/mnt/d3m \
    registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
    /bin/bash -c \
        "pip3 install -e /mnt/d3m/docs-quickstart && \
        python -m d3m \
            runtime \
            fit-score \
                --problem /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/185_baseball_problem/problemDoc.json \
                --input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/TRAIN/dataset_TRAIN/datasetDoc.json \
                --test-input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/TEST/dataset_TEST/datasetDoc.json \
                --score-input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/SCORE/dataset_TEST/datasetDoc.json \
                --pipeline /mnt/d3m/docs-quickstart/pipelines/operator.input_to_output.Quickstart/0f290525-3fec-44f7-ab93-bd778747b91e.json \
                --output /mnt/d3m/pipeline-outputs/predictions_new.csv \
                --output-run /mnt/d3m/pipeline-outputs/run_new.yml"

In the output, you should see the log statement as a warning, before the score is shown (similar to the text below):

...
WARNING:d3m.primitives.operator.input_to_output.Quickstart:Hi, InputToOutputPrimitive.produce was called!
...
metric,value,normalized,randomSeed
F1_MACRO,0.31696136214800263,0.31696136214800263,0

Verify that the old and new predictions.csv in pipeline-outputs are the same (you can use diff), as well as the scores in the old and new run.yml files (search for scores in the files).

Beyond this tutorial

Congratulations! You just built your own primitive and you were able to use it in a d3m pipeline!

Normally, when you build your own primitives, you would proceed to validating the primitives to be included in the d3m primitive index of all known primitives. See the primitives repo README on details on how to do this.