Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[source-gitlab] New child stream using SubstreamPartitionRouter and incremental_dependency only gets full refreshes #50962

Open
1 task done
marianob-span1 opened this issue Jan 7, 2025 · 0 comments

Comments

@marianob-span1
Copy link

Connector Name

source-gitlab

Connector Version

4.3.3

What step the error happened?

During the sync

Relevant information

I have added a new stream for Merge Requests Discussions (for which I'm happy to open a PR to contribute it) as child of Merge Requests and I was expecting it to use the parent MRs cursor and thus filter Discussions based on it.

Relevant lines from manifest.yaml here:

...
merge_requests_child_streams_retriever:
  $ref: "#/definitions/retriever"
  partition_router:
    type: SubstreamPartitionRouter
    parent_stream_configs:
      - type: ParentStreamConfig
        parent_key: "iid"
        stream: "#/definitions/merge_requests_stream"
        partition_field: "iid"
        incremental_dependency: true
...
base_merge_requests_child_stream:
  $ref: "#/definitions/base_full_refresh_stream"
  retriever: "#/definitions/merge_requests_child_streams_retriever"

merge_request_discussions_stream:
  name: "merge_request_discussions"
  $ref: "#/definitions/base_merge_requests_child_stream"
  retriever:
    $ref: "#/definitions/retriever"
    partition_router:
      type: SubstreamPartitionRouter
      parent_stream_configs:
        - type: ParentStreamConfig
          parent_key: "iid"
          stream: "#/definitions/merge_requests_stream"
          partition_field: "iid"
  $parameters:
    path: "projects/{{ stream_slice.parent_slice.id }}/merge_requests/{{ stream_slice.iid }}/discussions"

From command line, passing the right state it works as expected (i.e. something like poetry run source-gitlab read --config secrets/config.json --catalog catalog_mr_discussions.json --state state_mr_discussions.json) and the connector only pulls a small number of Discussions based on the Merge Requests cursor.

On Airbyte Platform, however, the connector never receives the state for Merge Request Discussions and therefore that always leads to a full refresh.

I verified that the connection saved the complete state since there is a parent_state object with the right Merge Requests cursor for the new Discussions stream (Ids and names have been anonymized):

{
    "streamDescriptor": {
      "name": "merge_request_discussions"
    },
    "streamState": {
      "states": [
         // Partition states omitted for brevity 
       ],
      "parent_state": {
        "merge_requests": {
          "states": [
            {
              "cursor": {
                "updated_at": "2025-01-07T14:28:40.290Z"
              },
              "partition": {
                  "id": 12345678,
                  "parent_slice": {
                      "id": "org123%2Fgroup-a%2Fgroup-b%2Fexample-project",
                      "name": "example-project",
                      "namespace_id": 11111111,
                      "namespace_name": "group-b",
                      "name_with_namespace": "organization / group-a / group-b / example-project",
                      "namespace_full_path": "org123/group-a/group-b"
                  }
              }
            }
          ]
        }
      }
    }
  },

But it seems that the connector does not get the full input state, I went into the workspaces docker volume and the input_state.json file only contains a state for Merge Requests and nothing else, which explains why the full refresh on Discussions:

root@56b7143deb47:/data# cat 141/0/input_state.json
[{"type":"STREAM","stream":{"stream_descriptor":{"name":"merge_requests"},"stream_state":{"states":[{"cursor":{"updated_at":"2025-01-07T14:28:40.290Z"},"partition":{"id":12345678,"parent_slice":{"id":"org123%2Fgroup-a%2Fgroup-b%2Fexample-project","name":"example-project","namespace_id":11111111,"namespace_name":"group-b","name_with_namespace":"organization / group-a / group-b / example-project","namespace_full_path":"org123/group-a/group-b"}}}]}}}]

I'm not sure if it's something specific to Gitlab and the stream I'm trying to add or something platform broad like the issue described here.

Relevant log output

Contribute

  • Yes, I want to contribute
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants