Replies: 7 comments 1 reply
-
@brunoapimentel @lkolacek @ejegrova @eskultety @taylormadore: This is what a Design doc could look like in Git Hub Discussions. Unfortunately, we don't have the ability to comment directly in the text of the doc, but something like the following could work (sorry, Bruno, Taylor) |
Beta Was this translation helpful? Give feedback.
-
"""
What if the API request fails (I know it rarely does, right? ^ ^) Retry a few times before failing the job? How is the "cleanup_job" reporting failures? Should we use the same approach? |
Beta Was this translation helpful? Give feedback.
-
"""
As long as we use the worker requests session, requests should retry: https://github.com/containerbuildsystem/cachito/blob/ccadeef3f9f86d08fa4b8434a3f7f15c9edb9cee/cachito/workers/requests.py#L28 |
Beta Was this translation helpful? Give feedback.
-
Obviously it's very different to GDocs. Another thing we lose is the document versioning and history which GDocs provides, but I don't have a sense of how valuable that is. |
Beta Was this translation helpful? Give feedback.
-
Note that in addition to the usual emoji response button, there is also an upvote, which we could use as our "+1" indicator for approvals |
Beta Was this translation helpful? Give feedback.
-
The GH markdown editor is quite nice, with some cool new "slash" features in beta, including
A really long text block (doesn't have to be code) "/requests/latest":
get:
operationId: cachito.web.api_v1.get_latest_request
summary: Get the latest request
description: Return the latest request for a given repo_name and ref
parameters:
- name: repo_name
in: query
description: A repository name to filter by
schema:
type: string
maxLength: 200
example: release-engineering/retrodep
- name: ref
in: query
description: A git ref to filter request by
schema:
type: string
minLength: 40
maxLength: 40
pattern: '^[a-f0-9]{40}$'
example: bc9767a71ede6e0084ae4a9e01dcd8b81c30b741
responses:
"200":
description: The requested Cachito request
content:
application/json:
schema:
$ref: "#/components/schemas/Request"
"404":
description: The request wasn't found
content:
application/json:
schema:
type: object
properties:
error:
type: string
example: The requested resource was not found |
Beta Was this translation helpful? Give feedback.
-
Thank you @ben-alkov for taking a look at this! Yes, versioning becomes a problem once you update the original post with additional findings as it will immediately render most comments irrelevant - do we want to start deleting irrelevant comments? Probably not. So keeping the discussion history relevant to the latest findings and navigation through it is going to be a challenge. That's where IMO mailing lists shine because the thread can be infinite and you always know what email and then (after opening one) which bit of the body a given message responds to which compared to GH comments (any comments for that matter) would be unacceptable for large projects with many stakeholders as that would get extremely messy IMO. That said, are we in that kind of situation? Not at the moment, so unless we want to create a mailing list, I think GH discussions will do. It's not the most refined interface out there, but then again, it's all hosted in a single place so contributors always know where to look for the source of truth and the experience is quite consistent (whether that's a good or a bad thing). Back to GDocs, it's true it has "versioning", how useful given its interface is it? Even if it were nice and refined, the fact one cannot use simple quick markdown formatting for technical discussions automatically disqualify GDocs IMO. As for commenting directly on a line of text or code, well, at least there's the usual quoting people are already used to (from reviews), so one can always refer to a specific paragraph, it might become a PITA for code blocks though once the level of quoting will nest significantly during the discussion, but hey, it is what it is and we can always look for something better in the meantime and add pointers to it here in the repo, but the one thing to bear in mind is the motivation we're even thinking about this - to host all relevant pieces of information and discussions here in the main repository AND to streamline developer's day-to-day workflows by moving many of the processes to GitHub, hence making the overall experience consistent. What might turn out as a nice feature is that we can create an issue automatically from the discussion (not sure how messy that will be). Once the main issue is tracked, it can be decomposed into smaller pieces linked to the main issue, just food for thought. |
Beta Was this translation helpful? Give feedback.
-
Background
Where do cachito-archives come from?
Cachito generates a source archive as part of processing requests that involve
git repositories in two different cases
has not been archived before, cachito will clone it and create an
archive
rubygems that are not available in their respective registries and instead
must be retrieved from a git repository, cachito will clone and package these
repositories
After archiving, it uploads them to a Nexus repository.
Structure of cachito-archives
The archive directory is hosted on an OpenShift dynamic NFS Persistent Volume
Claim (PVC) with ReadWriteMany (RWX) access mode. This directory is mounted
across all worker pods. When cachito processes a request and generates a source
archive, it creates a tarball with the following directory structure:
Standard repository archives are named 'namespace/repo_name/<git_ref>.tar.gz'.
Archives including git submodules are named
'namespace/repo_name/<git_ref>-with-submodules.tar.gz'.
For repositories cloned via SSH, the namespace includes the clone method
'git@github.com:namespace/repo_name/<git_ref>.tar.gz'.
From the archive structure, we have the repository namespace/name, but not the
full repo URL that is stored for the request in the cachito DB.
See
Proposal
Archive “Pruner” Script
Develop a script designed to clean up old source archives by deleting those which exceed a specified age
This script should
for deletion
weekends. This scheduling should be implemented as a cron job within
OpenShift, mirroring the approach taken with our current script that
identifies stale requests
Determine whether an archive is stale by performing the following process
extracted repo_name and ref. Querying the cachito DB seems preferable to
relying on file system timestamps. We have copied the cachito-archives volume
before and the integrity of that metadata could be questionable
threshold based on the "created" timestamp within the request
set time frame and log the details
Additional Topics
Place the new script in cachito/workers alongside the existing
cleanup_job.py script
Extract the repo_name and ref from the path, but whether it has submodules
is not relevant to the decision of whether or not to prune the archive.
As mentioned above, the repositories cloned using SSH are stored in a
different namespace within the archives directory than those cloned via
HTTPS, even if they refer to the same repository. The naming convention for
archives cloned via SSH looks like 'git@github.com:namespace/repo_name/<git_ref>.tar.gz'.
I don’t think we need to address this distinction at this time, since the
impact is negligible.
Specifically, if we receive both SSH and non-SSH clone requests for the same
repository and reference, the determination of the "age" for the archive
that was cloned without SSH will be based on whichever request, SSH or
non-SSH, is the most recent. In contrast, the age of the SSH-cloned archive
will be determined solely by when the SSH clone request was made.
application itself
It seems like this should be an ancillary, deployment-specific feature and
should be implemented outside the main application. Upstream cachito will
continue to retain source archives indefinitely
This refers to archives created for git dependencies that are subsequently
uploaded to nexus. They should be safe to delete, but we would need to
fall-back to file system timestamps or some similar method because there is no
associated request in the cachito DB for them. The number of these should be
much smaller and we can revisit if it ever becomes an issue with a backlog
story
New API Endpoint -
GET api/v1/requests/latest
Purpose
Return the most recent request for the specified repository name and git reference.
Operation
The endpoint queries the Request table, leveraging an index on the ref column to
efficiently filter results. It matches the repo column based on a substring
search, which requires a wildcard comparison due to the nature of the repo_name
input.
To optimize performance, the 'ref' parameter is filtered first to reduce the
dataset before applying the wildcard search on the 'repo_name'. The latest request
is identified by the highest id value.
SQLAlchemy Query (Draft)
Parameters
directory structure. This parameter is a substring of the full repository
URL stored in the 'repo' column. Example: "release-engineering/retrodep"
from the database index. Example: "bc9767a71ede6e0084ae4a9e01dcd8b81c30b741"
Response
OpenAPI Specification (Draft)
Beta Was this translation helpful? Give feedback.
All reactions