Apply the Github app filter before instead of after getting data in the integration
complete
K
Kevin Puventhiranathan
We have over 14k+ repos in our Port inventory at the moment.
Only about a 1/4 are active (not archived).
If we set the query in the selector to ".repo.archived == false", the GitHub integration pulls all repos, and then filters at the point of mapping.
The list organization repos in the REST API (doc) doesn't support using the archive filter.
But the search api (REST, GraphQL) allows active repos to be fetched using the following query:
archived: false
:- https://docs.github.com/en/rest/search/search?versionId=free-pro-team%40latest&productId=search-github&restPage=searching-on-github%2Csearching-for-repositories&apiVersion=2022-11-28#search-repositories
- https://docs.github.com/en/graphql/reference/queries#search
We would like the filter to apply before and not after (like it currently does).
It would make the github integration faster and reduce load.
M
Maya Margalit
marked this post as
complete
Supported in the new Github Ocean integration!
Docs: https://docs.port.io/build-your-software-catalog/sync-data-to-catalog/git/github-ocean/
Migration guide: https://docs.port.io/build-your-software-catalog/sync-data-to-catalog/git/github-ocean/migration-guide
Melody Anyaegbulam
Hey Kevin Puventhiranathan, this has now been implemented and is available on Github Ocean from v3.1.0-beta. You can bump to the latest version to use, docs are here: https://docs.port.io/build-your-software-catalog/sync-data-to-catalog/git/github-ocean/#ingest-repositories-via-search-api
Melody Anyaegbulam
Hi Kevin Puventhiranathan, We've reviewed your feature request and are pleased to confirm we will be implementing it as an optional add-on.
This functionality is enabled by the GitHub Search API. While this API delivers the specific capability you are looking for, we must highlight several technical constraints:
Result Limitation: The Search API enforces a hard limit of 1,000 results for any single query. Based on your current repository count, this means the add-on would be limited to ingesting roughly one-third of your active repositories.
Increased Rate Limits: Using the Search API introduces significantly more restrictive rate limits compared to other GitHub endpoints, which could impact overall system performance during ingestion.
Potential for Stale Data: The nature of the Search API means there is an increased potential for data to become momentarily stale compared to real-time endpoint usage.
We believe that offering this as an add-on is the best way to deliver the requested value while keeping these upstream constraints isolated.
M
Marcel Rossouw
Have you tried adding something like this in the query selector? we used this to remove pull-requests from archived repo's already in the catalog, so perhaps it can then remove those archived repo's (it's a workaround at best)
(if .head?.repo?.archived then false else true end)
K
Kevin Puventhiranathan
Marcel Rossouw Thank you for your suggestion!
The query selector filters the data after it has reached Port.
So using the example above, the 14k repository resource payloads will be sent to Port and then filtered out by the mapping process.
It'd be a lot more efficient if the repos were filtered out when querying GitHub, so only active repo payloads are sent to Port.