Github read and publish feature makes easy to share data between frictionless and the github repositories. All read/write functionalities are the wrapper around PyGithub library which is used under the hood to make connection to github api.
We need to install github extra dependencies to use this feature:
pip install frictionless[github] --pre
pip install 'frictionless[github]' --pre # for zsh shell
You can read data from a github repository as follows:
from frictionless import Package
package = Package("https://github.com/fdtester/test-repo-with-datapackage-json")
print(package)
{'name': 'test-package',
'resources': [{'name': 'first-resource',
'type': 'table',
'path': 'table.xls',
'scheme': 'file',
'format': 'xls',
'mediatype': 'application/vnd.ms-excel',
'schema': {'fields': [{'name': 'id', 'type': 'number'},
{'name': 'name', 'type': 'string'}]}}]}
To increase the access limit, pass 'apikey' as the param to the reader function as follows:
from frictionless import portals, Package
control = portals.GithubControl(apikey=apikey)
package = Package("https://github.com/fdtester/test-repo-with-datapackage-json", control=control)
print(package)
The reader
function can read package from repos with/without data package descriptor. If the repo does not have the descriptor it will create the descriptor with the same name as the repo name. By default, the function reads files of type csv, xlsx and xls but we can set the file types using control parameters.
If the repo has a descriptor it simply returns the descriptor as shown above.
Once you read the package from the repo, you can then easily access the resources and its data, for example:
from frictionless import Package
package = Package("https://github.com/fdtester/test-repo-with-datapackage-json")
print(package.get_resource('first-resource').read_rows())
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
Catalog is a container for the packages. We can read single/multiple repositories from github and create a catalog.
from frictionless import portals, Catalog
control = portals.GithubControl(search="'TestAction: Read' in:readme", apikey=apikey)
catalog = Catalog(
"https://github.com/fdtester", control=control
)
print("Total packages", len(catalog.packages))
print(catalog.packages[:2])
Total packages 4
[{'resources': [{'name': 'capitals',
'type': 'table',
'path': 'data/capitals.csv',
'scheme': 'file',
'format': 'csv',
'encoding': 'utf-8',
'mediatype': 'text/csv',
'dialect': {'csv': {'skipInitialSpace': True}},
'schema': {'fields': [{'name': 'id', 'type': 'integer'},
{'name': 'cid', 'type': 'integer'},
{'name': 'name', 'type': 'string'}]}}]},
{'name': 'test-repo-jquery',
'resources': [{'name': 'country-1',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-jquery/main/country-1.csv',
'scheme': 'https',
'format': 'csv',
'mediatype': 'text/csv'}]}]
To read catalog, we need authenticated user so we have to pass the token as 'apikey' to the function. In the above example we are using search text to filter the repositories to small number. The search field is not mandatory.
We can simply use 'control' parameters and get the same result as above, for example:
from frictionless import portals, Catalog
control = portals.GithubControl(search="'TestAction: Read' in:readme", user="fdtester", apikey=apikey)
catalog = Catalog(control=control)
print("Total packages", len(catalog.packages))
print(catalog.packages[:2])
As shown in the example above, we can use different qualifiers to search the repos. The above example searches for all the repos which has 'TestAction: Read' text in readme files. Similary we can use many different qualifiers and combination of those. To get full list of qualifiers you can check the github document here.
Some examples of the qualifiers:
‘jquery’ in:name
‘jquery’ in:name user:name
sort:updated-asc ‘TestAction: Read’ in:readme
If we want to read the list of repositories of user 'fdtester' which has 'jquery' in its name then we write search query as follows:
from frictionless import portals, Catalog
control = portals.GithubControl(apikey=apikey, search="user:fdtester jquery in:name")
catalog = Catalog(control=control)
print(catalog.packages)
[{'name': 'test-repo-jquery',
'resources': [{'name': 'country-1',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-jquery/main/country-1.csv',
'scheme': 'https',
'format': 'csv',
'mediatype': 'text/csv'}]}]
There is only one repository having 'jquery' in name for this user's account, so it returned only one repository.
We can also read repositories in defined order using 'sort' param or qualifier. Here we are trying to read the repos with 'TestAction: Read' text in readme file in recently updated order, for example:
from frictionless import portals, Catalog
control = portals.GithubControl(apikey=apikey, search="user:fdtester sort:updated-desc 'TestAction: Read' in:readme")
catalog = Catalog(control=control)
for index,package in enumerate(catalog.packages):
print(f"package:{index}", "\n")
print(package)
package:0
{'name': 'test-repo-jquery',
'resources': [{'name': 'country-1',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-jquery/main/country-1.csv',
'scheme': 'https',
'format': 'csv',
'mediatype': 'text/csv'}]}
package:1
{'resources': [{'name': 'capitals',
'type': 'table',
'path': 'data/capitals.csv',
'scheme': 'file',
'format': 'csv',
'encoding': 'utf-8',
'mediatype': 'text/csv',
'dialect': {'csv': {'skipInitialSpace': True}},
'schema': {'fields': [{'name': 'id', 'type': 'integer'},
{'name': 'cid', 'type': 'integer'},
{'name': 'name', 'type': 'string'}]}}]}
package:2
{'name': 'test-tabulator',
'resources': [{'name': 'first-resource',
'path': 'table.xls',
'schema': {'fields': [{'name': 'id', 'type': 'number'},
{'name': 'name', 'type': 'string'}]}},
{'name': 'number-two',
'path': 'table-reverse.csv',
'schema': {'fields': [{'name': 'id', 'type': 'integer'},
{'name': 'name', 'type': 'string'}]}}]}
To write data to the repository, we use Package.publish
function as follows:
from frictionless import portals, Package
package = Package('1174/datapackage.json')
control = portals.GithubControl(repo="test-new-repo-doc", name='FD', email=email, apikey=apikey)
response = package.publish(control=control)
print(response)
Repository(full_name="fdtester/test-new-repo-doc")
We need to mention name
and email
explicitly if the user doesn't have name set in his github account, and if email is private and hidden. Otherwise, it will take these info from the user account. In order to be able to publish/write to respository, we need to have the api token with 'repository write' access.
If the package is successfully published, the response is a 'Repository' instance.
We can control the behavior of all the above three functions using various params.
For example, to read only 'csv' files in package we use the following code:
from frictionless import portals, Package
control = portals.GithubControl(user="fdtester", formats=["csv"], repo="test-repo-without-datapackage")
package = Package("https://github.com/fdtester/test-repo-with-datapackage-json")
print(package)
{'name': 'test-package',
'resources': [{'name': 'first-resource',
'type': 'table',
'path': 'table.xls',
'scheme': 'file',
'format': 'xls',
'mediatype': 'application/vnd.ms-excel',
'schema': {'fields': [{'name': 'id', 'type': 'number'},
{'name': 'name', 'type': 'string'}]}}]}
In order to read first page of the search result and create a catalog, we use per_page
and page
params as follows:
from frictionless import portals, Catalog
control = portals.GithubControl(apikey=apikey, search="user:fdtester sort:updated-desc 'TestAction: Read' in:readme", per_page=1, page=1)
catalog = Catalog(control=control)
[{'name': 'test-repo-jquery',
'resources': [{'name': 'country-1',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-jquery/main/country-1.csv',
'scheme': 'https',
'format': 'csv',
'mediatype': 'text/csv'}]}]
Similary, we can also control the write function using params as follows:
from frictionless import portals, Package
package = Package('datapackage.json')
control = portals.GithubControl(repo="test-repo", name='FD Test', email="test@gmail", apikey=apikey)
response = package.publish(control=control)
print(response)
Repository(full_name="fdtester/test-repo")
Github control representation
(*, title: Optional[str] = None, description: Optional[str] = None, apikey: Optional[str] = None, basepath: Optional[str] = None, email: Optional[str] = None, formats: Optional[List[str]] = [csv, tsv, xlsx, xls, jsonl, ndjson], name: Optional[str] = None, order: Optional[str] = None, page: Optional[int] = None, per_page: Optional[int] = 30, repo: Optional[str] = None, search: Optional[str] = None, sort: Optional[str] = None, user: Optional[str] = None, filename: Optional[str] = None, enable_pages: Optional[bool] = None) -> None
The access token to authenticate to the github API. It is required to write files to github repo. For reading, it is optional however using apikey increases the api access limit from 60 to 5000 requests per hour. To write, access token has to have write repository access.
Optional[str]
Base path is the base folder, the package and resource files will be written to.
Optional[str]
Email is used while publishing the data to the github repo. It should be set explicitly, if the primary email for the github account is not set to public.
Optional[str]
Formats instructs plugin to only read specified types of files. By default it is set to 'csv,xls,xlsx'.
Optional[List[str]]
Name of the github which is used while publishing the data. It should be provided explicitly, if the name of the user is not set in the github account.
Optional[str]
The order in which to retrieve the data sorted by 'sort' param. It can be one of: 'asc','desc'. This parameter is ignored if 'sort' is not provided.
Optional[str]
If specified, only the given page is returned.
Optional[int]
The number of results per page. Default value is 30. Max value is 100.
Optional[int]
Name of the repo to read or write.
Optional[str]
Search query containing one or more search keywords and qualifiers to filter the repositories. For example, 'windows+label:bug+language:python'.
Optional[str]
Sorts the result of the query by number of stars, forks, help-wanted-issues or updated. By default the results are sorted by best match in desc order.
Optional[str]
username of the github account.
Optional[str]
Custom data package file name while publishing the data. By default it will use 'datapackage.json'.
Optional[str]
Optional[bool]