With CKAN portal feature you can load and publish packages from a CKAN, an open-source Data Management System.
To install this plugin you need to do:
pip install frictionless[ckan] --pre
pip install 'frictionless[ckan]' --pre # for zsh shell
To import a Dataset from a CKAN instance as a Frictionless Package you can do as below:
from frictionless.portals import CkanControl
from frictionless import Package
ckan_control = CkanControl()
package = Package('https://legado.dados.gov.br/dataset/bolsa-familia-pagamentos', control=ckan_control)
Where 'https://legado.dados.gov.br/dataset/bolsa-familia-pagamentos' is the URL for the CKAN dataset. This will download the dataset and all its resources metadata.
You can pass parameters to CKAN Control to configure it, like the CKAN instance
base URL (baseurl
) and the dataset that you do want to download (dataset
):
from frictionless.portals import CkanControl
from frictionless import Package
ckan_control = CkanControl(baseurl='https://legado.dados.gov.br', dataset='bolsa-familia-pagamentos')
package = Package(control=ckan_control)
You don't need to pass the dataset
parameter to CkanControl. In the case that
you pass only the baseurl
you can download a package as:
from frictionless.portals import CkanControl
from frictionless import Package
ckan_control = CkanControl(baseurl='https://legado.dados.gov.br')
package = Package('bolsa-familia-pagamentos', control=ckan_control)
In case that the CKAN dataset has a resource containing errors in its schema,
you still can load the package passing the parameter ignore_schema=True
to
CKAN Control:
from frictionless.portals import CkanControl
from frictionless import Package
ckan_control = CkanControl(baseurl='https://legado.dados.gov.br', ignore_schema=True)
package = Package('bolsa-familia-pagamentos', control=ckan_control)
This will download the dataset and all its resources, saving the resources'
original schemas on original_schema
.
To publish a Package to a CKAN instance you will need an API key from an CKAN's
user that has permission to create datasets. This key can be passed to CKAN
Control as the parameter apikey
.
from frictionless.portals import CkanControl
from frictionless import Package
ckan_control = CkanControl(baseurl='https://legado.dados.gov.br', apikey='YOUR-SECRET-API-KEY')
package = Package(...) # Create your package
package.publish(control=ckan_control)
You can download a list of CKAN datasets using the Catalog.
import frictionless
from frictionless import portals, Catalog
ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br')
c = Catalog(control=ckan_control)
This will download all datasets from the instance, limited only by the maximum
number of datasets returned by the instance CKAN API. If the instance returns
only 10 datasets as default, you can request more packages passing the
parameter num_packages
. In the example above if you want to download 1000
datasets you can do as:
import frictionless
from frictionless import portals, Catalog
ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', num_packages=1000)
c = Catalog(control=ckan_control)
It's possible that when you are requesting a large number of packages from
CKAN, that some of them don't have a valid Package descriptor according to the
specifications. In that case the standard behaviour will be to stop downloading
a raise an exception. If you want to ignore individual package errors, you can
pass the parameter ignore_package_errors=True
:
import frictionless
from frictionless import portals, Catalog
ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', ignore_package_errors=True, num_packages=1000)
c = Catalog(control=ckan_control)
And the output of the command above will be the CKAN datasets ids with errors and the total number of packages returned by your query to the CKAN instance:
Error in CKAN dataset 8d60eff7-1a46-42ef-be64-e8979117a378: [package-error] The data package has an error: descriptor is not valid (The data package has an error: property "contributors[].email" is not valid "email")
Error in CKAN dataset 933d7164-8128-4e12-97e6-208bc4935bcb: [package-error] The data package has an error: descriptor is not valid (The data package has an error: property "contributors[].email" is not valid "email")
Error in CKAN dataset 93114fec-01c2-4ef5-8dfe-67da5027d568: [package-error] The data package has an error: descriptor is not valid (The data package has an error: property "contributors[].email" is not valid "email") (The data package has an error: property "contributors[].email" is not valid "email")
Total number of packages: 13786
You can see in the example above that 1000 packages were download from a total 13786 packages. You can download other packages passing an offset as:
import frictionless
from frictionless import portals, Catalog
ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', ignore_package_erros=True, results_offset=1000)
c = Catalog(control=ckan_control)
This will download 1000 packages after the the first 1000 packages.
To fetch all packages from a organization will can use the CKAN Control
parameter organization_name
. e.g. if you want to fetch all datasets from the
organization https://legado.dados.gov.br/organization/agencia-espacial-brasileira-aeb
you can do
as follows:
import frictionless
from frictionless import portals, Catalog
ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', organization_name='agencia-espacial-brasileira-aeb')
c = Catalog(control=ckan_control)
Similarly, if you want to download all datasets from a CKAN Group you can pass
the parameter group_id
to the CKAN Control as:
import frictionless
from frictionless import portals, Catalog
ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', group_id='ciencia-informacao-e-comunicacao')
c = Catalog(control=ckan_control)
You can also fetch only the datasets that are returned by the CKAN Package
Search endpoint.
You can pass the search parameters as the parameter search
to CKAN Control.
import frictionless
from frictionless import portals, Catalog
ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', search={'q': 'name:bolsa*'})
c = Catalog(control=ckan_control)
Ckan control representation
(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, baseurl: Optional[str] = None, dataset: Optional[str] = None, apikey: Optional[str] = None, ignore_package_errors: Optional[bool] = False, ignore_schema: Optional[bool] = False, group_id: Optional[str] = None, organization_name: Optional[str] = None, search: Optional[Dict[str, Any]] = None, num_packages: Optional[int] = None, results_offset: Optional[int] = None, allow_update: Optional[bool] = False) -> None
Endpoint url for CKAN instance. e.g. https://dados.gov.br
Optional[str]
Unique identifier of the dataset to read or write.
Optional[str]
The access token to authenticate to the CKAN instance. It is required to write files to CKAN instance.
Optional[str]
Ignore Package errors in a Catalog. If multiple packages are being downloaded and one fails with an invalid descriptor, continue downloading the rest.
Optional[bool]
Ignore dataset resources schemas
Optional[bool]
CKAN Group id to get datasets in a Catalog
Optional[str]
CKAN Organization name to get datasets in a Catalog
Optional[str]
CKAN Search parameters as defined on https://docs.ckan.org/en/2.9/api/#ckan.logic.action.get.package_search
Optional[Dict[str, Any]]
Maximum number of packages to fetch
Optional[int]
Results page number
Optional[int]
Update a dataset on publish with an id is provided on the package descriptor
Optional[bool]