Edit page in Livemark
(2023-01-25 11:55)

Ckan Portal

With CKAN portal feature you can load and publish packages from a CKAN, an open-source Data Management System.

Installation

To install this plugin you need to do:

pip install frictionless[ckan] --pre
pip install 'frictionless[ckan]' --pre # for zsh shell

Reading a Package

To import a Dataset from a CKAN instance as a Frictionless Package you can do as below:

from frictionless.portals import CkanControl
from frictionless import Package

ckan_control = CkanControl()
package = Package('https://legado.dados.gov.br/dataset/bolsa-familia-pagamentos', control=ckan_control)

Where 'https://legado.dados.gov.br/dataset/bolsa-familia-pagamentos' is the URL for the CKAN dataset. This will download the dataset and all its resources metadata.

You can pass parameters to CKAN Control to configure it, like the CKAN instance base URL (baseurl) and the dataset that you do want to download (dataset):

from frictionless.portals import CkanControl
from frictionless import Package

ckan_control = CkanControl(baseurl='https://legado.dados.gov.br', dataset='bolsa-familia-pagamentos')
package = Package(control=ckan_control)

You don't need to pass the dataset parameter to CkanControl. In the case that you pass only the baseurl you can download a package as:

from frictionless.portals import CkanControl
from frictionless import Package

ckan_control = CkanControl(baseurl='https://legado.dados.gov.br')
package = Package('bolsa-familia-pagamentos', control=ckan_control)

Ignoring a Resource Schema

In case that the CKAN dataset has a resource containing errors in its schema, you still can load the package passing the parameter ignore_schema=True to CKAN Control:

from frictionless.portals import CkanControl
from frictionless import Package

ckan_control = CkanControl(baseurl='https://legado.dados.gov.br', ignore_schema=True)
package = Package('bolsa-familia-pagamentos', control=ckan_control)

This will download the dataset and all its resources, saving the resources' original schemas on original_schema.

Publishing a package

To publish a Package to a CKAN instance you will need an API key from an CKAN's user that has permission to create datasets. This key can be passed to CKAN Control as the parameter apikey.

from frictionless.portals import CkanControl
from frictionless import Package

ckan_control = CkanControl(baseurl='https://legado.dados.gov.br', apikey='YOUR-SECRET-API-KEY')
package = Package(...) # Create your package
package.publish(control=ckan_control)

Reading a Catalog

You can download a list of CKAN datasets using the Catalog.


import frictionless
from frictionless import portals, Catalog
    
ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br')
c = Catalog(control=ckan_control)

This will download all datasets from the instance, limited only by the maximum number of datasets returned by the instance CKAN API. If the instance returns only 10 datasets as default, you can request more packages passing the parameter num_packages. In the example above if you want to download 1000 datasets you can do as:


import frictionless
from frictionless import portals, Catalog
    
ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', num_packages=1000)
c = Catalog(control=ckan_control)

It's possible that when you are requesting a large number of packages from CKAN, that some of them don't have a valid Package descriptor according to the specifications. In that case the standard behaviour will be to stop downloading a raise an exception. If you want to ignore individual package errors, you can pass the parameter ignore_package_errors=True:


import frictionless
from frictionless import portals, Catalog
    
ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', ignore_package_errors=True, num_packages=1000)
c = Catalog(control=ckan_control)

And the output of the command above will be the CKAN datasets ids with errors and the total number of packages returned by your query to the CKAN instance:

Error in CKAN dataset 8d60eff7-1a46-42ef-be64-e8979117a378: [package-error] The data package has an error: descriptor is not valid (The data package has an error: property "contributors[].email" is not valid "email")
Error in CKAN dataset 933d7164-8128-4e12-97e6-208bc4935bcb: [package-error] The data package has an error: descriptor is not valid (The data package has an error: property "contributors[].email" is not valid "email")
Error in CKAN dataset 93114fec-01c2-4ef5-8dfe-67da5027d568: [package-error] The data package has an error: descriptor is not valid (The data package has an error: property "contributors[].email" is not valid "email") (The data package has an error: property "contributors[].email" is not valid "email")
Total number of packages: 13786

You can see in the example above that 1000 packages were download from a total 13786 packages. You can download other packages passing an offset as:


import frictionless
from frictionless import portals, Catalog
    
ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', ignore_package_erros=True, results_offset=1000)
c = Catalog(control=ckan_control)

This will download 1000 packages after the the first 1000 packages.

Fetching the datasets from an Organization or Group

To fetch all packages from a organization will can use the CKAN Control parameter organization_name. e.g. if you want to fetch all datasets from the organization https://legado.dados.gov.br/organization/agencia-espacial-brasileira-aeb you can do as follows:

import frictionless
from frictionless import portals, Catalog
    
ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', organization_name='agencia-espacial-brasileira-aeb')
c = Catalog(control=ckan_control)

Similarly, if you want to download all datasets from a CKAN Group you can pass the parameter group_id to the CKAN Control as:

import frictionless
from frictionless import portals, Catalog
    
ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', group_id='ciencia-informacao-e-comunicacao')
c = Catalog(control=ckan_control)

Using CKAN search

You can also fetch only the datasets that are returned by the CKAN Package Search endpoint. You can pass the search parameters as the parameter search to CKAN Control.

import frictionless
from frictionless import portals, Catalog
    
ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', search={'q': 'name:bolsa*'})
c = Catalog(control=ckan_control)

Reference

portals.CkanControl (class)

portals.CkanControl (class)

Ckan control representation

Signature

(*, title: Optional[str] = None, description: Optional[str] = None, baseurl: Optional[str] = None, dataset: Optional[str] = None, apikey: Optional[str] = None, ignore_package_errors: Optional[bool] = False, ignore_schema: Optional[bool] = False, group_id: Optional[str] = None, organization_name: Optional[str] = None, search: Optional[dict] = None, num_packages: Optional[int] = None, results_offset: Optional[int] = None, allow_update: Optional[bool] = False) -> None

Parameters

  • title (Optional[str])
  • description (Optional[str])
  • baseurl (Optional[str])
  • dataset (Optional[str])
  • apikey (Optional[str])
  • ignore_package_errors (Optional[bool])
  • ignore_schema (Optional[bool])
  • group_id (Optional[str])
  • organization_name (Optional[str])
  • search (Optional[dict])
  • num_packages (Optional[int])
  • results_offset (Optional[int])
  • allow_update (Optional[bool])

portals.ckanControl.baseurl (property)

Endpoint url for CKAN instance. e.g. https://dados.gov.br

Signature

Optional[str]

portals.ckanControl.dataset (property)

Unique identifier of the dataset to read.

Signature

Optional[str]

portals.ckanControl.apikey (property)

The access token to authenticate to the CKAN instance. It is required to write files to CKAN instance.

Signature

Optional[str]

portals.ckanControl.ignore_package_errors (property)

Ignore Package errors in a Catalog. If multiple packages are being downloaded and one fails with an invalid descriptor, continue downloading the rest.

Signature

Optional[bool]

portals.ckanControl.ignore_schema (property)

Ignore dataset resources schemas

Signature

Optional[bool]

portals.ckanControl.group_id (property)

CKAN Group id to get datasets in a Catalog

Signature

Optional[str]

portals.ckanControl.organization_name (property)

CKAN Organization name to get datasets in a Catalog

Signature

Optional[str]

portals.ckanControl.search (property)

CKAN Search parameters as defined on https://docs.ckan.org/en/2.9/api/#ckan.logic.action.get.package_search

Signature

Optional[dict]

portals.ckanControl.num_packages (property)

Maximum number of packages to fetch

Signature

Optional[int]

portals.ckanControl.results_offset (property)

Results page number

Signature

Optional[int]

portals.ckanControl.allow_update (property)

Update a dataset on publish with an id is provided on the package descriptor

Signature

Optional[bool]