Edit page in Livemark
(2024-11-07 15:17)

Resource Class

The Resource class is arguable the most important class of the whole Frictionless Framework. It's based on Data Resource Standard and Tabular Data Resource Standard

Creating Resource

Let's create a data resource:

from frictionless import Resource

resource = Resource('table.csv') # from a resource path
resource = Resource('resource.json') # from a descriptor path
resource = Resource({'path': 'table.csv'}) # from a descriptor
resource = Resource(path='table.csv') # from arguments

As you can see it's possible to create a resource providing different kinds of sources which will be detector to have some type automatically (e.g. whether it's a descriptor or a path). It's possible to make this step more explicit:

from frictionless import Resource

resource = Resource(path='data/table.csv') # from a path
resource = Resource('data/resource.json') # from a descriptor

Describing Resource

The standards support a great deal of resource metadata which is possible to have with Frictionless Framework too:

from frictionless import Resource

resource = Resource(
    name='resource',
    title='My Resource',
    description='My Resource for the Guide',
    path='table.csv',
    # it's possible to provide all the official properties like mediatype, etc
)
print(resource)
{'name': 'resource',
 'type': 'table',
 'title': 'My Resource',
 'description': 'My Resource for the Guide',
 'path': 'table.csv',
 'scheme': 'file',
 'format': 'csv',
 'mediatype': 'text/csv'}

If you have created a resource, for example, from a descriptor you can access this properties:

from frictionless import Resource

resource = Resource('resource.json')
print(resource.name)
# and others
name

And edit them:

from frictionless import Resource

resource = Resource('resource.json')
resource.name = 'new-name'
resource.title = 'New Title'
resource.description = 'New Description'
# and others
print(resource)
{'name': 'new-name',
 'type': 'table',
 'title': 'New Title',
 'description': 'New Description',
 'path': 'table.csv',
 'scheme': 'file',
 'format': 'csv',
 'mediatype': 'text/csv'}

Saving Descriptor

As any of the Metadata classes the Resource class can be saved as JSON or YAML:

from frictionless import Resource
resource = Resource('table.csv')
resource.to_json('resource.json') # Save as JSON
resource.to_yaml('resource.yaml') # Save as YAML

Resource Lifecycle

You might have noticed that we had to duplicate the with Resource(...) statement in some examples. The reason is that Resource is a streaming interface. Once it's read you need to open it again. Let's show it in an example:

from pprint import pprint
from frictionless import Resource

resource = Resource('capital-3.csv')
resource.open()
pprint(resource.read_rows())
pprint(resource.read_rows())
# We need to re-open: there is no data left
resource.open()
pprint(resource.read_rows())
# We need to close manually: not context manager is used
resource.close()
[{'id': 1, 'name': 'London'},
 {'id': 2, 'name': 'Berlin'},
 {'id': 3, 'name': 'Paris'},
 {'id': 4, 'name': 'Madrid'},
 {'id': 5, 'name': 'Rome'}]
[]
[{'id': 1, 'name': 'London'},
 {'id': 2, 'name': 'Berlin'},
 {'id': 3, 'name': 'Paris'},
 {'id': 4, 'name': 'Madrid'},
 {'id': 5, 'name': 'Rome'}]

At the same you can read data for a resource without opening and closing it explicitly. In this case Frictionless Framework will open and close the resource for you so it will be basically a one-time operation:

from frictionless import Resource

resource = Resource('capital-3.csv')
pprint(resource.read_rows())
[{'id': 1, 'name': 'London'},
 {'id': 2, 'name': 'Berlin'},
 {'id': 3, 'name': 'Paris'},
 {'id': 4, 'name': 'Madrid'},
 {'id': 5, 'name': 'Rome'}]

Reading Data

The Resource class is also a metadata class which provides various read and stream functions. The extract functions always read rows into memory; Resource can do the same but it also gives a choice regarding output data. It can be rows, data, text, or bytes. Let's try reading all of them:

from frictionless import Resource

resource = Resource('country-3.csv')
pprint(resource.read_bytes())
pprint(resource.read_text())
pprint(resource.read_cells())
pprint(resource.read_rows())
(b'id,capital_id,name,population\n1,1,Britain,67\n2,3,France,67\n3,2,Germany,8'
 b'3\n4,5,Italy,60\n5,4,Spain,47\n')
''
[['id', 'capital_id', 'name', 'population'],
 ['1', '1', 'Britain', '67'],
 ['2', '3', 'France', '67'],
 ['3', '2', 'Germany', '83'],
 ['4', '5', 'Italy', '60'],
 ['5', '4', 'Spain', '47']]
[{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67},
 {'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67},
 {'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83},
 {'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60},
 {'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}]

It's really handy to read all your data into memory but it's not always possible if a file is really big. For such cases, Frictionless provides streaming functions:

from frictionless import Resource

with Resource('country-3.csv') as resource:
    pprint(resource.byte_stream)
    pprint(resource.text_stream)
    pprint(resource.cell_stream)
    pprint(resource.row_stream)
    for row in resource.row_stream:
      print(row)
<frictionless.system.loader.ByteStreamWithStatsHandling object at 0x7fe60aee8190>
<_io.TextIOWrapper name='country-3.csv' encoding='utf-8'>
<itertools.chain object at 0x7fe60b1285b0>
<generator object TableResource.__open_row_stream.<locals>.row_stream at 0x7fe60cf2af10>
{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67}
{'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67}
{'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83}
{'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60}
{'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}

Indexing Data

Indexing resource in Frictionless terms means loading a data table into a database. Let's explore how this feature works in different modes.

All the example are written for SQLite for simplicity

Normal Mode

This mode is supported for any database that is supported by sqlalchemy. Under the hood, Frictionless will infer Table Schema and populate the data table as it normally reads data. It means that type errors will be replaced by null values and in-general it guarantees to finish successfully for any data even very invalid.

frictionless index table.csv --database sqlite:///index/project.db --name table
frictionless extract sqlite:///index/project.db --table table --json
──────────────────────────────────── Index ─────────────────────────────────────

[table] Indexed 3 rows in 0.204 seconds
──────────────────────────────────── Result ────────────────────────────────────
Succesefully indexed 1 tables
{
  "project": [
    {
      "id": 1,
      "name": "english"
    },
    {
      "id": 2,
      "name": "中国人"
    }
  ]
}
import sqlite3
from frictionless import Resource, formats

resource = Resource('table.csv')
resource.index('sqlite:///index/project.db', name='table')
print(Resource('sqlite:///index/project.db', control=formats.sql.SqlControl(table='table')).extract())
{'project': [{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]}

Fast Mode

Fast mode is supported for SQLite and Postgresql databases. It will infer Table Schema using a data sample and index data using COPY in Potgresql and .import in SQLite. For big data files this mode will be 10-30x faster than normal indexing but the speed comes with the price -- if there is invalid data the indexing will fail.

frictionless index table.csv --database sqlite:///index/project.db --name table --fast
frictionless extract sqlite:///index/project.db --table table --json
──────────────────────────────────── Index ─────────────────────────────────────

[table] Indexed 30 bytes in 0.208 seconds
──────────────────────────────────── Result ────────────────────────────────────
Succesefully indexed 1 tables
{
  "project": [
    {
      "id": 1,
      "name": "english"
    },
    {
      "id": 2,
      "name": "中国人"
    }
  ]
}
import sqlite3
from frictionless import Resource, formats

resource = Resource('table.csv')
resource.index('sqlite:///index/project.db', name='table', fast=True)
print(Resource('sqlite:///index/project.db', control=formats.sql.SqlControl(table='table')).extract())
{'project': [{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]}

Solution 1: Fallback

To ensure that the data will be successfully indexed it's possible to use fallback option. If the fast indexing fails Frictionless will start over in normal mode and finish the process successfully.

frictionless index table.csv --database sqlite:///index/project.db --name table --fast --fallback
import sqlite3
from frictionless import Resource, formats

resource = Resource('table.csv')
resource.index('sqlite:///index/project.db', name='table', fast=True, fallback=True)

Solution 2: QSV

Another option is to provide a path to QSV binary. In this case, initial schema inferring will be done based on the whole data file and will guarantee that the table is valid type-wise:

frictionless index table.csv --database sqlite:///index/project.db --name table --fast --qsv qsv_path
import sqlite3
from frictionless import Resource, formats

resource = Resource('table.csv')
resource.index('sqlite:///index/project.db', name='table', fast=True, qsv_path='qsv_path')

Scheme

The scheme also know as protocol indicates which loader Frictionless should use to read or write data. It can be file (default), text, http, https, s3, and others.

from frictionless import Resource

with Resource(b'header1,header2\nvalue1,value2', format='csv') as resource:
  print(resource.scheme)
  print(resource.to_view())
buffer
+----------+----------+
| header1  | header2  |
+==========+==========+
| 'value1' | 'value2' |
+----------+----------+

Format

The format or as it's also called extension helps Frictionless to choose a proper parser to handle the file. Popular formats are csv, xlsx, json and others

from frictionless import Resource

with Resource(b'header1,header2\nvalue1,value2.csv', format='csv') as resource:
  print(resource.format)
  print(resource.to_view())
csv
+----------+--------------+
| header1  | header2      |
+==========+==============+
| 'value1' | 'value2.csv' |
+----------+--------------+

Encoding

Frictionless automatically detects encoding of files but sometimes it can be inaccurate. It's possible to provide an encoding manually:

from frictionless import Resource

with Resource('country-3.csv', encoding='utf-8') as resource:
  print(resource.encoding)
  print(resource.path)
utf-8
country-3.csv
utf-8
data/country-3.csv

Innerpath

By default, Frictionless uses the first file found in a zip archive. It's possible to adjust this behaviour:

from frictionless import Resource

with Resource('table-multiple-files.zip', innerpath='table-reverse.csv') as resource:
  print(resource.compression)
  print(resource.innerpath)
  print(resource.to_view())
zip
table-reverse.csv
+----+-----------+
| id | name      |
+====+===========+
|  1 | '中国人'     |
+----+-----------+
|  2 | 'english' |
+----+-----------+

Compression

It's possible to adjust compression detection by providing the algorithm explicitly. For the example below it's not required as it would be detected anyway:

from frictionless import Resource

with Resource('table.csv.zip', compression='zip') as resource:
  print(resource.compression)
  print(resource.to_view())
zip
+----+-----------+
| id | name      |
+====+===========+
|  1 | 'english' |
+----+-----------+
|  2 | '中国人'     |
+----+-----------+

Dialect

Please read Table Dialect Guide for more information.

Schema

Please read Table Schema Guide for more information.

Checklist

Please read Checklist Guide for more information.

Pipeline

Please read Pipeline Guide for more information.

Stats

Resource's stats can be accessed with resource.stats:

from frictionless import Resource

resource = Resource('table.csv')
resource.infer(stats=True)
print(resource.stats)
<frictionless.resource.stats.ResourceStats object at 0x7fe60a078a90>

Reference

Resource (class)

Resource (class)

Resource representation. This class is one of the cornerstones of of Frictionless framework. It loads a data source, and allows you to stream its parsed contents. At the same time, it's a metadata class data description. ```python with Resource("data/table.csv") as resource: resource.header == ["id", "name"] resource.read_rows() == [ {'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}, ] ```

Signature

(*, source: Optional[Any] = None, control: Optional[Control] = None, packagify: bool = False, name: Optional[str] = , title: Optional[str] = None, description: Optional[str] = None, homepage: Optional[str] = None, profile: Optional[str] = None, licenses: List[Dict[str, Any]] = NOTHING, sources: List[Dict[str, Any]] = NOTHING, path: Optional[str] = None, data: Optional[Any] = None, scheme: Optional[str] = None, format: Optional[str] = None, datatype: Optional[str] = , mediatype: Optional[str] = None, compression: Optional[str] = None, extrapaths: List[str] = NOTHING, innerpath: Optional[str] = None, encoding: Optional[str] = None, hash: Optional[str] = None, bytes: Optional[int] = None, fields: Optional[int] = None, rows: Optional[int] = None, dialect: Union[Dialect, str] = NOTHING, schema: Union[Schema, str] = NOTHING, basepath: Optional[str] = None, detector: Detector = NOTHING, package: Optional[Package] = None) -> None

Parameters

  • source (Optional[Any])
  • control (Optional[Control])
  • packagify (bool)
  • name (Optional[str])
  • title (Optional[str])
  • description (Optional[str])
  • homepage (Optional[str])
  • profile (Optional[str])
  • licenses (List[Dict[str, Any]])
  • sources (List[Dict[str, Any]])
  • path (Optional[str])
  • data (Optional[Any])
  • scheme (Optional[str])
  • format (Optional[str])
  • datatype (Optional[str])
  • mediatype (Optional[str])
  • compression (Optional[str])
  • extrapaths (List[str])
  • innerpath (Optional[str])
  • encoding (Optional[str])
  • hash (Optional[str])
  • bytes (Optional[int])
  • fields (Optional[int])
  • rows (Optional[int])
  • dialect (Union[Dialect, str])
  • schema (Union[Schema, str])
  • basepath (Optional[str])
  • detector (Detector)
  • package (Optional[Package])

resource.source (property)

# TODO: add docs

Signature

Optional[Any]

resource.control (property)

# TODO: add docs

Signature

Optional[Control]

resource.packagify (property)

# TODO: add docs

Signature

bool

resource._name (property)

Resource name according to the specs. It should be a slugified name of the resource.

Signature

Optional[str]

resource.type (property)

Type of the resource

Signature

ClassVar[str]

resource.title (property)

Resource title according to the specs It should a human-oriented title of the resource.

Signature

Optional[str]

resource.description (property)

Resource description according to the specs It should a human-oriented description of the resource.

Signature

Optional[str]

resource.homepage (property)

A URL for the home on the web that is related to this package. For example, github repository or ckan dataset address.

Signature

Optional[str]

resource.profile (property)

A fully-qualified URL that points directly to a JSON Schema that can be used to validate the descriptor

Signature

Optional[str]

resource.licenses (property)

The license(s) under which the resource is provided. If omitted it's considered the same as the package's licenses.

Signature

List[Dict[str, Any]]

resource.sources (property)

The raw sources for this data resource. It MUST be an array of Source objects. Each Source object MUST have a title and MAY have path and/or email properties.

Signature

List[Dict[str, Any]]

resource.path (property)

Path to data source

Signature

Optional[str]

resource.data (property)

Inline data source

Signature

Optional[Any]

resource.scheme (property)

Scheme for loading the file (file, http, ...). If not set, it'll be inferred from `source`.

Signature

Optional[str]

resource.format (property)

File source's format (csv, xls, ...). If not set, it'll be inferred from `source`.

Signature

Optional[str]

resource._datatype (property)

Frictionless Framework specific data type as "table" or "schema"

Signature

Optional[str]

resource.mediatype (property)

Mediatype/mimetype of the resource e.g. “text/csv”, or “application/vnd.ms-excel”. Mediatypes are maintained by the Internet Assigned Numbers Authority (IANA) in a media type registry.

Signature

Optional[str]

resource.compression (property)

Source file compression (zip, ...). If not set, it'll be inferred from `source`.

Signature

Optional[str]

resource.extrapaths (property)

List of paths to concatenate to the main path. It's used for multipart resources.

Signature

List[str]

resource.innerpath (property)

Path within the compressed file. It defaults to the first file in the archive (if the source is an archive).

Signature

Optional[str]

resource.encoding (property)

Source encoding. If not set, it'll be inferred from `source`.

Signature

Optional[str]

resource.hash (property)

# TODO: add docs

Signature

Optional[str]

resource.bytes (property)

# TODO: add docs

Signature

Optional[int]

resource.fields (property)

# TODO: add docs

Signature

Optional[int]

resource.rows (property)

# TODO: add docs

Signature

Optional[int]

resource._dialect (property)

# TODO: add docs

Signature

Union[Dialect, str]

resource._schema (property)

# TODO: add docs

Signature

Union[Schema, str]

resource._basepath (property)

# TODO: add docs

Signature

Optional[str]

resource.detector (property)

File/table detector. For more information, please check the Detector documentation.

Signature

Detector

resource.package (property)

Parental to this resource package. For more information, please check the Package documentation.

Signature

Optional[Package]

resource.stats (property)

# TODO: add docs

Signature

ResourceStats

resource.tabular (property)

Whether the resource is tabular

Signature

ClassVar[bool]

resource.basepath (property)

A basepath of the resource The normpath of the resource is joined `basepath` and `/path`

Signature

Optional[str]

resource.buffer (property)

File's bytes used as a sample These buffer bytes are used to infer characteristics of the source file (e.g. encoding, ...).

Signature

types.IBuffer

resource.byte_stream (property)

Byte stream in form of a generator

Signature

types.IByteStream

resource.closed (property)

Whether the table is closed

Signature

bool

resource.memory (property)

Whether resource is not path based

Signature

bool

resource.multipart (property)

Whether resource is multipart

Signature

bool

resource.normpath (property)

Normalized path of the resource or raise if not set

Signature

Optional[str]

resource.normpaths (property)

Normalized paths of the resource

Signature

List[str]

resource.paths (property)

All paths of the resource

Signature

List[str]

resource.place (property)

Stringified resource location

Signature

str

resource.remote (property)

Whether resource is remote

Signature

bool

resource.text_stream (property)

Text stream in form of a generator

Signature

types.ITextStream

resource.close (method)

Close the resource as "filelike.close" does

Signature

() -> None

resource.dereference (method)

Dereference underlaying metadata If some of underlaying metadata is provided as a string it will replace it by the metadata object

Resource.describe (method) (static)

Describe the given source as a resource

Signature

(source: Optional[Any] = None, *, name: Optional[str] = None, type: Optional[str] = None, stats: bool = False, **options: Any) -> Metadata

Parameters

  • source (Optional[Any]): data source
  • name (Optional[str]): resoucrce name
  • type (Optional[str]): data type: "package", "resource", "dialect", or "schema"
  • stats (bool): if `True` infer resource's stats
  • options (Any)

resource.infer (method)

Infer metadata

Signature

(*, stats: bool = False) -> None

Parameters

  • stats (bool): stream file completely and infer stats

resource.list (method)

List dataset resources

Signature

(*, name: Optional[str] = None) -> List[Resource]

Parameters

  • name (Optional[str]): limit to one resource (if applicable)

resource.open (method)

Open the resource as "io.open" does

resource.read_bytes (method)

Read bytes into memory

Signature

(*, size: Optional[int] = None) -> bytes

Parameters

  • size (Optional[int])

resource.read_data (method)

Read data into memory

Signature

(*, size: Optional[int] = None) -> Any

Parameters

  • size (Optional[int])

resource.read_text (method)

Read text into memory

Signature

(*, size: Optional[int] = None) -> str

Parameters

  • size (Optional[int])

resource.to_copy (method)

Create a copy from the resource

Signature

(**options: Any) -> Self

Parameters

  • options (Any)

resource.validate (method)

Validate resource

Signature

(checklist: Optional[Checklist] = None, *, name: Optional[str] = None, on_row: Optional[types.ICallbackFunction] = None, parallel: bool = False, limit_rows: Optional[int] = None, limit_errors: int = 1000) -> Report

Parameters

  • checklist (Optional[Checklist]): a Checklist object
  • name (Optional[str]): limit validation to one resource (if applicable)
  • on_row (Optional[types.ICallbackFunction]): callbacke for every row
  • parallel (bool)
  • limit_rows (Optional[int]): limit amount of rows to this number
  • limit_errors (int): limit amount of errors to this number