The Resource class is arguable the most important class of the whole Frictionless Framework. It's based on Data Resource Standard and Tabular Data Resource Standard
Let's create a data resource:
from frictionless import Resource
resource = Resource('table.csv') # from a resource path
resource = Resource('resource.json') # from a descriptor path
resource = Resource({'path': 'table.csv'}) # from a descriptor
resource = Resource(path='table.csv') # from arguments
As you can see it's possible to create a resource providing different kinds of sources which will be detector to have some type automatically (e.g. whether it's a descriptor or a path). It's possible to make this step more explicit:
from frictionless import Resource
resource = Resource(path='data/table.csv') # from a path
resource = Resource('data/resource.json') # from a descriptor
The standards support a great deal of resource metadata which is possible to have with Frictionless Framework too:
from frictionless import Resource
resource = Resource(
name='resource',
title='My Resource',
description='My Resource for the Guide',
path='table.csv',
# it's possible to provide all the official properties like mediatype, etc
)
print(resource)
{'name': 'resource',
'title': 'My Resource',
'description': 'My Resource for the Guide',
'path': 'table.csv'}
If you have created a resource, for example, from a descriptor you can access this properties:
from frictionless import Resource
resource = Resource('resource.json')
print(resource.name)
# and others
name
And edit them:
from frictionless import Resource
resource = Resource('resource.json')
resource.name = 'new-name'
resource.title = 'New Title'
resource.description = 'New Description'
# and others
print(resource)
{'name': 'new-name',
'title': 'New Title',
'description': 'New Description',
'path': 'table.csv'}
As any of the Metadata classes the Resource class can be saved as JSON or YAML:
from frictionless import Resource
resource = Resource('table.csv')
resource.to_json('resource.json') # Save as JSON
resource.to_yaml('resource.yaml') # Save as YAML
You might have noticed that we had to duplicate the with Resource(...)
statement in some examples. The reason is that Resource is a streaming interface. Once it's read you need to open it again. Let's show it in an example:
from pprint import pprint
from frictionless import Resource
resource = Resource('capital-3.csv')
resource.open()
pprint(resource.read_rows())
pprint(resource.read_rows())
# We need to re-open: there is no data left
resource.open()
pprint(resource.read_rows())
# We need to close manually: not context manager is used
resource.close()
[{'id': 1, 'name': 'London'},
{'id': 2, 'name': 'Berlin'},
{'id': 3, 'name': 'Paris'},
{'id': 4, 'name': 'Madrid'},
{'id': 5, 'name': 'Rome'}]
[]
[{'id': 1, 'name': 'London'},
{'id': 2, 'name': 'Berlin'},
{'id': 3, 'name': 'Paris'},
{'id': 4, 'name': 'Madrid'},
{'id': 5, 'name': 'Rome'}]
At the same you can read data for a resource without opening and closing it explicitly. In this case Frictionless Framework will open and close the resource for you so it will be basically a one-time operation:
from frictionless import Resource
resource = Resource('capital-3.csv')
pprint(resource.read_rows())
[{'id': 1, 'name': 'London'},
{'id': 2, 'name': 'Berlin'},
{'id': 3, 'name': 'Paris'},
{'id': 4, 'name': 'Madrid'},
{'id': 5, 'name': 'Rome'}]
The Resource class is also a metadata class which provides various read and stream functions. The extract
functions always read rows into memory; Resource can do the same but it also gives a choice regarding output data. It can be rows
, data
, text
, or bytes
. Let's try reading all of them:
from frictionless import Resource
resource = Resource('country-3.csv')
pprint(resource.read_bytes())
pprint(resource.read_text())
pprint(resource.read_cells())
pprint(resource.read_rows())
(b'id,capital_id,name,population\n1,1,Britain,67\n2,3,France,67\n3,2,Germany,8'
b'3\n4,5,Italy,60\n5,4,Spain,47\n')
('id,capital_id,name,population\n'
'1,1,Britain,67\n'
'2,3,France,67\n'
'3,2,Germany,83\n'
'4,5,Italy,60\n'
'5,4,Spain,47\n')
[['id', 'capital_id', 'name', 'population'],
['1', '1', 'Britain', '67'],
['2', '3', 'France', '67'],
['3', '2', 'Germany', '83'],
['4', '5', 'Italy', '60'],
['5', '4', 'Spain', '47']]
[{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67},
{'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67},
{'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83},
{'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60},
{'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}]
It's really handy to read all your data into memory but it's not always possible if a file is really big. For such cases, Frictionless provides streaming functions:
from frictionless import Resource
with Resource('country-3.csv') as resource:
pprint(resource.byte_stream)
pprint(resource.text_stream)
pprint(resource.cell_stream)
pprint(resource.row_stream)
for row in resource.row_stream:
print(row)
<frictionless.system.loader.ByteStreamWithStatsHandling object at 0x7f0f9465b9a0>
<_io.TextIOWrapper name='country-3.csv' encoding='utf-8'>
<itertools.chain object at 0x7f0f946df130>
<generator object Resource.__prepare_row_stream.<locals>.row_stream at 0x7f0f94497e40>
{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67}
{'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67}
{'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83}
{'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60}
{'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}
[email protected]
as a feature preview and request for comments. The implementation is raw and doesn't cover many edge cases.
Indexing resource in Frictionless terms means loading a data table into a database with or without metadata. Let's explore how this feature works in different modes.
All the example are written for SQLite for simplicity
This mode is supported for any database that is supported by sqlalchemy
. Under the hood, Frictionless will infer Table Schema and populate the data table as it normally reads data. It means that type errors will be replaced by null
values and in-general it guarantees to finish successfully for any data even very invalid.
frictionless index table.csv --database sqlite:///index/project.db --table table
frictionless extract sqlite:///index/project.db --table table --json
Indexed 2 rows in 0.319 seconds
[
{
"id": 1,
"name": "english"
},
{
"id": 2,
"name": "中国人"
}
]
import sqlite3
from frictionless import Resource, formats
resource = Resource('table.csv')
resource.index('sqlite:///index/project.db', table_name='table')
print(Resource('sqlite:///index/project.db', control=formats.sql.SqlControl(table='table')).extract())
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
In metadata mode, the indexing process will be the same but it also stores the metadata in the database. This mode is highly-experimental and, currently, in-general not intended for using outside of Frictionless Software. Let's explore on the example:
frictionless index table.csv --database sqlite:///index/project.db --metadata
frictionless extract sqlite:///index/project.db --table table --json
frictionless extract sqlite:///index/project.db --table _resources --json
Indexed 2 rows in 0.342 seconds
[
{
"_row_number": 2,
"_row_valid": true,
"id": 1,
"name": "english"
},
{
"_row_number": 3,
"_row_valid": true,
"id": 2,
"name": "中国人"
}
]
[
{
"path": "table.csv",
"table_name": "table",
"updated": "2023-01-25T11:57:10",
"resource": "{\n \"name\": \"table\",\n \"type\": \"table\",\n \"path\": \"table.csv\",\n \"scheme\": \"file\",\n \"format\": \"csv\",\n \"encoding\": \"utf-8\",\n \"mediatype\": \"text/csv\",\n \"schema\": {\n \"fields\": [\n {\n \"name\": \"id\",\n \"type\": \"integer\"\n },\n {\n \"name\": \"name\",\n \"type\": \"string\"\n }\n ]\n },\n \"stats\": {\n \"md5\": \"6c2c61dd9b0e9c6876139a449ed87933\",\n \"sha256\": \"a1fd6c5ff3494f697874deeb07f69f8667e903dd94a7bc062dd57550cea26da8\",\n \"bytes\": 30,\n \"fields\": 2,\n \"rows\": 2\n }\n}",
"report": "{\n \"valid\": true,\n \"stats\": {\n \"tasks\": 1,\n \"warnings\": 0,\n \"errors\": 0,\n \"seconds\": 0.004\n },\n \"warnings\": [],\n \"errors\": [],\n \"tasks\": [\n {\n \"valid\": true,\n \"name\": \"table\",\n \"type\": \"table\",\n \"place\": \"table.csv\",\n \"labels\": [\n \"id\",\n \"name\"\n ],\n \"stats\": {\n \"md5\": \"6c2c61dd9b0e9c6876139a449ed87933\",\n \"sha256\": \"a1fd6c5ff3494f697874deeb07f69f8667e903dd94a7bc062dd57550cea26da8\",\n \"bytes\": 30,\n \"fields\": 2,\n \"rows\": 2,\n \"warnings\": 0,\n \"errors\": 0,\n \"seconds\": 0.004\n },\n \"warnings\": [],\n \"errors\": []\n }\n ]\n}"
}
]
import sqlite3
from frictionless import Resource, formats
resource = Resource('table.csv')
resource.index('sqlite:///index/project.db', with_metadata=True)
print(Resource('sqlite:///index/project.db', control=formats.sql.SqlControl(table='table')).extract())
print(Resource('sqlite:///index/project.db', control=formats.sql.SqlControl(table='_resources')).extract())
[{'_row_number': 2, '_row_valid': True, 'id': 1, 'name': 'english'}, {'_row_number': 3, '_row_valid': True, 'id': 2, 'name': '中国人'}]
[{'path': 'table.csv', 'table_name': 'table', 'updated': datetime.datetime(2023, 1, 25, 11, 57, 12, 671396), 'resource': '{\n "name": "table",\n "type": "table",\n "path": "table.csv",\n "scheme": "file",\n "format": "csv",\n "encoding": "utf-8",\n "mediatype": "text/csv",\n "schema": {\n "fields": [\n {\n "name": "id",\n "type": "integer"\n },\n {\n "name": "name",\n "type": "string"\n }\n ]\n },\n "stats": {\n "md5": "6c2c61dd9b0e9c6876139a449ed87933",\n "sha256": "a1fd6c5ff3494f697874deeb07f69f8667e903dd94a7bc062dd57550cea26da8",\n "bytes": 30,\n "fields": 2,\n "rows": 2\n }\n}', 'report': '{\n "valid": true,\n "stats": {\n "tasks": 1,\n "warnings": 0,\n "errors": 0,\n "seconds": 0.004\n },\n "warnings": [],\n "errors": [],\n "tasks": [\n {\n "valid": true,\n "name": "table",\n "type": "table",\n "place": "table.csv",\n "labels": [\n "id",\n "name"\n ],\n "stats": {\n "md5": "6c2c61dd9b0e9c6876139a449ed87933",\n "sha256": "a1fd6c5ff3494f697874deeb07f69f8667e903dd94a7bc062dd57550cea26da8",\n "bytes": 30,\n "fields": 2,\n "rows": 2,\n "warnings": 0,\n "errors": 0,\n "seconds": 0.004\n },\n "warnings": [],\n "errors": []\n }\n ]\n}'}]
[email protected]+
command to be available.
Fast mode is supported for SQLite and Postgresql databases. It will infer Table Schema using a data sample and index data using COPY
in Potgresql and .import
in SQLite. For big data files this mode will be 10-30x faster than normal indexing but the speed comes with the price -- if there is invalid data the indexing will fail.
frictionless index table.csv --database sqlite:///index/project.db --table table --fast
frictionless extract sqlite:///index/project.db --table table --json
Indexed 30 bytes in 0.607 seconds
[
{
"id": 1,
"name": "english"
},
{
"id": 2,
"name": "中国人"
}
]
import sqlite3
from frictionless import Resource, formats
resource = Resource('table.csv')
resource.index('sqlite:///index/project.db', table_name='table', fast=True)
print(Resource('sqlite:///index/project.db', control=formats.sql.SqlControl(table='table')).extract())
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
To ensure that the data will be successfully indexed it's possible to use fallback
option. If the fast indexing fails Frictionless will start over in normal mode and finish the process successfully.
frictionless index table.csv --database sqlite:///index/project.db --table table --fast --fallback
import sqlite3
from frictionless import Resource, formats
resource = Resource('table.csv')
resource.index('sqlite:///index/project.db', table_name='table', fast=True, fallback=True)
Another option is to provide a path to QSV binary. In this case, initial schema inferring will be done based on the whole data file and will guarantee that the table is valid type-wise:
frictionless index table.csv --database sqlite:///index/project.db --table table --fast --qsv qsv_path
import sqlite3
from frictionless import Resource, formats
resource = Resource('table.csv')
resource.index('sqlite:///index/project.db', table_name='table', fast=True, qsv_path='qsv_path')
The scheme also know as protocol indicates which loader Frictionless should use to read or write data. It can be file
(default), text
, http
, https
, s3
, and others.
from frictionless import Resource
with Resource(b'header1,header2\nvalue1,value2', format='csv') as resource:
print(resource.scheme)
print(resource.to_view())
buffer
+----------+----------+
| header1 | header2 |
+==========+==========+
| 'value1' | 'value2' |
+----------+----------+
The format or as it's also called extension helps Frictionless to choose a proper parser to handle the file. Popular formats are csv
, xlsx
, json
and others
from frictionless import Resource
with Resource(b'header1,header2\nvalue1,value2.csv', format='csv') as resource:
print(resource.format)
print(resource.to_view())
csv
+----------+--------------+
| header1 | header2 |
+==========+==============+
| 'value1' | 'value2.csv' |
+----------+--------------+
Frictionless automatically detects encoding of files but sometimes it can be inaccurate. It's possible to provide an encoding manually:
from frictionless import Resource
with Resource('country-3.csv', encoding='utf-8') as resource:
print(resource.encoding)
print(resource.path)
utf-8
country-3.csv
utf-8
data/country-3.csv
By default, Frictionless uses the first file found in a zip archive. It's possible to adjust this behaviour:
from frictionless import Resource
with Resource('table-multiple-files.zip', innerpath='table-reverse.csv') as resource:
print(resource.compression)
print(resource.innerpath)
print(resource.to_view())
zip
table-reverse.csv
+----+-----------+
| id | name |
+====+===========+
| 1 | '中国人' |
+----+-----------+
| 2 | 'english' |
+----+-----------+
It's possible to adjust compression detection by providing the algorithm explicitly. For the example below it's not required as it would be detected anyway:
from frictionless import Resource
with Resource('table.csv.zip', compression='zip') as resource:
print(resource.compression)
print(resource.to_view())
zip
+----+-----------+
| id | name |
+====+===========+
| 1 | 'english' |
+----+-----------+
| 2 | '中国人' |
+----+-----------+
Please read Table Dialect Guide for more information.
Please read Table Schema Guide for more information.
Please read Checklist Guide for more information.
Please read Pipeline Guide for more information.
Resource's stats can be accessed with resource.stats
:
from frictionless import Resource
resource = Resource('table.csv')
resource.infer(stats=True)
print(resource.stats)
{'md5': '6c2c61dd9b0e9c6876139a449ed87933',
'sha256': 'a1fd6c5ff3494f697874deeb07f69f8667e903dd94a7bc062dd57550cea26da8',
'bytes': 30,
'fields': 2,
'rows': 2}
Resource representation. This class is one of the cornerstones of of Frictionless framework. It loads a data source, and allows you to stream its parsed contents. At the same time, it's a metadata class data description. ```python with Resource("data/table.csv") as resource: resource.header == ["id", "name"] resource.read_rows() == [ {'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}, ] ```
(source: Optional[Any] = None, control: Optional[Control] = None, *, name: Optional[str] = None, type: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, homepage: Optional[str] = None, profiles: List[Union[IProfile, str]] = [], licenses: List[dict] = [], sources: List[dict] = [], path: Optional[str] = None, data: Optional[Any] = None, scheme: Optional[str] = None, format: Optional[str] = None, encoding: Optional[str] = None, mediatype: Optional[str] = None, compression: Optional[str] = None, extrapaths: List[str] = [], innerpath: Optional[str] = None, dialect: Optional[Union[Dialect, str]] = None, schema: Optional[Union[Schema, str]] = None, checklist: Optional[Union[Checklist, str]] = None, pipeline: Optional[Union[Pipeline, str]] = None, stats: Optional[Stats] = None, basepath: Optional[str] = None, detector: Optional[Detector] = None, package: Optional[Package] = None)
Resource name according to the specs. It should be a slugified name of the resource.
Optional[str]
Type of the data e.g. "table"
Optional[str]
Resource title according to the specs It should a human-oriented title of the resource.
Optional[str]
Resource description according to the specs It should a human-oriented description of the resource.
Optional[str]
A URL for the home on the web that is related to this package. For example, github repository or ckan dataset address.
Optional[str]
Strings identifying the profile of this descriptor. For example, `tabular-data-resource`.
List[Union[IProfile, str]]
The license(s) under which the resource is provided. If omitted it's considered the same as the package's licenses.
List[dict]
The raw sources for this data resource. It MUST be an array of Source objects. Each Source object MUST have a title and MAY have path and/or email properties.
List[dict]
Path to data source
Optional[str]
Inline data source
Optional[Any]
Scheme for loading the file (file, http, ...). If not set, it'll be inferred from `source`.
Optional[str]
File source's format (csv, xls, ...). If not set, it'll be inferred from `source`.
Optional[str]
Source encoding. If not set, it'll be inferred from `source`.
Optional[str]
Mediatype/mimetype of the resource e.g. “text/csv”, or “application/vnd.ms-excel”. Mediatypes are maintained by the Internet Assigned Numbers Authority (IANA) in a media type registry.
Optional[str]
Source file compression (zip, ...). If not set, it'll be inferred from `source`.
Optional[str]
List of paths to concatenate to the main path. It's used for multipart resources.
List[str]
Path within the compressed file. It defaults to the first file in the archive (if the source is an archive).
Optional[str]
File/table detector. For more information, please check the Detector documentation.
Detector
Parental to this resource package. For more information, please check the Package documentation.
Optional[Package]
A basepath of the resource The normpath of the resource is joined `basepath` and `/path`
Optional[str]
File's bytes used as a sample These buffer bytes are used to infer characteristics of the source file (e.g. encoding, ...).
IBuffer
Byte stream in form of a generator
IByteStream
Cell stream in form of a generator
ICellStream
Checklist object. For more information, please check the Checklist documentation.
(Optional[Union[Checklist, str]]) -> Optional[Checklist]
Whether the table is closed
bool
File Dialect object. For more information, please check the Dialect documentation.
(Optional[Union[Dialect, str]]) -> Dialect
Table's lists used as fragment. These fragment rows are used internally to infer characteristics of the source file (e.g. schema, ...).
IFragment
Header
ILabels
Lookup
Whether resource is not path based
bool
Whether resource is multipart
bool
Normalized data or raise if not set
Any
Normalized path of the resource or raise if not set
str
Normalized paths of the resource
List[str]
All paths of the resource
List[str]
Pipeline object. For more information, please check the Pipeline documentation.
(Optional[Union[Pipeline, str]]) -> Optional[Pipeline]
Stringified resource location
str
Whether resource is remote
bool
Row stream in form of a generator of Row objects
IRowStream
Table's lists used as sample. These sample rows are used to infer characteristics of the source file (e.g. schema, ...).
ISample
Table Schema object. For more information, please check the Schema documentation.
(Optional[Union[Schema, str]]) -> Schema
Stats object. An object with the following possible properties: md5, sha256, bytes, fields, rows.
(Optional[Union[Stats, str]]) -> Stats
Whether resource is tabular
bool
Text stream in form of a generator
ITextStream
Analyze the resource This feature is currently experimental, and its API may change without warning.
(: Resource, *, detailed=False) -> dict
Close the resource as "filelike.close" does
() -> None
Describe the given source as a resource
(source: Optional[Any] = None, *, stats: bool = False, **options)
Extract resource rows
(: Resource, *, limit_rows: Optional[int] = None, process: Optional[IProcessFunction] = None, filter: Optional[IFilterFunction] = None, stream: bool = False)
Create a resource from PETL view
(view, **options)
Index resource into a database
(: Resource, database_url: str, *, table_name: Optional[str] = None, fast: bool = False, qsv_path: Optional[str] = None, on_progress: Optional[Callable[[str], None]] = None, use_fallback: bool = False, with_metadata: bool = False)
Infer metadata
(*, sample: bool = True, stats: bool = False) -> None
Open the resource as "io.open" does
(*, as_file: bool = False)
Read bytes into memory
(*, size: Optional[int] = None) -> bytes
Read lists into memory
(*, size: Optional[int] = None) -> List[List[Any]]
Read data into memory
(*, size: Optional[int] = None) -> Any
Read rows into memory
(*, size=None) -> List[Row]
Read text into memory
(*, size: Optional[int] = None) -> str
Create a copy from the resource
(**options)
Helper to export resource as an inline data
(*, dialect=None)
Helper to export resource as an Pandas dataframe
(*, dialect=None)
Export resource as a PETL table
(normalize=False)
Create a snapshot from the resource
(*, json=False)
Create a view from the resource See PETL's docs for more information: https://platform.petl.readthedocs.io/en/stable/util.html#visualising-tables
(type=look, **options)
Transform resource
(: Resource, pipeline: Optional[Pipeline] = None)
Validate resource
(: Resource, checklist: Optional[Checklist] = None, *, limit_errors: int = 1000, limit_rows: Optional[int] = None, on_row: Optional[ICallbackFunction] = None)
Write this resource to the target resource
(target: Optional[Union[Resource, Any]] = None, *, control: Optional[Control] = None, **options) -> Resource