Let's get started with Frictionless! We will learn how to install and use the framework. The simple example below will showcase the framework's basic functionality.
The framework requires Python3.8+. Versioning follows the SemVer Standard.
pip install frictionless
pip install frictionless[sql] # to install a core plugin (optional)
pip install 'frictionless[sql]' # for zsh shell
The framework supports CSV, Excel, and JSON formats by default. The second command above installs a plugin for SQL support. There are plugins for SQL, Pandas, HTML, and others (all supported plugins are listed in the "File Formats" and schemes in "File Schemes" menu). Usually, you don't need to think about it in advance–frictionless will display a useful error message about a missing plugin with installation instructions.
Did you have an error installing Frictionless? Here are some dependencies and common errors:
pip: command not found
. Please see the pip docs for help installing pip.Still having a problem? Ask us for help on our Discord chat or open an issue. We're happy to help!
The framework can be used:
For instance, both examples below do the same thing:
frictionless extract data/table.csv
from frictionless import extract
rows = extract('data/table.csv')
The interfaces are as much alike as possible regarding naming conventions and the way you interact with them. Usually, it's straightforward to translate, for instance, Python code to a command-line call. Frictionless provides code completion for Python and the command-line, which should help to get useful hints in real time.
Arguments conform to the following naming convention:
missing_values
missingValues
--missing-values
To get the documentation for a command-line interface just use the --help
flag:
frictionless --help
frictionless describe --help
frictionless extract --help
frictionless validate --help
frictionless transform --help
Download
invalid.csv
to reproduce the examples (right-click and "Save link as"). For more examples, please take a look at the Basic Examples article.
We will take a very messy data file:
cat invalid.csv
id,name,,name
1,english
1,english
2,german,1,2,3
with open('invalid.csv') as file:
print(file.read())
id,name,,name
1,english
1,english
2,german,1,2,3
First of all, let's use describe
to infer the metadata directly from the tabular data. We can then edit and save it to provide others with useful information about the data:
The CLI output is in YAML, it is a default Frictionless output format.
frictionless describe invalid.csv
─────────────────────────────────── Dataset ────────────────────────────────────
dataset
┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┓
┃ name ┃ type ┃ path ┃
┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━┩
│ invalid │ table │ invalid.csv │
└─────────┴───────┴─────────────┘
──────────────────────────────────── Tables ────────────────────────────────────
invalid
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ id ┃ name ┃ field3 ┃ name2 ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ integer │ string │ integer │ integer │
└─────────┴────────┴─────────┴─────────┘
from pprint import pprint
from frictionless import describe
resource = describe('invalid.csv')
pprint(resource)
{'name': 'invalid',
'type': 'table',
'path': 'invalid.csv',
'scheme': 'file',
'format': 'csv',
'mediatype': 'text/csv',
'encoding': 'utf-8',
'schema': {'fields': [{'name': 'id', 'type': 'integer'},
{'name': 'name', 'type': 'string'},
{'name': 'field3', 'type': 'integer'},
{'name': 'name2', 'type': 'integer'}]}}
Now that we have inferred a table schema from the data file (e.g., expected format of the table, expected type of each value in a column, etc.), we can use extract
to read the normalized tabular data from the source CSV file:
frictionless extract invalid.csv
─────────────────────────────────── Dataset ────────────────────────────────────
dataset
┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┓
┃ name ┃ type ┃ path ┃
┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━┩
│ invalid │ table │ invalid.csv │
└─────────┴───────┴─────────────┘
──────────────────────────────────── Tables ────────────────────────────────────
invalid
┏━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ id ┃ name ┃ field3 ┃ name2 ┃
┡━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ 1 │ english │ None │ None │
│ 1 │ english │ None │ None │
│ None │ None │ None │ None │
│ 2 │ german │ 1 │ 2 │
└──────┴─────────┴────────┴───────┘
from pprint import pprint
from frictionless import extract
rows = extract('invalid.csv')
pprint(rows)
{'invalid': [{'field3': None, 'id': 1, 'name': 'english', 'name2': None},
{'field3': None, 'id': 1, 'name': 'english', 'name2': None},
{'field3': None, 'id': None, 'name': None, 'name2': None},
{'field3': 1, 'id': 2, 'name': 'german', 'name2': 2}]}
Last but not least, let's get a validation report. This report will help us to identify and fix all the errors present in the tabular data, as comprehensive information is provided for every problem:
frictionless validate invalid.csv
─────────────────────────────────── Dataset ────────────────────────────────────
dataset
┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ name ┃ type ┃ path ┃ status ┃
┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩
│ invalid │ table │ invalid.csv │ INVALID │
└─────────┴───────┴─────────────┴─────────┘
──────────────────────────────────── Tables ────────────────────────────────────
invalid
┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Row ┃ Field ┃ Type ┃ Message ┃
┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ None │ 3 │ blank-label │ Label in the header in field at position │
│ │ │ │ "3" is blank │
│ None │ 4 │ duplicate-label │ Label "name" in the header at position "4" │
│ │ │ │ is duplicated to a label: at position "2" │
│ 2 │ 3 │ missing-cell │ Row at position "2" has a missing cell in │
│ │ │ │ field "field3" at position "3" │
│ 2 │ 4 │ missing-cell │ Row at position "2" has a missing cell in │
│ │ │ │ field "name2" at position "4" │
│ 3 │ 3 │ missing-cell │ Row at position "3" has a missing cell in │
│ │ │ │ field "field3" at position "3" │
│ 3 │ 4 │ missing-cell │ Row at position "3" has a missing cell in │
│ │ │ │ field "name2" at position "4" │
│ 4 │ None │ blank-row │ Row at position "4" is completely blank │
│ 5 │ 5 │ extra-cell │ Row at position "5" has an extra value in │
│ │ │ │ field at position "5" │
└──────┴───────┴─────────────────┴─────────────────────────────────────────────┘
from pprint import pprint
from frictionless import validate
report = validate('invalid.csv')
pprint(report.flatten(["rowNumber", "fieldNumber", "type"]))
[[None, 3, 'blank-label'],
[None, 4, 'duplicate-label'],
[2, 3, 'missing-cell'],
[2, 4, 'missing-cell'],
[3, 3, 'missing-cell'],
[3, 4, 'missing-cell'],
[4, None, 'blank-row'],
[5, 5, 'extra-cell']]
Now that we have all this information: