Edit page in Livemark
(2024-01-29 13:37)

Detector Class

The Detector object can be used in various places within the Framework. The main purpose of this class is to tweak how different aspects of metadata are detected.

Here is a quick example:

frictionless extract table.csv --field-missing-values 1,2
─────────────────────────────────── Dataset ────────────────────────────────────
           dataset
┏━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓
┃ name  ┃ type  ┃ path      ┃
┡━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩
│ table │ table │ table.csv │
└───────┴───────┴───────────┘
──────────────────────────────────── Tables ────────────────────────────────────
      table
┏━━━━━━┳━━━━━━━━━┓
┃ id   ┃ name    ┃
┡━━━━━━╇━━━━━━━━━┩
│ None │ english │
│ None │ 中国人  │
└──────┴─────────┘
from frictionless import Detector, Resource

detector = Detector(field_missing_values=['1', '2'])
resource = Resource('table.csv', detector=detector)
print(resource.read_rows())
[{'id': None, 'name': 'english'}, {'id': None, 'name': '中国人'}]

Many options below have their CLI equivalent. Please consult with the CLI help.

Detector Usage

The detector class instance are accepted by many classes and functions:

You just need to create a Detector instance using desired options and pass to the classed and function from above.

Buffer Size

By default, Frictionless will use the first 10000 bytes to detect encoding. Including more bytes by increasing buffer_size can improve the inference. However, it will be slower, but the encoding detection will be more accurate.

from frictionless import Detector, describe

detector = Detector(buffer_size=100000)
resource = describe("country-1.csv", detector=detector)
print(resource.encoding)
utf-8

Sample Size

By default, Frictionless will use the first 100 rows to detect field types. Including more samples by increasing sample_size can improve the inference. However, it will be slower, but the result will be more accurate.

from frictionless import Detector, describe

detector = Detector(sample_size=1000)
resource = describe("country-1.csv", detector=detector)
print(resource.schema)
{'fields': [{'name': 'id', 'type': 'integer'},
            {'name': 'neighbor_id', 'type': 'integer'},
            {'name': 'name', 'type': 'string'},
            {'name': 'population', 'type': 'integer'}]}

Encoding Function

By default, Frictionless encoding_function is None and user can use built in encoding functions. But user has option to implement their own encoding using this feature. The following example simply returns utf-8 encoding but user can add more complex logic to the encoding function.

from frictionless import Detector, Resource

detector = Detector(encoding_function=lambda sample: "utf-8")
with Resource("table.csv", detector=detector) as resource:
  print(resource.encoding)
utf-8

Field Type

This option allows manually setting all the field types to a given type. It's useful when you need to skip data casting (setting any type) or have everything as a string (setting string type):

from frictionless import Detector, describe

detector = Detector(field_type='string')
resource = describe("country-1.csv", detector=detector)
print(resource.schema)
{'fields': [{'name': 'id', 'type': 'string'},
            {'name': 'neighbor_id', 'type': 'string'},
            {'name': 'name', 'type': 'string'},
            {'name': 'population', 'type': 'string'}]}

Field Names

Sometimes you don't want to use existent header row to compose field names. It's possible to provide custom names:

from frictionless import Detector, describe

detector = Detector(field_names=["f1", "f2", "f3", "f4"])
resource = describe("country-1.csv", detector=detector)
print(resource.schema.field_names)
['f1', 'f2', 'f3', 'f4']

Field Confidence

By default, Frictionless uses 0.9 (90%) confidence level for data types detection. It means that it there are 9 integers in a field and one string it will be inferred as an integer. If you want a guarantee that an inferred schema will conform to the data you can set it to 1 (100%):

from frictionless import Detector, describe

detector = Detector(field_confidence=1)
resource = describe("country-1.csv", detector=detector)
print(resource.schema)
{'fields': [{'name': 'id', 'type': 'integer'},
            {'name': 'neighbor_id', 'type': 'integer'},
            {'name': 'name', 'type': 'string'},
            {'name': 'population', 'type': 'integer'}]}

Field Float Numbers

By default, Frictionless will consider that all non integer numbers are decimals. It's possible to make them float which is a faster data type:

from frictionless import Detector, describe

detector = Detector(field_float_numbers=True)
resource = describe("floats.csv", detector=detector)
print(resource.schema)
print(resource.read_rows())
{'fields': [{'name': 'number', 'type': 'number', 'floatNumber': True}]}
[{'number': 1.1}, {'number': 1.2}, {'number': 1.3}, {'number': 1.4}, {'number': 1.5}]

Field Missing Values

Missing Values is an important concept in data description. It provides information about what cell values should be considered as nulls. We can customize the defaults:

from frictionless import Detector, describe

detector = Detector(field_missing_values=["", "1", "2"])
resource = describe("table.csv", detector=detector)
print(resource.schema.missing_values)
print(resource.read_rows())
['', '1', '2']
[{'id': None, 'name': 'english'}, {'id': None, 'name': '中国人'}]

As we can see, the textual values equal to "67" are now considered nulls. Usually, it's handy when you have data with values like: '-', 'n/a', and similar.

Schema Sync

There is a way to sync provided schema based on a header row's field order. It's very useful when you have a schema that describes a subset or a superset of the resource's fields:

from frictionless import Detector, Resource, Schema, fields

# Note the order of the fields
detector = Detector(schema_sync=True)
schema = Schema(fields=[fields.StringField(name='name'), fields.IntegerField(name='id')])
with Resource('table.csv', schema=schema, detector=detector) as resource:
    print(resource.schema)
    print(resource.read_rows())
{'fields': [{'name': 'id', 'type': 'integer'},
            {'name': 'name', 'type': 'string'}]}
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]

Schema Patch

Sometimes we just want to update only a few fields or some schema's properties without providing a brand new schema. For example, the two examples above can be simplified as:

from frictionless import Detector, Resource

detector = Detector(schema_patch={'fields': {'id': {'type': 'string'}}})
with Resource('table.csv', detector=detector) as resource:
    print(resource.schema)
    print(resource.read_rows())
{'fields': [{'name': 'id', 'type': 'string'},
            {'name': 'name', 'type': 'string'}]}
[{'id': '1', 'name': 'english'}, {'id': '2', 'name': '中国人'}]

Reference

Detector (class)

Detector (class)

Detector representation. This main purpose of this class is to set the parameters to define how different aspects of metadata are detected.

Signature

(*, buffer_size: int = 10000, sample_size: int = 100, encoding_function: Optional[types.IEncodingFunction] = None, encoding_confidence: float = 0.5, field_type: Optional[str] = None, field_names: Optional[List[str]] = None, field_confidence: float = 0.9, field_float_numbers: bool = False, field_missing_values: List[str] = NOTHING, field_true_values: List[str] = NOTHING, field_false_values: List[str] = NOTHING, schema_sync: bool = False, schema_patch: Optional[Dict[str, Any]] = None) -> None

Parameters

  • buffer_size (int)
  • sample_size (int)
  • encoding_function (Optional[types.IEncodingFunction])
  • encoding_confidence (float)
  • field_type (Optional[str])
  • field_names (Optional[List[str]])
  • field_confidence (float)
  • field_float_numbers (bool)
  • field_missing_values (List[str])
  • field_true_values (List[str])
  • field_false_values (List[str])
  • schema_sync (bool)
  • schema_patch (Optional[Dict[str, Any]])

detector.buffer_size (property)

The amount of bytes to be extracted as a buffer. It defaults to 10000. The buffer_size can be increased to improve the inference accuracy to detect file encoding.

Signature

int

detector.sample_size (property)

The amount of rows to be extracted as a sample for dialect/schema inferring. It defaults to 100. The sample_size can be increased to improve the inference accuracy.

Signature

int

detector.encoding_function (property)

A custom encoding function for the file.

Signature

Optional[types.IEncodingFunction]

detector.encoding_confidence (property)

Confidence value for encoding function.

Signature

float

detector.field_type (property)

Enforce all the inferred types to be this type. For more information, please check "Describing Data" guide.

Signature

Optional[str]

detector.field_names (property)

Enforce all the inferred fields to have provided names. For more information, please check "Describing Data" guide.

Signature

Optional[List[str]]

detector.field_confidence (property)

A number from 0 to 1 setting the infer confidence. If 1 the data is guaranteed to be valid against the inferred schema. For more information, please check "Describing Data" guide. It defaults to 0.9

Signature

float

detector.field_float_numbers (property)

Flag to indicate desired number type. By default numbers will be `Decimal`; if `True` - `float`. For more information, please check "Describing Data" guide. It defaults to `False`

Signature

bool

detector.field_missing_values (property)

String to be considered as missing values. For more information, please check "Describing Data" guide. It defaults to `['']`

Signature

List[str]

detector.field_true_values (property)

String to be considered as true values. For more information, please check "Describing Data" guide. It defaults to `["true", "True", "TRUE", "1"]`

Signature

List[str]

detector.field_false_values (property)

String to be considered as false values. For more information, please check "Describing Data" guide. It defaults to `["false", "False", "FALSE", "0"]`

Signature

List[str]

detector.schema_sync (property)

Whether to sync the schema. If it sets to `True` the provided schema will be mapped to the inferred schema. It means that, for example, you can provide a subset of fields to be applied on top of the inferred fields or the provided schema can have different order of fields.

Signature

bool

detector.schema_patch (property)

A dictionary to be used as an inferred schema patch. The form of this dictionary should follow the Schema descriptor form except for the `fields` property which should be a mapping with the key named after a field name and the values being a field patch. For more information, please check "Extracting Data" guide.

Signature

Optional[Dict[str, Any]]

detector.detect_dialect (method)

Detect dialect from sample

Signature

(sample: types.ISample, *, dialect: Optional[Dialect] = None) -> Dialect

Parameters

  • sample (types.ISample): data sample
  • dialect (Optional[Dialect])

detector.detect_encoding (method)

Detect encoding from buffer

Signature

(buffer: types.IBuffer, *, encoding: Optional[str] = None) -> str

Parameters

  • buffer (types.IBuffer): byte buffer
  • encoding (Optional[str])

Detector.detect_metadata_type (method) (static)

Return an descriptor type as 'resource' or 'package'

Signature

(source: Any, *, format: Optional[str] = None) -> Optional[str]

Parameters

  • source (Any)
  • format (Optional[str])

detector.detect_resource (method)

Detects path details

Signature

(resource: Resource) -> None

Parameters

  • resource (Resource)

detector.detect_schema (method)

Detect schema from fragment

Signature

(fragment: types.IFragment, *, labels: Optional[List[str]] = None, schema: Optional[Schema] = None, field_candidates: List[Dict[str, Any]] = [{type: yearmonth}, {type: geopoint}, {type: duration}, {type: geojson}, {type: object}, {type: array}, {type: datetime}, {type: time}, {type: date}, {type: integer}, {type: number}, {type: boolean}, {type: year}, {type: string}]) -> Schema

Parameters

  • fragment (types.IFragment): data fragment
  • labels (Optional[List[str]])
  • schema (Optional[Schema])
  • field_candidates (List[Dict[str, Any]])