datasetinsights.io

datasetinsights.io.bbox

class datasetinsights.io.bbox.BBox2D(label, x, y, w, h, score=1.0)

Bases: object

Canonical Representation of a 2D bounding box.

label

string representation of the label.

Type

str

x

x pixel coordinate of the upper left corner.

Type

float

y

y pixel coordinate of the upper left corner.

Type

float

w

width (number of pixels)of the bounding box.

Type

float

h

height (number of pixels) of the bounding box.

Type

float

score

detection confidence score. Default is set to score=1. if this is a ground truth bounding box.

Type

float

Examples

Here is an example about how to use this class.

>>> gt_bbox = BBox2D(label='car', x=2, y=6, w=2, h=4)
>>> gt_bbox
"label='car'|score=1.0|x=2.0|y=6.0|w=2.0|h=4.0"
>>> pred_bbox = BBox2D(label='car', x=2, y=5, w=2, h=4, score=0.79)
>>> pred_bbox.area
8
>>> pred_bbox.intersect_with(gt_bbox)
True
>>> pred_bbox.intersection(gt_bbox)
6
>>> pred_bbox.union(gt_bbox)
10
>>> pred_bbox.iou(gt_bbox)
0.6
property area

Calculate area of this bounding box

Returns

width x height of the bound box

intersect_with(other)

Check whether this box intersects with other bounding box

Parameters

other (BBox2D) – other bounding box object to check intersection

Returns

True if two bounding boxes intersect, False otherwise

intersection(other)

Calculate the intersection area with other bounding box

Parameters

other (BBox2D) – other bounding box object to calculate intersection

Returns

float of the intersection area for two bounding boxes

iou(other)

Calculate intersection over union area with other bounding box

\[IOU = \frac{intersection}{union}\]
Parameters

other (BBox2D) – other bounding box object to calculate iou

Returns

float of the union area for two bounding boxes

union(other, intersection_area=None)

Calculate union area with other bounding box

Parameters
  • other (BBox2D) – other bounding box object to calculate union

  • intersection_area (float) – pre-calculated area of intersection

Returns

float of the union area for two bounding boxes

class datasetinsights.io.bbox.BBox3D(translation, size, label, sample_token, score=1, rotation: pyquaternion.quaternion.Quaternion = Quaternion(1.0, 0.0, 0.0, 0.0), velocity=nan, nan, nan)

Bases: object

Class for 3d bounding boxes which can either be predictions or ground-truths. This class is the primary representation in this repo of 3d bounding boxes and is based off of the Nuscenes style dataset.

property back_left_bottom_pt

Back-left-bottom point.

Type

Returns

Type

float

property back_left_top_pt

Back-left-top point.

Type

float

property back_right_bottom_pt

Back-right-bottom point.

Type

float

property back_right_top_pt

Back-right-top point.

Type

float

property front_left_bottom_pt

Front-left-bottom point.

Type

float

property front_left_top_pt

Front-left-top point.

Type

float

property front_right_bottom_pt

Front-right-bottom point.

Type

float

property front_right_top_pt

Front-right-top point.

Type

float

property p
list of all 8 corners of the box beginning with the the bottom

four corners and then the top

four corners, both in counterclockwise order (from birds eye view) beginning with the back-left corner

Type

Returns

datasetinsights.io.bbox.group_bbox2d_per_label(bboxes)

Group 2D bounding boxes with same label.

Parameters

bboxes (list[BBox2D]) – a list of 2D bounding boxes

Returns

a dictionary of 2d boundign box group. {label1: [bbox1, bboxes2, …], label2: [bbox1, …]}

Return type

dict

datasetinsights.io.checkpoint

Save estimator checkpoints

class datasetinsights.io.checkpoint.EstimatorCheckpoint(estimator_name, checkpoint_dir, distributed)

Bases: object

Saves and loads estimator checkpoints.

Assigns estimator checkpoint writer according to log_dir which is responsible for saving estimators. Writer can be a GCS or local writer. Assigns loader which is responsible for loading estimator from a given path. Loader can a local, GCS or HTTP loader.

Parameters
  • estimator_name (str) – name of the estimator

  • checkpoint_dir (str) – Directory where checkpoints are stored

  • distributed (bool) – boolean to determine distributed training

checkpoint_dir

Directory where checkpoints are stored

Type

str

distributed

boolean to determine distributed training

Type

bool

load(estimator, path)

Loads estimator from given path.

Path can be either a local path or GCS path or HTTP url.

Parameters
save(estimator, epoch)

Save estimator to the log_dir.

Parameters
class datasetinsights.io.checkpoint.GCSEstimatorWriter(cloud_path, prefix, *, suffix='estimator')

Bases: object

Writes (saves) estimator checkpoints on GCS.

Parameters
  • cloud_path (str) – GCS cloud path (e.g. gs://bucket/path/to/directoy)

  • prefix (str) – filename prefix of the checkpoint files

  • suffix (str) – filename suffix of the checkpoint files

save(estimator, epoch=None)

Save estimator to checkpoint files on GCS.

Parameters
Returns

Full GCS cloud path to the saved checkpoint file.

class datasetinsights.io.checkpoint.LocalEstimatorWriter(dirname, prefix, *, suffix='estimator', create_dir=True)

Bases: object

Writes (saves) estimator checkpoints locally.

Parameters
  • dirname (str) – Directory where estimator is to be saved.

  • prefix (str) – Filename prefix of the checkpoint files.

  • suffix (str) – Filename suffix of the checkpoint files.

  • create_dir (bool) – Flag for creating new directory. Default: True.

dirname

directory name of where checkpoint files are stored

Type

str

prefix

filename prefix of the checkpoint files

Type

str

suffix

filename suffix of the checkpoint files

Type

str

save(estimator, epoch=None)

Save estimator to locally to log_dir.

Parameters
Returns

Full path to the saved checkpoint file.

datasetinsights.io.checkpoint.load_from_gcs(estimator, full_cloud_path)

Load estimator from checkpoint files on GCS.

Parameters
datasetinsights.io.checkpoint.load_from_http(estimator, url)

Load estimator from checkpoint files on GCS.

Parameters
datasetinsights.io.checkpoint.load_local(estimator, path)

Loads estimator checkpoints from a local path.

datasetinsights.io.download

class datasetinsights.io.download.TimeoutHTTPAdapter(timeout, *args, **kwargs)

Bases: requests.adapters.HTTPAdapter

send(request, **kwargs)

Sends PreparedRequest object. Returns Response object.

Parameters
  • request – The PreparedRequest being sent.

  • stream – (optional) Whether to stream the request content.

  • timeout (float or tuple or urllib3 Timeout object) – (optional) How long to wait for the server to send data before giving up, as a float, or a (connect timeout, read timeout) tuple.

  • verify – (optional) Either a boolean, in which case it controls whether we verify the server’s TLS certificate, or a string, in which case it must be a path to a CA bundle to use

  • cert – (optional) Any user-provided SSL certificate to be trusted.

  • proxies – (optional) The proxies dictionary to apply to the request.

Return type

requests.Response

datasetinsights.io.download.checksum_matches(filepath, expected_checksum, algorithm='CRC32')

Check if the checksum matches

Parameters
  • filepath (str) – the doaloaded file path

  • expected_checksum (int) – expected checksum of the file

  • algorithm (str) – checksum algorithm. Defaults to CRC32

Returns

True if the file checksum matches.

datasetinsights.io.download.compute_checksum(filepath, algorithm='CRC32')

Compute the checksum of a file.

Parameters
  • filepath (str) – the doaloaded file path

  • algorithm (str) – checksum algorithm. Defaults to CRC32

Returns

the checksum value

Return type

int

datasetinsights.io.download.download_file(source_uri: str, dest_path: str, file_name: str = None)

Download a file specified from a source uri

Parameters
  • source_uri (str) – source url where the file should be downloaded

  • dest_path (str) – destination path of the file

  • file_name (str) – file name of the file to be downloaded

Returns

String of destination path.

datasetinsights.io.download.get_checksum_from_file(filepath)

This method return checksum of the file whose filepath is given.

Parameters

filepath (str) – Path of the checksum file. Path can be HTTP(s) url or local path.

Raises

ValueError – Raises this error if filepath is not local or not HTTP or HTTPS url.

datasetinsights.io.download.validate_checksum(filepath, expected_checksum, algorithm='CRC32')

Validate checksum of the downloaded file.

Parameters
  • filepath (str) – the doaloaded file path

  • expected_checksum (int) – expected checksum of the file

  • algorithm (str) – checksum algorithm. Defaults to CRC32

Raises

ChecksumError if the file checksum does not match.

datasetinsights.io.exceptions

exception datasetinsights.io.exceptions.ChecksumError

Bases: Exception

Raises when the downloaded file checksum is not correct.

exception datasetinsights.io.exceptions.DownloadError

Bases: Exception

Raise when download file failed.

datasetinsights.io.gcs

class datasetinsights.io.gcs.GCSClient(**kwargs)

Bases: object

This class is used to download data from GCS location and perform function such as downloading the dataset and checksum validation.

GCS_PREFIX = '^gs://'
KEY_SEPARATOR = '/'
download(*, url=None, local_path=None, bucket=None, key=None)

This method is used to download the dataset from GCS.

Parameters
  • url (str) – This is the downloader-uri that indicates where the dataset should be downloaded from.

  • local_path (str) – This is the path to the directory where the download will store the dataset.

  • bucket (str) – gcs bucket name

  • key (str) – object key path

  • Examples

    >>> url = "gs://bucket/folder or gs://bucket/folder/data.zip"
    >>> local_path = "/tmp/folder"
    >>> bucket ="bucket"
    >>> key ="folder/data.zip" or "folder"
    

upload(*, local_path=None, bucket=None, key=None, url=None, pattern='*')

Upload a file or list of files from directory to GCS

Parameters
  • url (str) – This is the gcs location that indicates where

  • dataset should be uploaded. (the) –

  • local_path (str) – This is the path to the directory or file

  • the data is stored. (where) –

  • bucket (str) – gcs bucket name

  • key (str) – object key path

  • pattern – Unix glob patterns. Use **/* for recursive glob.

  • Examples

    For file upload:
    >>> url = "gs://bucket/folder/data.zip"
    >>> local_path = "/tmp/folder/data.zip"
    >>> bucket ="bucket"
    >>> key ="folder/data.zip"
    
    For directory upload:
    >>> url = "gs://bucket/folder"
    >>> local_path = "/tmp/folder"
    >>> bucket ="bucket"
    >>> key ="folder"
    >>> key ="**/*"
    

datasetinsights.io.kfp_output

class datasetinsights.io.kfp_output.KubeflowPipelineWriter(tb_log_dir='/home/docs/checkouts/readthedocs.org/user_builds/datasetinsights/checkouts/0.2.2/runs/20201027-200642', filename='mlpipeline-metrics.json', filepath='/')

Bases: object

Serializes metrics dictionary genereated during model training/evaluation to JSON and store in a file.

Parameters
  • filename (str) – Name of the file to which the writer will save metrics

  • filepath (str) – Path where the file will be stored

filename

Name of the file to which the writer will save metrics

Type

str

filepath

Path where the file will be stored

Type

str

data_dict

A dictionary to save metrics name and value pairs

Type

dict

data

Dictionary to be JSON serialized

add_metric(name, val)

Adds metric to the data dictionary of the writer

Note: Using same name key will overwrite the previous value as the current strategy is to save only the metrics generated in last epoch

Parameters
  • name (str) – Name of the metric

  • val (float) – Value of the metric

create_tb_visualization_json()
write_metric()

Saves all the metrics added previously to a file in the format required by kubeflow

datasetinsights.io.loader

datasetinsights.io.loader.create_loader(dataset, *, dryrun=False, batch_size=1, num_workers=0, collate_fn=None)

Create data loader from dataset

Note: The data loader here is a pytorch data loader object which does not assume tensor_type to be pytorch tensor. We only require input dataset to support __getitem__ and __len__ mothod to iterate over items in the dataset.

Since collate_fn method in torch.utils.data.DataLoader behave differently when automatic batching is on, we might need to override this method. If create_loader method became too complicated in order to support different estimators, we might expect different estimator to have their own create_loader method.

https://pytorch.org/docs/stable/data.html#working-with-collate-fn

Parameters
  • dataset (Dataset) – dataset object derived from datasetinsights.data.datasets.Dataset class.

  • dryrun (bool) – indicator whether to use a very small subset of the dataset. This subset is useful to make sure we can quickly run estimator without loading the whole dataset. (default: False)

  • batch_size (int) – how many samples per batch to load (default: 1)

  • num_workers (int) – number of parallel workers used for data loader. Set to 0 to run on a single thread (instead of 1 which might introduce overhead). (default: 0)

Returns

torch.utils.data.DataLoader object as data loader

datasetinsights.io.transforms

class datasetinsights.io.transforms.Compose(transforms)

Bases: object

class datasetinsights.io.transforms.RandomHorizontalFlip(flip_prob=0.5)

Bases: object

Flip the image from top to bottom.

Parameters

flip_prob – the probability to flip the image

class datasetinsights.io.transforms.Resize(img_size=- 1, target_size=- 1)

Bases: object

Resize the (image, target) to the given sizes.

Parameters
  • img_size (tuple or int) – Desired output size. If size is a sequence like (h, w), output size will be matched to this. If size is an int, smaller edge of the image will be matched to this number. i.e, if height > width, then image will be rescaled to (size * height / width, size)

  • target_size (tuple or int) – Desired output size. If size is a sequence like (h, w), output size will be matched to this. If size is an int, smaller edge of the image will be matched to this number. i.e, if height > width, then image will be rescaled to (size * height / width, size)

class datasetinsights.io.BBox2D(label, x, y, w, h, score=1.0)

Bases: object

Canonical Representation of a 2D bounding box.

label

string representation of the label.

Type

str

x

x pixel coordinate of the upper left corner.

Type

float

y

y pixel coordinate of the upper left corner.

Type

float

w

width (number of pixels)of the bounding box.

Type

float

h

height (number of pixels) of the bounding box.

Type

float

score

detection confidence score. Default is set to score=1. if this is a ground truth bounding box.

Type

float

Examples

Here is an example about how to use this class.

>>> gt_bbox = BBox2D(label='car', x=2, y=6, w=2, h=4)
>>> gt_bbox
"label='car'|score=1.0|x=2.0|y=6.0|w=2.0|h=4.0"
>>> pred_bbox = BBox2D(label='car', x=2, y=5, w=2, h=4, score=0.79)
>>> pred_bbox.area
8
>>> pred_bbox.intersect_with(gt_bbox)
True
>>> pred_bbox.intersection(gt_bbox)
6
>>> pred_bbox.union(gt_bbox)
10
>>> pred_bbox.iou(gt_bbox)
0.6
property area

Calculate area of this bounding box

Returns

width x height of the bound box

intersect_with(other)

Check whether this box intersects with other bounding box

Parameters

other (BBox2D) – other bounding box object to check intersection

Returns

True if two bounding boxes intersect, False otherwise

intersection(other)

Calculate the intersection area with other bounding box

Parameters

other (BBox2D) – other bounding box object to calculate intersection

Returns

float of the intersection area for two bounding boxes

iou(other)

Calculate intersection over union area with other bounding box

\[IOU = \frac{intersection}{union}\]
Parameters

other (BBox2D) – other bounding box object to calculate iou

Returns

float of the union area for two bounding boxes

union(other, intersection_area=None)

Calculate union area with other bounding box

Parameters
  • other (BBox2D) – other bounding box object to calculate union

  • intersection_area (float) – pre-calculated area of intersection

Returns

float of the union area for two bounding boxes

class datasetinsights.io.EstimatorCheckpoint(estimator_name, checkpoint_dir, distributed)

Bases: object

Saves and loads estimator checkpoints.

Assigns estimator checkpoint writer according to log_dir which is responsible for saving estimators. Writer can be a GCS or local writer. Assigns loader which is responsible for loading estimator from a given path. Loader can a local, GCS or HTTP loader.

Parameters
  • estimator_name (str) – name of the estimator

  • checkpoint_dir (str) – Directory where checkpoints are stored

  • distributed (bool) – boolean to determine distributed training

checkpoint_dir

Directory where checkpoints are stored

Type

str

distributed

boolean to determine distributed training

Type

bool

load(estimator, path)

Loads estimator from given path.

Path can be either a local path or GCS path or HTTP url.

Parameters
save(estimator, epoch)

Save estimator to the log_dir.

Parameters
class datasetinsights.io.KubeflowPipelineWriter(tb_log_dir='/home/docs/checkouts/readthedocs.org/user_builds/datasetinsights/checkouts/0.2.2/runs/20201027-200642', filename='mlpipeline-metrics.json', filepath='/')

Bases: object

Serializes metrics dictionary genereated during model training/evaluation to JSON and store in a file.

Parameters
  • filename (str) – Name of the file to which the writer will save metrics

  • filepath (str) – Path where the file will be stored

filename

Name of the file to which the writer will save metrics

Type

str

filepath

Path where the file will be stored

Type

str

data_dict

A dictionary to save metrics name and value pairs

Type

dict

data

Dictionary to be JSON serialized

add_metric(name, val)

Adds metric to the data dictionary of the writer

Note: Using same name key will overwrite the previous value as the current strategy is to save only the metrics generated in last epoch

Parameters
  • name (str) – Name of the metric

  • val (float) – Value of the metric

create_tb_visualization_json()
write_metric()

Saves all the metrics added previously to a file in the format required by kubeflow

datasetinsights.io.create_downloader(source_uri, **kwargs)
This function instantiates the dataset downloader

after finding it with the source-uri provided

Parameters
  • source_uri – URI used to look up the correct dataset downloader

  • **kwargs

Returns: The dataset downloader instance matching the source-uri.