datasetinsights.io.downloader

datasetinsights.io.downloader.base

class datasetinsights.io.downloader.base.DatasetDownloader(**kwargs)

Bases: abc.ABC

This is the base class for all dataset downloaders The DatasetDownloader can be subclasses in the following way

class NewDatasetDownloader(DatasetDownloader, protocol=”protocol://”)

Here the ‘protocol://’ should match the prefix that the method download source_uri supports. Example http:// gs://

abstract download(source_uri, output, **kwargs)

This method downloads a dataset stored at the source_uri and stores it in the output directory

Parameters
  • source_uri – URI that points to the dataset that should be downloaded

  • output – path to local folder where the dataset should be stored

datasetinsights.io.downloader.base.create_dataset_downloader(source_uri, **kwargs)
This function instantiates the dataset downloader

after finding it with the source-uri provided

Parameters
  • source_uri – URI used to look up the correct dataset downloader

  • **kwargs

Returns: The dataset downloader instance matching the source-uri.

datasetinsights.io.downloader.gcs_downloader

class datasetinsights.io.downloader.gcs_downloader.GCSDatasetDownloader(**kwargs)

Bases: datasetinsights.io.downloader.base.DatasetDownloader

This class is used to download data from GCS

download(source_uri=None, output=None, **kwargs)
Parameters
  • source_uri – This is the downloader-uri that indicates where on GCS the dataset should be downloaded from. The expected source-uri follows these patterns gs://bucket/folder or gs://bucket/folder/data.zip

  • output – This is the path to the directory where the download will store the dataset.

datasetinsights.io.downloader.http_downloader

class datasetinsights.io.downloader.http_downloader.HTTPDatasetDownloader(**kwargs)

Bases: datasetinsights.io.downloader.base.DatasetDownloader

This class is used to download data from any HTTP or HTTPS public url and perform function such as downloading the dataset and checksum validation if checksum file path is provided.

download(source_uri, output, checksum_file=None, **kwargs)

This method is used to download the dataset from HTTP or HTTPS url.

Parameters
  • source_uri (str) – This is the downloader-uri that indicates where the dataset should be downloaded from.

  • output (str) – This is the path to the directory where the download will store the dataset.

  • checksum_file (str) – This is path of the txt file that contains checksum of the dataset to be downloaded. It can be HTTP or HTTPS url or local path.

Raises

ChecksumError – This will raise this error if checksum doesn’t matches

class datasetinsights.io.downloader.GCSDatasetDownloader(**kwargs)

Bases: datasetinsights.io.downloader.base.DatasetDownloader

This class is used to download data from GCS

download(source_uri=None, output=None, **kwargs)
Parameters
  • source_uri – This is the downloader-uri that indicates where on GCS the dataset should be downloaded from. The expected source-uri follows these patterns gs://bucket/folder or gs://bucket/folder/data.zip

  • output – This is the path to the directory where the download will store the dataset.

class datasetinsights.io.downloader.HTTPDatasetDownloader(**kwargs)

Bases: datasetinsights.io.downloader.base.DatasetDownloader

This class is used to download data from any HTTP or HTTPS public url and perform function such as downloading the dataset and checksum validation if checksum file path is provided.

download(source_uri, output, checksum_file=None, **kwargs)

This method is used to download the dataset from HTTP or HTTPS url.

Parameters
  • source_uri (str) – This is the downloader-uri that indicates where the dataset should be downloaded from.

  • output (str) – This is the path to the directory where the download will store the dataset.

  • checksum_file (str) – This is path of the txt file that contains checksum of the dataset to be downloaded. It can be HTTP or HTTPS url or local path.

Raises

ChecksumError – This will raise this error if checksum doesn’t matches

datasetinsights.io.downloader.create_dataset_downloader(source_uri, **kwargs)
This function instantiates the dataset downloader

after finding it with the source-uri provided

Parameters
  • source_uri – URI used to look up the correct dataset downloader

  • **kwargs

Returns: The dataset downloader instance matching the source-uri.