Package: backend.tweets

Submodules

datasets

Python Module to handle data loading

class backend.tweets.datasets.Dataset(name: str, data_format: str, filename: str)

Bases: object

(Data)Class encapsulating information for a single Dataset. Dataset instances will be saved into a global DATA_DICTIONARY (see below) acting as a proxy to access available datasets.

This mechanism acts as a surrogate for a Database and so potentially subject to change in the future.

name

Chosen dataset unique name

Type

str

data_format

Format of the data (e.g. CSV, GeoJSON)

Type

str

filename

Name of the datafile (including file extension)

Type

str

property data

Reads and returns a pd.DataFrame of the data.

Raises

NotImplementedError – When reading of the file type is not supported.

data_format: str = None
filename: str = None
property is_valid

Returns True if the file can be found at the source path provided.

Raises

FileNotFoundError: – When the file does not exist at the given path.

name: str = None
property source_path

Gets the source path for the file.

Raises

FileNotFoundError: – When the file does not exist at the given path.

backend.tweets.datasets.generate_la_keys(data_filename: str = 'la_keys.geojson')

Generate the Local Authorities Keys.

This function merges demographics and buondaries of Local Authorithies (as retrieved from “buondaries_LAs.geojson”) to generate an LA lookup table to be used for a quicker geographical location attribution for tweets.

backend.tweets.datasets.load_local_authorities()backend.tweets.datasets.Dataset

Load the Local Authorities Keys Dataset, or generate it if not found.

backend.tweets.datasets.load_tweets()backend.tweets.datasets.Dataset

Load the Tweets Dataset

pipelines

Module containing class for creating custom transformation pipelines, a series of functions used for transforming Twitter data, and a Twitter transformation pipeline.

class backend.tweets.pipelines.Pipeline

Bases: abc.ABC

ABC implementing the general template of a Dynamic Pipeline to process the Twitter dataset.

The pipeline is composed by a sequence of (name, function) pairs. Valid (Pipe) functions must comply to the following signature:

Callable[[pd.DataFrame], pd.DataFrame]

Therefore, only one parameter is expected, i.e. the dataset in the form of a pandas.DataFrame, and a DataFrame is returned (to be fed as input for the next step).

(Concrete) pipeline definitions are obtained via subclassing: steps are hard-coded so to have pre-defined and controllable behaviours. However, it is always possible to register extra steps into a pipeline via the register method.

apply(data: pandas.core.frame.DataFrame, verbosity: int = 0) → pandas.core.frame.DataFrame

Executes the pipeline on input data

Parameters
  • data (pd.DataFrame) – Input data to initialise the pipeline

  • verbosity (int, optional) – Controls the verbosity of the execution of the pipeline, by default 0 (no verbosity)

Returns

New copy of the data after the execution of all the steps of the pipeline.

Return type

pd.DataFrame

abstract create_pipeline() → List[Tuple[str, Callable[[pandas.core.frame.DataFrame], pandas.core.frame.DataFrame]]]
register(op: Tuple[str, Callable[[pandas.core.frame.DataFrame], pandas.core.frame.DataFrame]])

Register a virtual subclass of an ABC.

Returns the subclass, to allow usage as a class decorator.

class backend.tweets.pipelines.TwitterPipeline

Bases: backend.tweets.pipelines.Pipeline

Implementation of the Pipeline class for reading and preparing tweets.

create_pipeline() → List[Tuple[str, Callable[[pandas.core.frame.DataFrame], pandas.core.frame.DataFrame]]]
backend.tweets.pipelines.convert_coordinates(data, col='geo.coordinates')

Takes a Pandas dataframe with a column geo.coordinates (col) and adds the lat and long to their own columns for easy conversion to geojson.

backend.tweets.pipelines.match_local_authorities(bbox: Sequence[Tuple[float, float]], la_df: pandas.core.frame.DataFrame, return_all: bool = False) → Union[Tuple[str, str, str], pandas.core.frame.DataFrame]

Get the Intersection Over Union for the the Local Authorities that overlap with the bounding box. Requires ‘geometry’ col in LA geopandas df. Returns df of local authorities of interest.

Parameters
  • bbox (BoundingBox) – Bounding box coordinates of the tweet

  • la_df (pd.DataFrame) – (pandas) DataFrame containing information of the Local Authorities (keys) to match

  • return_all (bool) – Flag controlling whether to return only the top matching local authorities or all of them (ranked by likelihood). By default, False.

Returns

  • A tuple containing the name, the code, and the reference of

  • the top matching LA, or all of them (in the form of

  • a pd.DataFrame)

backend.tweets.pipelines.match_reference_la(data)

Choose LA with highest likelihood. Add LA and LHB to dataset.