API¶
The Spark bolt array implements the core API, as well as the extra methods highlighted here. Some are used to implement the core methods (e.g. swap) and may be of interest to developers. Others are Spark-specific convienience functions (e.g. cache). Full documentation can be found below.
stack([size]) | Aggregates records of a distributed array. |
chunk([size, axis]) | Chunks records of a distributed array. |
swap(kaxes, vaxes[, size]) | Swap axes from keys to values. |
cache() | Cache the underlying RDD in memory. |
unpersist() | Remove the underlying RDD from memory. |
toarray() | Returns the contents as a local array. |
tordd() | Return the underlying RDD of the bolt array. |
split | Axis at which the array is split into keys/values. |
Detailed API¶
- class bolt.spark.array.BoltArraySpark(rdd, shape=None, split=None, dtype=None)[source]¶
-
- astype(dtype, casting='unsafe')[source]¶
Cast the array to a specified type.
Parameters: dtype : str or dtype
Typecode or data-type to cast the array to (see numpy)
- chunk(size='150', axis=None)[source]¶
Chunks records of a distributed array.
Chunking breaks arrays into subarrays, using an specified size of chunks along each value dimension. Can alternatively specify an average chunk byte size (in megabytes) and the size of chunks (as ints) will be computed automatically.
Parameters: size : tuple, int, or str, optional, default = “150”
A string giving the size in megabytes, or a tuple with the size of chunks along each dimension.
axis : int or tuple, optional, default = None
One or more axis to chunk array along, if None will use all axes,
Returns: ChunkedArray
- filter(func, axis=(0, ))[source]¶
Filter array along an axis.
Applies a function which should evaluate to boolean, along a single axis or multiple axes. Array will be aligned so that the desired set of axes are in the keys, which may incur a swap.
Parameters: func : function
Function to apply, should return boolean
axis : tuple or int, optional, default=(0,)
Axis or multiple axes to filter along.
Returns: BoltArraySpark
- map(func, axis=(0, ), value_shape=None)[source]¶
Apply a function across an axis.
Array will be aligned so that the desired set of axes are in the keys, which may incur a swap.
Parameters: func : function
Function of a single array to apply
axis : tuple or int, optional, default=(0,)
Axis or multiple axes to apply function along.
value_shape : tuple, optional, default=None
Known shape of values resulting from operation
Returns: BoltArraySpark
- max(axis=None, keepdims=False)[source]¶
Return the maximum of the array over the given axis.
Parameters: axis : tuple or int, optional, default=None
Axis to compute statistic over, if None will compute over all axes
keepdims : boolean, optional, default=False
Keep axis remaining after operation with size 1.
- mean(axis=None, keepdims=False)[source]¶
Return the mean of the array over the given axis.
Parameters: axis : tuple or int, optional, default=None
Axis to compute statistic over, if None will compute over all axes
keepdims : boolean, optional, default=False
Keep axis remaining after operation with size 1.
- min(axis=None, keepdims=False)[source]¶
Return the minimum of the array over the given axis.
Parameters: axis : tuple or int, optional, default=None
Axis to compute statistic over, if None will compute over all axes
keepdims : boolean, optional, default=False
Keep axis remaining after operation with size 1.
- reduce(func, axis=(0, ), keepdims=False)[source]¶
Reduce an array along an axis.
Applies a commutative/associative function of two arguments cumulatively to all arrays along an axis. Array will be aligned so that the desired set of axes are in the keys, which may incur a swap.
Parameters: func : function
Function of two arrays that returns a single array
axis : tuple or int, optional, default=(0,)
Axis or multiple axes to reduce along.
Returns: BoltArraySpark
- reshape(*shape)[source]¶
Return an array with the same data but a new shape.
Currently only supports reshaping that independently reshapes the keys, or the values, or both.
Parameters: shape : tuple of ints, or n ints
New shape
- squeeze(axis=None)[source]¶
Remove one or more single-dimensional axes from the array.
Parameters: axis : tuple or int
One or more singleton axes to remove.
- stack(size=None)[source]¶
Aggregates records of a distributed array.
Stacking should improve the performance of vectorized operations, but the resulting StackedArray object only exposes a restricted set of operations (e.g. map, reduce). The unstack method can be used to restore the full bolt array.
Parameters: size : int, optional, default=None
The maximum size for each stack (number of original records), will aggregate groups of records per partition up to this size, if None will aggregate all records on each partition.
Returns: StackedArray
- std(axis=None, keepdims=False)[source]¶
Return the standard deviation of the array over the given axis.
Parameters: axis : tuple or int, optional, default=None
Axis to compute statistic over, if None will compute over all axes
keepdims : boolean, optional, default=False
Keep axis remaining after operation with size 1.
- sum(axis=None, keepdims=False)[source]¶
Return the sum of the array over the given axis.
Parameters: axis : tuple or int, optional, default=None
Axis to compute statistic over, if None will compute over all axes
keepdims : boolean, optional, default=False
Keep axis remaining after operation with size 1.
- swap(kaxes, vaxes, size='150')[source]¶
Swap axes from keys to values.
This is the core operation underlying shape manipulation on the Spark bolt array. It exchanges an arbitrary set of axes between the keys and the valeus. If either is None, will only move axes in one direction (from keys to values, or values to keys). Keys moved to values will be placed immediately after the split; values moved to keys will be placed immediately before the split.
Parameters: kaxes : tuple
Axes from keys to move to values
vaxes : tuple
Axes from values to move to keys
size : tuple or int, optional, default = “150”
Can either provide a string giving the size in megabytes, or a tuple with the number of chunks along each value dimension being moved
Returns: BoltArraySpark
- swapaxes(axis1, axis2)[source]¶
Return the array with two axes interchanged.
Parameters: axis1 : int
The first axis to swap
axis2 : int
The second axis to swap
- toarray()[source]¶
Returns the contents as a local array.
Will likely cause memory problems for large objects.
- transpose(*axes)[source]¶
Return an array with the axes transposed.
This operation will incur a swap unless the desiured permutation can be obtained only by transpoing the keys or the values.
Parameters: axes : None, tuple of ints, or n ints
If None, will reverse axis order.
- var(axis=None, keepdims=False)[source]¶
Return the variance of the array over the given axis.
Parameters: axis : tuple or int, optional, default=None
Axis to compute statistic over, if None will compute over all axes
keepdims : boolean, optional, default=False
Keep axis remaining after operation with size 1.