您当前的位置: 首页 >  大数据
  • 0浏览

    0关注

    2393博文

    0收益

  • 0浏览

    0点赞

    0打赏

    0留言

私信
关注
热门博文

ML之FE:特征工程中常用的五大数据集划分方法(特殊类型数据分割,如时间序列数据分割法)讲解及其代码

一个处女座的程序猿 发布时间:2021-01-30 19:29:03 ,浏览量:0

ML之FE:特征工程中常用的五大数据集划分方法(特殊类型数据分割,如时间序列数据分割法)讲解及其代码

 

 

目录

特殊类型数据分割

5.1、时间序列数据分割TimeSeriesSplit

 

 

特殊类型数据分割 5.1、时间序列数据分割TimeSeriesSplit

class TimeSeriesSplit Found at: sklearn.model_selection._split

 

class TimeSeriesSplit(_BaseKFold):

    """Time Series cross-validator .. versionadded:: 0.18

    

    Provides train/test indices to split time series data samples that are observed at fixed time intervals, in train/test sets. In each split, test indices must be higher than before, and thus shuffling in cross validator is inappropriate. This cross-validation object is a variation of :class:`KFold`.  In the kth split, it returns first k folds as train set and the (k+1)th fold as test set.

    Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them.

    Read more in the :ref:`User Guide `.

    

    Parameters

    ----------

    n_splits : int, default=5. Number of splits. Must be at least 2. .. versionchanged:: 0.22 . ``n_splits`` default value changed from 3 to 5.

    max_train_size : int, default=None. Maximum size for a single training set.

 

 

 

 

 

 

提供训练/测试索引,以分割时间序列数据样本,在训练/测试集中,在固定的时间间隔观察。在每次分割中,测试索引必须比以前更高,因此在交叉验证器中变换是不合适的。这个交叉验证对象是KFold 的变体。在第k次分割中,它返回第k次折叠作为序列集,返回第(k+1)次折叠作为测试集。

注意,与标准的交叉验证方法不同,连续训练集是之前那些训练集的超集。

更多信息请参见:ref: ' User Guide '。

 

参数

----------

n_splits :int,默认=5。数量的分裂。必须至少是2. ..versionchanged:: 0.22。' ' n_split ' ' '默认值从3更改为5。

max_train_size : int,默认None。单个训练集的最大容量。

    Examples

    --------

    >>> import numpy as np

    >>> from sklearn.model_selection import TimeSeriesSplit

    >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])

    >>> y = np.array([1, 2, 3, 4, 5, 6])

    >>> tscv = TimeSeriesSplit()

    >>> print(tscv)

    TimeSeriesSplit(max_train_size=None, n_splits=5)

    >>> for train_index, test_index in tscv.split(X):

    ...     print("TRAIN:", train_index, "TEST:", test_index)

    ...     X_train, X_test = X[train_index], X[test_index]

    ...     y_train, y_test = y[train_index], y[test_index]

    TRAIN: [0] TEST: [1]

    TRAIN: [0 1] TEST: [2]

    TRAIN: [0 1 2] TEST: [3]

    TRAIN: [0 1 2 3] TEST: [4]

    TRAIN: [0 1 2 3 4] TEST: [5]

    

    Notes

    -----

    The training set has size ``i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1)`` in the ``i``th split, with a test set of size ``n_samples//(n_splits + 1)``, where ``n_samples`` is the number of samples.

 

    """

    @_deprecate_positional_args

    def __init__(self, n_splits=5, *, max_train_size=None):

        super().__init__(n_splits, shuffle=False, random_state=None)

        self.max_train_size = max_train_size

    

    def split(self, X, y=None, groups=None):

        """Generate indices to split data into training and test set.

 

        Parameters

        ----------

        X : array-like of shape (n_samples, n_features). Training data, where n_samples is the number of samples and n_features is the number of features.

 

        y : array-like of shape (n_samples,). Always ignored, exists for compatibility.

 

        groups : array-like of shape (n_samples,). Always ignored, exists for compatibility.

 

        Yields

        ------

        train : ndarray. The training set indices for that split.

 

        test : ndarray. The testing set indices for that split.

        """

        X, y, groups = indexable(X, y, groups)

        n_samples = _num_samples(X)

        n_splits = self.n_splits

        n_folds = n_splits + 1

        if n_folds > n_samples:

            raise ValueError(

                ("Cannot have number of folds ={0} greater than the number of samples: {1}."). format(n_folds, n_samples))

        indices = np.arange(n_samples)

        test_size = n_samples // n_folds

        test_starts = range(test_size + n_samples % n_folds, n_samples,

         test_size)

        for test_start in test_starts:

            if self.max_train_size and self.max_train_size < test_start:

                yield indices[test_start - self.max_train_size:test_start], indices

                 [test_start:test_start + test_size]

            else:

                yield indices[:test_start], indices[test_start:test_start + test_size]

 

    Examples

    --------

    >>> import numpy as np

    >>> from sklearn.model_selection import TimeSeriesSplit

    >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])

    >>> y = np.array([1, 2, 3, 4, 5, 6])

    >>> tscv = TimeSeriesSplit()

    >>> print(tscv)

    TimeSeriesSplit(max_train_size=None, n_splits=5)

    >>> for train_index, test_index in tscv.split(X):

    ...     print("TRAIN:", train_index, "TEST:", test_index)

    ...     X_train, X_test = X[train_index], X[test_index]

    ...     y_train, y_test = y[train_index], y[test_index]

    TRAIN: [0] TEST: [1]

    TRAIN: [0 1] TEST: [2]

    TRAIN: [0 1 2] TEST: [3]

    TRAIN: [0 1 2 3] TEST: [4]

    TRAIN: [0 1 2 3 4] TEST: [5]

    

    Notes

    -----

    The training set has size ``i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1)`` in the ``i``th split, with a test set of size ``n_samples//(n_splits + 1)``, where ``n_samples`` is the number of samples.

 

    """

    @_deprecate_positional_args

    def __init__(self, n_splits=5, *, max_train_size=None):

        super().__init__(n_splits, shuffle=False, random_state=None)

        self.max_train_size = max_train_size

    

    def split(self, X, y=None, groups=None):

        """Generate indices to split data into training and test set.

 

        Parameters

        ----------

        X : array-like of shape (n_samples, n_features). Training data, where n_samples is the number of samples and n_features is the number of features.

 

        y : array-like of shape (n_samples,). Always ignored, exists for compatibility.

 

        groups : array-like of shape (n_samples,). Always ignored, exists for compatibility.

 

        Yields

        ------

        train : ndarray. The training set indices for that split.

 

        test : ndarray. The testing set indices for that split.

        """

        X, y, groups = indexable(X, y, groups)

        n_samples = _num_samples(X)

        n_splits = self.n_splits

        n_folds = n_splits + 1

        if n_folds > n_samples:

            raise ValueError(

                ("Cannot have number of folds ={0} greater than the number of samples: {1}."). format(n_folds, n_samples))

        indices = np.arange(n_samples)

        test_size = n_samples // n_folds

        test_starts = range(test_size + n_samples % n_folds, n_samples,

         test_size)

        for test_start in test_starts:

            if self.max_train_size and self.max_train_size < test_start:

                yield indices[test_start - self.max_train_size:test_start], indices

                 [test_start:test_start + test_size]

            else:

                yield indices[:test_start], indices[test_start:test_start + test_size]

 
关注
打赏
1664196048
查看更多评论
立即登录/注册

微信扫码登录

0.0498s