【Python】リストや文字列の要素をランダムに抽出する(random.choice, choices, sample)

Python標準ライブラリのrandomモジュールは、シーケンス型データ（リスト、タプル、文字列、range）からランダムに要素を抽出する関数も準備されています。

要素を一つだけランダムに選択したい： choice()
同じ要素が選ばれてもいいので、シーケンスからランダムに複数個選びたい：　choices()
同じ要素が選ばれないように、シーケンスからランダムに複数個選びたい：　sample()

本記事では、これらの関数について具体的な例と合わせて解説します。

ランダムに要素を1つ抽出：random.choice()
ランダムに要素をk個抽出（重複あり）: random.choices()
1. 引数kの設定
2. 重みweights、累積重みcum_weightsの設定
ランダムに要素をk個抽出（重複なし）: random.sample()
今回確認した環境
まとめ

ランダムに要素を1つ抽出：random.choice()

random.choice() は、シーケンス型データ（リスト、タプル、文字列、range）からランダムに要素を1つ選んで返します。
使い方は下記です。

random.choice(seq)

抽出する元データ（シーケンス）を引数seqに設定します。

例を以下に示します。（同じ関数を3回繰り返して、ランダムに抽出されることを確認しています）

リスト

# 最初にrandomモジュールをインポート
>>> import random

>>> random.choice(['a', 'b', 'c', 'd', 'e'])
'd'
>>> random.choice(['a', 'b', 'c', 'd', 'e'])
'a'
>>> random.choice(['a', 'b', 'c', 'd', 'e'])
'b'

タプル

>>> random.choice(('a', 'b', 'c', 'd', 'e'))
'd'
>>> random.choice(('a', 'b', 'c', 'd', 'e'))
'b'
>>> random.choice(('a', 'b', 'c', 'd', 'e'))
'c'

文字列

>>> random.choice('abcde')
'a'
>>> random.choice('abcde')
'b'
>>> random.choice('abcde')
'c'

range

>>> random.choice(range(5))
3
>>> random.choice(range(5))
2
>>> random.choice(range(5))
1

引数seqが空の場合はIndexErrorを返します。

>>> random.choice([])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/hibikisan/anaconda3/envs/python3.7/lib/python3.7/random.py", line 261, in choice
    raise IndexError('Cannot choose from an empty sequence') from None
IndexError: Cannot choose from an empty sequence

ランダムに要素をk個抽出（重複あり）: random.choices()

random.choices() は、シーケンスpopulationからランダムに要素をk個選び、それを新しいリストにして返します。要素を選ぶ際は重複が許されるので、同じ要素が選択される場合があります。
使い方は以下です。

random.choices(population, weights=None, *, cum_weights=None, k=1)

戻り値は長さkのリストになります。

また、重みweights、累積重みcum_weightsは、要素を選択する上での重みを表します。これが省略された場合は、どの要素も同じ確率で選択されます。

引数kの設定

引数kは、戻り値のリストの長さです。
元のシーケンスの長さに依存する必要はないことに注意です。

以下に具体例を示します。
尚、判りやすくする為ここでは重みweights, 累積重みcum_weightsは省略しています。

リスト

>>> import random
# k=3の場合
>>> random.choices(['a', 'b', 'c', 'd', 'e'], k=3)
['a', 'c', 'd']

# k=5の場合
>>> random.choices(['a', 'b', 'c', 'd', 'e'], k=5)
['b', 'c', 'a', 'c', 'a']

# k=8の場合
>>> random.choices(['a', 'b', 'c', 'd', 'e'], k=8)
['c', 'c', 'd', 'd', 'd', 'd', 'c', 'b']

タプル

# k=3の場合
>>> random.choices(('a', 'b', 'c', 'd', 'e'), k=3)
['c', 'c', 'c']

# k=5の場合
>>> random.choices(('a', 'b', 'c', 'd', 'e'), k=5)
['d', 'e', 'b', 'a', 'b']

# k=8の場合
>>> random.choices(('a', 'b', 'c', 'd', 'e'), k=8)
['b', 'b', 'c', 'b', 'd', 'a', 'c', 'e']

文字列

# k=3の場合
>>> random.choices('abcde', k=3)
['e', 'a', 'c']

# k=5の場合
>>> random.choices('abcde', k=5)
['a', 'b', 'd', 'c', 'b']

# k=8の場合
>>> random.choices('abcde', k=8)
['a', 'e', 'a', 'e', 'd', 'c', 'd', 'b']

range

# k=3の場合
>>> random.choices(range(5), k=3)
[2, 1, 0]

# k=5の場合
>>> random.choices(range(5), k=5)
[3, 0, 2, 2, 0]

# k=8の場合
>>> random.choices(range(5), k=8)
[2, 4, 4, 4, 3, 2, 1, 3]

データが空の場合はIndexErrorを返します。

>>> random.choices([])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/hibikisan/ProgramFiles/py37env/lib/python3.7/random.py", line 356, in choices
    return [population[_int(random() * total)] for i in range(k)]
File "/home/hibikisan/ProgramFiles/py37env/lib/python3.7/random.py", line 356, in <listcomp>
    return [population[_int(random() * total)] for i in range(k)]
IndexError: list index out of range

重みweights、累積重みcum_weightsの設定

これらのパラメータは、

シーケンスpopulationの各要素に対する重みを非負の数値（int, float等）で表したもの。
それぞれ、

重みweights：　要素それぞれの相対的な重み

# （相対）重み
weights = [2, 5, 10, 3]

累積重みcum_weights：　一番左からの要素の累積で表した重み

# 累積重み
cum_weights = [2, 7, 17, 20]

関数の使い方について、具体例を以下に示します。

# （相対）重みweights
>>> w = [5, 100, 25, 50, 70]  
>>> random.choices(['a', 'b', 'c', 'd', 'e'], weights=w, k=8)
['e', 'd', 'c', 'b', 'c', 'b', 'c', 'c']

# 累積重みcum_weights
>>> cum_w = [5, 105, 130, 180, 250]
>>> random.choices(['a', 'b', 'c', 'd', 'e'], cum_weights=cum_w, k=8)
['b', 'e', 'd', 'b', 'c', 'd', 'e', 'b']

尚、両方設定するとTypeErrorとなります。

>>> random.choices(['a', 'b', 'c', 'd', 'e'], weights=w, cum_weights=cum_w, k=8)
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/home/hibikisan/ProgramFiles/py37env/lib/python3.7/random.py", line 359, in choices
    raise TypeError('Cannot specify both weights and cumulative weights')
TypeError: Cannot specify both weights and cumulative weights

（参考）

累積重みcum_weightsを使う方が、相対重みweightsよりも関数の内部処理的には少し効率が良いです。
重みはゼロや負の数でも値は返りますが、期待するランダム値にはなりません。

これらをソースコードで見てみましょう。

下記のコードは、こちらからrandom.choices()を抜粋したものです。

def choices(self, population, weights=None, *, cum_weights=None, k=1):
"""Return a k sized list of population elements chosen with replacement.
If the relative weights or cumulative weights are not specified,
the selections are made with equal probability.
"""
random = self.random
n = len(population)
if cum_weights is None:　　★➀
    if weights is None:
        _int = int
        n += 0.0    # convert to float for a small speed improvement
        return [population[_int(random() * n)] for i in _repeat(None, k)]
    cum_weights = list(_accumulate(weights))　★②
elif weights is not None:　
    raise TypeError('Cannot specify both weights and cumulative weights')
if len(cum_weights) != n:
    raise ValueError('The number of weights does not match the population')
bisect = _bisect
total = cum_weights[-1] + 0.0  # convert to float　
hi = n - 1
return [population[bisect(cum_weights, random() * total, 0, hi)] for i in _repeat(None, k)] ★③

★➀の分岐でcum_weights、weightsの有無をチェックして、weightsが設定されていた場合は、★②でcum_weightsに変換しています。つまり、これ以降の処理では累積重みが使われます。

★③では、累積重みcum_weightsに対して、random.random() * 累積和の計算結果が挿入される場所（インデックス）を二分法bisect() を用いて求めています。
このアルゴリズムを用いる場合は、累積和cum_weightsの配列は正の値である必要があります。
こちらのサイト様が参考になります。

ランダムに要素をk個抽出（重複なし）: random.sample()

random.sample()は、シーケンスまたは集合populationからk個のランダムな要素を「重複無し」で選び、それを新しいリストにして返します。
書式は以下です。

random.sample(population, k)

戻り値は長さkのリストになります。

以下に具体例を示します。

# k=1の場合
>>> random.sample(['a', 'b', 'c', 'd', 'e'], k=1)
['b']

# k=3の場合
>>> random.sample(['a', 'b', 'c', 'd', 'e'], k=3)
['d', 'c', 'a']

K= (元の要素の長さ)とすると、シャッフルが出来ます。
※）要素のシャッフルについては関数random.suffule()も用意されています。

# k=5の場合
>>> random.sample(['a', 'b', 'c', 'd', 'e'], k=5)
['b', 'a', 'd', 'c', 'e']

>>> random.sample(['a', 'b', 'c', 'd', 'e'], k=5)
['d', 'a', 'e', 'b', 'c']

>>> random.sample(['a', 'b', 'c', 'd', 'e'], k=5)
['c', 'd', 'a', 'e', 'b']

尚、kの値が元のシーケンスpopulationよりも大きいとValueErrorとなります。

>>> random.sample(['a', 'b', 'c', 'd', 'e'], k=7)
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/home/hibikisan/ProgramFiles/py37env/lib/python3.7/random.py", line 321, in sample
    raise ValueError("Sample larger than population or is negative")
ValueError: Sample larger than population or is negative