家入さんのツイートをMeCabで解析。ビジネストレンドを知れるおもしろい分析結果になった

f:id:bloggertech:20180121222337p:plain

家入さんって有名。Twitterでの発信力もえげつないし、そのツイートには一国の方向性を変えるパワーがあるとすら思っています。

もちろん、多くの起業家やビジネスメンに多大な影響を与えてきたことに間違いはないでしょう。

そこで思いました。

「どーしたら家入さんみたいに発信力つけて、さまざま視点で物事を捉え、ビジネスを回していけるのか。家入さんアドバイザーとして迎い入れた生活をしてみたい。」

こんな夢そう簡単に叶うはずがありません。

ただ「あの世界的サービスTwitterから家入さんのツイートを解析すればトレンド、ビジネスの向かう先を知れて、あなたのアドバイザー的結果になるんじゃね？」って思いました。

ということで今回は、家入さんのツイートをMecabで形態素解析/データhoge処理して、ビジネスのトレンドを知っちゃいたいと思います。

Twitter API使用の準備

TwitterのAPIを叩いてデータを取得する前に必要なのが以下の情報。

Consumer Key
Consumer Secret
Access Token
Access Token Secret

ということでtwitterにログインした状態でこちらのURLへアクセスします。

apps.twitter.com

するとこんな画面になるのでクリックして適宜入力します。

f:id:bloggertech:20180119220617p:plain

f:id:bloggertech:20180119221135p:plain

アプリケーションが作成されたら、Permissions タブをクリックして、権限を「Read, write, and direct messages」にチェックして更新します。 f:id:bloggertech:20180119221416p:plain

最後に、Key and access tokenタブをクリックして、Create access tokenボタンをクリック。

すると、

Consumer Key
Consumer Secret
Access Token
Access Token Secret

APIを叩く際に必要な4つのkeyやアクセストークンを取得できます。

ちなみに私の場合はこんなエラーがでましたが、無事取得されたのでシカトします。 f:id:bloggertech:20180119221725p:plain

Pythonで家入さんのツイートを取得する

今回はPythonでプログラムを書いていきます。

環境と使うものはこちら。

Python3.6
jupyter notebook
MeCab（辞書：mecab-ipadic-NEologd）
collectionsモジュール(各要素の出現頻度を数える関数が便利)

jupyter notebookだといろいろと便利です。（便利さは別の記事で紹介します）

まずは家入さんのツイート・リツイートを取得・連結した文字列を変数に格納するプログラムです。

# coding: utf-8
import requests
from requests_oauthlib import OAuth1Session
import json
import re
import MeCab
from collections import Counter

#API keyなど
access_token = '[Access Token]'
access_token_secret = '[Access Token Secret]'
consumer_key = '[Consumer Key]'
consumer_key_secret = '[Consumer Secret]'

# タイムライン取得用URL
# こちらから確認：https://developer.twitter.com/en/docs/api-reference-index
url = "https://api.twitter.com/1.1/statuses/user_timeline.json"

#パラメータの定義
params = {
    'screen_name':'hbkr',
    'exclude_replies':True,
    'include_rts':True,
    'count':200
    }

#APIの認証
twitter = OAuth1Session(consumer_key, consumer_key_secret, access_token, access_token_secret)

timeline_list = []
for j in range(100):
    res = twitter.get(url, params = params)

    if res.status_code == 200:
        # API残り
        limit = res.headers['x-rate-limit-remaining']
        print ("API remain: " + limit)
        if limit == 1:
            sleep(60*15)

        n = 0
        timeline = json.loads(res.text)
        # 各ツイートの本文を表示
        for i in range(len(timeline)):
            if i != len(timeline)-1:
                print(timeline[i]['text'] + '\n')
                timeline_list.append(timeline[i]['text'] )
            else:
                print(timeline[i]['text'] + '\n')
                timeline_list.append(timeline[i]['text'] )

                params['max_id'] = timeline[i]['id']-1

# ここまでで、リストにツイート・リツイートが格納されてる
timeline_list

# 正規表現関数の定義
def regex(text):
    # 時系列系
    text = re.sub(r'(\d{4})(-|\.|/)(\d{1,2})(-|\.|/)(\d{1,2})', "", text) # 2017/12/31, 2017-12-31, 2017.12.31
    text = re.sub(r'(\d{1,2})/(\d{1,2})', "", text) # 12/31 に対応

    text = re.sub(r'\d{4}年', "", text) # 2017年に対応
    text = re.sub(r'\d{1,2}月', "", text) # 12月に対応
    text = re.sub(r'\d{1,2}日(\s?\((月|火|水|木|金|土|日)\)?)', "", text) # 31日(水), 31日 (水)に対応
    text = re.sub(r'\d{1,2}日', "", text) # 31日に対応

    text = re.sub(r'\d{1,2}:\d{1,2}', "", text) # 10:21 に対応
    text = re.sub(r'\d{1,2}時\d{1,2}分', "", text) # 10時21分 に対応
    # 金額系
    text = re.sub(r'(?<=[0-9]),', '', text) # まずは数字の間のカンマを除去
    text = re.sub(r'(¥|\$|€)[0-9]{1,}\s?', "", text)
    text = re.sub(r'[0-9]{1,}\s?(円|ドル|ユーロ)', "", text) # 金額
    # URL
    text = re.sub(r'(https?|ftp)(:\/\/[-_\.!~*\'()a-zA-Z0-9;\/?:\@&=\+\$,%#]+)', "", text) # URL
    return text

# 正規表現でテキストをキレイにする
for i in range(len(timeline_list)):
    timeline_list[i] = regex(timeline_list[i])
    print(timeline_list[i])

# 配列を連結して変数に格納
ieiri_all_text = ''.join(timeline_list)

# 使用頻度が高いが重要ではない単語を置換する
ieiri_all_text = ieiri_all_text.replace('家入一真', '')
ieiri_all_text = ieiri_all_text.replace('家入さん', '')
ieiri_all_text = ieiri_all_text.replace('RT', '')
ieiri_all_text = ieiri_all_text.replace('さん', '')

# これが出来上がりの変数
ieiri_all_text

こちらを参考にさせていただきました。

www.randpy.tokyo

また以下の注意点があるようです。

GET statuses/user_timelineを見ると、元々各ユーザーの最新ツイートは、リツイートも含めて3,200までしか取得できないようです。更にそこからリツイートを除外すると、大体2000件ぐらいになるということですね。
[引用元：http://www.randpy.tokyo/entry/python_wordcloud]

ちなみに今回取得した数は2923ツイート・リツイートっぽいです。リプライを除いたので、ということですかね。

MeCabなどのインストールに関しては以下を参考にしてください。

blogtech.hatenablog.com

それでは、ツイート・リツイートの文字列が格納された変数を解析していきます。

流れはこんな感じ。

MeCabで形態素解析
名詞のみ抽出
正規表現などでデータをキレイにする
出現頻度順で出力

# MeCabで分析
def get_mecab_results(text):
    '''
        入力例:
        ラーメン食べたい。
    '''
    mecab_res_dicts = []

    tagger = MeCab.Tagger("-Ochasen")
    tagger.parse('') # ※ 先に空文字を処理しないと、結果がきちんと取得できないというバグがある。
    node = tagger.parseToNode(text)

    i = 0
    # 解析結果を辞書の配列として取得
    while node:
        mecab_res_dicts.append({}) # 空辞書を追加
        mecab_res_dicts[i]['surface'] = node.surface # 表層 ( 'ラーメン' )
        mecab_res_dicts[i]['feature'] = node.feature.split(',') # 形態素解析の結果 ( '['名詞','固有名詞','一般','*','*','*','ラーメン','ラーメン','ラーメン'] )

        # 次のノードへ
        node = node.next
        i+=1

    return mecab_res_dicts[1:-1] # ★ BOSと、EOSを省いて出力 (配列の 1 〜 最後から-1番目 までを返す)

mecab_ieiri_dicts = get_mecab_results(ieiri_all_text)

# 名詞だけのリスト
noun_list = []
# 3文字以上&&名詞全データをtuple(surface,reading,prono)を保持する配列
noun_zen_list = []
for item in mecab_ieiri_dicts:
    if len(item['feature']) > 7: 
        if item['feature'][0] == "名詞":
            zen_tuple = (item['surface'],item['feature'][7],item['feature'][8])
            noun_zen_list.append(zen_tuple)
            noun_list.append(item['feature'][6])

counter = Counter(noun_list)
# 出現数をcounter関数で出す
counter_common = counter.most_common()
# これに今回ほしい情報が入っている
counter_common

終わりました。

家入さんのツイートを分析してトレンドとビジネスチャンスを見つける

counter_commonの中身がどうなっているかというとこんな感じ。 f:id:bloggertech:20180121215633p:plain

ふむふむ。あたりまえですが、

CAMPFIRE
Crowd Funding
支援

このあたりの単語多いのはあたりまえです。

家入さんは基本的に会社のこと社会への貢献などを含んだサービス、発言などが多い印象なのですが、「社会」という単語の上位に来ているのでその認識は間違っていなかったようです。

f:id:bloggertech:20180121220231p:plain

ポルカ
声
挑戦

こちらも最近流行ったサービスと関連する単語がみられます。RIVaというのは、駆け込み寺「リバ邸」のことかと。

共同創業取締役ということで「BASE」関連にもよくツイートされているようです。

ここまでは家入さん本人が絡んでいる単語が多かったですが、もう少し出現頻度が低い単語を見てみるとビジネスチャンスとしておもしろいのかなって思う結果になっています。 f:id:bloggertech:20180121220812p:plain f:id:bloggertech:20180121221154p:plain