
#D-TW04 ツイッターデータを集めて掃除の方法を考える前編 (How do I clean data - First part)

(English follows)

  1. なるべく集めない

  2. 消したいデータを見つける

  3. 特徴を見つけて削除

言語などもいらない言語が混じってこないように指定することができます。不必要なデータも「-」をつけることで検索から省けます。検索ワード自体も工夫すると、だいぶ必要なものに近づきます。今回の「会社行きたくない INDEX」ではリツイートは分析に使わないので、検索コマンドで削除、言語も日本語のみに指定しました。まだまだできそうな気もしている。。。スキル不足。




The last article was about how we collect data from twitter and importance on search command, query and next page token. This article is about the next step, how we clean data which we collected. This part would be long so I split it to 2 parts and this is the first one. Again I will explain it without showing codes for non-programmers. The first one is about the top part of the chart below.

  1. Try not to collect unnecessary data

  2. Find data which you want to delete

  3. Find features of data which you want to delete

Try not to collect unnecessary data
This is simple. It is better not to collect the data which you don't need. If it is possible… So we have to fully utilize search commands and query which I mentioned in the last article (Here). You can suggest language to avoid tweet from other language to be mix. You can also suggest the words which you don't need by using "-" command. Then you also have to optimize your search words to get closer to what you really want to get. For "Don't want to work" Index, I didn't include retweets since I don't use them and set language only in Japanese.

Find data which you want to delete
This is tough. Well not very difficult but its just a lot. You have to read raw data everyday and delete data which you don't need and again face to raw data repeatedly until you think you deleted most of the unnecessary data.
Lets look at examples. In "Don't want to work" INDEX's case, I don't know why but there are many indecent tweets even though you search for "Don't want to work"… Also there are many ad tweets. For example, interestingly supporting resignation service is popular in Japan and many of their ads include "don't want to work". Also many ads from Rakuten, Amazon. Also I found many bot tweets from some spiritual accounts. Also I found many tweets from Hotel official account for tweet to meet workation needs. I also had to delete news tweets and bots to reach to how many people tweet "Don't want to wok" in a day.

Find features of data which you want to delete
I use variety of methods (Soooo many) to delete data (Smile). Firstly "Source" column in tweet data helps. "Source" shows name of services or apps which generates the tweet. When you find exactly the same tweets they are generated by service or bots which you can tell from names in source column.  For example there are names include "bot" in it. By the way, what's important for you is, you don't need to seek for 100% accuracy for this. It is ok when you might delete few necessary tweets but delete 1000 of unnecessary data. I just feel sorry for those few data… Sorry.
Of course, account name helps to delete confusing data. For instance, I deleted account names with "Company", "Amazon", "Rakuten". Also I deleted tweets which have more than 4 hash tags in one tweets, very short tweets and very long tweets. Then I deleted entire accounts which post the same tweets repeatedly. I deleted tweets by using some "termination" words. Most of indecent tweets were deleted by only few words…

I couldn't mention the details but please repeat delete and search many times till you are satisfied… This is not a spectacular job but contribute a lot to improve your analysis… yes not a sexy job but I somehow like it, hahahaha.
