Co-occurrence of Anime Genres with Chord Diagrams

Preamble

In [1]:
import numpy as np                   # for multi-dimensional containers 
import pandas as pd                  # for DataFrames
import itertools
from ast import literal_eval
from chord import Chord

Introduction

In this section, we're going to use the MyAnimeList dataset to visualise the co-occurrence of anime genres.

The Dataset

The dataset documentation states that we can expect 31 variables per each of the 14478 entries. Let's download the mirrored dataset and have a look for ourselves.

In [2]:
data_url = 'https://shahinrostami.com/datasets/anime_list.csv'
data = pd.read_csv(data_url)
data.head()
Out[2]:
anime_id title title_english title_japanese title_synonyms image_url type source episodes status ... background premiered broadcast related producer licensor studio genre opening_theme ending_theme
0 11013 Inu x Boku SS Inu X Boku Secret Service 妖狐×僕SS Youko x Boku SS https://myanimelist.cdn-dena.com/images/anime/... TV Manga 12 Finished Airing ... Inu x Boku SS was licensed by Sentai Filmworks... Winter 2012 Fridays at Unknown {'Adaptation': [{'mal_id': 17207, 'type': 'man... Aniplex, Square Enix, Mainichi Broadcasting Sy... Sentai Filmworks David Production Comedy, Supernatural, Romance, Shounen ['"Nirvana" by MUCC'] ['#1: "Nirvana" by MUCC (eps 1, 11-12)', '#2: ...
1 2104 Seto no Hanayome My Bride is a Mermaid 瀬戸の花嫁 The Inland Sea Bride https://myanimelist.cdn-dena.com/images/anime/... TV Manga 26 Finished Airing ... NaN Spring 2007 Unknown {'Adaptation': [{'mal_id': 759, 'type': 'manga... TV Tokyo, AIC, Square Enix, Sotsu Funimation Gonzo Comedy, Parody, Romance, School, Shounen ['"Romantic summer" by SUN&LUNAR'] ['#1: "Ashita e no Hikari (明日への光)" by Asuka Hi...
2 5262 Shugo Chara!! Doki Shugo Chara!! Doki しゅごキャラ!!どきっ Shugo Chara Ninenme, Shugo Chara! Second Year https://myanimelist.cdn-dena.com/images/anime/... TV Manga 51 Finished Airing ... NaN Fall 2008 Unknown {'Adaptation': [{'mal_id': 101, 'type': 'manga... TV Tokyo, Sotsu NaN Satelight Comedy, Magic, School, Shoujo ['#1: "Minna no Tamago (みんなのたまご)" by Shugo Cha... ['#1: "Rottara Rottara (ロッタラ ロッタラ)" by Buono! ...
3 721 Princess Tutu Princess Tutu プリンセスチュチュ NaN https://myanimelist.cdn-dena.com/images/anime/... TV Original 38 Finished Airing ... Princess Tutu aired in two parts. The first pa... Summer 2002 Fridays at Unknown {'Adaptation': [{'mal_id': 1581, 'type': 'mang... Memory-Tech, GANSIS, Marvelous AQL ADV Films Hal Film Maker Comedy, Drama, Magic, Romance, Fantasy ['"Morning Grace" by Ritsuko Okazaki'] ['"Watashi No Ai Wa Chiisaikeredo" by Ritsuko ...
4 12365 Bakuman. 3rd Season Bakuman. バクマン。 Bakuman Season 3 https://myanimelist.cdn-dena.com/images/anime/... TV Manga 25 Finished Airing ... NaN Fall 2012 Unknown {'Adaptation': [{'mal_id': 9711, 'type': 'mang... NHK, Shueisha NaN J.C.Staff Comedy, Drama, Romance, Shounen ['#1: "Moshimo no Hanashi (もしもの話)" by nano.RIP... ['#1: "Pride on Everyday" by Sphere (eps 1-13)...

5 rows × 31 columns

It looks good so far, but let's confirm the 31 variables against 14478 samples from the documentation.

In [3]:
data.shape
Out[3]:
(14478, 31)

Perfect, that's exactly what we were expecting.

Data Wrangling

We need to do a bit of data wrangling before we can visualise our data. We can see from the column names there's a single column for genres, containing comma separated values.

Let's convert them to lists of strings.

In [4]:
def get_list(x):
    if isinstance(x, int):
        return []
    if isinstance(x,str):
        result = [s.strip() for s in x.split(',')]
        return sorted(result)

    return []
In [5]:
genres = data['genre'].apply(get_list)
pd.DataFrame(genres)
Out[5]:
genre
0 [Comedy, Romance, Shounen, Supernatural]
1 [Comedy, Parody, Romance, School, Shounen]
2 [Comedy, Magic, School, Shoujo]
3 [Comedy, Drama, Fantasy, Magic, Romance]
4 [Comedy, Drama, Romance, Shounen]
... ...
14473 [Kids]
14474 [Comedy]
14475 [Action, Adventure, Fantasy, Sci-Fi]
14476 [Fantasy, Kids]
14477 [Comedy]

14478 rows × 1 columns

Without further investigation, we can see that we have at least a few empty list values, [], and a few single-entry lists in the table above, so let's remove all samples which contain an empty or single-entry list.

In [6]:
genres = genres[genres.str.len() > 1]
pd.DataFrame(genres)
Out[6]:
genre
0 [Comedy, Romance, Shounen, Supernatural]
1 [Comedy, Parody, Romance, School, Shounen]
2 [Comedy, Magic, School, Shoujo]
3 [Comedy, Drama, Fantasy, Magic, Romance]
4 [Comedy, Drama, Romance, Shounen]
... ...
14467 [Drama, Kids]
14469 [Kids, School]
14471 [Drama, Fantasy, Kids]
14475 [Action, Adventure, Fantasy, Sci-Fi]
14476 [Fantasy, Kids]

10974 rows × 1 columns

Our chord diagram will need two inputs: the co-occurrence matrix, and a list of names to label the segments.

We can build a co-occurrence matrix with the following approach. We'll start by cgetting all combinations within each list.

In [7]:
genres = [list(itertools.combinations(i,2)) for i in genres]
pd.DataFrame(genres)
Out[7]:
0 1 2 3 4 5 6 7 8 9 ... 68 69 70 71 72 73 74 75 76 77
0 (Comedy, Romance) (Comedy, Shounen) (Comedy, Supernatural) (Romance, Shounen) (Romance, Supernatural) (Shounen, Supernatural) None None None None ... None None None None None None None None None None
1 (Comedy, Parody) (Comedy, Romance) (Comedy, School) (Comedy, Shounen) (Parody, Romance) (Parody, School) (Parody, Shounen) (Romance, School) (Romance, Shounen) (School, Shounen) ... None None None None None None None None None None
2 (Comedy, Magic) (Comedy, School) (Comedy, Shoujo) (Magic, School) (Magic, Shoujo) (School, Shoujo) None None None None ... None None None None None None None None None None
3 (Comedy, Drama) (Comedy, Fantasy) (Comedy, Magic) (Comedy, Romance) (Drama, Fantasy) (Drama, Magic) (Drama, Romance) (Fantasy, Magic) (Fantasy, Romance) (Magic, Romance) ... None None None None None None None None None None
4 (Comedy, Drama) (Comedy, Romance) (Comedy, Shounen) (Drama, Romance) (Drama, Shounen) (Romance, Shounen) None None None None ... None None None None None None None None None None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10969 (Drama, Kids) None None None None None None None None None ... None None None None None None None None None None
10970 (Kids, School) None None None None None None None None None ... None None None None None None None None None None
10971 (Drama, Fantasy) (Drama, Kids) (Fantasy, Kids) None None None None None None None ... None None None None None None None None None None
10972 (Action, Adventure) (Action, Fantasy) (Action, Sci-Fi) (Adventure, Fantasy) (Adventure, Sci-Fi) (Fantasy, Sci-Fi) None None None None ... None None None None None None None None None None
10973 (Fantasy, Kids) None None None None None None None None None ... None None None None None None None None None None

10974 rows × 78 columns

Now we will flatten the nested lists, this will give us all the genre pairings in original and reversed order.

In [8]:
genres = list(itertools.chain.from_iterable((i, i[::-1]) for c_ in genres for i in c_))
pd.DataFrame(genres)
Out[8]:
0 1
0 Comedy Romance
1 Romance Comedy
2 Comedy Shounen
3 Shounen Comedy
4 Comedy Supernatural
... ... ...
119691 Sci-Fi Adventure
119692 Fantasy Sci-Fi
119693 Sci-Fi Fantasy
119694 Fantasy Kids
119695 Kids Fantasy

119696 rows × 2 columns

Which we can now use to create the matrix.

In [9]:
matrix = pd.pivot_table(
    pd.DataFrame(genres), index=0, columns=1, aggfunc="size", fill_value=0
).values.tolist()

We can list this using a DataFrame for better presentation.

In [10]:
pd.DataFrame(matrix)
Out[10]:
0 1 2 3 4 5 6 7 8 9 ... 33 34 35 36 37 38 39 40 41 42
0 0 1022 34 985 12 165 550 201 860 103 ... 13 38 232 141 390 475 32 56 3 2
1 1022 0 22 994 6 97 428 67 1054 80 ... 7 70 134 39 136 229 6 10 0 0
2 34 22 0 11 0 0 12 0 1 8 ... 0 2 1 36 0 1 0 0 0 0
3 985 994 11 0 28 103 499 500 952 65 ... 41 835 79 246 249 439 11 52 12 11
4 12 6 0 28 0 1 21 3 18 1 ... 1 1 2 1 0 16 4 0 0 0
5 165 97 0 103 1 0 41 24 166 8 ... 0 3 1 1 25 194 4 11 0 1
6 550 428 12 499 21 41 0 54 369 19 ... 37 336 137 148 58 236 36 23 23 1
7 201 67 0 500 3 24 54 0 136 11 ... 0 35 10 22 45 97 0 13 0 8
8 860 1054 1 952 18 166 369 136 0 101 ... 11 132 17 11 110 360 7 33 0 3
9 103 80 8 65 1 8 19 11 101 0 ... 0 20 3 7 1 14 6 0 0 0
10 73 19 0 225 0 18 53 156 74 1 ... 1 20 6 1 10 55 0 11 0 2
11 41 15 0 52 2 57 28 0 68 0 ... 0 1 4 3 5 63 0 3 11 32
12 224 214 1 195 3 56 335 7 184 4 ... 7 82 3 6 15 135 3 3 2 0
13 139 62 1 71 30 74 88 14 94 2 ... 3 1 11 2 10 201 21 35 1 0
14 17 8 0 39 0 4 33 0 18 3 ... 3 29 0 3 0 17 0 3 0 0
15 134 476 23 596 1 22 224 3 588 34 ... 0 122 19 35 36 48 0 5 0 0
16 316 278 0 407 0 57 131 89 538 19 ... 4 60 5 0 52 168 7 7 0 0
17 236 103 0 101 1 20 41 26 72 3 ... 1 15 2 20 86 26 0 0 0 1
18 623 302 10 224 8 4 188 35 73 16 ... 0 13 182 13 44 22 1 0 1 0
19 304 90 0 61 3 14 193 19 52 5 ... 0 9 137 4 18 31 2 6 1 1
20 64 35 3 137 51 4 114 9 89 7 ... 3 113 28 19 6 23 0 5 3 1
21 206 154 0 210 11 19 140 10 79 14 ... 5 24 6 3 44 188 48 23 0 0
22 88 27 2 453 8 6 6 40 55 15 ... 3 29 11 11 44 18 1 3 1 2
23 105 73 5 114 2 1 29 6 3 0 ... 1 5 3 5 1 11 9 1 0 0
24 64 24 0 32 29 6 117 6 32 16 ... 4 12 7 2 2 74 38 0 1 0
25 277 244 2 838 6 64 606 243 306 13 ... 45 241 45 39 32 224 9 24 21 3
26 114 38 0 53 0 10 39 8 23 0 ... 0 1 0 0 11 21 0 1 2 0
27 227 43 1 831 3 25 238 209 125 29 ... 13 361 7 135 62 140 4 16 3 6
28 1143 695 10 676 20 32 464 110 262 40 ... 6 72 377 41 143 98 18 6 3 2
29 256 105 21 373 2 20 144 90 65 12 ... 0 153 24 40 26 102 16 13 0 3
30 76 77 0 242 1 31 191 0 205 1 ... 15 119 1 23 9 66 1 14 0 0
31 13 1 0 33 1 0 20 14 9 0 ... 0 22 0 0 2 4 0 0 0 2
32 809 675 19 963 1 73 266 114 431 63 ... 0 85 63 276 183 249 7 24 0 0
33 13 7 0 41 1 0 37 0 11 0 ... 0 10 0 2 1 16 1 4 1 0
34 38 70 2 835 1 3 336 35 132 20 ... 10 0 7 39 8 93 1 1 2 0
35 232 134 1 79 2 1 137 10 17 3 ... 0 7 0 3 9 7 0 0 0 0
36 141 39 36 246 1 1 148 22 11 7 ... 2 39 3 0 11 3 0 0 2 0
37 390 136 0 249 0 25 58 45 110 1 ... 1 8 9 11 0 90 3 8 0 0
38 475 229 1 439 16 194 236 97 360 14 ... 16 93 7 3 90 0 39 86 2 0
39 32 6 0 11 4 4 36 0 7 6 ... 1 1 0 0 3 39 0 1 0 0
40 56 10 0 52 0 11 23 13 33 0 ... 4 1 0 0 8 86 1 0 0 0
41 3 0 0 12 0 0 23 0 0 0 ... 1 2 0 2 0 2 0 0 0 0
42 2 0 0 11 0 1 1 8 3 0 ... 0 0 0 0 0 0 0 0 0 0

43 rows × 43 columns

Now for the names of our genres.

In [11]:
names = np.unique(genres).tolist()
pd.DataFrame(names)
Out[11]:
0
0 Action
1 Adventure
2 Cars
3 Comedy
4 Dementia
5 Demons
6 Drama
7 Ecchi
8 Fantasy
9 Game
10 Harem
11 Hentai
12 Historical
13 Horror
14 Josei
15 Kids
16 Magic
17 Martial Arts
18 Mecha
19 Military
20 Music
21 Mystery
22 Parody
23 Police
24 Psychological
25 Romance
26 Samurai
27 School
28 Sci-Fi
29 Seinen
30 Shoujo
31 Shoujo Ai
32 Shounen
33 Shounen Ai
34 Slice of Life
35 Space
36 Sports
37 Super Power
38 Supernatural
39 Thriller
40 Vampire
41 Yaoi
42 Yuri

We may wish to remove some genres for our visualisation. The example below will remove a single genre from the co-occurrence matrix and list of names, however, if you add more genre names to the discarded_categories list it will work for them too.

In [12]:
matrix = pd.DataFrame(matrix)
names = pd.DataFrame(names)

discarded_categories = ["Hentai", "Yaoi", "Yuri", "Ecchi",
                        "Shounen Ai", "Shoujo Ai"]

discard_mask = names.isin(discarded_categories).values
discard_indices = names[discard_mask].index

for drop_idx in discard_indices:
    matrix = matrix.drop(drop_idx, axis=1)
    matrix = matrix.drop(drop_idx, axis=0)
    names = names.drop(drop_idx, axis=0)   

Chord Diagram

Time to visualise the co-occurrence of genres using a chord diagram. We are going to use a list of custom colours that represent the genres.

In [13]:
colors = ["#660000", "#734139", "#e59173", "#ff4400", "#332b26", "#593000",
          "#998773", "#d97400", "#8c5e00", "#f2ca79", "#ffcc00", "#59562d",
          "#736b00", "#c2cc33", "#245900", "#8cff40", "#269926", "#ace6ac",
          "#40ffa6", "#336655", "#008c5e", "#39e6da", "#ace6e2", "#566d73",
          "#39c3e6", "#1d5673", "#3d9df2", "#163159", "#acc3e6", "#000f73",
          "#565a73", "#000033", "#8273e6", "#6d00cc", "#633366", "#e2ace6",
          "#f23de6", "#cc0088", "#590024", "#cc0036", "#f27999", "#e6acb4"];

Finally, we can put it all together.

In [14]:
Chord(
    matrix.values.tolist(),
    names.values.tolist(),
    padding=0.03,
    colors=colors,
    wrap_labels=False,
    margin=80,
    font_size="14px",
    font_size_large="14px",
    credit=True,
    noun = "Anime"
).show()
Chord Diagram

Conclusion

In this section, we demonstrated how to conduct some data wrangling on a downloaded dataset to prepare it for a chord diagram. Our chord diagram is interactive, so you can use your mouse or touchscreen to investigate the co-occurrences!

Co-occurrence of Movie Genres with Chord Diagrams

Preamble

In [1]:
import numpy as np                   # for multi-dimensional containers 
import pandas as pd                  # for DataFrames
import itertools
from ast import literal_eval
from chord import Chord

Introduction

In this section, we're going to use the TMDB 5000 Movie Dataset dataset to visualise the co-occurrence of movie genres.

The Dataset

The dataset documentation states that we can expect 20 variables per each of the 4803 movies. Let's download the mirrored dataset and have a look for ourselves.

In [2]:
data_url = 'https://shahinrostami.com/datasets/tmdb_5000_movies.csv'
data = pd.read_csv(data_url)
data.head()
Out[2]:
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 [{"id": 1463, "name": "culture clash"}, {"id":... en Avatar In the 22nd century, a paraplegic Marine is di... 150.437577 [{"name": "Ingenious Film Partners", "id": 289... [{"iso_3166_1": "US", "name": "United States o... 2009-12-10 2787965087 162.0 [{"iso_639_1": "en", "name": "English"}, {"iso... Released Enter the World of Pandora. Avatar 7.2 11800
1 300000000 [{"id": 12, "name": "Adventure"}, {"id": 14, "... http://disney.go.com/disneypictures/pirates/ 285 [{"id": 270, "name": "ocean"}, {"id": 726, "na... en Pirates of the Caribbean: At World's End Captain Barbossa, long believed to be dead, ha... 139.082615 [{"name": "Walt Disney Pictures", "id": 2}, {"... [{"iso_3166_1": "US", "name": "United States o... 2007-05-19 961000000 169.0 [{"iso_639_1": "en", "name": "English"}] Released At the end of the world, the adventure begins. Pirates of the Caribbean: At World's End 6.9 4500
2 245000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.sonypictures.com/movies/spectre/ 206647 [{"id": 470, "name": "spy"}, {"id": 818, "name... en Spectre A cryptic message from Bond’s past sends him o... 107.376788 [{"name": "Columbia Pictures", "id": 5}, {"nam... [{"iso_3166_1": "GB", "name": "United Kingdom"... 2015-10-26 880674609 148.0 [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... Released A Plan No One Escapes Spectre 6.3 4466
3 250000000 [{"id": 28, "name": "Action"}, {"id": 80, "nam... http://www.thedarkknightrises.com/ 49026 [{"id": 849, "name": "dc comics"}, {"id": 853,... en The Dark Knight Rises Following the death of District Attorney Harve... 112.312950 [{"name": "Legendary Pictures", "id": 923}, {"... [{"iso_3166_1": "US", "name": "United States o... 2012-07-16 1084939099 165.0 [{"iso_639_1": "en", "name": "English"}] Released The Legend Ends The Dark Knight Rises 7.6 9106
4 260000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://movies.disney.com/john-carter 49529 [{"id": 818, "name": "based on novel"}, {"id":... en John Carter John Carter is a war-weary, former military ca... 43.926995 [{"name": "Walt Disney Pictures", "id": 2}] [{"iso_3166_1": "US", "name": "United States o... 2012-03-07 284139100 132.0 [{"iso_639_1": "en", "name": "English"}] Released Lost in our world, found in another. John Carter 6.1 2124

It looks good so far, but let's confirm the 20 variables against 4803 samples from the documentation.

In [3]:
data.shape
Out[3]:
(4803, 20)

Perfect, that's exactly what we were expecting.

Data Wrangling

We need to do a bit of data wrangling before we can visualise our data. We can see from the column names there's a single column for genres, containing a string representation of a dictionary.

The first thing we need to do is evaluate these from strings into a type we can work with.

In [4]:
genres = data['genres'].apply(literal_eval)
genres
Out[4]:
0       [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
1       [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...
2       [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
3       [{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...
4       [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
                              ...                        
4798    [{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...
4799    [{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...
4800    [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...
4801                                                   []
4802                  [{'id': 99, 'name': 'Documentary'}]
Name: genres, Length: 4803, dtype: object

The genres are now in lists of dictionaries. Let's convert them to lists of strings.

In [5]:
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        return sorted(names)

    return []
In [6]:
genres = genres.apply(get_list)
pd.DataFrame(genres)
Out[6]:
genres
0 [Action, Adventure, Fantasy, Science Fiction]
1 [Action, Adventure, Fantasy]
2 [Action, Adventure, Crime]
3 [Action, Crime, Drama, Thriller]
4 [Action, Adventure, Science Fiction]
... ...
4798 [Action, Crime, Thriller]
4799 [Comedy, Romance]
4800 [Comedy, Drama, Romance, TV Movie]
4801 []
4802 [Documentary]

4803 rows × 1 columns

Without further investigation, we can see that we have at least a few empty list values, [], in the table above, so we can remove all samples which contain an empty list.

In [7]:
genres = genres[genres.str.len() > 0]
pd.DataFrame(genres)
Out[7]:
genres
0 [Action, Adventure, Fantasy, Science Fiction]
1 [Action, Adventure, Fantasy]
2 [Action, Adventure, Crime]
3 [Action, Crime, Drama, Thriller]
4 [Action, Adventure, Science Fiction]
... ...
4797 [Foreign, Thriller]
4798 [Action, Crime, Thriller]
4799 [Comedy, Romance]
4800 [Comedy, Drama, Romance, TV Movie]
4802 [Documentary]

4775 rows × 1 columns

Our chord diagram will need two inputs: the co-occurrence matrix, and a list of names to label the segments.

We can build a co-occurrence matrix with the following approach. We'll start by getting all combinations within each list.

In [8]:
genres = [list(itertools.combinations(i,2)) for i in genres]
pd.DataFrame(genres)
Out[8]:
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 20
0 (Action, Adventure) (Action, Fantasy) (Action, Science Fiction) (Adventure, Fantasy) (Adventure, Science Fiction) (Fantasy, Science Fiction) None None None None ... None None None None None None None None None None
1 (Action, Adventure) (Action, Fantasy) (Adventure, Fantasy) None None None None None None None ... None None None None None None None None None None
2 (Action, Adventure) (Action, Crime) (Adventure, Crime) None None None None None None None ... None None None None None None None None None None
3 (Action, Crime) (Action, Drama) (Action, Thriller) (Crime, Drama) (Crime, Thriller) (Drama, Thriller) None None None None ... None None None None None None None None None None
4 (Action, Adventure) (Action, Science Fiction) (Adventure, Science Fiction) None None None None None None None ... None None None None None None None None None None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4770 (Foreign, Thriller) None None None None None None None None None ... None None None None None None None None None None
4771 (Action, Crime) (Action, Thriller) (Crime, Thriller) None None None None None None None ... None None None None None None None None None None
4772 (Comedy, Romance) None None None None None None None None None ... None None None None None None None None None None
4773 (Comedy, Drama) (Comedy, Romance) (Comedy, TV Movie) (Drama, Romance) (Drama, TV Movie) (Romance, TV Movie) None None None None ... None None None None None None None None None None
4774 None None None None None None None None None None ... None None None None None None None None None None

4775 rows × 21 columns

Now we will flatten the nested lists, this will give us all the genre pairings in original and reversed order.

In [9]:
genres = list(itertools.chain.from_iterable((i, i[::-1]) for c_ in genres for i in c_))
pd.DataFrame(genres)
Out[9]:
0 1
0 Action Adventure
1 Adventure Action
2 Action Fantasy
3 Fantasy Action
4 Action Science Fiction
... ... ...
24655 Romance Drama
24656 Drama TV Movie
24657 TV Movie Drama
24658 Romance TV Movie
24659 TV Movie Romance

24660 rows × 2 columns

Which we can now use to create the matrix.

In [10]:
matrix = pd.pivot_table(
    pd.DataFrame(genres), index=0, columns=1, aggfunc="size", fill_value=0
).values.tolist()

We can list this using a DataFrame for better presentation.

In [11]:
pd.DataFrame(matrix)
Out[11]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 0 465 26 258 276 3 339 62 144 5 58 76 9 57 63 277 1 547 55 35
1 465 0 114 223 56 2 183 211 190 2 27 20 7 37 66 205 0 203 30 22
2 26 114 0 125 0 0 19 195 61 1 0 0 14 1 8 30 1 3 1 2
3 258 223 125 0 180 11 576 299 166 9 11 78 84 27 484 109 4 113 8 17
4 276 56 0 180 0 1 381 8 10 4 13 36 11 105 47 18 1 414 4 9
5 3 2 0 11 1 0 7 5 0 2 6 1 15 1 0 0 0 1 0 0
6 339 183 19 576 381 7 0 121 99 27 175 84 106 175 603 102 5 554 118 34
7 62 211 195 299 8 5 121 0 149 3 1 1 32 6 52 58 3 7 0 3
8 144 190 61 166 10 0 99 149 0 0 0 53 11 19 64 85 0 63 3 2
9 5 2 1 9 4 2 27 3 0 0 3 0 0 0 9 0 0 3 1 0
10 58 27 0 11 13 6 175 1 0 3 0 1 6 3 30 0 0 21 59 8
11 76 20 0 78 36 1 84 1 53 0 1 0 3 91 15 95 1 291 1 1
12 9 7 14 84 11 15 106 32 11 0 6 3 0 3 61 2 1 5 1 3
13 57 37 1 27 105 1 175 6 19 0 3 91 3 0 24 47 0 242 3 2
14 63 66 8 484 47 0 603 52 64 9 30 15 61 24 0 31 3 64 26 12
15 277 205 30 109 18 0 102 58 85 0 0 95 2 47 31 0 0 211 2 1
16 1 0 1 4 1 0 5 3 0 0 0 1 1 0 3 0 0 1 0 0
17 547 203 3 113 414 1 554 7 63 3 21 291 5 242 64 211 1 0 24 7
18 55 30 1 8 4 0 118 0 3 1 59 1 1 3 26 2 0 24 0 3
19 35 22 2 17 9 0 34 3 2 0 8 1 3 2 12 1 0 7 3 0

Now for the names of our genres.

In [12]:
names = np.unique(genres).tolist()
pd.DataFrame(names)
Out[12]:
0
0 Action
1 Adventure
2 Animation
3 Comedy
4 Crime
5 Documentary
6 Drama
7 Family
8 Fantasy
9 Foreign
10 History
11 Horror
12 Music
13 Mystery
14 Romance
15 Science Fiction
16 TV Movie
17 Thriller
18 War
19 Western

Chord Diagram

Time to visualise the co-occurrence of genres using a chord diagram. We are going to use a list of custom colours that represent the genres.

In [13]:
colors = ["#e6194B", "#3cb44b", "#ffe119", "#4363d8", "#f58231",
    "#911eb4", "#42d4f4", "#f032e6", "#bfef45", "#fabebe",
    "#469990", "#e6beff", "#9A6324", "#fffac8", "#800000",
    "#aaffc3", "#a9a9a9", "#ffd8b1", "#000075", "#a9a9a9",];

Finally, we can put it all together.

In [14]:
Chord(
    matrix,
    names,
    colors=colors,
    wrap_labels=False,
    margin=50
).show()
Chord Diagram

Conclusion

In this section, we demonstrated how to conduct some data wrangling on a downloaded dataset to prepare it for a chord diagram. Our chord diagram is interactive, so you can use your mouse or touchscreen to investigate the co-occurrences!

Interactive Chord Diagrams

Preamble

In [2]:
:dep chord = {Version = "0.1.4"}
use chord::{Chord, Plot};

Introduction

In a chord diagram (or radial network), entities are arranged radially as segments with their relationships visualised by arcs that connect them. The size of the segments illustrates the numerical proportions, whilst the size of the arc illustrates the significance of the relationships1.

Chord diagrams are useful when trying to convey relationships between different entities, and they can be beautiful and eye-catching.

The Chord Crate

I wasn't able to find any Rust crates for plotting chord diagrams, so I ported my own from Python to Rust.

You can get the package either from crates.io or from the GitHub repository. With your processed data, you should be able to plot something beautiful with just a single line, Chord{ matrix : matrix, names : names, .. Chord::default() }.show()

The Dataset

The focus for this section will be the demonstration of the chord package. To keep it simple, we will use synthetic data that illustrates the co-occurrences between movie genres within the same movie.

In [30]:
let matrix: Vec<Vec<f64>> = vec![
    vec![0., 5., 6., 4., 7., 4.],
    vec![5., 0., 5., 4., 6., 5.],
    vec![6., 5., 0., 4., 5., 5.],
    vec![4., 4., 4., 0., 5., 5.],
    vec![7., 6., 5., 5., 0., 4.],
    vec![4., 5., 5., 5., 4., 0.],
];

let names: Vec<String> = vec![
    "Action",
    "Adventure",
    "Comedy",
    "Drama",
    "Fantasy",
    "Thriller",
]
.into_iter()
.map(String::from)
.collect();

Chord Diagrams

Let's see what the Chord defaults produce when we invoke the show() method.

In [32]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    wrap_labels: true,
    ..Chord::default()
}
.show();
Out[32]:
Chord Diagram

Different Colours

The defaults are nice, but what if we want different colours? You can pass in almost anything from d3-scale-chromatic, or you could pass in a list of hexadecimal colour codes.

In [33]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    wrap_labels: true,
    colors: "d3.schemeSet2".to_string(),
    ..Chord::default()
}
.show();
Out[33]:
Chord Diagram
In [34]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    wrap_labels: true,
    colors: format!("d3.schemeGnBu[{:?}]",names.len()).to_string(),
    ..Chord::default()
}
.show();
Out[34]:
Chord Diagram
In [35]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    wrap_labels: true,
    colors: "d3.schemeSet3".to_string(),
    ..Chord::default()
}
.show();
Out[35]:
Chord Diagram
In [36]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    wrap_labels: true,
    colors: format!("d3.schemePuRd[{:?}]",names.len()).to_string(),
    ..Chord::default()
}
.show();
Out[36]:
Chord Diagram
In [37]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    wrap_labels: true,
    colors: format!("d3.schemeYlGnBu[{:?}]",names.len()).to_string(),
    ..Chord::default()
}
.show();
Out[37]:
Chord Diagram
In [38]:
let hex_colours : Vec<String> = vec!["#222222", "#333333", "#4c4c4c", "#666666", "#848484", "#9a9a9a"].into_iter()
.map(String::from)
.collect();

Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    wrap_labels: true,
    colors: format!("{:?}",hex_colours),
    ..Chord::default()
}
.show();
Out[38]:
Chord Diagram

Label Styling

We can disable wrapped labels, and even change the colour.

In [39]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    wrap_labels: false,
    label_color:"#4c40bf".to_string(),
    ..Chord::default()
}
.show();
Out[39]:
Chord Diagram

Opacity

We can also change the default opacity of the relationships.

In [40]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    opacity: 0.1,
    ..Chord::default()
}
.show();
Out[40]:
Chord Diagram

Width

We can also change the maximum width the plot.

In [41]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    width: 400.0,
    wrap_labels: true,
    ..Chord::default()
}
.show()
Out[41]:
Chord Diagram

Conclusion

In this section, we've introduced the chord diagram and chord crate. We used the crate and some synthetic data to demonstrate several chord diagram visualisations with different configurations. The chord Python package is available for free from crates.io or from the GitHub repository.


  1. Tintarev, N., Rostami, S., & Smyth, B. (2018, April). Knowing the unknown: visualising consumption blind-spots in recommender systems. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing (pp. 1396-1399). 

Co-occurrence of Pokemon Types (Gen 1-8) with Chord Diagrams

Preamble

In [1]:
import numpy as np                   # for multi-dimensional containers 
import pandas as pd                  # for DataFrames
import itertools
from chord import Chord

Introduction

In previous sections, we visualised co-occurrences of Pokémon type. Whilst it was interesting to look at, the dataset only contained Pokémon from the first six geerations. In this section, we're going to use the Pokemon with stats Generation 8 dataset to visualise the co-occurrence of Pokémon types from generations one to eight.

The Dataset

The dataset documentation states that we can expect 51 variables per each of the 1028 Pokémon of the first eight generations.

Let's download the mirrored dataset and have a look for ourselves.

In [2]:
data_url = 'https://shahinrostami.com/datasets/pokemon_gen_1_to_8.csv'
data = pd.read_csv(data_url)
data.head()
Out[2]:
Unnamed: 0 pokedex_number name german_name japanese_name generation status species type_number type_1 ... against_ground against_flying against_psychic against_bug against_rock against_ghost against_dragon against_dark against_steel against_fairy
0 0 1 Bulbasaur Bisasam フシギダネ (Fushigidane) 1 Normal Seed Pokémon 2 Grass ... 1.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5
1 1 2 Ivysaur Bisaknosp フシギソウ (Fushigisou) 1 Normal Seed Pokémon 2 Grass ... 1.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5
2 2 3 Venusaur Bisaflor フシギバナ (Fushigibana) 1 Normal Seed Pokémon 2 Grass ... 1.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5
3 3 3 Mega Venusaur Bisaflor フシギバナ (Fushigibana) 1 Normal Seed Pokémon 2 Grass ... 1.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5
4 4 4 Charmander Glumanda ヒトカゲ (Hitokage) 1 Normal Lizard Pokémon 1 Fire ... 2.0 1.0 1.0 0.5 2.0 1.0 1.0 1.0 0.5 0.5

5 rows × 51 columns

It looks good so far, but let's confirm the 51 variables against 1028 samples from the documentation.

In [3]:
data.shape
Out[3]:
(1028, 51)

Perfect, that's exactly what we were expecting.

Data Wrangling

We need to do a bit of data wrangling before we can visualise our data. We can see from the columns names that the Pokémon types are split between the columns type_1 and type_2.

In [4]:
pd.DataFrame(data.columns.values.tolist()).head(20)
Out[4]:
0
0 Unnamed: 0
1 pokedex_number
2 name
3 german_name
4 japanese_name
5 generation
6 status
7 species
8 type_number
9 type_1
10 type_2
11 height_m
12 weight_kg
13 abilities_number
14 ability_1
15 ability_2
16 ability_hidden
17 total_points
18 hp
19 attack

So let's select just these two columns and work with a list containing only them as we move forward.

In [5]:
types = pd.DataFrame(data[['type_1', 'type_2']].values)
types
Out[5]:
0 1
0 Grass Poison
1 Grass Poison
2 Grass Poison
3 Grass Poison
4 Fire NaN
... ... ...
1023 Fairy NaN
1024 Fighting Steel
1025 Fighting NaN
1026 Poison Dragon
1027 Poison Dragon

1028 rows × 2 columns

Without further investigation, we can see that we have at least a few NaN values in the table above. We are only interested in co-occurrence of types, so we can remove all samples which contain a NaN value.

In [6]:
types = types.dropna()

We can also see an instance where the type Fighting at index $1014$ is followed by \n. We'll strip all these out before continuing.

In [7]:
types = types.replace('\n','', regex=True)
types
Out[7]:
0 1
0 Grass Poison
1 Grass Poison
2 Grass Poison
3 Grass Poison
6 Fire Flying
... ... ...
1021 Dragon Ghost
1022 Fairy Steel
1024 Fighting Steel
1026 Poison Dragon
1027 Poison Dragon

542 rows × 2 columns

Our chord diagram will need two inputs: the co-occurrence matrix, and a list of names to label the segments.

We can build a co-occurrence matrix with the following approach. We'll start by creating a list with every type pairing in its original and reversed form.

In [8]:
types = list(itertools.chain.from_iterable((i, i[::-1]) for i in types.values))

Which we can now use to create the matrix.

In [9]:
matrix = pd.pivot_table(
    pd.DataFrame(types), index=0, columns=1, aggfunc="size", fill_value=0
).values.tolist()

We can list this using a DataFrame for better presentation.

In [10]:
pd.DataFrame(matrix)
Out[10]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 0 0 0 4 2 4 4 14 1 6 2 2 0 13 2 5 7 5
1 0 0 4 2 3 3 4 5 3 3 3 2 5 5 3 2 2 7
2 0 4 0 3 1 2 3 8 5 5 9 3 1 4 5 2 2 3
3 4 2 3 0 2 0 1 6 1 1 1 2 2 3 1 3 4 3
4 2 3 1 2 0 0 0 2 1 5 0 1 5 1 9 3 5 4
5 4 3 2 0 0 0 7 1 1 3 0 1 4 2 6 1 4 3
6 4 4 3 1 0 7 0 7 5 0 4 1 2 2 3 3 1 1
7 14 5 8 6 2 1 7 0 3 7 4 2 27 3 7 6 3 8
8 1 3 5 1 1 1 5 3 0 12 6 1 0 4 3 0 4 2
9 6 3 5 1 5 3 0 7 12 0 1 3 2 15 3 2 3 3
10 2 3 9 1 0 0 4 4 6 1 0 3 1 2 2 9 6 10
11 2 2 3 2 1 1 1 2 1 3 3 0 0 0 4 2 2 7
12 0 5 1 2 5 4 2 27 0 2 1 0 0 0 5 0 0 1
13 13 5 4 3 1 2 2 3 4 15 2 0 0 0 0 1 0 6
14 2 3 5 1 9 6 3 7 3 3 2 4 5 0 0 2 9 6
15 5 2 2 3 3 1 3 6 0 2 9 2 0 1 2 0 7 11
16 7 2 2 4 5 4 1 3 4 3 6 2 0 0 9 7 0 1
17 5 7 3 3 4 3 1 8 2 3 10 7 1 6 6 11 1 0

Now for the names of our types.

In [11]:
names = np.unique(types).tolist()
pd.DataFrame(names)
Out[11]:
0
0 Bug
1 Dark
2 Dragon
3 Electric
4 Fairy
5 Fighting
6 Fire
7 Flying
8 Ghost
9 Grass
10 Ground
11 Ice
12 Normal
13 Poison
14 Psychic
15 Rock
16 Steel
17 Water

Chord Diagram

Time to visualise the co-occurrence of types using a chord diagram. We are going to use a list of custom colours that represent the types.

In [12]:
colors = ["#A6B91A", "#705746", "#6F35FC", "#F7D02C", "#D685AD",
          "#C22E28", "#EE8130", "#A98FF3", "#735797", "#7AC74C",
          "#E2BF65", "#96D9D6", "#A8A77A", "#A33EA1", "#F95587",
          "#B6A136", "#B7B7CE", "#6390F0"];
In [13]:
names
Out[13]:
['Bug',
 'Dark',
 'Dragon',
 'Electric',
 'Fairy',
 'Fighting',
 'Fire',
 'Flying',
 'Ghost',
 'Grass',
 'Ground',
 'Ice',
 'Normal',
 'Poison',
 'Psychic',
 'Rock',
 'Steel',
 'Water']

Finally, we can put it all together.

In [14]:
Chord(matrix, names, colors=colors).show()
Chord Diagram

Chord Diagram with Names

Note

The following example uses a customised version of Chord that supports the presentation of additional information.

It would be nice to show a list of Pokémon names when hovering over co-occurring Pokémon types. To do this, we can make use of the optional details parameter.

Let's clean up our dataset by removing all instances of \n.

In [15]:
data = data.replace('\n','', regex=True)

Next, we'll create an empty multi-dimensional array with the same shape as our matrix.

In [16]:
details = np.empty((len(names),len(names)),dtype=object)

Now we can populate the details array with lists of Pokémon names in the correct positions.

In [17]:
for count_x, item_x in enumerate(names):
    for count_y, item_y in enumerate(names):
        details[count_x][count_y] = data[
            (data['type_1'].isin([item_x, item_y])) &
            (data['type_2'].isin([item_y, item_x]))]['name'].to_list()

details=pd.DataFrame(details).values.tolist()

Finally, we can put it all together but this time with the details matrix passed in.

In [18]:
Chord(
    matrix,
    names,
    colors=colors,
    details=details,
    credit=True
).show()
Chord Diagram

Conclusion

In this section, we demonstrated how to conduct some data wrangling on a downloaded dataset to prepare it for a chord diagram. Our chord diagram is interactive, so you can use your mouse or touchscreen to investigate the co-occurrences!

Occurrence and Co-occurrence of Pokemon Types with Chord Diagrams

Preamble

In [1]:
import numpy as np                   # for multi-dimensional containers 
import pandas as pd                  # for DataFrames
import itertools
from chord import Chord

Introduction

In this section, we're going to use the Pokémon for Data Mining and Machine Learning dataset to visualise the occurrences and co-occurrence of Pokémon types.

The Dataset

The dataset documentation states that we can expect 23 variables per each of the 721 Pokémon of the first six generations.

Let's download the mirrored dataset and have a look for ourselves.

In [2]:
data_url = 'https://shahinrostami.com/datasets/pokemon.csv'
data = pd.read_csv(data_url)
data.head()
Out[2]:
Number Name Type_1 Type_2 Total HP Attack Defense Sp_Atk Sp_Def ... Color hasGender Pr_Male Egg_Group_1 Egg_Group_2 hasMegaEvolution Height_m Weight_kg Catch_Rate Body_Style
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 ... Green True 0.875 Monster Grass False 0.71 6.9 45 quadruped
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 ... Green True 0.875 Monster Grass False 0.99 13.0 45 quadruped
2 3 Venusaur Grass Poison 525 80 82 83 100 100 ... Green True 0.875 Monster Grass True 2.01 100.0 45 quadruped
3 4 Charmander Fire NaN 309 39 52 43 60 50 ... Red True 0.875 Monster Dragon False 0.61 8.5 45 bipedal_tailed
4 5 Charmeleon Fire NaN 405 58 64 58 80 65 ... Red True 0.875 Monster Dragon False 1.09 19.0 45 bipedal_tailed

5 rows × 23 columns

It looks good so far, but let's confirm the 23 variables against 721 samples from the documentation.

In [3]:
data.shape
Out[3]:
(721, 23)

Perfect, that's exactly what we were expecting.

Data Wrangling

We need to do a bit of data wrangling before we can visualise our data. We can see from the columns names that the Pokémon types are split between the columns Type_1 and Type_2.

In [4]:
pd.DataFrame(data.columns.values.tolist())
Out[4]:
0
0 Number
1 Name
2 Type_1
3 Type_2
4 Total
5 HP
6 Attack
7 Defense
8 Sp_Atk
9 Sp_Def
10 Speed
11 Generation
12 isLegendary
13 Color
14 hasGender
15 Pr_Male
16 Egg_Group_1
17 Egg_Group_2
18 hasMegaEvolution
19 Height_m
20 Weight_kg
21 Catch_Rate
22 Body_Style

So let's select just these two columns and work with a list containing only them as we move forward.

In [5]:
data = pd.DataFrame(data[['Type_1', 'Type_2']].values)
data
Out[5]:
0 1
0 Grass Poison
1 Grass Poison
2 Grass Poison
3 Fire NaN
4 Fire NaN
... ... ...
716 Dark Flying
717 Dragon Ground
718 Rock Fairy
719 Psychic Ghost
720 Fire Water

721 rows × 2 columns

Unlike the previous section, this visualisation will also include every occurrence of a type, not just the co-occurrences. Let's extract these single-typed instances before continuing.

In [6]:
single_typed = data[data.isnull().any(1)][0]

Without further investigation, we can see that we have at least a few NaN values in the table above. We are only interested in co-occurrence of types, so we can remove all samples which contain a NaN value.

In [7]:
data = data.dropna()
data
Out[7]:
0 1
0 Grass Poison
1 Grass Poison
2 Grass Poison
5 Fire Flying
11 Bug Flying
... ... ...
716 Dark Flying
717 Dragon Ground
718 Rock Fairy
719 Psychic Ghost
720 Fire Water

350 rows × 2 columns

Our chord diagram will need two inputs: the co-occurrence matrix, and a list of names to label the segments.

We can build a co-occurrence matrix with the following approach. We'll start by creating a list with every type pairing in its original and reversed form.

In [8]:
data = list(itertools.chain.from_iterable((i, i[::-1]) for i in data.values))

Which we can now use to create the matrix.

In [9]:
matrix = pd.pivot_table(
    pd.DataFrame(data), index=0, columns=1, aggfunc="size", fill_value=0
).values.tolist()

We can list this using a DataFrame for better presentation.

In [10]:
pd.DataFrame(matrix)
Out[10]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 0 0 0 2 0 1 2 13 1 6 1 0 0 12 0 5 5 1
1 0 0 3 0 0 3 2 5 2 3 3 2 0 3 2 1 2 4
2 0 3 0 1 0 0 1 6 1 0 6 1 0 1 2 2 1 2
3 2 0 1 0 1 0 0 3 1 0 1 0 2 0 0 0 3 2
4 0 0 0 1 0 0 0 2 0 2 0 0 4 0 5 2 2 2
5 1 3 0 0 0 0 6 1 0 3 0 0 0 2 3 1 2 2
6 2 2 1 0 0 6 0 5 3 0 2 0 2 0 2 1 1 1
7 13 5 6 3 2 1 5 0 2 4 3 2 23 3 6 3 1 7
8 1 2 1 1 0 0 3 2 0 4 2 1 0 3 1 0 3 2
9 6 3 0 0 2 3 0 4 4 0 1 2 2 14 3 2 2 3
10 1 3 6 1 0 0 2 3 2 1 0 3 1 2 2 9 2 9
11 0 2 1 0 0 0 0 2 1 2 3 0 0 0 2 2 0 6
12 0 0 0 2 4 0 2 23 0 2 1 0 0 0 2 0 0 1
13 12 3 1 0 0 2 0 3 3 14 2 0 0 0 0 0 0 4
14 0 2 2 0 5 3 2 6 1 3 2 2 2 0 0 2 6 4
15 5 1 2 0 2 1 1 3 0 2 9 2 0 0 2 0 6 10
16 5 2 1 3 2 2 1 1 3 2 2 0 0 0 6 6 0 1
17 1 4 2 2 2 2 1 7 2 3 9 6 1 4 4 10 1 0

We extracted our single-typed instances earlier. Here is the frequency of these occurrences.

In [11]:
single_typed = single_typed.value_counts().sort_index()
pd.DataFrame(single_typed)
Out[11]:
0
Bug 17
Dark 9
Dragon 11
Electric 26
Fairy 15
Fighting 20
Fire 28
Flying 1
Ghost 9
Grass 33
Ground 13
Ice 12
Normal 60
Poison 15
Psychic 32
Rock 9
Steel 4
Water 57

Let's add these to our matrix along the diagonal. We are lucky that our matrix is of size $ 18 \times 18 $, and our single_typed list is of size $18$.

In [12]:
for i in range(0, len(single_typed)):
    matrix[i][i] = single_typed[i]

We can confirm they've been added by displaying the matrix through a DataFrame once again.

In [13]:
pd.DataFrame(matrix)
Out[13]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 17 0 0 2 0 1 2 13 1 6 1 0 0 12 0 5 5 1
1 0 9 3 0 0 3 2 5 2 3 3 2 0 3 2 1 2 4
2 0 3 11 1 0 0 1 6 1 0 6 1 0 1 2 2 1 2
3 2 0 1 26 1 0 0 3 1 0 1 0 2 0 0 0 3 2
4 0 0 0 1 15 0 0 2 0 2 0 0 4 0 5 2 2 2
5 1 3 0 0 0 20 6 1 0 3 0 0 0 2 3 1 2 2
6 2 2 1 0 0 6 28 5 3 0 2 0 2 0 2 1 1 1
7 13 5 6 3 2 1 5 1 2 4 3 2 23 3 6 3 1 7
8 1 2 1 1 0 0 3 2 9 4 2 1 0 3 1 0 3 2
9 6 3 0 0 2 3 0 4 4 33 1 2 2 14 3 2 2 3
10 1 3 6 1 0 0 2 3 2 1 13 3 1 2 2 9 2 9
11 0 2 1 0 0 0 0 2 1 2 3 12 0 0 2 2 0 6
12 0 0 0 2 4 0 2 23 0 2 1 0 60 0 2 0 0 1
13 12 3 1 0 0 2 0 3 3 14 2 0 0 15 0 0 0 4
14 0 2 2 0 5 3 2 6 1 3 2 2 2 0 32 2 6 4
15 5 1 2 0 2 1 1 3 0 2 9 2 0 0 2 9 6 10
16 5 2 1 3 2 2 1 1 3 2 2 0 0 0 6 6 4 1
17 1 4 2 2 2 2 1 7 2 3 9 6 1 4 4 10 1 57

Now for the names of our types.

In [14]:
names = np.unique(data).tolist()
pd.DataFrame(names)
Out[14]:
0
0 Bug
1 Dark
2 Dragon
3 Electric
4 Fairy
5 Fighting
6 Fire
7 Flying
8 Ghost
9 Grass
10 Ground
11 Ice
12 Normal
13 Poison
14 Psychic
15 Rock
16 Steel
17 Water

Chord Diagram

Time to visualise the occurrences and co-occurrence of types using a chord diagram. We are going to use a list of custom colours that represent the types.

In [15]:
colors = ["#A6B91A", "#705746", "#6F35FC", "#F7D02C",
          "#D685AD", "#C22E28", "#EE8130", "#A98FF3",
          "#735797", "#7AC74C", "#E2BF65", "#96D9D6",
          "#A8A77A", "#A33EA1", "#F95587", "#B6A136",
          "#B7B7CE", "#6390F0"];

Finally, we can put it all together.

In [16]:
Chord(
    matrix,
    names,
    colors=colors,
    wrap_labels=False,
    margin=10,
    credit=True
).show()
Chord Diagram

Conclusion

In this section, we demonstrated how to conduct some data wrangling on a downloaded dataset to prepare it for a chord diagram. Our chord diagram is interactive, so you can use your mouse or touchscreen to investigate the occurrences and co-occurrences!

Co-occurrence of Pokemon Types (Gen 1-6) with Chord Diagrams

Updated Version Available

Click here to see the updated version which includes Pokémon from Generations 1-8. This visualisation only includes Pokémon from Generation 1-6.

Preamble

In [1]:
import numpy as np                   # for multi-dimensional containers 
import pandas as pd                  # for DataFrames
import itertools
from chord import Chord

Introduction

In this section, we're going to use the Pokémon for Data Mining and Machine Learning dataset to visualise the co-occurrence of Pokémon types.

The Dataset

The dataset documentation states that we can expect 23 variables per each of the 721 Pokémon of the first six generations.

Let's download the mirrored dataset and have a look for ourselves.

In [2]:
data_url = 'https://shahinrostami.com/datasets/pokemon.csv'
data = pd.read_csv(data_url)
data.head()
Out[2]:
Number Name Type_1 Type_2 Total HP Attack Defense Sp_Atk Sp_Def ... Color hasGender Pr_Male Egg_Group_1 Egg_Group_2 hasMegaEvolution Height_m Weight_kg Catch_Rate Body_Style
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 ... Green True 0.875 Monster Grass False 0.71 6.9 45 quadruped
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 ... Green True 0.875 Monster Grass False 0.99 13.0 45 quadruped
2 3 Venusaur Grass Poison 525 80 82 83 100 100 ... Green True 0.875 Monster Grass True 2.01 100.0 45 quadruped
3 4 Charmander Fire NaN 309 39 52 43 60 50 ... Red True 0.875 Monster Dragon False 0.61 8.5 45 bipedal_tailed
4 5 Charmeleon Fire NaN 405 58 64 58 80 65 ... Red True 0.875 Monster Dragon False 1.09 19.0 45 bipedal_tailed

5 rows × 23 columns

It looks good so far, but let's confirm the 23 variables against 721 samples from the documentation.

In [3]:
data.shape
Out[3]:
(721, 23)

Perfect, that's exactly what we were expecting.

Data Wrangling

We need to do a bit of data wrangling before we can visualise our data. We can see from the columns names that the Pokémon types are split between the columns Type_1 and Type_2.

In [4]:
pd.DataFrame(data.columns.values.tolist())
Out[4]:
0
0 Number
1 Name
2 Type_1
3 Type_2
4 Total
5 HP
6 Attack
7 Defense
8 Sp_Atk
9 Sp_Def
10 Speed
11 Generation
12 isLegendary
13 Color
14 hasGender
15 Pr_Male
16 Egg_Group_1
17 Egg_Group_2
18 hasMegaEvolution
19 Height_m
20 Weight_kg
21 Catch_Rate
22 Body_Style

So let's select just these two columns and work with a list containing only them as we move forward.

In [5]:
data = pd.DataFrame(data[['Type_1', 'Type_2']].values)
data
Out[5]:
0 1
0 Grass Poison
1 Grass Poison
2 Grass Poison
3 Fire NaN
4 Fire NaN
... ... ...
716 Dark Flying
717 Dragon Ground
718 Rock Fairy
719 Psychic Ghost
720 Fire Water

721 rows × 2 columns

Without further investigation, we can see that we have at least a few NaN values in the table above. We are only interested in co-occurrence of types, so we can remove all samples which contain a NaN value.

In [6]:
data = data.dropna()

Our chord diagram will need two inputs: the co-occurrence matrix, and a list of names to label the segments.

We can build a co-occurrence matrix with the following approach. We'll start by creating a list with every type pairing in its original and reversed form.

In [7]:
data = list(itertools.chain.from_iterable((i, i[::-1]) for i in data.values))

Which we can now use to create the matrix.

In [8]:
matrix = pd.pivot_table(
    pd.DataFrame(data), index=0, columns=1, aggfunc="size", fill_value=0
).values.tolist()

We can list this using a DataFrame for better presentation.

In [9]:
pd.DataFrame(matrix)
Out[9]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 0 0 0 2 0 1 2 13 1 6 1 0 0 12 0 5 5 1
1 0 0 3 0 0 3 2 5 2 3 3 2 0 3 2 1 2 4
2 0 3 0 1 0 0 1 6 1 0 6 1 0 1 2 2 1 2
3 2 0 1 0 1 0 0 3 1 0 1 0 2 0 0 0 3 2
4 0 0 0 1 0 0 0 2 0 2 0 0 4 0 5 2 2 2
5 1 3 0 0 0 0 6 1 0 3 0 0 0 2 3 1 2 2
6 2 2 1 0 0 6 0 5 3 0 2 0 2 0 2 1 1 1
7 13 5 6 3 2 1 5 0 2 4 3 2 23 3 6 3 1 7
8 1 2 1 1 0 0 3 2 0 4 2 1 0 3 1 0 3 2
9 6 3 0 0 2 3 0 4 4 0 1 2 2 14 3 2 2 3
10 1 3 6 1 0 0 2 3 2 1 0 3 1 2 2 9 2 9
11 0 2 1 0 0 0 0 2 1 2 3 0 0 0 2 2 0 6
12 0 0 0 2 4 0 2 23 0 2 1 0 0 0 2 0 0 1
13 12 3 1 0 0 2 0 3 3 14 2 0 0 0 0 0 0 4
14 0 2 2 0 5 3 2 6 1 3 2 2 2 0 0 2 6 4
15 5 1 2 0 2 1 1 3 0 2 9 2 0 0 2 0 6 10
16 5 2 1 3 2 2 1 1 3 2 2 0 0 0 6 6 0 1
17 1 4 2 2 2 2 1 7 2 3 9 6 1 4 4 10 1 0

Now for the names of our types.

In [10]:
names = np.unique(data).tolist()
pd.DataFrame(names)
Out[10]:
0
0 Bug
1 Dark
2 Dragon
3 Electric
4 Fairy
5 Fighting
6 Fire
7 Flying
8 Ghost
9 Grass
10 Ground
11 Ice
12 Normal
13 Poison
14 Psychic
15 Rock
16 Steel
17 Water

Chord Diagram

Time to visualise the co-occurrence of types using a chord diagram. We are going to use a list of custom colours that represent the types.

In [11]:
colors = ["#A6B91A", "#705746", "#6F35FC", "#F7D02C",
          "#D685AD", "#C22E28", "#EE8130", "#A98FF3",
          "#735797", "#7AC74C", "#E2BF65", "#96D9D6",
          "#A8A77A", "#A33EA1", "#F95587", "#B6A136",
          "#B7B7CE", "#6390F0"];

Finally, we can put it all together.

In [12]:
Chord(
    matrix,
    names,
    colors=colors,
    wrap_labels=False,
    margin=10,
    credit=True
).show()
Chord Diagram

Conclusion

In this section, we demonstrated how to conduct some data wrangling on a downloaded dataset to prepare it for a chord diagram. Our chord diagram is interactive, so you can use your mouse or touchscreen to investigate the co-occurrences!