Learning Japanese Through Music: An Analysis of Ichiko Aoba's Lyrics
I recently started learning Japanese. I believe immersion is necessary for language acquisition; it’s how kids learn their mother tongue. Since it’ll require thousands of hours, I’m trying to make it fun with good films and music.
Almost four years ago I fell in love with the tender melodies of Japanese songwriter Ichiko Aoba (青葉市子) through her magical album
“Music transcends language barriers, reaching places words cannot.”
— 青葉市子 (Ichiko Aoba)
“言葉が通じないところでも音楽は通じていくものだ”
— 青葉市子 (Ichiko Aoba)
Having felt her music, I decided to try to understand the lyrics. It’s hard—I know very little vocabulary. I am learning the most common words of the language, but to make learning more purposeful, I decided to figure out the words she uses most, and study them—even if they are rarer.
Why not listen to her music while reading this post? Here’s a nice compilation.
Counting words
Getting the most used words from an English text is simple:
- Split the text into words
- Count the times each word appears
- Sort the counts
There are two problems, though. First, Japanese doesn’t use spaces, complicating the word-splitting step. Second, even if it used spaces, I want to group words by their root; I don’t care if I find “see” X times, “sees” Y times and “saw” Z times. I want “see” with a count of X+Y+Z.
A single tool solves both problems: morphology analysis. Morphology examines how words are built from the smallest meaningful units of a language: morphemes. For example, “unbreakable” has three morphemes: the prefix “un-“, the root “break” and the suffix “-able”.
I found a collection of natural language processing tools for Japanese which includes morphology analysis tools. Given a text, a morphological analyzer will split it into words and show attributes like their “dictionary form” (e.g. not “ran”, but “run”).
My plan: download all of Aoba’s lyrics, process them with a morphological analyzer, and count how many times each word appears.
Getting the lyrics
If you’ve ever looked up the lyrics to a song, you probably ended up in Genius.com. That site has most of Ichiko Aoba’s lyrics. To download them, I used LyricsGenius.
Click to view code
# Using a fork of LyricsGenius with a bug fix: https://github.com/xathon/LyricsGenius
# pip install git+https://github.com/xathon/LyricsGenius.git
= # Create an account and visit https://genius.com/api-clients
=
# Configuration.
= True
=
return f
=
# All her albums except a soundtrack (Amiko) and a field recording album (鮎川のしづく [Ayukawa no shizuku]).
=
=
# Avoid re-downloading.
continue
=
I fixed a few mistakes and added lyrics for songs that lacked them.
The lyrics for 血の風 (Chi no kaze) are in Okinawan language and I only found a partial translation; I removed them.
After trying a few Python libraries, I decided to use Janome for the morphological analysis. I scanned the lyrics of each album, counting how many times each word (in its “dictionary form”) appeared.
Click to view code
=
return
# docs: https://mocobeta.github.io/janome/api/janome.html#janome.tokenizer.Token
# Each Token object has the following attributes:
# - surface: the word as it appears in the text
# - part_of_speech: the part of speech of the word, which can be a compound value like "動詞,自立,*,*"
# - infl_type: the type of inflection of the word (e.g., "五段・ラ行" for a verb)
# - infl_form: the form of inflection of the word (e.g., "連用形" for a verb in the continuous form)
# - base_form: the word in its dictionary/base form (e.g., "行く" for the verb "行った")
# - reading: the reading of the word in katakana
# - phonetic: the phonetic representation of the word in katakana
=
=
=
return
# Function to remove non-word characters (space, comma, newline…)
return
=
=
=
=
=
=
+=
Now I had a list of all the words in Ichiko Aoba’s lyrics and their frequency: here it is. Having the data ready, I couldn’t resist visualising it.
Word clouds
In a word cloud, the size of each word is proportional to its frequency.
I used the word_cloud Python package and the Jisho API to get approximate translations.
Click to view code
= 3000
= 3000
=
= f
# Overall cloud.
# It's translation time!
# Jisho provided too much context for these, or not the right meaning.
=
return
= f
=
=
=
=
=
return
return None
=
=
# Multiple words can have the same translation (e.g. "僕" & "私" = "I").
+=
=
return
# Fetch translations for all words.
=
=
=
=
=
# Translated word clouds.
# Note: I used the SVG masks to complete the word clouds with the album covers in Photoshop.
# I got the covers from https://ichikoaoba.com/discography/.
Here’s the word cloud for the aggregated lyrics. Click on the image to translate it:
I repeated the process for each album using the word cloud as a mask and the cover art as the background. Again, clicking on them will show the translation:
To see an image in full size, right-click on it and “Open image in new tab”.
Some observations:
- Many of the big words are nature-related: 風 (wind), 光 (light), 星 (star), 海 (sea), 空 (sky)… These, along others like 静か (quiet), 夢 (dream), 消える (disappear), and ふわり (gently), match the feelings her music evokes.
- Over 60% of the extracted words only appear once. These are hapax legomena: words that only occur once in a context. This matches Zipf’s law, which predicts that a small number of words will be common, while the majority words will rarely appear.
- 言霊 (kotodama) is one of the hapax legomenon. Its literal meaning is “word spirit/soul”, and it refers to the spiritual power that words are said to possess. In ancient Japan words were seen as having the same essence as physical objects.
To learn this vocabulary, I’ll create flashcards for the most frequent words and try to spot them when immersed in Japanese media.
Slowly but surely (じわじわ), I’ll be able to understand Ichiko Aoba’s lyrics—or the words she uses, at least.