| | |

Make Money with Python – AZLyrics Scrape

Upwork again offers an opportunity to demonstrate how we can make money with Python. The task proposal is a webscraping task that looks to extract the lyrics for every artist on AZLyrics.com.

Watch the YouTube tutorial…

Clients’ requirements

Homepage

A review of the homepage shows us the A-Z plus ‘#’ of artists at the top of the home page. But otherwise, it’s a very plain site.

Inspect the page

Inspecting the home page we can see a list of links, one for each of the letters listing the artists (mostly) last name. They are contained in a div of class ‘btn-group’ are ‘a’ links of class ‘btn-menu’. Searching on that class in the inspect window we can see that there are 27 instances A-Z plus ‘#’.

Extract links for letters listed in the header

So we can find the ‘letter-links’ using this class as the identifier.
To start this code we’ll need some libraries>
The requests library and BeautifulSoup
We’ll need a user-agent for this site which we can get by using google and searching for ‘my user agent’. Mine is a Mozilla web browser.
Let’s start some code.

    import requests
    from bs4 import BeautifulSoup as bs4
    
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
    
    headers = {'user_agent': user_agent}
    
    url = 'https://www.azlyrics.com/'
    
    response = requests.get(url, headers=headers)
    soup = bs4(response.text, 'html.parser')

This soup allows us to extract the letter_hrefs using BeautifulSoup. I have limited the list to 2 entries so we do not overload the server. Also, the links scraped have 2 leading back slasshed so we will slice those off the link before appending it to the letter_hrefs list. We also need to add the string ‘https:’ so that the link is correctly formatted.

    #get list of hrefs for artist letters
    letter_hrefs=[]
    letters = soup.find_all('a' ,{'class': 'btn btn-menu'} )
    for letter in letters[:2]: # 2 entries only
        href = letter.get('href')
        letter_hrefs.append('https:' + href)
    
    letter_hrefs #list output

['https://www.azlyrics.com/a.html', 'https://www.azlyrics.com/b.html']

Extract the artists

Next, we want to go to each of these links in turn and request the list of artists for each. Again we are limiting ourselves to 2 links. Inspecting the page we can see that the artists are listed as hrefs in the atags of a div with class ‘artist-col’. We also need to append the base ‘URL’ so that the artist link is fully formed.

all_artists = []
for href in letter_hrefs[:2]: 
    response = requests.get(href, headers=headers)
    artist_soup = bs4(response.text, 'html.parser')
    artists_soup_list = artist_soup.find_all('div',{'class':'artist-col'})
    for artist in artists_soup_list:
        artist_atags = artist.find_all('a')
        for atag in artist_atags:
            all_artists.append(url + atag.get('href'))
all_artists[:5] # print the first of the links

['https://www.azlyrics.com/a/a1.html',
 'https://www.azlyrics.com/f/floyda1bentley.html',
 'https://www.azlyrics.com/a/a1xj1.html',
 'https://www.azlyrics.com/a/a.html',
 'https://www.azlyrics.com/a/a2h.html']

Extract the songs for each artist

Next we need to request the page of the artist, convert it to ‘soup’ and extract the list of songs for each artist.
Lets inspect the page and see what we need to extract. We can see that the songs are contained in a div with class name ‘listalbum-item’. Again we will need to extract the ‘href’ and add the base URL then append it to the list.
Some of the song labels do not follow the standard pattern and already have the full URL included. These are assessed as part of an ‘if’ statement and appended to the song list correctly.

songs = []
for artist in all_artists[:2]:
    response = requests.get(artist, headers=headers)
    song_soup = bs4(response.text, 'html')
    song_soup_list = song_soup.find_all('div',{'class':'listalbum-item'})
    for song in song_soup_list:
        song_atags = song.find_all('a')
        for atag in song_atags:
            if 'https' not in atag: 
                songs.append((url + atag.get('href')[1:]))
            else:
                songs.append(atag.get('href')[1:])
songs[:5] #lets look at 5 examples

['https://www.azlyrics.com/lyrics/a1/foreverinlove.html',
 'https://www.azlyrics.com/lyrics/a1/bethefirsttobelieve.html',
 'https://www.azlyrics.com/lyrics/a1/summertimeofourlives.html',
 'https://www.azlyrics.com/lyrics/a1/readyornot.html',
 'https://www.azlyrics.com/lyrics/a1/everytime.html']

Extracting the song’s Title, Artist and Lyrics

We now have a comprehensive list of all of the songs from the website, from which we can extract the Title, Artist and Lyrics.

Inspecting a ‘song’ page we can see that the section has the ‘Artist’ name and “Song” title. So that allows us to extract this information with BeautifulSoup.

Further down the page, we can see the lyrics inside an ‘unnamed’ div. This is only slightly problematic as we can use the div prior named ‘ringtone’ and select the ‘next sibling’ to extract the lyrics.

Build a tuple for the Song, Artist and Lyrics and add to a list of Results

We will now extract the key details for every song ready for writing to a text file.
Firstly, we get the ‘title’ tag text fron the ‘head’ of the HTML structure.
Then, we need to manipulate this string.
We split the text string on the ‘|’ and take element ‘[0]’ of this list to manipulate further.
That substring is now split on the ‘-‘ with artist being the first element.
We then remove (replace with nothing) the word ‘Lyrics’ and any white space for the song name.
Finally, we create a tuple of these 3 variables, which will be added to a list in the final script.
Here we just print the tuple to demonstrate that this section of code works.

songs[:5]
all_lyrics = []
for song in songs:
    response = requests.get(song, headers=headers)
    song_soup = bs4(response.text, 'html.parser')
    # title first
    title = song_soup.find('title').text
    title = title.split('|')
    title = title[0].split('-')
    artist = title[0]
    song = title[1].replace('Lyrics','').strip()

    # Extract the lyrics

    lyrics = song_soup.find('div', {'class': 'ringtone'})
    lyric_tag = lyrics.find_next_sibling('div')
    lyrics = lyric_tag.text
    song_tuple=(song, artist, lyrics)
    print(song_tuple)
    break
('Forever In Love', 'a1 ', "\n\r\nLove leads to laughter\nLove leads to pain\nWith you by my side\nI feel good times again\n\nNever have I felt these feelings before\nYou showed me the world\nHow can I ask for more?\n\nAnd although there's confusion\nWe'll find a solution to keep my heart close to you\n\nAnd I know, yes I know\nIf you hold me, believe me\nI'll never, never ever leave\n\nAnd I know\nThere is nothing that I would not do for you\nForever be true\nAnd I know\nAlthough times can be hard\nWe will see it through\nI'm forever in love with you\n\nShow me affection\nIn all different ways\nGive you my heart\nFor the rest of my days\n\nWith you all my troubles are left far behind \nLike heaven on earth\nWhen I look in your eyes\n\nAnd although there's confusion\nWe'll find a solution\nTo keep my heart close to you\n\nAnd I know, yes I know\nIf you hold me, believe me\nI'll never, never ever leave\n\nAnd I know\nThere is nothing that I would not do for you\nForever be true\nAnd I know\nAlthough times can be hard\nWe will see it through\nI'm forever in love with you\n\nNo need to cry\nI'll be right by your side\n(Right by your side)\n\nLet's take our time\nLove won't run dry\nIf you hold me, believe me\nI'll never, never ever leave\n\nAnd I know\nThere is nothing that I would not do for you\nForever be true\nAnd I know\nAlthough times can be hard\nWe will see it through\nI'm forever in love\nAnd I know\nThere is nothing that I would not do for you\n\nForever be true\nAnd I know\n\nOh I know\nAlthough times can be hard\nWe will see it through\nI'm forever in love with you\n")

Refactor for a single script in modular form

Finally, we now refactor these sections of code into a full script for use.

# AZLyrics Scraper
# Charming Python
# 14 Jan 24

import requests
from bs4 import BeautifulSoup as bs4

def get_response(url, headers):
    response = requests.get(url, headers=headers)
    return response

def get_artists(response):
    artists = []
    artist_soup = bs4(response.text, 'html.parser')
    artists_soup_list = artist_soup.find_all('div',{'class':'artist-col'})
    for artist in artists_soup_list:
        artist_atags = artist.find_all('a')
        for atag in artist_atags:
            artists.append(atag.get('href'))
    return artists

def get_songs(response):
    songs = []
    song_soup = bs4(response.text, 'html.parser')
    song_soup_list = song_soup.find_all('div',{'class':'listalbum-item'})
    for song in song_soup_list:
        song_atags = song.find_all('a')
        for atag in song_atags:
            songs.append(atag.get('href'))
    return songs

def get_lyrics(response):
    song_tuples =[] #title, artist, content
    song_soup = bs4(response.text, 'html.parser')
    title = song_soup.find('title').text
    title = title.split('|')
    title = title[0].split('-')
    artist = title[0]
    song = title[1].replace('Lyrics','').strip()
    # Extract the lyrics
    lyrics = song_soup.find('div', {'class': 'ringtone'})
    lyric_tag = lyrics.find_next_sibling('div')
    lyrics = lyric_tag.text
    song_tuples=(song, artist, lyrics)
    return song_tuples

def main():
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
    headers = {'user_agent': user_agent}
    url = 'https://www.azlyrics.com/'
    #first soup
    response = get_response(url, headers)
    soup = bs4(response.text, 'html.parser')
    #get list of hrefs for artist letters
    letter_hrefs=[]
    letters = soup.find_all('a' ,{'class': 'btn btn-menu'} )
    for letter in letters[:2]: # removing leading 2 slashes
        href = letter.get('href')
        letter_hrefs.append('https:' + href)
    # get the artists list

    all_artists = []
    for href in letter_hrefs[:2]: # limit the list - testing purposes
        response = get_response(href, headers)
        artists = get_artists(response)
        all_artists = all_artists + artists

    all_songs = []
    for artist in all_artists[:2]: # limit the list - testing purposes
        artist_url = url + artist # complete the partial URL
        response = get_response(artist_url, headers)
        songs = get_songs(response)
        all_songs = all_songs + songs

    all_song_tuples = []
    for song in all_songs[:2]: # limit the list - testing purposes
        song_url = url + song[1:] # remove the leading addition /
        response = get_response(song_url, headers=headers)
        song_tuple = get_lyrics(response)
        all_song_tuples.append(song_tuple)

    # append to File    
    with open ('songs.txt','w') as f:
        for lyric in all_song_tuples:
            title = lyric[0]
            artist = lyric[1]
            content = lyric[2]
            f.write(f'Poem Title: {title}\n')
            f.write(f'Author : {artist}\n')
            f.write('Poem Content:\n')
            f.write(content)
            f.write('\n')
            f.write('------------------------\n')
            f.write('\n')   

if __name__ == '__main__':
    main()

Similar Posts