Crawler on IKEA Catalogue (deprecated)

Why

IKEA announced that they would end publication of IKEA Catalogue after 70 years, and I run into the fact that they’ve uploaded the electronic edition of the catalogue to IKEA kataloger, which makes me excited since I can exercise my web scraping skill for another time (although when I almost downloaded all of them, I came to know that they are all written in Swedish, as implied by the ISO 639-1 code “sv” in the hyperlink).

How

First let’s open the GUI browser of your choice (for me, it’s Firefox ESR), use Ctrl + Shift + I to open Developer Tools as always. Let’s inspect the cover of the first image, which leads us to a class named “grid” and voilà, we can see the link of images and it seems too obvious with the format of https://ikeamuseum.com/wp-content/themes/ikea-museum/includes/catalogues/external/images/default-1979.jpg. However, it’s just an image with a resolution of 500×675 pixels. If we are barve enough to click the image on the website, the URL becomes https://ikeamuseum.com/sv/ikea-kataloger/?id=1979, I guess you’ve got the point. And inspect the image again, we’ll find that the URL is https://ikeamuseum.com/wp-content/themes/ikea-museum/includes/catalogues/external/images/large-1979.jpg, which shows an image with the resolution of 1200×1641 pixels. And then it’s just a for or while loop with wget and we’ve done with the cover part since IKEA Museum does not seem to be sensitive like Google, no captcha or unbearable download limit.

Now if you dare, click the book icon and we are headed into https://ikeacatalogues.ikea.com/sv-1979/page/1, and you may be surprised by the content of the 212-page catalogue. But if you look at the left side, there is a download icon suggesting that we can download the original pdf at https://ikeacatalogues.ikea.com/77436/1102679/pdfs/ed0738b6-d395-4379-b0cb-33eae94e3c81.pdf though the URL itself becomes obsecure this time. But if you search for the link in the source code, it seems to be just a matter of RegEx. However, this is not the case since the file name becomes something unreadable like ed0738b6-d395-4379-b0cb-33eae94e3c81.pdf. As a result, we can use --content-disposition option in wget to rename that to something like ikea-museum-4kgg93zv6bpb-sv-1979.pdf and that’s it! Go write a shell script to generate a file that contains all the links and type wget -i your_file, press enter, and wait till the pull progress finishes.

What

I am going to cover this part in the upcoming weekend and before that, I have to download all the catalogues again in English orz.

[The crawler is done but the article is never finished.]

Vinfall's Geekademy

Sine īrā et studiō


IKEA catalogue crawler written in Bash (deprecated).


Created 2020-12-14
Updated 2022-05-22
Contain 405 words
GPG Key html asc

#dev #life #python