Zalando Dress Recommendation and Tagging

Zalando Dress Recommendation and TaggingUtilize images and textual descriptions to suggest and tag productsMarco CerlianiBlockedUnblockFollowFollowingMay 2In Artificial Intelligence, Computer Vision techniques are massively applied.

A nice field of application (one of my favourite) is fashion industry.

The availability of resource in term of raw images allows to develop interesting use cases.

Zalando knows this (I suggest to take a look at their GitHub repository) and frequently develops amazing AI solution, or publishes juicy ML research studies.

In AI community, Zalando research team is also known for the released of Fashion-MNIST, a dataset of Zalando’s article images, which aims to replace the traditional MNIST dataset in the study of machine learning.

Recently they released another interesting dataset: the Feidegger.

A dataset composed of dress images and releted textual description.

Like the previous one, this data was donated by Zalando to the research community to experiment various text-image tasks such as captioning and image retrieval.

In this post I make use of this data to build:a Dress Recomendation System based on image similarity;a Dress Tagging System based only on textual description.

THE DATASETThe dataset itself consists of 8732 high-resolution images, each depicting a dress from the available on the Zalando shop against a white-background.

For each of the images were provided five textual annotations in German, each of which has been generated by a separate user.

The example below shows 2 of the 5 descriptions for a dress (English translations only given for illustration, but not part of the dataset).

source ZalandoAt the beginning the dataset stores for each singular description the related image (in url format): we have for a singular dress plus entries.

We start to merge the descrition of the same dress to easy operate with images and reduce duplicates.

data = pd.




fillna(' ')newdata = data.

groupby('Image URL')['Description'].

apply(lambda x: x.


cat(sep=' ')).

reset_index()DRESS RECOMENDATION SYSTEMIn order to build our dress recomendation system we make use of transfer learning.

In detail, we utilize the pre-trained VGG16 to extract relevant features from our dress images and build a similarity score on them.

vgg_model = vgg16.

VGG16(weights='imagenet')feat_extractor = Model(inputs=vgg_model.

input, outputs=vgg_model.


output)We ‘cut’ the VGG at second-last layer, so we obtain for every single image a vector of dimension 1×4096.

At the end of this process we can plot all our features in a 2D space:TSNE on VGG featuresTo test the goodness of our system we keep away a part of our dresses (around 10%).

The rest are used to build the similarity score matrix.

We’ve chosen as similarity score the cosine similarity.

Every time we pass a dress image to our system, we compute the similarity with all our dresses stored in ‘train’ and then we select the most similar (with the highest similarity scores).

sim = cosine_similarity(train, test[test_id].

reshape(1,-1))Here I report some exemples, where the ‘original’ image is an image of a dress coming from the test set.

The dresses on the right are the 5 most similar, refering to the ‘original’ dress that we’ve previously passed.

Not bad!.The VGG is very powerful and does a very good job!DRESS TAGGING SYSTEMThe approach we followed, to develop the dress tagging system, is different from the previous one for the dress similarity.

This scenario is also different from a classical problem of tagging where we have images and the relative tags in form of single words.

Here we have only text descriptions of dresses and we have to extract infromation from them.

This is a little bit tricky because we have to analyze free text written from human.

Our idea is to extract the most significative words from descriptions in order to use them as tags of images.

Our workflow is summarized in the graph below:The image descriptions are written in basic german… Zum Glück spreche Ich wenig Deutsch (hopefully I speak a little bit german) so I decided to work with german and in case of difficulty to ask Google Translate.

Our idea is to develop two different models; one for nouns and another one that deals with adjectives.

To operate this separation we initially make POS tagging on the image descriptions of our original dataset.

tokenizer = nltk.


RegexpTokenizer(r'[a-zA-ZäöüßÄÖÜ]+')nlp = spacy.

load('de_core_news_sm')def clean(txt): text = tokenizer.

tokenize(txt) text = nlp(" ".

join(text)) adj, noun = [], [] for token in text: if token.

pos_ == 'ADJ' and len(token)>2: adj.


lemma_) elif token.

pos_ in ['NOUN','PROPN'] and len(token)>2: noun.


lemma_) return " ".


lower(), " ".


lower()adj, noun = zip(*map(clean,tqdm(data['Description'])))After we combine all the adjectives, referred to the same images (do the same with nouns).

newdata = data.

groupby(‘Image URL’)[‘adj_Description’].

apply(lambda x: x.


cat(sep=’ XXX ‘)).

reset_index()At this point, to extract significative tags for every images, we apply the TFIDF and get the most important ADJs /NOUNs based on this score (we’ve selected the 3 best ADJs /NOUNs.

If no words are found, return a serie of ‘xxx’ only for efficency).

I also compute a series of ambiguos ADJs/NOUNs to exclude.

def tagging(comments, remove=None, n_word=3): comments = comments.

split('XXX') try: counter = TfidfVectorizer(min_df=2, analyzer='word', stop_words=remove) counter.

fit(comments) score = counter.



sum(axis=0) word = counter.

get_feature_names() vocab = pd.




values return " ".

join(list(vocab)+['xxx']*(n_word-len(vocab))) except: return " ".

join(['xxx']*n_word)For every dress we end up with at most 3 ADJs and 3 NOUNs… We are ready to build our models!To feed our model we make use of previous used features, extracted with VGG.

In our case every dress makes appearance at most 3 times, with at most 3 different labels (refered to 3 different ADJs/NOUNs).

The models we utilize are very simple and have the same structure, as shown below:inp = Input(shape=(4096, ))dense1 = Dense(256, activation='relu')(inp)dense2 = Dense(128, activation='relu')(dense1)drop = Dropout(0.

5)(dense2)dense3 = Dense(64, activation='relu')(drop)out = Dense(y.

shape[1], activation='softmax')(dense3)model = Model(inputs=inp, outputs=out)model.

compile(optimizer='adam', loss='categorical_crossentropy')Let’s see some results!We test our models on the same previous dresses and plot the first two labels with highest probability, for ADJs and NOUNs (I also provide translation).

The results are great!.Jointly, our models are able to describe well dresses shown in the images.

SUMMARYIn this post we make use of transfer learning to directly develop a content based recomendation system.

At the second stage we try to tag dresses extracting information only from textual description.

The results achived are beautyful and easy to observe, as well as giving advice if you’d like to renew your wardrobe.

CHECK MY GITHUB REPOKeep in touch: Linkedin.

. More details

Leave a Reply