# Cosinus Ähnlichkeit, Mindmaps

Autor: J.Busse, 6/2021, 2022-04-19

Lizenz: public domain / [CC 0](https://creativecommons.org/publicdomain/zero/1.0/deed.de) 

Zur Weiterbearbeitung durch Studierende im Rahmen der LV dsci-txt

Dieses Programm zitieren: 

* Busse 2021-06-16: Cosinus Ähnlichkeit, Mindmaps. IPYNB-Notebook, April 2022

In [1]:
import numpy as np
import pandas as pd

### Global Parameters

In [2]:
# path to files, incl. glob mask

path_to_files = "mm/*.mm"

# show intermediary results
# 0 none, 1 informative, 2 debug
verbosity = 2

In [3]:
def verbose(level,item):
    if level <= verbosity:
        display(item)

Read Filenames
----

In [4]:
# https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory
import glob
files = glob.glob(path_to_files)
verbose(2,files)

[]

### read mindmaps

map einlesen, Liste von MAPS

In [5]:
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import Element, SubElement, Comment, tostring, ElementTree

In [6]:
def walk_and_collect_dict(node, parent_text, resultdict):
    """walk mindmap, collect n-grams into resultdict"""
    
    myText = node.get('TEXT')
    
    # textAnalysiert = SpaCy.nlp(myText)
    
    # basic bag of word (WOW) items: the terms themself
    resultdict[ "A_" + myText ] = 1
    
    # add n-gram to BOW, e.g. parent<-chlild
    resultdict[ "B_" + parent_text + "|" + myText ] = 1
    
    # add term  plus time stamp of node creation to BOW
    #resultdict[ "C_" + myText + "_" + node.get('CREATED') ] = 1

    # add CREADTED to BOW
    #resultdict[ "D_" + "CREATED_" + node.get('CREATED') ] = 1
    
    # add MODIFIED to BOW
    # resultdict[ "E_" + "MODIFIED_" + node.get('MODIFIED') ] = 1
    
    
    for child in node.findall('node'):
        walk_and_collect_dict(child, myText, resultdict)

In [7]:
def read_mm_files(files):
    corpus = {}
    
    # walk through all files
    for file in files:
        # verbose(3,file)
    
        # load file as an XML element tree
        with open(file) as file_ref:
            verbose(2, "reading {}".format(file_ref))
            # https://docs.python.org/2/library/xml.etree.elementtree.html#parsing-xml
            
            # parse mindmap file
            tree = ET.parse(file_ref)
            
            # point root to xml root-element "/map"
            root = tree.getroot()
            
            tokens = {}
            for n in root.findall('node/node'):
                walk_and_collect_dict(n, "TOP", tokens)
            
        corpus[file] = tokens
    return corpus

In [8]:
corpus_dict = read_mm_files(files)

In [9]:
corpus_dict

{}

In [10]:
# https://www.geeksforgeeks.org/how-to-create-dataframe-from-dictionary-in-python-pandas/
# Method 6: Create DataFrame from nested Dictionary.

# nicht verändern
corpus_df = pd.DataFrame(corpus_dict).T.fillna(0)

# zeigen
corpus_df.T

In [11]:
# nur eine bestimmte Klasse von Spalten betrachten,
# hier: Alle Spalten, die mit 'C' beginnen
[ c for c in corpus_df.columns if c[0] == 'C']

[]

In [12]:
# falls man das tun will:
# den gesamten Korpus in ein Dictionary von Korpora aufteilen
corpus_df_dict = {}
typliste = ['A', 'B', 'C', 'D']
for t in typliste:
    Auswahl = [ c for c in corpus_df.columns if c[0] == t]
    print(t, Auswahl)
    corpus_df_dict[t] = corpus_df[Auswahl]
    

A []
B []
C []
D []


In [13]:
for t in typliste:
    display(corpus_df_dict[t].T)

In [14]:
# falls man das tun will: nur ausgewählte betrachten?
#corpus_df =  corpus_df_dict['C', 'D']
#corpus_df

TfIdf
----

  * https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer
  * https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

In [15]:
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf=False)
verbose(1,transformer)

In [16]:
tfidf = transformer.fit_transform(corpus_df)
tfidf

ValueError: at least one array or dtype is required

In [None]:
verbose(1,pd.DataFrame(tfidf.toarray()))

Cosine Similarity
-------

  * interessant, ggf. auch noch ausprobieren?:  https://stackoverflow.com/questions/17627219/whats-the-fastest-way-in-python-to-calculate-cosine-similarity-given-sparse-mat
  
Wir machen es hier eher low level, um unter die Motorhaube sehen zu können:
  * https://scikit-learn.org/stable/modules/metrics.html#cosine-similarity
  * didaktische Erklärung: http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(tfidf)
verbose(2,similarity)

In [None]:
similarity_df = pd.DataFrame(similarity)
similarity_df.columns = files 
similarity_df.index = files
similarity_df

In [None]:
import seaborn as sns
mask = np.zeros_like(similarity)
mask[np.triu_indices_from(mask)] = True
ax = sns.heatmap(similarity_df, mask=mask,annot= True , cmap = 'RdBu')

In [None]:
ax = sns.clustermap(similarity_df,annot= True , cmap = 'RdBu')
ax.savefig("clustermap.png")

https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html