Textdatei segmentieren: Aufgaben#

Voraussetzungen:

Datei www.w3schools.com_python_python_strings_methods.txt einlesen:

Datensatz laden und vorbereiten#

# psm: python_strings_methods
with open("www.w3schools.com_python_python_strings_methods.txt") as my_file:
    psm_text = my_file.read()
psm_text_zeilen = psm_text.splitlines()
#psm_text_zeilen
psm_text_tokens = []

for zeile in psm_text_zeilen[1:] : # die erste Zeile enthält etwas anderes, auslassen
    if len(zeile) >= 1:
        token_list = zeile.split("\t")
        psm_text_tokens.append( token_list )
psm_text_tokens
[['capitalize()', 'Converts the first character to upper case'],
 ['casefold()', 'Converts string into lower case'],
 ['center()', 'Returns a centered string'],
 ['count()',
  'Returns the number of times a specified value occurs in a string'],
 ['encode()', 'Returns an encoded version of the string'],
 ['endswith()', 'Returns true if the string ends with the specified value'],
 ['expandtabs()', 'Sets the tab size of the string'],
 ['find()',
  'Searches the string for a specified value and returns the position of where it was found'],
 ['format()', 'Formats specified values in a string'],
 ['format_map()', 'Formats specified values in a string'],
 ['index()',
  'Searches the string for a specified value and returns the position of where it was found'],
 ['isalnum()',
  'Returns True if all characters in the string are alphanumeric'],
 ['isalpha()',
  'Returns True if all characters in the string are in the alphabet'],
 ['isascii()',
  'Returns True if all characters in the string are ascii characters'],
 ['isdecimal()', 'Returns True if all characters in the string are decimals'],
 ['isdigit()', 'Returns True if all characters in the string are digits'],
 ['isidentifier()', 'Returns True if the string is an identifier'],
 ['islower()', 'Returns True if all characters in the string are lower case'],
 ['isnumeric()', 'Returns True if all characters in the string are numeric'],
 ['isprintable()',
  'Returns True if all characters in the string are printable'],
 ['isspace()', 'Returns True if all characters in the string are whitespaces'],
 ['istitle() ', 'Returns True if the string follows the rules of a title'],
 ['isupper()', 'Returns True if all characters in the string are upper case'],
 ['join()', 'Joins the elements of an iterable to the end of the string'],
 ['ljust()', 'Returns a left justified version of the string'],
 ['lower()', 'Converts a string into lower case'],
 ['lstrip()', 'Returns a left trim version of the string'],
 ['maketrans()', 'Returns a translation table to be used in translations'],
 ['partition()',
  'Returns a tuple where the string is parted into three parts'],
 ['replace()',
  'Returns a string where a specified value is replaced with a specified value'],
 ['rfind()',
  'Searches the string for a specified value and returns the last position of where it was found'],
 ['rindex()',
  'Searches the string for a specified value and returns the last position of where it was found'],
 ['rjust()', 'Returns a right justified version of the string'],
 ['rpartition()',
  'Returns a tuple where the string is parted into three parts'],
 ['rsplit()',
  'Splits the string at the specified separator, and returns a list'],
 ['rstrip()', 'Returns a right trim version of the string'],
 ['split()',
  'Splits the string at the specified separator, and returns a list'],
 ['splitlines()', 'Splits the string at line breaks and returns a list'],
 ['startswith()',
  'Returns true if the string starts with the specified value'],
 ['strip()', 'Returns a trimmed version of the string'],
 ['swapcase()', 'Swaps cases, lower case becomes upper case and vice versa'],
 ['title()', 'Converts the first character of each word to upper case'],
 ['translate()', 'Returns a translated string'],
 ['upper()', 'Converts a string into upper case'],
 ['zfill()',
  'Fills the string with a specified number of 0 values at the beginning']]

Aufgabe 1: Liste aller Funktionen#

Gegeben:

  • Unser Text in der Variablen psm_text_tokens.

Gesucht:

  • Eine Liste aller Funktionen

Beispiel:

psm_fn_list = ['capitalize()', 'casefold()', 'center()', 'count()', 'encode()',
     "viele_sonstige_funktionen",
     'swapcase()', 'title()', 'translate()', 'upper()', 'zfill()']
print(psm_fn_list)
['capitalize()', 'casefold()', 'center()', 'count()', 'encode()', 'viele_sonstige_funktionen', 'swapcase()', 'title()', 'translate()', 'upper()', 'zfill()']

Hier selber codieren:

# psm_fn_list berechnen aus psm_text_tokens :
psm_fn_list = [ token[0] for token in psm_text_tokens if len(token[0]) >= 1 ]
# print(psm_fn_list): ['capitalize()', 'casefold()', 'center()', 'count()', ...
assert 'lower()' in psm_fn_list

Funktionen, die mit “is” beginnen#

Gegeben: Unser Text in den Variablen

  • psm_text_tokens

  • psm_fn_list

Gesucht:

  • eine Liste aller Funktionen, die mit “is” beginnen:

psm_fn_is = [ f for f in psm_fn_list if f.startswith("is") ]
print(psm_fn_is)
['isalnum()', 'isalpha()', 'isascii()', 'isdecimal()', 'isdigit()', 'isidentifier()', 'islower()', 'isnumeric()', 'isprintable()', 'isspace()', 'istitle() ', 'isupper()']
assert 'isdecimal()' in psm_fn_is

Darstellung als Dict#

Gegeben:

  • psm_text_zeilen

Gesucht:

  • eine Darstellung als Dict psm_text_dict

z.B. psm_text_dict == {'capitalize()': 'Converts the first character to upper case',  'casefold()': 'Converts string into lower case',  'center()': 'Returns a centered string', ... }

psm_text_dict = {}

for zeile in psm_text_zeilen[1:] :
    if len(zeile) >= 1:
        token_list = zeile.split("\t")
        psm_text_dict[token_list[0]] = token_list[1]
# psm_text_dict: 
# {'capitalize()': 'Converts the first character to upper case',
# 'casefold()': 'Converts string into lower case', ...
psm_text_dict
{'capitalize()': 'Converts the first character to upper case',
 'casefold()': 'Converts string into lower case',
 'center()': 'Returns a centered string',
 'count()': 'Returns the number of times a specified value occurs in a string',
 'encode()': 'Returns an encoded version of the string',
 'endswith()': 'Returns true if the string ends with the specified value',
 'expandtabs()': 'Sets the tab size of the string',
 'find()': 'Searches the string for a specified value and returns the position of where it was found',
 'format()': 'Formats specified values in a string',
 'format_map()': 'Formats specified values in a string',
 'index()': 'Searches the string for a specified value and returns the position of where it was found',
 'isalnum()': 'Returns True if all characters in the string are alphanumeric',
 'isalpha()': 'Returns True if all characters in the string are in the alphabet',
 'isascii()': 'Returns True if all characters in the string are ascii characters',
 'isdecimal()': 'Returns True if all characters in the string are decimals',
 'isdigit()': 'Returns True if all characters in the string are digits',
 'isidentifier()': 'Returns True if the string is an identifier',
 'islower()': 'Returns True if all characters in the string are lower case',
 'isnumeric()': 'Returns True if all characters in the string are numeric',
 'isprintable()': 'Returns True if all characters in the string are printable',
 'isspace()': 'Returns True if all characters in the string are whitespaces',
 'istitle() ': 'Returns True if the string follows the rules of a title',
 'isupper()': 'Returns True if all characters in the string are upper case',
 'join()': 'Joins the elements of an iterable to the end of the string',
 'ljust()': 'Returns a left justified version of the string',
 'lower()': 'Converts a string into lower case',
 'lstrip()': 'Returns a left trim version of the string',
 'maketrans()': 'Returns a translation table to be used in translations',
 'partition()': 'Returns a tuple where the string is parted into three parts',
 'replace()': 'Returns a string where a specified value is replaced with a specified value',
 'rfind()': 'Searches the string for a specified value and returns the last position of where it was found',
 'rindex()': 'Searches the string for a specified value and returns the last position of where it was found',
 'rjust()': 'Returns a right justified version of the string',
 'rpartition()': 'Returns a tuple where the string is parted into three parts',
 'rsplit()': 'Splits the string at the specified separator, and returns a list',
 'rstrip()': 'Returns a right trim version of the string',
 'split()': 'Splits the string at the specified separator, and returns a list',
 'splitlines()': 'Splits the string at line breaks and returns a list',
 'startswith()': 'Returns true if the string starts with the specified value',
 'strip()': 'Returns a trimmed version of the string',
 'swapcase()': 'Swaps cases, lower case becomes upper case and vice versa',
 'title()': 'Converts the first character of each word to upper case',
 'translate()': 'Returns a translated string',
 'upper()': 'Converts a string into upper case',
 'zfill()': 'Fills the string with a specified number of 0 values at the beginning'}
# Für interessierte Leser: was passiert in der folgenden Zeile?
{ key: psm_text_dict[key] for key in list(psm_text_dict.keys())[0:5] }
{'capitalize()': 'Converts the first character to upper case',
 'casefold()': 'Converts string into lower case',
 'center()': 'Returns a centered string',
 'count()': 'Returns the number of times a specified value occurs in a string',
 'encode()': 'Returns an encoded version of the string'}
assert psm_text_dict['join()'] == 'Joins the elements of an iterable to the end of the string'

Darstellung ähnlich zu orient = “index”#

Gegeben:

  • psm_text_zeilen, psm_text_tokens

Gesucht:

psm_orient_index = {} # ein Dict, kein Set

for zeilennummer in range(len(psm_text_tokens)):
    psm_orient_index[zeilennummer] = \
        { "Funktion": psm_text_tokens[zeilennummer][0], 
          "Beschreibung": psm_text_tokens[zeilennummer][1] }
# Für interessierte Leser: was passiert in der folgenden Zeile?
{ key: psm_orient_index[key] for key in range(5) }
{0: {'Funktion': 'capitalize()',
  'Beschreibung': 'Converts the first character to upper case'},
 1: {'Funktion': 'casefold()',
  'Beschreibung': 'Converts string into lower case'},
 2: {'Funktion': 'center()', 'Beschreibung': 'Returns a centered string'},
 3: {'Funktion': 'count()',
  'Beschreibung': 'Returns the number of times a specified value occurs in a string'},
 4: {'Funktion': 'encode()',
  'Beschreibung': 'Returns an encoded version of the string'}}

Aufgabe: Formulieren Sie diese Lösung als eine Comprehension!

psm_orient_index = { zeilennummer : 
             {"Funktion": psm_text_tokens[zeilennummer][0],
              "Beschreibung": psm_text_tokens[zeilennummer][1] }
               for zeilennummer in range(len(psm_text_tokens))
           }
assert psm_orient_index[0] == {'Funktion': 'capitalize()',
 'Beschreibung': 'Converts the first character to upper case'}

Erzeuge “normalisierte” Beschreibung#

Wir wollen den Text der Beschreibungen der Funktionen auswerten. Dazu “normalisieren” wir die Texte:

  • nur noch Kleischreibung

  • keine Sonderzeichen (“.,:;?!”) mehr

Schritt 1:

  • definiere eine Funktion normalisiere(), die einen String normalisiert

def normalisiere(s):
    s_clean = [ Buchstabe for Buchstabe in s.lower() if Buchstabe not in ".,:;?!" ]
    ... # 
    return ergebnis

normalisiere("Hä? Ach so!")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 6
      3     ... # 
      4     return ergebnis
----> 6 normalisiere("Hä? Ach so!")

Cell In[15], line 4, in normalisiere(s)
      2 s_clean = [ Buchstabe for Buchstabe in s.lower() if Buchstabe not in ".,:;?!" ]
      3 ... # 
----> 4 return ergebnis

NameError: name 'ergebnis' is not defined

Schritt2:

Gegeben:

  • psm_orient_index

  • obige Funktion normalisiere()

Gesucht:

  • psm_orient_index mit einem neuen Key “normalisiert”

for index, Zeile in psm_orient_index.items():
    ... # 
psm_orient_index[0]
{'Funktion': 'capitalize()',
 'Beschreibung': 'Converts the first character to upper case'}

Zur Kontrolle in schönem Layout anschauen:

import pandas as pd
psm_df = pd.DataFrame.from_dict(psm_orient_index, orient = 'index')
psm_df.head()
Funktion Beschreibung
0 capitalize() Converts the first character to upper case
1 casefold() Converts string into lower case
2 center() Returns a centered string
3 count() Returns the number of times a specified value ...
4 encode() Returns an encoded version of the string

Funktionen, die True/False zurückliefern#

Gegeben:

  • Unser Text in den verschiedneen Variablen oben

Gesucht:

  • eine Liste aller Funktionen, die True oder False zurückliefern

Vorgehen: Suche in der (idealerweise normalisierten) Beschreibung der Funktion nach geeigneten Hinweisen – insbesondere nach dem String “True” ;-)

# TBD, 2bd, to be done