Textdatei segmentieren: Aufgaben#
Voraussetzungen:
Datei www.w3schools.com_python_python_strings_methods.txt
einlesen:
Datensatz laden und vorbereiten#
# psm: python_strings_methods
with open("www.w3schools.com_python_python_strings_methods.txt") as my_file:
psm_text = my_file.read()
psm_text_zeilen = psm_text.splitlines()
#psm_text_zeilen
psm_text_tokens = []
for zeile in psm_text_zeilen[1:] : # die erste Zeile enthält etwas anderes, auslassen
if len(zeile) >= 1:
token_list = zeile.split("\t")
psm_text_tokens.append( token_list )
psm_text_tokens
[['capitalize()', 'Converts the first character to upper case'],
['casefold()', 'Converts string into lower case'],
['center()', 'Returns a centered string'],
['count()',
'Returns the number of times a specified value occurs in a string'],
['encode()', 'Returns an encoded version of the string'],
['endswith()', 'Returns true if the string ends with the specified value'],
['expandtabs()', 'Sets the tab size of the string'],
['find()',
'Searches the string for a specified value and returns the position of where it was found'],
['format()', 'Formats specified values in a string'],
['format_map()', 'Formats specified values in a string'],
['index()',
'Searches the string for a specified value and returns the position of where it was found'],
['isalnum()',
'Returns True if all characters in the string are alphanumeric'],
['isalpha()',
'Returns True if all characters in the string are in the alphabet'],
['isascii()',
'Returns True if all characters in the string are ascii characters'],
['isdecimal()', 'Returns True if all characters in the string are decimals'],
['isdigit()', 'Returns True if all characters in the string are digits'],
['isidentifier()', 'Returns True if the string is an identifier'],
['islower()', 'Returns True if all characters in the string are lower case'],
['isnumeric()', 'Returns True if all characters in the string are numeric'],
['isprintable()',
'Returns True if all characters in the string are printable'],
['isspace()', 'Returns True if all characters in the string are whitespaces'],
['istitle() ', 'Returns True if the string follows the rules of a title'],
['isupper()', 'Returns True if all characters in the string are upper case'],
['join()', 'Joins the elements of an iterable to the end of the string'],
['ljust()', 'Returns a left justified version of the string'],
['lower()', 'Converts a string into lower case'],
['lstrip()', 'Returns a left trim version of the string'],
['maketrans()', 'Returns a translation table to be used in translations'],
['partition()',
'Returns a tuple where the string is parted into three parts'],
['replace()',
'Returns a string where a specified value is replaced with a specified value'],
['rfind()',
'Searches the string for a specified value and returns the last position of where it was found'],
['rindex()',
'Searches the string for a specified value and returns the last position of where it was found'],
['rjust()', 'Returns a right justified version of the string'],
['rpartition()',
'Returns a tuple where the string is parted into three parts'],
['rsplit()',
'Splits the string at the specified separator, and returns a list'],
['rstrip()', 'Returns a right trim version of the string'],
['split()',
'Splits the string at the specified separator, and returns a list'],
['splitlines()', 'Splits the string at line breaks and returns a list'],
['startswith()',
'Returns true if the string starts with the specified value'],
['strip()', 'Returns a trimmed version of the string'],
['swapcase()', 'Swaps cases, lower case becomes upper case and vice versa'],
['title()', 'Converts the first character of each word to upper case'],
['translate()', 'Returns a translated string'],
['upper()', 'Converts a string into upper case'],
['zfill()',
'Fills the string with a specified number of 0 values at the beginning']]
Aufgabe 1: Liste aller Funktionen#
Gegeben:
Unser Text in der Variablen
psm_text_tokens
.
Gesucht:
Eine Liste aller Funktionen
Beispiel:
psm_fn_list = ['capitalize()', 'casefold()', 'center()', 'count()', 'encode()',
"viele_sonstige_funktionen",
'swapcase()', 'title()', 'translate()', 'upper()', 'zfill()']
print(psm_fn_list)
['capitalize()', 'casefold()', 'center()', 'count()', 'encode()', 'viele_sonstige_funktionen', 'swapcase()', 'title()', 'translate()', 'upper()', 'zfill()']
Hier selber codieren:
# psm_fn_list berechnen aus psm_text_tokens :
psm_fn_list = [ token[0] for token in psm_text_tokens if len(token[0]) >= 1 ]
# print(psm_fn_list): ['capitalize()', 'casefold()', 'center()', 'count()', ...
assert 'lower()' in psm_fn_list
Funktionen, die mit “is” beginnen#
Gegeben: Unser Text in den Variablen
psm_text_tokens
psm_fn_list
Gesucht:
eine Liste aller Funktionen, die mit “is” beginnen:
psm_fn_is = [ f for f in psm_fn_list if f.startswith("is") ]
print(psm_fn_is)
['isalnum()', 'isalpha()', 'isascii()', 'isdecimal()', 'isdigit()', 'isidentifier()', 'islower()', 'isnumeric()', 'isprintable()', 'isspace()', 'istitle() ', 'isupper()']
assert 'isdecimal()' in psm_fn_is
Darstellung als Dict#
Gegeben:
psm_text_zeilen
Gesucht:
eine Darstellung als Dict
psm_text_dict
z.B. psm_text_dict == {'capitalize()': 'Converts the first character to upper case', 'casefold()': 'Converts string into lower case', 'center()': 'Returns a centered string', ... }
psm_text_dict = {}
for zeile in psm_text_zeilen[1:] :
if len(zeile) >= 1:
token_list = zeile.split("\t")
psm_text_dict[token_list[0]] = token_list[1]
# psm_text_dict:
# {'capitalize()': 'Converts the first character to upper case',
# 'casefold()': 'Converts string into lower case', ...
psm_text_dict
{'capitalize()': 'Converts the first character to upper case',
'casefold()': 'Converts string into lower case',
'center()': 'Returns a centered string',
'count()': 'Returns the number of times a specified value occurs in a string',
'encode()': 'Returns an encoded version of the string',
'endswith()': 'Returns true if the string ends with the specified value',
'expandtabs()': 'Sets the tab size of the string',
'find()': 'Searches the string for a specified value and returns the position of where it was found',
'format()': 'Formats specified values in a string',
'format_map()': 'Formats specified values in a string',
'index()': 'Searches the string for a specified value and returns the position of where it was found',
'isalnum()': 'Returns True if all characters in the string are alphanumeric',
'isalpha()': 'Returns True if all characters in the string are in the alphabet',
'isascii()': 'Returns True if all characters in the string are ascii characters',
'isdecimal()': 'Returns True if all characters in the string are decimals',
'isdigit()': 'Returns True if all characters in the string are digits',
'isidentifier()': 'Returns True if the string is an identifier',
'islower()': 'Returns True if all characters in the string are lower case',
'isnumeric()': 'Returns True if all characters in the string are numeric',
'isprintable()': 'Returns True if all characters in the string are printable',
'isspace()': 'Returns True if all characters in the string are whitespaces',
'istitle() ': 'Returns True if the string follows the rules of a title',
'isupper()': 'Returns True if all characters in the string are upper case',
'join()': 'Joins the elements of an iterable to the end of the string',
'ljust()': 'Returns a left justified version of the string',
'lower()': 'Converts a string into lower case',
'lstrip()': 'Returns a left trim version of the string',
'maketrans()': 'Returns a translation table to be used in translations',
'partition()': 'Returns a tuple where the string is parted into three parts',
'replace()': 'Returns a string where a specified value is replaced with a specified value',
'rfind()': 'Searches the string for a specified value and returns the last position of where it was found',
'rindex()': 'Searches the string for a specified value and returns the last position of where it was found',
'rjust()': 'Returns a right justified version of the string',
'rpartition()': 'Returns a tuple where the string is parted into three parts',
'rsplit()': 'Splits the string at the specified separator, and returns a list',
'rstrip()': 'Returns a right trim version of the string',
'split()': 'Splits the string at the specified separator, and returns a list',
'splitlines()': 'Splits the string at line breaks and returns a list',
'startswith()': 'Returns true if the string starts with the specified value',
'strip()': 'Returns a trimmed version of the string',
'swapcase()': 'Swaps cases, lower case becomes upper case and vice versa',
'title()': 'Converts the first character of each word to upper case',
'translate()': 'Returns a translated string',
'upper()': 'Converts a string into upper case',
'zfill()': 'Fills the string with a specified number of 0 values at the beginning'}
# Für interessierte Leser: was passiert in der folgenden Zeile?
{ key: psm_text_dict[key] for key in list(psm_text_dict.keys())[0:5] }
{'capitalize()': 'Converts the first character to upper case',
'casefold()': 'Converts string into lower case',
'center()': 'Returns a centered string',
'count()': 'Returns the number of times a specified value occurs in a string',
'encode()': 'Returns an encoded version of the string'}
assert psm_text_dict['join()'] == 'Joins the elements of an iterable to the end of the string'
Darstellung ähnlich zu orient = “index”#
Gegeben:
psm_text_zeilen
,psm_text_tokens
Gesucht:
psm_orient_index
: Eine Darstellung in der gleichen Datenstruktur wiepd.to_dict(orient='index')
, siehe Runde 1b: Verschachtelte Datenstrukturen mit Tiefe 2.Spalte 1: “Funktion”
Spalte 2: “Beschreibung”
psm_orient_index = {} # ein Dict, kein Set
for zeilennummer in range(len(psm_text_tokens)):
psm_orient_index[zeilennummer] = \
{ "Funktion": psm_text_tokens[zeilennummer][0],
"Beschreibung": psm_text_tokens[zeilennummer][1] }
# Für interessierte Leser: was passiert in der folgenden Zeile?
{ key: psm_orient_index[key] for key in range(5) }
{0: {'Funktion': 'capitalize()',
'Beschreibung': 'Converts the first character to upper case'},
1: {'Funktion': 'casefold()',
'Beschreibung': 'Converts string into lower case'},
2: {'Funktion': 'center()', 'Beschreibung': 'Returns a centered string'},
3: {'Funktion': 'count()',
'Beschreibung': 'Returns the number of times a specified value occurs in a string'},
4: {'Funktion': 'encode()',
'Beschreibung': 'Returns an encoded version of the string'}}
Aufgabe: Formulieren Sie diese Lösung als eine Comprehension!
psm_orient_index = { zeilennummer :
{"Funktion": psm_text_tokens[zeilennummer][0],
"Beschreibung": psm_text_tokens[zeilennummer][1] }
for zeilennummer in range(len(psm_text_tokens))
}
assert psm_orient_index[0] == {'Funktion': 'capitalize()',
'Beschreibung': 'Converts the first character to upper case'}
Erzeuge “normalisierte” Beschreibung#
Wir wollen den Text der Beschreibungen der Funktionen auswerten. Dazu “normalisieren” wir die Texte:
nur noch Kleischreibung
keine Sonderzeichen (“.,:;?!”) mehr
Schritt 1:
definiere eine Funktion
normalisiere()
, die einen String normalisiert
def normalisiere(s):
s_clean = [ Buchstabe for Buchstabe in s.lower() if Buchstabe not in ".,:;?!" ]
... #
return ergebnis
normalisiere("Hä? Ach so!")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[15], line 6
3 ... #
4 return ergebnis
----> 6 normalisiere("Hä? Ach so!")
Cell In[15], line 4, in normalisiere(s)
2 s_clean = [ Buchstabe for Buchstabe in s.lower() if Buchstabe not in ".,:;?!" ]
3 ... #
----> 4 return ergebnis
NameError: name 'ergebnis' is not defined
Schritt2:
Gegeben:
psm_orient_index
obige Funktion
normalisiere()
Gesucht:
psm_orient_index
mit einem neuen Key “normalisiert”
for index, Zeile in psm_orient_index.items():
... #
psm_orient_index[0]
{'Funktion': 'capitalize()',
'Beschreibung': 'Converts the first character to upper case'}
Zur Kontrolle in schönem Layout anschauen:
import pandas as pd
psm_df = pd.DataFrame.from_dict(psm_orient_index, orient = 'index')
psm_df.head()
Funktion | Beschreibung | |
---|---|---|
0 | capitalize() | Converts the first character to upper case |
1 | casefold() | Converts string into lower case |
2 | center() | Returns a centered string |
3 | count() | Returns the number of times a specified value ... |
4 | encode() | Returns an encoded version of the string |
Funktionen, die True/False zurückliefern#
Gegeben:
Unser Text in den verschiedneen Variablen oben
Gesucht:
eine Liste aller Funktionen, die
True
oderFalse
zurückliefern
Vorgehen: Suche in der (idealerweise normalisierten) Beschreibung der Funktion nach geeigneten Hinweisen – insbesondere nach dem String “True” ;-)
# TBD, 2bd, to be done