generalize_names_duplcheck: Generalize names while preventing duplicates among different names

A function that converts a name into a general format <last_name><separator><firstname letter(s)> (all lowercase) in a pandas DataFrame while avoiding duplicate entries.

from mlxtend.text import generalize_names_duplcheck

Overview

Note that using mlxtend.text.generalize_names with few firstname_output_letters can result in duplicate entries. E.g., if your dataset contains the names "Adam Johnson" and "Andrew Johnson", the default setting (i.e., 1 first name letter) will produce the generalized name "johnson a" in both cases.

One solution is to increase the number of first name letters in the output by setting the parameter firstname_output_letters to a value larger than 1.

An alternative solution is to use the generalize_names_duplcheck function if you are working with pandas DataFrames.

By default, generalize_names_duplcheck will apply generalize_names to a pandas DataFrame column with the minimum number of first name letters and append as many first name letters as necessary until no duplicates are present in the given DataFrame column. An example dataset column that contains the names

References

Example 1 - Defaults

Reading in a CSV file that has column Name for which we want to generalize the names:

Samuel Eto'o
Adam Johnson
Andrew Johnson

import pandas as pd
from io import StringIO

simulated_csv = "name,some_value\n"\
                "Samuel Eto'o,1\n"\
                "Adam Johnson,1\n"\
                "Andrew Johnson,1\n"

df = pd.read_csv(StringIO(simulated_csv))
df

	name	some_value
0	Samuel Eto'o	1
1	Adam Johnson	1
2	Andrew Johnson	1

Applying generalize_names_duplcheck to generate a new DataFrame with the generalized names without duplicates:

from mlxtend.text import generalize_names_duplcheck
df_new = generalize_names_duplcheck(df=df, col_name='name')
df_new

	name	some_value
0	etoo s	1
1	johnson ad	1
2	johnson an	1

API

generalize_names_duplcheck(df, col_name)

Generalizes names and removes duplicates.

Applies mlxtend.text.generalize_names to a DataFrame with 1 first name letter by default and uses more first name letters if duplicates are detected.

Parameters

df : pandas.DataFrame

DataFrame that contains a column where generalize_names should be applied.
col_name : str

Name of the DataFrame column where generalize_names function should be applied to.

Returns

df_new : str

New DataFrame object where generalize_names function has been applied without duplicates.

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/text/generalize_names_duplcheck/

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search