Just for fun (part II): Produce real names by nationality using probability

Question

0.00/5 (No votes)

See more:

Hi,

see Just for fun: Produce real names by nationality using probability[^] for full details leading up to this question.

I am trying to create an accurate (ish) random name generator. I will be using it to performance test my new system. I would like to weight the results of the generator by name popularity.

So far, I have three collections: Male UK Forenames, Female UK Forenames and UK Surnames. Each has a percentage of popularity:

C#

private static Dictionary<string, double> _enMForenames = new Dictionary<string, double>
        {
            {"Oliver",1.939}, //about 499 more
        };
private static Dictionary<string, double> _enFForenames = new Dictionary<string, double>
        {
            {"Amelia",1.638}, //about 499 more
        };
private static Dictionary<string,double> _enSurnames = new Dictionary<string, double>
        {
            {"SMITH",0.0074062771845516}, //about 30k more
        };

Note that these collections are not complete (god help me if they were :S) so the percent sub of all of the items does not reach 100% (I think Male names are about 68% of the total?)

This is what I have so far for generating one person:

C#

private static Passenger CreatePerson ()
{
    Person newPerson = new Person ();

    bool male = rand.Next(0,1) == 1;

    Dictionary<string, double> forenames;

    if (male)
    {
        forenames = _enMForenames;
        newPassenger.Gender = rand.Next(0, 1) == 1 ? "M" : "Male";
    }
    else
    {
        forenames = _enFForenames;
        newPassenger.Gender = rand.Next(0, 1) == 1 ? "F" : "Female";
    }

    //This is just rough atm but here is where I need to weight my selection
    newPerson .Forename =
        forenames.Keys.Select((k, i) => new { i, k })
            .Where(n => n.i == rand.Next(0, forenames.Count - 1))
            .Select(n => n.k)
            .First();

    //Surname will be done in a similar way here

    newPerson .Email = string.Format("{0}.{1}@{2}", newPerson .Forename, newPerson .Surname, "test.test");

    int days = rand.Next(0, int.MaxValue);

    newPerson .DateOfBirth = (new DateTime(1900, 1, 1).Date).AddDays(days);

    return newPerson ;
}

So:
How do I use the percent to weight the random name selection?
and
Can I do this more efficiently for large sets of names (2 million)?

Thanks

Andy

Edit: these are my resources:
Forenames
http://www.behindthename.com/top/lists/england-wales/2013[^]
Surnames
http://en.geneanet.org/genealogy/1/Surname.php[^]

Posted 21-May-15 3:35am

Andy Lanng

Updated 21-May-15 4:50am

v2

Add a Solution

Comments

Sergey Alexandrovich Kryukov 21-May-15 10:42am

First of all, don't use dictionary (in contrast to the question on how to store frequency). Or, more exactly, use something else for the purpose of generation a name. On input, you have random number in uniform distribution, not name. So, name keys won't help. You need to create a set of ranges and check if the number is in certain range. How to make it efficient? It needs thinking; not so simple thing.
—SA

Sergey Alexandrovich Kryukov 21-May-15 10:44am

Are the figures normalized (is their sum equals to 100)?
—SA

Andy Lanng 21-May-15 10:55am

No, but they can be. I don't update the list at run time so I can perform a one-off sum and percentage calc

Sergey Alexandrovich Kryukov 21-May-15 11:36am

Sure. Please see my solution.
—SA

Sascha Lefèvre 21-May-15 10:46am

I might have an idea but don't know if it's feasible with actual data. Is your list of surnames publicly downloadable somewhere?

Andy Lanng 21-May-15 10:54am

Question updated

3 solutions

Solution 3

I haven't tested the other proposed solutions but I could imagine that they take "some time" when generating 2 million random names.

This solution should be a lot faster in exchange for consuming a lot more memory. Maybe it's neccessary to sacrifice some accuracy in order to make it work memory-wise but I assume it will not be neccessary for your corpus of names.

1) Sum the weights of all names (e.g. weight = 859017 for "Smith"). Let's call this sum S.

2) Allocate an array with a length of S. Depending on the amount of names it should be an array of UInt16 or Int32.

You might need this in your app.config to allow objects larger than 2GB:

XML

<runtime>
  <gcAllowVeryLargeObjects enabled="true" />
</runtime>

If you run into an OutOfMemoryException, divide each name's weight by an arbitrary number (e.g. 2). If the new weight would be 0, give it a value of 1 (theoretically; surely not neccessary for your data). Start over with step 1. (This would be the accuracy-loss)

3) Initialize that array (pseudo-code):

offset = 0
for nameIdx = 0 to names.length - 1
   for i = 1 to names[nameIdx].weight
      array[offset + i] = nameIdx
   endfor
   offset += names[nameIdx].weight
endfor

4) Generate a random number rnd between 0 and S. The random name then is: names[array[rnd]]
Repeat step 4 as often as you want.

Posted 21-May-15 7:50am

Sascha Lefèvre

Updated 21-May-15 8:16am

v4

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Richard Deeming · Accepted Answer · 2015-05-21T05:00:00

Based on Tomas's answer[^] to your previous question, something like this should work:

C#

public sealed class RandomNameGenerator
{
    private readonly string[] _names;
    private readonly double[] _thresholds;

    public RandomNameGenerator(IDictionary<string, double> source)
    {
        if (source == null) throw new ArgumentNullException("source");
        if (source.Count == 0) throw new ArgumentException("No names specified.", "source");

        _names = new string[source.Count];
        _thresholds = new double[source.Count];

        int index = 0;
        double runningTotal = 0D;
        double totalWeight = source.Values.Sum();

        foreach (KeyValuePair<string, double> pair in source)
        {
            runningTotal += pair.Value;
            _thresholds[index] = runningTotal / totalWeight;
            _names[index] = pair.Key;
            index++;
        }
    }

    public string GetRandomName(Random random)
    {
        if (random == null) throw new ArgumentNullException("random");

        double n = random.NextDouble();
        int index = Array.BinarySearch(_thresholds, n);
        if (index < 0) index = ~index;

        return _names[index];
    }
}

Using the class should be fairly simple:

C#

var names = new RandomNameGenerator(new Dictionary<string, double>
{
    { "Smith", 1.39 },
    { "Jones", 0.7 },
    { "Adams", 42 }
});

var random = new Random();
var result = Enumerable.Range(0, 10000)
    .Select(_ => names.GetRandomName(random))
    .GroupBy(n => n, (key, items) => new { Name = key, Count = items.Count() }, StringComparer.OrdinalIgnoreCase)
    .Dump(); // LINQPad extension method - use your own preferred display method.

/*
Computed thresholds:
Smith: 0.03152642
Jones: 0.04740304
Adams: 1

Output should be something similar to:
Smith: ~315
Jones: ~159
Adams: ~9526
*/

Sergey Alexandrovich Kryukov · Accepted Answer · 2015-05-21T05:22:00

Let's assume you are starting with finding a random floating-point number of uniform distribution 0 to 1. You need to convert it to you non-uniform (weighted) distribution, which is a kind of uniform, with giving each name different range on distribution mapping function. This function is very simple, you will need a simple array storing ranges and names. You will need to search a range in this array.

This is the array definition:

C#

class NameDescriptor {
    internal NameDescriptor(string name, double low, double high) {
        this.Low = low;
        this.High = high;
        this.Name = name;
    }
    internal double Low { get; private set; }
    internal double High { get; private set; }
    internal string Name { get; private set; }
} //class NameDescriptor

Create an array of this element, by the number of names. To populate the array, re-work recurrently your probability into ranges withing 0..1.
Say, you have percentage values per name 1.5, 3, 2… Then the Low and High values should be

0 to 0.015 (1.5/100)
0.015 to 0.045 (1.5/100 + 3/100)
0.045 to 0.065 (1.5/100 + 3/100 + 2/100)
...

Now, your uniformly-distributed value 0 to 1 will fall into one of these ranges. Find the instance of NameDescriptor in the array by this criterion. I did not devise exact algorithm, bit of course it should not be that slow the linear search. The simplest algorithm would be pretty fast (but not the fastest possible) divide-by-two algorithm. It is fast, because all your ranges are ordered. Roughly speaking, you start with middle array index and check that the value fits in the range. If it does not, determine is the suitable range is on left or on right of your attempted element. This way, you divided your search variant by two. And so on…

When the array element is found, your put its Name to output.

—SA

Just for fun (part II): Produce real names by nationality using probability

3 solutions

Solution 1

Solution 2

Solution 3

Add your solution here

Preview 0