Click here to Skip to main content
15,915,042 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi,

see Just for fun: Produce real names by nationality using probability[^] for full details leading up to this question.

I am trying to create an accurate (ish) random name generator. I will be using it to performance test my new system. I would like to weight the results of the generator by name popularity.

So far, I have three collections: Male UK Forenames, Female UK Forenames and UK Surnames. Each has a percentage of popularity:
C#
private static Dictionary<string, double> _enMForenames = new Dictionary<string, double>
        {
            {"Oliver",1.939}, //about 499 more
        };
private static Dictionary<string, double> _enFForenames = new Dictionary<string, double>
        {
            {"Amelia",1.638}, //about 499 more
        };
private static Dictionary<string,double> _enSurnames = new Dictionary<string, double>
        {
            {"SMITH",0.0074062771845516}, //about 30k more
        };



Note that these collections are not complete (god help me if they were :S) so the percent sub of all of the items does not reach 100% (I think Male names are about 68% of the total?)

This is what I have so far for generating one person:
C#
private static Passenger CreatePerson ()
{
    Person newPerson = new Person ();

    bool male = rand.Next(0,1) == 1;

    Dictionary<string, double> forenames;

    if (male)
    {
        forenames = _enMForenames;
        newPassenger.Gender = rand.Next(0, 1) == 1 ? "M" : "Male";
    }
    else
    {
        forenames = _enFForenames;
        newPassenger.Gender = rand.Next(0, 1) == 1 ? "F" : "Female";
    }

    //This is just rough atm but here is where I need to weight my selection
    newPerson .Forename =
        forenames.Keys.Select((k, i) => new { i, k })
            .Where(n => n.i == rand.Next(0, forenames.Count - 1))
            .Select(n => n.k)
            .First();

    //Surname will be done in a similar way here

    newPerson .Email = string.Format("{0}.{1}@{2}", newPerson .Forename, newPerson .Surname, "test.test");

    int days = rand.Next(0, int.MaxValue);

    newPerson .DateOfBirth = (new DateTime(1900, 1, 1).Date).AddDays(days);

    return newPerson ;
}



So:
How do I use the percent to weight the random name selection?
and
Can I do this more efficiently for large sets of names (2 million)?

Thanks

Andy



Edit: these are my resources:
Forenames
http://www.behindthename.com/top/lists/england-wales/2013[^]
Surnames
http://en.geneanet.org/genealogy/1/Surname.php[^]
Posted
Updated 21-May-15 4:50am
v2
Comments
Sergey Alexandrovich Kryukov 21-May-15 10:42am    
First of all, don't use dictionary (in contrast to the question on how to store frequency). Or, more exactly, use something else for the purpose of generation a name. On input, you have random number in uniform distribution, not name. So, name keys won't help. You need to create a set of ranges and check if the number is in certain range. How to make it efficient? It needs thinking; not so simple thing.
—SA
Sergey Alexandrovich Kryukov 21-May-15 10:44am    
Are the figures normalized (is their sum equals to 100)?
—SA
Andy Lanng 21-May-15 10:55am    
No, but they can be. I don't update the list at run time so I can perform a one-off sum and percentage calc
Sergey Alexandrovich Kryukov 21-May-15 11:36am    
Sure. Please see my solution.
—SA
Sascha Lefèvre 21-May-15 10:46am    
I might have an idea but don't know if it's feasible with actual data. Is your list of surnames publicly downloadable somewhere?

Based on Tomas's answer[^] to your previous question, something like this should work:
C#
public sealed class RandomNameGenerator
{
    private readonly string[] _names;
    private readonly double[] _thresholds;

    public RandomNameGenerator(IDictionary<string, double> source)
    {
        if (source == null) throw new ArgumentNullException("source");
        if (source.Count == 0) throw new ArgumentException("No names specified.", "source");

        _names = new string[source.Count];
        _thresholds = new double[source.Count];

        int index = 0;
        double runningTotal = 0D;
        double totalWeight = source.Values.Sum();

        foreach (KeyValuePair<string, double> pair in source)
        {
            runningTotal += pair.Value;
            _thresholds[index] = runningTotal / totalWeight;
            _names[index] = pair.Key;
            index++;
        }
    }

    public string GetRandomName(Random random)
    {
        if (random == null) throw new ArgumentNullException("random");

        double n = random.NextDouble();
        int index = Array.BinarySearch(_thresholds, n);
        if (index < 0) index = ~index;

        return _names[index];
    }
}


Using the class should be fairly simple:
C#
var names = new RandomNameGenerator(new Dictionary<string, double>
{
    { "Smith", 1.39 },
    { "Jones", 0.7 },
    { "Adams", 42 }
});

var random = new Random();
var result = Enumerable.Range(0, 10000)
    .Select(_ => names.GetRandomName(random))
    .GroupBy(n => n, (key, items) => new { Name = key, Count = items.Count() }, StringComparer.OrdinalIgnoreCase)
    .Dump(); // LINQPad extension method - use your own preferred display method.

/*
Computed thresholds:
Smith: 0.03152642
Jones: 0.04740304
Adams: 1

Output should be something similar to:
Smith: ~315
Jones: ~159
Adams: ~9526
*/
 
Share this answer
 
v2
Comments
Tomas Takac 21-May-15 11:17am    
+5, very nice and complete solution. One question though: any reason to pass Random as parameter?
Richard Deeming 21-May-15 11:24am    
The Random class isn't thread-safe. If you wanted to use this class from multiple threads, you'd need to create a separate Random instance per thread, so storing it at the class level wouldn't work.

The method could create it's own instance, but then you'd run the risk that two calls could end up using the same seed.

You might also want to pass in an instance with a known seed for unit-testing.

And you might want to pass in an instance of a derived class - for example, one that uses a cryptographically-secure PRNG[^]. (If you're going to use that one, take note of my comments to the post as well.)
Andy Lanng 21-May-15 11:52am    
Great - I was 99% there.

I actually inherit from a Dictionary now. I don't change the items after the dictionary is created so I create the "running total" array once and have a get parameter that retrieves the actual string.

being able to use the same seed for successive tests will be useful. I'll defo implement that
Sergey Alexandrovich Kryukov 21-May-15 12:29pm    
Richard,

I did not check up that this solution is mathematically correct (please see my answer), but I'm afraid that your LINQ solution is much slower than it should be (remember, performance is one of the concerns here). As this LINQ does not the use the fact that ranges of input random values are ordered (again, please see my answer), the search must be linear, which gives you very bad O(N) time complexity. Manually written search algorithm using the order will be much faster.

—SA
Richard Deeming 21-May-15 12:34pm    
Sergey,

Not sure I follow your comment.

I've only used LINQ twice in this answer:

* Once in the constructor, to calculate the sum of the "frequency".
* Once in the sample of using the class, to generate some grouped random names to see if the distribution looks correct.

The expected usage is to create and cache an instance of this class for each set of names, and then call the GetRandomName function within the loop that creates the Person objects.

No part of the code called in the loop uses LINQ. It uses a binary search on the _thresholds array, which should be reasonably fast.
Let's assume you are starting with finding a random floating-point number of uniform distribution 0 to 1. You need to convert it to you non-uniform (weighted) distribution, which is a kind of uniform, with giving each name different range on distribution mapping function. This function is very simple, you will need a simple array storing ranges and names. You will need to search a range in this array.

This is the array definition:
C#
class NameDescriptor {
    internal NameDescriptor(string name, double low, double high) {
        this.Low = low;
        this.High = high;
        this.Name = name;
    }
    internal double Low { get; private set; }
    internal double High { get; private set; }
    internal string Name { get; private set; }
} //class NameDescriptor

Create an array of this element, by the number of names. To populate the array, re-work recurrently your probability into ranges withing 0..1.
Say, you have percentage values per name 1.5, 3, 2… Then the Low and High values should be
0 to 0.015 (1.5/100)
0.015 to 0.045 (1.5/100 + 3/100)
0.045 to 0.065 (1.5/100 + 3/100 + 2/100)
...


Now, your uniformly-distributed value 0 to 1 will fall into one of these ranges. Find the instance of NameDescriptor in the array by this criterion. I did not devise exact algorithm, bit of course it should not be that slow the linear search. The simplest algorithm would be pretty fast (but not the fastest possible) divide-by-two algorithm. It is fast, because all your ranges are ordered. Roughly speaking, you start with middle array index and check that the value fits in the range. If it does not, determine is the suitable range is on left or on right of your attempted element. This way, you divided your search variant by two. And so on…

When the array element is found, your put its Name to output.

—SA
 
Share this answer
 
v2
Comments
Andy Lanng 21-May-15 11:49am    
Awesome - Thanks for explaining it so clearly. It makes complete sense now ^_^
Sergey Alexandrovich Kryukov 21-May-15 12:31pm    
You are very welcome.
Please see also my last comment to Solution 1, a big performance warning.
—SA
Andy Lanng 21-May-15 14:41pm    
I saw it ;)
Sergey Alexandrovich Kryukov 21-May-15 15:58pm    
But maybe I was wrong...
—SA
I haven't tested the other proposed solutions but I could imagine that they take "some time" when generating 2 million random names.

This solution should be a lot faster in exchange for consuming a lot more memory. Maybe it's neccessary to sacrifice some accuracy in order to make it work memory-wise but I assume it will not be neccessary for your corpus of names.

1) Sum the weights of all names (e.g. weight = 859017 for "Smith"). Let's call this sum S.

2) Allocate an array with a length of S. Depending on the amount of names it should be an array of UInt16 or Int32.

You might need this in your app.config to allow objects larger than 2GB:
XML
<runtime>
  <gcAllowVeryLargeObjects enabled="true" />
</runtime>

If you run into an OutOfMemoryException, divide each name's weight by an arbitrary number (e.g. 2). If the new weight would be 0, give it a value of 1 (theoretically; surely not neccessary for your data). Start over with step 1. (This would be the accuracy-loss)

3) Initialize that array (pseudo-code):
offset = 0
for nameIdx = 0 to names.length - 1
   for i = 1 to names[nameIdx].weight
      array[offset + i] = nameIdx
   endfor
   offset += names[nameIdx].weight
endfor

4) Generate a random number rnd between 0 and S. The random name then is: names[array[rnd]]
Repeat step 4 as often as you want.
 
Share this answer
 
v4

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900