Thursday, November 27, 2008

Pragmatic Compromise

Have you noticed that all the guys who are advocating a compromise code quality vs. speed,
are not the guys who are going to maintain that code? Have you noticed that these guys bully you into delivering features fast, and later they come complaining ' Why is the quality so poor? '.
Even if quality is not seen as corporate asset, make it a part of your professional ethics.

Tuesday, November 11, 2008

DynamicQuery

Has anybody tried to do an EJB DynamicQuery implementation a la groovy, using a DynamicProxy ?
(post inspired by AutoBoundary proxymatic)
Apparently yes: check-out the LiquidForm.
There are some other alternatives, but java is not there yet.

Monday, November 10, 2008

Scrum Log

After reading/watching an impressive amount of presentations from Henrik Kniberg: parleys, slides, technical-debt, bootstraping scrum and xp in a crisis, scrum checklist, and more, I've decided to keep a personal iteration log.
It's amazing how much clarity you get at the end of an iteration: the retrospective.

Monday, November 03, 2008

Dark Syntax Highlight

(in RGB)
Brackets: 128,255,0
Keywords: 217,167,15
Comments: 0,185,0
Tasks,ToDos: 0,255,0

Thursday, October 23, 2008

API Usability

The success of an API/component largely depends on its usability: is it easy to create, is it easy to handle, is it easy to wire? Is it easy to make a mistake? Are some contracts unclear ?

The first feedback we get from (unit-)tests. Unfortunately a lot of people don't do that.
Another point is that the tests (partially) lie about how the component is used: they represent how the author assumes the component will be used. Feedback from customers and refactorings should follow...

Wednesday, October 22, 2008

Filter support for PropertyConfigurator

The PropertyConfigurator class in log4j does not support filters and few other advanced configuration options. There was a need to add logging filters to a legacy Java app, but modifying the application code to use DomConfigurator was out question. The alternate approach was to directly add filter support to the PropertyConfigurator class in log4j.

Filters are chained to an appender in a specified order. However, the runtime representation of a .properties files is basically a hashtable, and there is no ordering defined on the keys of a plain hashmap. Any implicit ordering of the lines in the properties file is lost when it is parsed. To overcome this, filters are identified by unique IDs, and filters are added to the appender in the lexicographic order of the IDs.

Keeping the notations used by the log4j source code, filters are configured as follows:

log4j.appender.appenderName.filter.ID=fully.qualified.name.of.filter.class
log4j.appender.appenderName.filter.ID.option1=value1
...
log4j.appender.appenderName.filter.ID.optionN=valueN


You can download the patch directly from the log4j bug tracker. It can be applied to the current HEAD of the 1.2 branch, more exactly to 1.2.15.

Friday, October 10, 2008

Coder

Written code is a cost, since it has to be maintained. Less code, less maintenance.
Maybe we should call developers 'anti-coders'.

Tuesday, August 19, 2008

Option/Maybe monad is a generic NullObject

I've just reviewed the NullObject and Visitor post. It is so similar with the Option/Maybe monad.
ps. Thought's on Validation: Scala/liftweb has an nice Can.

Monday, August 11, 2008

Scala Marketing Challenge

Josh Suereth would like to improve the Scala's marketing strategy.
This is a tough sell. If we look at the Change Function:

ChangeFunction = F(PerceivedCrisis/PerceivedPainOfAdoption).

(btw. read Erik Meijer's paper: 'Confessions of a Used Programming Language Salesman', it's really, really good.)

So we'll have to ask ourselves what problem is Scala trying to solve, and how much it'll cost. IMHO, Scala is trying to make programming easier. It partially addresses the concurrency problem, but it is not its main-focus, I would say is the ScalaActors are a good side-effect of Scala's good (language + library) design.

What we'll have to pay in order to make the 'switch': the syntax is slightly different. Functions are 1st level citizens. (but functions/closure are scary). Operator overload is also scary (funny enough, method overload is not scary: we are afraid of symbols but we are not afraid of words). Then we have for a good generic support we have 'Generics of a Higher Type', type covariance/contra-variance. Pretty scary... booh! Did I say that IDE support is not yet at java's level?

On the other side we have a huge number of java developers. They are satisfied. Not really happy, not really sad. Ok, java is not that great, but it has been like this for years, why change? To be more effective? Probably to sell that to a manager you'll need 5 independent studies to show that scala is better then java and you'll need at least one big company to push the language. ('if they use it, we could use it too')

Regarding concurrency: a lot of java developers have problems to right correct code, so probably concurrency is the least of their problems. Other developers we'll say 'we have java.util.concurrency'.
If you look at the package you'll see they offer concurrency primitives, and to write shared-state concurrent programs is very hard. You'll have to think a lot more about object isolation and the threading context.

In conclusion:
- UserPerceivedCrisis: small
- PerceivePainOfAdoption: big
ChangeFunction -tends-to-> small.

sorry, scala...


Friday, August 08, 2008

Poka-yoke API Design

API Design should be Poka-yoke.
And if we look what zen aesthetic means to people, a good design is zen: clean, simple, minimalistic.

Thursday, July 31, 2008

hg vs git, pull+merge vs rebase

@pastebin
GIT encourages rebasing, which leads to rebasing of your public and other people's
code. This is bad.

Mercurial, which hashes a changeset's place in history as well as it's
content, discourages rebasing. Rebasing happens in private with mq patch queues.
Public trees are merged. This is good.

Either style is possible with both tools. The difference is the default emphasis.

Defaults matter.

Linus Torvalds on rebasing:

* http://kerneltrap.org/Linux/Git_Management
* http://lwn.net/Articles/291302/
* http://lwn.net/Articles/291303/
* http://lwn.net/Articles/291304/


Matt Mackall on synchronizing (pull + merge, don't push, don't rebase):

* http://www.selenic.com/pipermail/mercurial/2008-July/020116.html
* http://www.selenic.com/pipermail/mercurial/2008-July/020131.html

Thursday, July 24, 2008

Scala Exercises

I have completed Tonny Morris Scala exercises. (maybe I'll put them on the svn as well, or maybe mercurial ?)...
Anyway some observations:
- it felt like doing the 'Little Schemer' in Scala.
- after doing 3-4 exercises, you feel the need to use/implement fold_left/fold_right.
- maximum is buggy: what's the maximum on an empty list? (the maximum found must be contained in the list;reasoning related to 'Noumenal Null'). I believe that this exercise is a 'tricky question' and instead of throwing an exception, or returning Nothing or Int.NaN, probably the signature should be changed to : maximum(x: List[Int]) : Option[Int]

- In the List's implementation of map/filter/... they use instead of pattern matching if x.isEmpty ... else ( do_something(x.head, x.tail). Probably it's faster than pattern decomposition.

Thursday, June 12, 2008

Lack of Fit

Over the J. Paul Morrison's article I found the following quotation:

In the early part of his book "Notes on the Synthesis of Form", Alexander describes the concept of "fit". He makes the point that, as humans, we find it very hard to detect "fit", but are well-adapted to detecting "lack of fit". Think of trying to decide if the edge of a piece of wood is straight: the natural thing to do is to put it against a straight surface and see if any gaps show in between. As we try to improve things by reducing the misfit factors, we soon discover that these factors tend to be interrelated: fixing one may exacerbate another, so there is a continuous process of compromise between a number of less than ideal solutions. Intuitively, the less interrelated the misfit items are, the easier it will be to reach a state of acceptable fit. This means that our only hope of reducing misfit factors is if these factors occur in clusters such that the connections between clusters are relatively loose. In application development, this works best if the infrastructure software components also define boundaries between misfit factors.

maybe that's why I'm so critical....
btw. I've started reading J. Paul Morrison's book. I like it, I recommend it.

Thursday, May 22, 2008

LanguageFear

I've just read Martin Fowler's article ParserFear. So, Why Not a Parser for a DSL?
Martin didn't quite specified why we need a DSL:
- Is it for business people to specify their requirements?
- Is it just 'readable' code, for programmers ?
I assume that in both cases we are trying to solve a customer's problem, by building an easy to maintain.

Let's say we use XML. Well that is a bad idea, and Martin knows it: XML is not the answer. It is not even the question. And to write XML is a difficult thing to sell, to a business-man.

Let's say that we go the ANTLR way. Which means from all the languages which are out there, none fits our needs, so we have to build our own. Ok, let's say we've build one. But now we need for our users an editor. We might need good error messages, syntax highlight, the possibility to reuse blocks of code from our programming language. Later we will need to extend our language, without hurting the users. And on the other side, the code generated will be difficult to maintain or to build abstractions on it. That's because code generation is just a raw, automatic, copy-paste machine: it doesn't build any higher-order abstractions. And the direction seems to be wrong: instead of helping our customer, we are baby-sitting our language/editor/infrastructure.

Or we go with embedded DSLs: some languages are better then others. And some languages can be easily integrated with others. Some have heavier syntax making more difficult to create a clean DSL, some are more lightweight, making a perfect match for DSLs.

Or we can go with a 'building material', a 'programmable programming language', or something similar.

Why do people fear of learning of a new language, but they invest uncountable hours in building sand-castles. Even Martin agrees: be a polyglot.

Friday, May 02, 2008

Validation Combinators, a F# prototype

Suppose we have the following situation: we have a bag/set of named objects which have to be validated. These objects are primitives, let's say strings, for our sample.
Some strings can be validated as 'standalone' some need to be validate against each other. This validation doesn't belong to the domain for several reasons.
Regrettably, the existing frameworks don't suite our needs.

Trying to decompose the problem, we need:
- a way to get object out of the bag, by name
- a validation method/function which can validate one or more objects
- if the validation fails, we want to collect a 'failed validation message'
- we want to compose these validations: and, or, not, xor, etc...

Let's put that in code:

1 #light

2 open System

3 open System.Collections.Generic

4 open System.Text

5

6 type get_by_name = String -> String

7 type accumulate_message = String -> unit

8 type validator = get_by_name -> accumulate_message -> bool

We use F# light syntax, and open some .Net namespaces, define the required function types: I really like this kind of 'aliasing': give name to a type, since being based on type inference, is not obtrusive.It is not like a java/c# Interface, where the interface must be implemented. In our case, if we have another function with a signature matching our type, we can use that function as the required type. (we could say is a type-safe duck-typing)

10 let _and_ (x : validator) (y : validator) = (fun (get_value_from_bag : get_by_name) (accumulate_validation_message : accumulate_message) ->

11 let ok_x = lazy x get_value_from_bag accumulate_validation_message

12 let ok_y = lazy y get_value_from_bag accumulate_validation_message

13 Lazy.force ok_x && Lazy.force ok_y)

14

15 let _or_ (x : validator) (y : validator) = (fun (get_value_from_bag : get_by_name) (accumulate_validation_message : accumulate_message) ->

16 let ok_x = lazy x get_value_from_bag accumulate_validation_message

17 let ok_y = lazy y get_value_from_bag accumulate_validation_message

18 Lazy.force ok_x || Lazy.force ok_y)

19

20 let _not_ (x : validator) = (fun (get_value_from_bag : get_by_name) (accumulate_validation_message : accumulate_message) ->

21 let ok_x = x get_value_from_bag accumulate_validation_message

22 not ok_x)

23

24 let _xor_ (x : validator) (y : validator) = ((_not_ x) |> _and_ y ) |> _or_ (y |> _and_ (_not_ x) )

Having defined what a validator is, we have defined functions to combined them: and, or, xor. In order to simulate the behavior (a and b -> b is evaluated only if a is false), we cache the result of a validator in
a lazy value: the value is computed only once, delayed, on demand. In the xor combinator we have used another DSL-friendly feature: partial-function application: ie. if we have a function f(x, y) -> z we could write this as: y |> f x.

Ok, let's see how it works. We define a validator:

27 let is_starting_with (start : String) (name : String) = (fun (get_value_from_bag : get_by_name) (accumulate_validation_message : accumulate_message) ->

28 let value = get_value_from_bag name

29 let ok = value.StartsWith(start)

30 if not ok then

31 let msg = string.Format("{0} '{1}' doesn't start with {2}", name, value, start)

32 accumulate_validation_message msg

33 ok)


Lets define a map of name -> values, and a message accumulator:

36 let map = new Dictionary()

37 map.Add("customer","ali")

38 map.Add("buyer", "baba")

39 map.Add("vendor", "nono")

40

41 let get_value (key : String) : String = map.Item(key)

42

43 let acc = new StringBuilder()

44 let accumulate (msg: String) : unit = acc.AppendLine(msg) |> ignore


Lets create a more complex validator:

46 let x = ("customer" |> is_starting_with "a") |> _and_ ("buyer" |> is_starting_with "b") |> _and_ ("vendor" |> is_starting_with "v")

That's really readable, isn't it?

And we can evaluate it:

48 printfn "evaluation: %b messages: %s" (x get_value accumulate) (acc.ToString())

49

50 System.Console.ReadKey()


ps. Look at a better example in a Good Video: "Composing Contracts: An Adventure in Financial Engineering".
pps. here is the code.

Thursday, April 10, 2008

Maybe, me too

After my previous post, I've found another one blogging about Option as 'explicit null'.
If you think about, Java/C# always had that Option, is only called Iterable[T]/IEnumerable[T]: either you have one element inside or not.

So I've done another implementation in java. In comparison with the 'other' implementation:
- instead of the slower 'instanceof' matching, we use a boolean hasValue()
- almost as in F#, the Option implements the Iterable interface, allowing the filter/map/reduce operations. (in F# those operations are directly in the Option type)

ps. A lot of people like the 'scala Option' or the 'F# Option'.
pps. My favorite 'What is a Monad?' explanations are the Brian Beckman - "Don't fear the Monads" video, and the Wes Dyer "The Marvels of Monads" blog-entry. Here the other links.

Monday, March 31, 2008

Thoughts on Validation Patterns

During my worked at project X, I've looked for several validation patterns.

The use case is: you have a view with several fields, you need to validate them all, collect all the error messages and focus on the 1st erroneous file.

Since the View represents of a view (projection) over the Domain (Model) the right thing to do is to put the validation in the Model.

The problem arises from the fact the in an ActiveController-PassiveView (aka Model-View-Presenter)
neither the Model nor the View have access to each other:

PassiveView <-- Presenter --> Model.

The latest frameworks use annotations/validators & reflection to map validations ViewField <->
ModelField. (it gets hairier if you need to validate accross several fields, or several objects).

Project X was a .Net project, and .Net 2.0 has a nice feature: implicit type inference for delegates.
Let me explain: if you define an delegate
public delegate R Func<R, A1>(A1 arg1);


and a method
void DoSomething<R,A1>(Func<R,A1> converter) {
... }


then you can pass *any* method which matches the signature of the delegate to the
DoSomething(...)
method.

public R MyConverter<R,A1>(A1 arg);

...
DoSomething(MyConverter) // this is valid
Basically, starting with .Net 2.0, closures/functions are 1st class citizens in C#, like any other 'normal' objects. (closures are objects, afterall), with some remarks:

- .Net/C# properties are not methods and cannot be used as such
- in java you have to write a *lot* of anonymous classes, making more difficult to see any benefit. (that's also why an Enumerating component is not appealing in java, in C# the compiler writes the anon. classes for you)

Having said that, we can do a 'manual' binding in Presenter:

ValidationService.bind(PassiveView.GetField(), Model.ValidateField, PassiveView.SetErrorOnField)

the function signature of bind would be
bind<T>(T candidate, Func<bool, T> validate, Action signalError) {
if (! validate(field)) signalError();
}

We can do some more variations on the thema, where the ValidateField will also give us a message saying what was wrong with the candidate.


Variation within variation: message in several languages, should the model be responsible for that, maybe should provide the id to an message which will be translate...

One smell is that, after we have validated the field, by the Model, we pass again the value back to
the model in order to update/create some objects/fields.

Another discussion we had is that a constructor/factory using exceptions for validating arguments is expensive and it fails too fast, since we want to collect all error messages.

So we could replace that with a 2-step process:
maybeValidArgs = factory.Validate(args);
factory.create(maybeValidArgs);
but this is too 'manual'.

Some static typing functional languages have a construct to explicitely define a nullable value. In a c#/java would be:

interface Option<T> {
boolean HasValue();
T GetValue();
}

class Some<T> {
boolean HasValue() -> always true;
T GetValue() -> always the value;
}

class None<T> {
boolean HasValue() -> always false;
T GetValue() -> always throws an exception;
}

Note: In java we have type erasure, so for None<?> we can use a Singleton object.

Now for validation use case we might:

interface Option<T> {
boolean HasValue();
T GetValue();
IEnumerable<String> Errors();// Danger in using Strings !!! String is a 'primitive/sealed' type, which you cannot extend.
}

class Some<T> {
boolean HasValue() -> always true;
T GetValue() -> always the value;
IEnumerable<String> Errors() -> always Empty;
}

class None<T> {
boolean HasValue() -> always false;
T GetValue() -> always throw;
IEnumerable<String> Errors() -> always the errors;
}

Basically HasValue() has the same semantic with IsEmpty(Errors()).
We could change the pull - with a push:
instead of IEnumerable<String> Errors() (pull errors)
we push the errors in a handler
DoWithErrors(Action<IEnumerable<String>> errorHandler) -> Some will call the the action, None will do not

so we could use it like this:

Option<T> maybe = factory.TryCreate(args);

if (maybe.HasValue()) { ... do something with it }
else {
maybe.doWithErrors(View.DisplayErrors);
}

It is not difficult to associate an extra validation action for each arg/candidate field.

We just add optional args to TryCreate(args, params IAction[] actions) which we'll define as
mapping arg1 -> action1, arg2 -> action2. (if we have more args then actions, we don't call any action for them)

eg.

Option<Address> maybe = addressFactory.TryCreate(street_candidate, number_candidate, city_name_candidate, view.SignalErrorStreetName, view.SignalErrorValidateNumberCandidate, view.SignalValidateCityName)

...probably this will evolve to a 'framework', but at least a safe-type one, not string/reflection-based... And probably the annotation/reflection based are more AOP/crosscutting then this will be.

Thursday, March 27, 2008

I don't get Spring

Like CrazyBob, of guice fame, I also don't get Spring. Hammet also doesn't get it.
I mean what is the whole idea of wiring by xml? To decouple the wiring from the model.
But if you put the wiring.xml in the jar, so that it lands in the maven repository, to be used by others then it's not really decoupled.

Suppose I have 2 services, with their corresponding interfaces

class Producer implements IProducer { ... }

class Consumer implements IConsumer {
____Consumer(IProducer producer) { ... }
____...
}

what do we put in the xml file ? Basically we repeat the code-lines.
[bean id="..." class="Producer][/bean]
[bean id="..." class="Consumer"]
____[constructor-arg ref="..." index="0"/]
[/bean]

So not only we write again the required dependency, but we do it in xml so that we don't get
the benefit of refactoring (any changes due to renaming are lost).
I've intentionally left the ids unspecified. What should we put in there? The name of the class or the name of the interface?
- If we put the name of the class, then the Consumer-owner must know which class implements the IProducer interface, so a large part of the benefit of using interfaces is lost.
- If we put the name of the interface, in a declarative way, 'I am an IConsumer and I need an IProducer' then we have really duplicated the constructor line and the class declaration line in XML.

And now what about setter-injection. When I look at a class, I look at the interface it implements and what services it needs (in the constructor) to fulfill that contract. But Spring favors 'Setter Injection'.

Wednesday, March 26, 2008

blog marketing

I've received a link about how-to make my blog more popular. (thanks Andi ;)
There are good tips in there but I'm not sure I want to go aggressive in 'active marketing':
  • it is more important that my readers say 'hey this is a cool guy with a good blog' that I say 'I am the greatest'
  • more time on marketing means less time on something else (this is probably weak, since I'm not a very effective blogger, but I compensate that by being an avid reader)
  • I would prioritize 'more, better content' higher than 'more aggressive marketing'. (I should really write more)
  • On the other side is really important to reach people and bring them something. An architect said that the real important think to do is to spread knowledge. Know imagine: instead of sitting here and doing a bad copy of bile-blog, without being even funny, winning about how java is crap, C# is crap, (btw. C# is better than java, sic!), I could/should put some interesting stuff in here and move at least a hand of developers to a better programming world. So if I can reach/teach at least 3 people, and every 'student' will reach at most 3 people and so on, maybe we can reach the critical mass, and move something.... (...or maybe this was just a naive rant)
Making it short: help me make the blog more popular: reddit, technorati, dzone me.

Tuesday, March 18, 2008

Memoization in Java

OnJava in 2003, Tim White gave us generic way to build a memoize in java.
IMHO, Mr. White is trying to use big guns (dynamic proxy, reflection) on a small problem: function composition.

If we use/declare an interface:
public interface IFunc1 {
____R apply(A1 arg1);
}

then we can do implement Memoize1 (on 1 argument) as: (I use the Scala notation for generics [,] since blogger cuts out the xml devil quotes \<,\>)

public class Memoize1[A1, R] implements IFunc1[A1, R] {
____private final IFunc1[A1, R] func;
____private final Map[A1, R] cache;

____public Memoize1(IFunc1[A1, R] func) {
________this(func,new HashMap[A1, R]());
____}

____public Memoize1(IFunc1 func, Map[A1,R] cache) {
________this.func = func;
________this.cache = cache;
}

public R apply(A1 arg1) {
____R result = cache.get(arg1);
____if (result == null) {
________result = func.apply(arg1);
____cache.put(arg1, result);
____}
____return result;
____}
}

That's it. Generic. Composable. Reusable. No schnick-schnack.
Similar memoizer could be provide for IFunc2, IFunc3...
Notice that the Memoize has the same signature as the function it encapsulates.

ps. here is the F# version.

Tuesday, February 26, 2008

sooo '98

from maven-surefire-plugin:

By default, the surefire plugin will automatically include all test classes with the following wildcard patterns:

  • "**/Test*.java" - includes all of its subdirectory and all java filenames that start with "Test".
  • "**/*Test.java" - includes all of its subdirectory and all java filenames that end with "Test".
  • "**/*TestCase.java" - includes all of its subdirectory and all java filenames that end with "TestCase".

If the test classes does not go with the naming convention, then configure surefire plugin and specify the tests you want to include.


ahh, those romantic days.
Apparently annotations and specifications are too 'exotic'.

Thursday, February 21, 2008

Code Snippets Project

I've published the C#, F# and python snippets in the code-snippets project.
Check-out get_dilbert_strips in py and F#.

Tuesday, February 19, 2008

Type Inference in F# & Scala (pointers)

Here are some just some links about type inference in F# and Scala.

1st. from F# manual:

F# is statically typed, but frequently code will not contain many type annotations. This is because F# uses type inference to automatically determine the types of expressions. Type inference checks that types "match up" in obvious ways, i.e., that a function definition is consistent and that the definition of a function is sufficiently general to match all the ways in which it is used. Type inference is a global process over an entire source file, or, in the case of entries to F# Interactive, over the scope of a single entry delimited by ';;' terminators.

Type inference automatically generalizes code to be as generic as possible. To see the types inferred for the top-level definitions in your program use the -i compiler switch, or hover your mouse over definitions in a supported interactive environment such as Visual Studio.

Calls and accesses to .NET and F# methods and properties (collectively known as members) involve a form of type-directed overload resolution, resolved based on left-to-right, outside-to-in analysis of a file, i.e., type information subsequent to the use of an overloaded member is not taken into account for its resolution.

In addition, some operators such as + are overloaded. Operator overloading is resolved over the same scope as type inference, typically an entire file. Where operator overloading is not resolved the overloading typically defaults to work over 32-bit integers.

Code that uses .NET library calls (and in particular libraries which make heavy use of the dot-notation) will need to be annotated with extra type annotations in order to assist the type inference process. The F# type checker will indicate when further type annotations are needed.


(for F# I can also recommend the F# books)

2nd Scala: besides the type-inference, has a very nice generic system, with focus on components & scalability, see 'Generics of a Higher Kind', the google-talk (slides here), and the other scala-docs (for the studious). (you can feel a flavor of Haskell in scala (traits, implicits) , bringing the high-order-function elegance)

Wednesday, January 23, 2008

What's the probability of at least two persons among 20 to have the same birthday?



I'm an anti-virus engineer. Recently, we've detected a mobile phone worm that spreads by attaching itself to MMSs that it sends to various phone numbers. Our contact at the mobile phone operator harvested about 1000 MMSs. Some MMSs were sent to the same phone numbers.

The question was: does the worm generate the phone numbers randomly or are the phone numbers hard-coded in its code?

A related question is: IF the numbers are randomly generated, which is the probability to have several identical numbers among a set of 1000 numbers?

The variable part in the 11-digit phone numbers consists of 5 digits. We assume that the phone numbers are equally probable.

There are 10^5 equally probable possible phone numbers. Let us consider each phone number as a letter of alphabet that contains 10^5 letters. There are (10^5)^1000 distinct 1000-letter strings that we can write in this alphabet. (To convince yourself, think that the alphabet is our normal 10-digit decimal system and we want to build a 3-letter string. How many distinct strings can we build? 1000. 000, 001, ..., 009, 010, 011, ..., 019, 020, ..., 999.)

How many strings do we have such that each string contains only distinct letters? In order to answer, think how to build such a string. You pick the first letter from the 10^5 available. Then you pick the second from the 10^5-1 available, etc. If you remember your combinatorics classes, then you'll recognize the "variations". How many variations of 1000 elements from a set of 10^5 are there? (10^5) * (10^5 - 1) * ... * (10^5 - 999).

To summarize, we have an alphabet of 10^5 letters, they are equally probable. We have (10^5)^1000 possible 1000-letter strings. These strings are also equally probable. We have (10^5) * (10^5 - 1) * ... * (10^5 - 999) equally probable strings such that each string contains only distinct letters.

Thus, the probability to pick a randomly build a 1000-letter string that contains only distinct letters (i.e. the probability that all phone numbers among the 1000 harvested phone numbers be different) is

(10^5 / 10^5) * ((10^5 - 1) / 10^5) * ((10^5 - 2) / 10^5) * ... * ((10^5 - 999) / 10^5) = 0.0066594036

The probability that at least two numbers are identical among the 1000 numbers is 1 minus the probability computed above, i.e. 0.9933405964.

As I was not very sure of my reasoning, I wrote a small simulation:

#include <stdio.h>
#include <stdlib.h>

#include <strings.h>
#include <unistd.h>
#include <time.h>
#include <errno.h>
#include <signal.h>

#define DFLT_N 100000
#define DFLT_K 1000
#define DFLT_MAX_ITER 1000000


static
unsigned int max_iter, iter, equals, N, K;

static
unsigned int *a;

static
double
uniform_distrib(double a, double b) {

return
a + (b - a) * ((double)random() / (double)RAND_MAX);
}


static
void
onUSR1(int dummy) {
printf("%f\n", (double)equals / iter);
}


static
void
parse_cmd_line(int argc, char *argv[]) {

int
c;
char
*endp;
extern
char *optarg;

while
((c = getopt(argc, argv, "hN:K:i:")) != EOF)

switch
(c) {
case
'?':
/* error */

break
;

case
'h':
/* help message */

break
;
case
'N':

errno = 0;
if
(optarg == NULL) {

/* error */

}

N = (unsigned int)strtoul(optarg, &endp, 0);

if
(errno != 0 || *endp != '\0') {

/* error */

}

break
;
case
'K':
errno = 0;

if
(optarg == NULL) {
/* error */

}


K = (unsigned int)strtoul(optarg, &endp, 0);
if
(errno != 0 || *endp != '\0') {

/* error */

}

break
;
case
'i':
errno = 0;

if
(optarg == NULL) {
/* error */

}


max_iter = (unsigned int)strtoul(optarg, &endp, 0);
if
(errno != 0 || *endp != '\0') {

/* error */

}

break
;
}

if
(N == 0)

N = DFLT_N;
if
(K == 0)

K = DFLT_K;
if
(max_iter == 0)

max_iter = DFLT_MAX_ITER;
}


int

main
(int argc, char *argv[]) {

unsigned int
i, picked;
struct
sigaction act;

parse_cmd_line(argc, argv);

srandom(time(NULL));

if
((a = malloc(N * sizeof(unsigned int))) == NULL) {

/* error */

}


act.sa_handler = onUSR1;
act.sa_flags = 0;

sigemptyset(&act.sa_mask);
sigaction(SIGUSR1, &act, NULL);

for
(iter = 0; iter < max_iter; ++iter) {

bzero(a, N * sizeof(unsigned int));
for
(i = 0; i < K; ++i) {

picked = (unsigned int)uniform_distrib(0.0, (double)N);
if
(a[picked] == 1) {
++
equals;

break
;
}

a[picked] = 1;
}
}


onUSR1(SIGUSR1);

return
0;
}



It can be used to answer the question in the title. The simulation says 0.411308. The "theory" says 0.4114383836.

If we were to be pedantic, the fact that we make some observations coherent with the predictions of a model does not imply that the model is indeed true, especially when we can easily imagine other models that make the same predictions. Thus, we called three distinct phone numbers out of the 1000 harvested. Two of them were not allocated. But this second method looks like the ultimate solution to measuring the height of a building with a barometer.