Hi Prakash,
In this post I am going to only write about SIMPLE. Because my observation is MUCH different than what you are saying should happen, so I just want to focus on one strategey at a time (and I had a long day

)
Mine runs 400 XMLS, not 40. Yes it runs 40 threads, which the tool tip states the "Threads" field is "Virtual Users". My Data Source contains 400 XMLs - so why would it only do 40 requests? I have it set to "Set to share this DataSource between running threads during a Load Test". So all 40 use up the 400 XMLS. If I uncheck it, then each of the 40 will do 400, so that is 40 X400 = 16000 XMLs. This is shown in the "HTTP Test Rquest" "cnt" column in the load test. And these counts (400 and 16,000) are all when Runs Per Thread are set to 1.
Test Delay - you wrote this is the delay between each thread, but I saw it as between Runs. I set the "Runs Per Thread" to 2. I set the Delay to 10 seconds (no Random) and the random to 0. I counted how much time it took between the 1st Run Per Thread and the 2nd - it was 10 seconds. After each Run I saw that the 16000 XMLs were completed (when set not to ""Set to share this DataSource...") 32000 XMLs completed after the 2nd Run. The tool tip on the "Test Delay" field states "Sets the delay between each test run in milliseconds". It states between RUNs, not THREADS. If it were Threads, that is 40 X 10 seconds and that makes out to 400 seconds (per RUN). My test finished in much less than 400 seconds.